Posts - A Petition 2020-06-25T23:29:46.491Z · score: 60 (33 votes)
RobBensinger's Shortform 2019-09-23T19:44:20.095Z · score: 6 (1 votes)
New edition of "Rationality: From AI to Zombies" 2018-12-15T23:39:22.975Z · score: 22 (11 votes)
AI Summer Fellows Program: Applications open 2018-03-23T21:20:05.203Z · score: 5 (5 votes)
Anonymous EA comments 2017-02-07T21:42:24.686Z · score: 28 (30 votes)
Ask MIRI Anything (AMA) 2016-10-11T19:54:25.621Z · score: 18 (20 votes)
MIRI is seeking an Office Manager / Force Multiplier 2015-07-05T19:02:24.163Z · score: 8 (8 votes)


Comment by robbensinger on Some thoughts on the EA Munich // Robin Hanson incident · 2020-09-02T16:27:47.084Z · score: 35 (16 votes) · EA · GW
I believe that the social dynamics leading to development of CC do not depend on the balance of opinions favoring CC, and only require that those who are against it are afraid to speak up honestly and publicly

I agree with this. This seems like an opportune time for me to say in a public, easy-to-google place that I think cancel culture is a real thing, and very harmful.

Comment by robbensinger on Study results: The most convincing argument for effective donations · 2020-08-30T02:48:23.426Z · score: 2 (1 votes) · EA · GW

I haven't talked to Schwitzgebel, and you're of course welcome to pass this all on. :)

Comment by robbensinger on Study results: The most convincing argument for effective donations · 2020-08-27T17:22:47.948Z · score: 4 (2 votes) · EA · GW

An anonymous comment someone asked me to post for them (similar to Mati Roy's recent comment):

I always felt kind of uneasy about Eric Scwhitzgebel's competition to find the most convincing argument for making people donate. It felt kind of symmetric in a way I don't like, like you could do that for any action you want to convince people to take. I also pattern matched it to starting with a conclusion and then trying to find all the best arguments for that conclusion, which is a grave sin in my culture.
I thought of something I could do to make it better, which I am probably not going to do because I feel like I would get yelled at.
edit: I'm not sure it did actually work like this, but it was something similar.
The way his competition worked was that he had subjects read the arguments submitted by competitors, then gave them 10 usd, and had them decide how much of that 10 usd if any they wanted to give to charity. People can now publicize the argument that turns out to most reliably cause readers to donate most. My idea to correct for this is to hold the opposite competition. What argument is best for convincing people not to donate, measured the same way? Then we could publicize both arguments together. Probably not going to do this, but I wish somebody would, and I would support them.

I took "It felt kind of symmetric" to be referring to Guided By The Beauty Of Our Weapons, and "starting with a conclusion and then trying to find all the best arguments for that conclusion" to be referring to The Bottom Line.

I replied (edited):

This is a cool idea!
Another potential problem is that persuasiveness isn't the same thing as accuracy, informational value, honesty, epistemic empowerment, etc.
And a third problem is that since each submission is trying to be maximally convincing, there's no incentive for any submission to note the weaknesses, limitations, or caveats affecting its own arguments. A 'debate' format, allowing for the two sides to respond to each other, seems better than a 'we each make our own arguments in separate rooms from each other' format, since it's a lot easier to come away uninformed or lacking important context if you don't hear anyone pick apart the original arguments.
Maybe the best version of this contest would be some version of, where people submit arguments for or against a proposition, and the most compelling argument wins; then people submit rebuttals to the most compelling argument, and the most compelling rebuttal wins; then the original winner gets a chance to respond, and the second winner gets a chance to counter-respond. Then all four entries get posted together, so the argument gets a fairer hearing.
Comment by robbensinger on The academic contribution to AI safety seems large · 2020-07-31T16:43:07.211Z · score: 19 (6 votes) · EA · GW

I agree with Max's take. MIRI researchers still look at Alignment for Advanced Machine Learning Systems (AAMLS) problems periodically, but per Why I am not currently working on the AAMLS agenda, they mostly haven't felt the problems are tractable enough right now to warrant a heavy focus.

Nate describes our new work here: 2018 Update: Our New Research Directions.

Since 2016, actually “about half” of MIRI’s research has been on their ML agenda, apparently to cover the chance of prosaic AGI.

I don't think any of MIRI's major research programs, including AAMLS, have been focused on prosaic AI alignment. (I'd be interested to hear if Jessica or others disagree with me.)

Paul introduced prosaic AI alignment (in November 2016) with:

It’s conceivable that we will build “prosaic” AGI, which doesn’t reveal any fundamentally new ideas about the nature of intelligence or turn up any “unknown unknowns.” I think we wouldn’t know how to align such an AGI; moreover, in the process of building it, we wouldn’t necessarily learn anything that would make the alignment problem more approachable. So I think that understanding this case is a natural priority for research on AI alignment.

In contrast, I think of AAMLS as assuming that we'll need new deep insights into intelligence in order to actually align an AGI system. There's a large gulf between (1) "Prosaic AGI alignment is feasible" and (2) "AGI may be produced by techniques that are descended from current ML techniques" or (3) "Working with ML concepts and systems can help improve our understanding of AGI alignment", and I think of AAMLS as assuming some combination of 2 and 3, but not 1. From a post I wrote in July 2016:

[... AAMLS] is intended to help more in scenarios where advanced AI is relatively near and relatively directly descended from contemporary ML techniques, while our agent foundations agenda is more agnostic about when and how advanced AI will be developed.
As we recently wrote, we believe that developing a basic formal theory of highly reliable reasoning and decision-making “could make it possible to get very strong guarantees about the behavior of advanced AI systems — stronger than many currently think is possible, in a time when the most successful machine learning techniques are often poorly understood.” Without such a theory, AI alignment will be a much more difficult task.
The authors of “Concrete problems in AI safety” write that their own focus “is on the empirical study of practical safety problems in modern machine learning systems, which we believe is likely to be robustly useful across a broad variety of potential risks, both short- and long-term.” Their paper discusses a number of the same problems as the [AAMLS] agenda (or closely related ones), but directed more toward building on existing work and finding applications in present-day systems.
Where the agent foundations agenda can be said to follow the principle “start with the least well-understood long-term AI safety problems, since those seem likely to require the most work and are the likeliest to seriously alter our understanding of the overall problem space,” the concrete problems agenda [by Amodei, Olah, Steinhardt, Christiano, Schulman, and Mané] follows the principle “start with the long-term AI safety problems that are most applicable to systems today, since those problems are the easiest to connect to existing work by the AI research community.”
Taylor et al.’s new [AAMLS] agenda is less focused on present-day and near-future systems than “Concrete problems in AI safety,” but is more ML-oriented than the agent foundations agenda.
Comment by robbensinger on Are Humans 'Human Compatible'? · 2020-01-31T07:01:11.548Z · score: 2 (1 votes) · EA · GW

I agree that suffering is bad in all universes, for the reasons described in I'd say that "ethics... is not constituted by any feature of the universe" in the sense you note, but I'd point to our human brains if we were asking any question like:

  • What explains why "suffering is bad" is true in all universes? How could an agent realistically discover this truth -- how do we filter out the false moral claims and zero in on the true ones, and how could an alien do the same?
Comment by robbensinger on Are Humans 'Human Compatible'? · 2019-12-10T05:05:57.209Z · score: 2 (1 votes) · EA · GW
I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone's wellbeing into account, human or otherwise.

The view I would advocate is that something like utilitarianism (i.e., some form of impartial, species-indifferent welfare maximization) is a core part of human values. What I mean by 'human values' here isn't on your list; it's closer to an idealized version of our preferences: what we would prefer if we were smarter, more knowledgeable, had greater self-control.

Russels' assumption that "The machine’s only objective is to maximize the realization of human preferences" seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?

The language of "human-compatible" is very speciesist, since ethically we should want AGI to be "compatible" with all moral patients, human or not.

On the other hand, the idea of using human brains as a "starting point" for identifying what's moral makes sense. "Which ethical system is correct?" isn't written in the stars or in Plato's heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).

The same is true for identifying the right notion of "impartial", "fair", "compassionate", "taking other species' welfare into account", etc.; to figure out the correct moral account of those important values, you would primarily need to learn facts about human brains. You'd then need to learn facts about non-humans' brains in order to implement the resultant impartiality procedure (because the relevant criterion, "impartiality", says that whether you have human DNA is utterly irrelevant to moral conduct).

The need to bootstrap from values encoded in our brains doesn't and shouldn't mean that humans are the only moral patients (or even that we're particularly important moral patients; insects could turn out to be utility monsters, for all we know today). Hence "human-compatible" is an unfortunate phrase here.

But it does mean that if, e.g., it turns out that cats' ultimate true preferences are to torture all species forever, we shouldn't give that particular preference equal decision weight. Speaking very loosely, the goal is more like 'ensuring all beings gets to have a good life', not like 'ensuring all species (however benevolent or sadistic they turn out to be) get an equal say in what kind of life all beings get to live'.

If there's a more benevolent species than humans, I'd hope that sufficiently advanced science could identify that species, and pass the buck to them. (In an odd sense, we're already building an alien species to defer to if we're constructing 'an idealized version of human preferences', since I would expect sufficiently idealized preferences to turn out to be pretty alien compared to the views human beings espouse today.)

I think it's reasonable to worry that given humans' flaws, humans might not in fact build AGI that 'ensures all beings get to have a good life'. But I do think that something like the latter is the goal; and when you ask me what physical facts in the world make that 'the goal', and what we would need to investigate in order to work out all the wrinkles and implementation details, I'm forced to initially point to facts about human (if only to identify the right notions of 'what a moral patient is' and 'how one ought to impartially take into account all moral patients' welfare').

Comment by robbensinger on EA Leaders Forum: Survey on EA priorities (data and analysis) · 2019-12-08T22:30:56.377Z · score: 35 (11 votes) · EA · GW

Speaking as a random EA — I work at an org that attended the forum (MIRI), but I didn't personally attend and am out of the loop on just about everything that was discussed there — I'd consider it a shame if CEA stopped sharing interesting take-aways from meetings based on an "everything should either be 100% publicly disclosed or 100% secret" policy.

I also don't think there's anything particularly odd about different orgs wanting different levels of public association with EA's brand, or having different levels of risk tolerance in general. EAs want to change the world, and the most leveraged positions in the world don't perfectly overlap with 'the most EA-friendly parts of the world'. Even in places where EA's reputation is fine today, it makes sense to have a diversified strategy where not every wonkish, well-informed, welfare-maximizing person in the world has equal exposure if EA's reputation takes a downturn in the future.

MIRI is happy to be EA-branded itself, but I'd consider it a pretty large mistake if MIRI started cutting itself off from everyone in the world who doesn't want to go all-in on EA (refuse to hear their arguments or recommendations, categorically disinvite them from any important meetings, etc.). So I feel like I'm logically forced to say this broad kind of thing is fine (without knowing enough about the implementation details in this particular case to weigh in on whether people are making all the right tradeoffs).

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-28T00:49:35.393Z · score: 2 (1 votes) · EA · GW

+1, I agree with all this.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T07:24:25.987Z · score: 5 (5 votes) · EA · GW

This is a good discussion! Ben, thank you for inspiring so many of these different paths we've been going down. :) At some point the hydra will have to stop growing, but I do think the intuitions you've been sharing are widespread enough that it's very worthwhile to have public discussion on these points.

Therefore, when a member of the rationalist community uses the word "decision theory" to refer to a decision procedure, they are talking about something that's pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems don't directly speak to the questions that most academic "decision theorists" are actually debating with one another.

On the contrary:

  • MIRI is more interested in identifying generalizations about good reasoning ("criteria of rightness") than in fully specifying a particular algorithm.
  • MIRI does discuss decision algorithms in order to better understand decision-making, but this isn't different in kind from the ordinary way decision theorists hash things out. E.g., the traditional formulation of CDT is underspecified in dilemmas like Death in Damascus. Joyce and Arntzenius' response to this wasn't to go "algorithms are uncouth in our field"; it was to propose step-by-step procedures that they think capture the intuitions behind CDT and give satisfying recommendations for how to act.
  • MIRI does discuss "what decision procedure performs best", but this isn't any different from traditional arguments in the field like "naive EDT is wrong because it performs poorly in the smoking lesion problem". Compared to the average decision theorist, the average rationalist puts somewhat more weight on some considerations and less weight on others, but this isn't different in kind from the ordinary disagreements that motivate different views within academic decision theory, and these disagreements about what weight to give categories of consideration are themselves amenable to argument.
  • As I noted above, MIRI is primarily interested in decision theory for the sake of better understanding the nature of intelligence, optimization, embedded agency, etc., not for the sake of picking a "decision theory we should build into future AI systems". Again, this doesn't seem unlike the case of philosophers who think that decision theory arguments will help them reach conclusions about the nature of rationality.
I think it's totally conceivable that no criterion of rightness is correct (e.g. because the concept of a "criterion of rightness" turns out to be some spooky bit of nonsense that doesn't really map onto anything in the real world.)

Could you give an example of what the correctness of a meta-criterion like "Don't Make Things Worse" could in principle consist in?

I’m not looking here for a “reduction” in the sense of a full translation into other, simpler terms. I just want a way of making sense of how human brains can tell what’s “decision-theoretically normative” in cases like this.

Human brains didn’t evolve to have a primitive “normativity detector” that beeps every time a certain thing is Platonically Normative. Rather, different kinds of normativity can be understood by appeal to unmysterious matters like “things brains value as ends”, “things that are useful for various ends”, “things that accurately map states of affairs”...

When I think of other examples of normativity, my sense is that in every case there's at least one good account of why a human might be able to distinguish "truly" normative things from non-normative ones. E.g. (considering both epistemic and non-epistemic norms):

1. If I discover two alien species who disagree about the truth-value of "carbon atoms have six protons", I can evaluate their correctness by looking at the world and seeing whether their statement matches the world.

2. If I discover two alien species who disagree about the truth value of "pawns cannot move backwards in chess" or "there are statements in the language of Peano arithmetic that can neither be proved nor disproved in Peano arithmetic", then I can explain the rules of 'proving things about chess' or 'proving things about PA' as a symbol game, and write down strings of symbols that collectively constitute a 'proof' of the statement in question.

I can then assert that if any member of any species plays the relevant 'proof' game using the same rules, from now until the end of time, they will never prove the negation of my result, and (paper, pen, time, and ingenuity allowing) they will always be able to re-prove my result.

(I could further argue that these symbol games are useful ones to play, because various practical tasks are easier once we've accumulated enough knowledge about legal proofs in certain games. This usefulness itself provides a criteria for choosing between "follow through on the proof process" and "just start doodling things or writing random letters down".)

The above doesn't answer questions like "do the relevant symbols have Platonic objects as truthmakers or referents?", or "why do we live in a consistent universe?", or the like. But the above answer seems sufficient for rejecting any claim that there's something pointless, epistemically suspect, or unacceptably human-centric about affirming Gödel's first incompleteness theorem. The above is minimally sufficient grounds for going ahead and continuing to treat math as something more significant than theology, regardless of whether we then go on to articulate a more satisfying explanation of why these symbol games work the way they do.

3. If I discover two alien species who disagree about the truth-value of "suffering is terminally valuable", then I can think of at least two concrete ways to evaluate which parties are correct. First, I can look at the brains of a particular individual or group, see what that individual or group terminally values, and see whether the statement matches what's encoded in those brains. Commonly the group I use for this purpose is human beings, such that if an alien (or a housecat, etc.) terminally values suffering, I say that this is "wrong".

Alternatively, I can make different "wrong" predicates for each species: , , , , etc.

This has the disadvantage of maybe making it sound like all these values are on "equal footing" in an internally inconsistent way ("it's wrong to put undue weight on what's !", where the first "wrong" is secretly standing in for ""), but has the advantage of making it easy to see why the aliens' disagreement might be important and substantive, while still allowing that aliens' normative claims can be wrong (because they can be mistaken about their own core values).

The details of how to go from a brain to an encoding of "what's right" seem incredibly complex and open to debate, but it seems beyond reasonable dispute that if the information content of a set of terminal values is encoded anywhere in the universe, it's going to be in brains (or constructs from brains) rather than in patterns of interstellar dust, digits of pi, physical laws, etc.

If a criterion like “Don’t Make Things Worse” deserves a lot of weight, I want to know what that weight is coming from.

If the answer is “I know it has to come from something, but I don’t know what yet”, then that seems like a perfectly fine placeholder answer to me.

If the answer is “This is like the ‘terminal values’ case, in that (I hypothesize) it’s just an ineradicable component of what humans care about”, then that also seems structurally fine, though I’m extremely skeptical of the claim that the “warm glow of feeling causally efficacious” is important enough to outweigh other things of great value in the real world.

If the answer is “I think ‘Don’t Make Things Worse’ is instrumentally useful, i.e., more useful than UDT for achieving the other things humans want in life”, then I claim this is just false. But, again, this seems like the right kind of argument to be making; if CDT is better than UDT, then that betterness ought to consist in something.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T06:48:02.423Z · score: 3 (2 votes) · EA · GW
Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there's no relevant sense in which the agent should have two-boxed."

No, I don't endorse this argument. To simplify the discussion, let's assume that the Newcomb predictor is infallible. FDT agents, CDT agents, and EDT agents each get a decision: two-box (which gets you $1000 plus an empty box), or one-box (which gets you $1,000,000 and leaves the $1000 behind). Obviously, insofar as they are in fact following the instructions of their decision theory, there's only one possible outcome; but it would be odd to say that a decision stops being a decision just because it's determined by something. (What's the alternative?)

I do endorse "given the predictor's perfect accuracy, it's impossible for the P_UDT agent to two-box and come away with $1,001,000". I also endorse "given the predictor's perfect accuracy, it's impossible for the P_CDT agent to two-box and come away with $1,001,000". Per the problem specification, no agent can two-box and get $1,001,000 or one-box and get $0. But this doesn't mean that no decision is made; it just means that the predictor can predict the decision early enough to fill the boxes accordingly.

(Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this "dominance" argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don't think agents "should" try to achieve outcomes that are impossible from the problem specification itself. The reason agents get more utility than CDT in Newcomb's problem is that non-CDT agents take into account that the predictor is a predictor when they construct their counterfactuals.)

In the transparent version of this dilemma, the agent who sees the $1M and one-boxes also "could have two-boxed", but if they had two-boxed, it would only have been after making a different observation. In that sense, if the agent has any lingering uncertainty about what they'll choose, the uncertainty goes away as soon as they see whether the box is full.

In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So -- insofar as we accept this kind of objection from determinism -- there seems to be something problematically non-naturalistic about discussing what "would have happened" if we built in one decision procedure or another.

No, there's nothing non-naturalistic about this. Consider the scenario you and I are in. Simplifying somewhat, we can think of ourselves as each doing meta-reasoning to try to choose between different decision algorithms to follow going forward; where the new things we learn in this conversation are themselves a part of that meta-reasoning.

The meta-reasoning process is deterministic, just like the object-level decision algorithms are. But this doesn't mean that we can't choose between object-level decision algorithms. Rather, the meta-reasoning (in spite of having deterministic causes) chooses either "I think I'll follow P_FDT from now on" or "I think I'll follow P_CDT from now on". Then the chosen decision algorithm (in spite of also having deterministic causes) outputs choices about subsequent actions to take. Meta-processes that select between decision algorithms (to put into an AI, or to run in your own brain, or to recommend to other humans, etc.)) can make "real decisions", for exactly the same reason (and in exactly the same sense) that the decision algorithms in question can make real decisions.

It isn't problematic that all these processes requires us to consider counterfactuals that (if we were omniscient) we would perceive as inconsistent/impossible. Deliberation, both at the object level and at the meta level, just is the process of determining the unique and only possible decision. Yet because we are uncertain about the outcome of the deliberation while deliberating, and because the details of the deliberation process do determine our decision (even as these details themselves have preceding causes), it feels from the inside of this process as though both options are "live", are possible, until the very moment we decide.

(See also Decisions are for making bad outcomes inconsistent.)

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T05:42:47.141Z · score: 4 (3 votes) · EA · GW
But there's nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT.

If the thing being argued for is "R_CDT plus P_SONOFCDT", then that makes sense to me, but is vulnerable to all the arguments I've been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT's "Don't Make Things Worse" principle.

If the thing being argued for is "R_CDT plus P_FDT", then I don't understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over "R_FDT plus P_FDT"? (Indeed, what difference between the two views would be intended here?)

So why shouldn't I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be not self-effacing.

The argument against "R_CDT plus P_SONOFCDT" doesn't require any mention of self-effacingness; it's entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.

The argument against "R_CDT plus P_FDT" seems to demand some reference to self-effacingness or inconsistency, or triviality / lack of teeth. But I don't understand what this view would mean or why anyone would endorse it (and I don't take you to be endorsing it).

For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don't think this ambiguity matters much for the argument.

We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what "expected utility" means.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T05:22:21.379Z · score: 3 (2 votes) · EA · GW

So, as an experiment, I'm going to be a very obstinate reductionist in this comment. I'll insist that a lot of these hard-seeming concepts aren't so hard.

Many of them are complicated, in the fashion of "knowledge" -- they admit an endless variety of edge cases and exceptions -- but these complications are quirks of human cognition and language rather than deep insights into ultimate metaphysical reality. And where there's a simple core we can point to, that core generally isn't mysterious.

It may be inconvenient to paraphrase the term away (e.g., because it packages together several distinct things in a nice concise way, or has important emotional connotations, or does important speech-act work like encouraging a behavior). But when I say it "isn't mysterious", I mean it's pretty easy to see how the concept can crop up in human thought even if it doesn't belong on the short list of deep fundamental cosmic structure terms.

I would say that there's also at least a fourth way that philosophers often use the word "rational," which is also the main way I use the word "rational." This is to refer to an irreducibly normative concept.

Why is this a fourth way? My natural response is to say that normativity itself is either a messy, parochial human concept (like "love," "knowledge," "France") , or it's not (in which case it goes in bucket 2).

Some examples of concepts that are arguably irreducible are "truth," "set," "property," "physical," "existance," and "point."

Picking on the concept here that seems like the odd one out to me: I feel confident that there isn't a cosmic law (of nature, or of metaphysics, etc.) that includes "truth" as a primitive (unless the list of primitives is incomprehensibly long). I could see an argument for concepts like "intentionality/reference", "assertion", or "state of affairs", though the former two strike me as easy to explain in simple physical terms.

Mundane empirical "truth" seems completely straightforward. Then there's the truth of sentences like "Frodo is a hobbit", "2+2=4", "I could have been the president", "Hamburgers are more delicious than battery acid"... Some of these are easier or harder to make sense of in the naive correspondence model, but regardless, it seems clear that our colloquial use of the word "true" to refer to all these different statements is pre-philosophical, and doesn't reflect anything deeper than that "each of these sentences at least superficially looks like it's asserting some state of affairs, and each sentence satisfies the conventional assertion-conditions of our linguistic community".

I think that philosophers are really good at drilling down on a lot of interesting details and creative models for how we can try to tie these disparate speech-acts together. But I think there's also a common failure mode in philosophy of treating these questions as deeper, more mysterious, or more joint-carving than the facts warrant. Just because you can argue about the truthmakers of "Frodo is a hobbit" doesn't mean you're learning something deep about the universe (or even something particularly deep about human cognition) in the process.

[Parfit:] It is hard to explain the concept of a reason, or what the phrase ‘a reason’ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But ‘counts in favour of’ means roughly ‘gives a reason for’. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words.

Suppose I build a robot that updates hypotheses based on observations, then selects actions that its hypotheses suggest will help it best achieve some goal. When the robot is deciding which hypotheses to put more confidence in based on an observation, we can imagine it thinking, "To what extent is observation o a [WORD] to believe hypothesis h?" When the robot is deciding whether it assigns enough probability to h to choose an action a, we can imagine it thinking, "To what extent is P(h)=0.7 a [WORD] to choose action a?" As a shorthand, when observation o updates a hypothesis h that favors an action a, the robot can also ask to what extent o itself is a [WORD] to choose a.

When two robots meet, we can moreover add that they negotiate a joint "compromise" goal that allows them to work together rather than fight each other for resources. In communicating with each other, they then start also using "[WORD]" where an action is being evaluated relative to the joint goal, not just the robot's original goal.

Thus when Robot A tells Robot B "I assign probability 90% to 'it's noon', which is [WORD] to have lunch", A may be trying to communicate that A wants to eat, or that A thinks eating will serve A and B's joint goal. (This gets even messier if the robots have an incentive to obfuscate which actions and action-recommendations are motivated by the personal goal vs. the joint goal.)

If you decide to relabel "[WORD]" as "reason", I claim that this captures a decent chunk of how people use the phrase "a reason". "Reason" is a suitcase word, but that doesn't mean there are no similarities between e.g. "data my goals endorse using to adjust the probability of a given hypothesis" and "probabilities-of-hypotheses my goals endorse using to select an action"), or that the similarity is mysterious and ineffable.

(I recognize that the above story leaves out a lot of important and interesting stuff. Though past a certain point, I think the details will start to become Gettier-case nitpicks, as with most concepts.)

For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of "a rational choice" to the concept of "a winning choice" (or, in line with the type-2 conception you mention, a "utility-maximizing choice").

That essay isn't trying to "reduce" the term "rationality" in the sense of taking a pre-existing word and unpacking or translating it. The essay is saying that what matters is utility, and if a human being gets too invested in verbal definitions of "what the right thing to do is", they risk losing sight of the thing they actually care about and were originally in the game to try to achieve (i.e., their utility).

Therefore: if you're going to use words like "rationality", make sure that the words in question won't cause you to shoot yourself in the foot and take actions that will end up costing you utility (e.g., costing human lives, costing years of averted suffering, costing money, costing anything or everything). And if you aren't using "rationality" in a safe "nailed-to-utility" way, make sure that you're willing to turn on a time and stop being "rational" the second your conception of rationality starts telling you to throw away value.

It ultimately seems hard, at least to me, to make non-vacuous true claims about what it's "rational" to do withoit evoking a non-reducible notion of "rationality."

"Rationality" is a suitcase word. It refers to lots of different things. On LessWrong, examples include not just "(systematized) winning" but (as noted in the essay) "Bayesian reasoning", or in Rationality: Appreciating Cognitive Algorithms, "cognitive algorithms or mental processes that systematically produce belief-accuracy or goal-achievement". In philosophy, the list is a lot longer.

The common denominator seems to largely be "something something reasoning / deliberation" plus (as you note) "something something normativity / desirability / recommendedness / requiredness".

The idea of "normativity" doesn't currently seem that mysterious to me either, though you're welcome to provide perplexing examples. My initial take is that it seems to be a suitcase word containing a bunch of ideas tied to:

  • Goals/preferences/values, especially overridingly strong ones.
  • Encouraged, endorsed, mandated, or praised conduct.

Encouraging, endorsing, mandating, and praising are speech-acts that seem very central to how humans perceive and intervene on social situations; and social situations seem pretty central to human cognition overall. So I don't think it's particularly surprising if words associated with such loaded ideas would have fairly distinctive connotations and seem to resist reduction, especially reduction that neglects the pragmatic dimensions of human communication and only considers the semantic dimension.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T10:03:14.906Z · score: 1 (5 votes) · EA · GW
Whereas others think self-consistency is more important.

The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.

It's not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn't it come down to which principles you favor?

FDT gets you more utility than CDT. If you value literally anything in life more than you value "which ritual do I use to make my decisions?", then you should go with FDT over CDT; that's the core argument.

This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think there's no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).

The latter argument for CDT isn't circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T09:34:17.060Z · score: 3 (2 votes) · EA · GW

My impression is that most CDT advocates who know about FDT think FDT is making some kind of epistemic mistake, where the most popular candidate (I think) is some version of magical thinking.

Superstitious people often believe that it's possible to directly causally influence things across great distances of time and space. At a glance, FDT's prescription ("one-box, even though you can't causally affect whether the box is full") as well as its account of how and why this works ("you can somehow 'control' the properties of abstract objects like 'decision functions'") seem weird and spooky in the manner of a superstition.

FDT's response: if a thing seems spooky, that's a fine first-pass reason to be suspicious of it. But at some point, the accusation of magical thinking has to cash out in some sort of practical, real-world failure -- in the case of decision theory, some systematic loss of utility that isn't balanced by an equal, symmetric loss of utility from CDT. After enough experience of seeing a tool outperforming the competition in scenario after scenario, at some point calling the use of that tool "magical thinking" starts to ring rather hollow. At that point, it's necessary to consider the possibility that FDT is counter-intuitive but correct (like Einstein's "spukhafte Fernwirkung"), rather than magical.

In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:

2. I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on.

The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the "deterministic subprocess" view of our decision-making, we would find nothing strange about the idea that it's sometimes right for this subprocess to do locally incorrect things for the sake of better global results.

E.g., consider the transparent Newcomb problem with a 1% chance of predictor error. If we think of the brain's decision-making as a rule-governed system whose rules we are currently determining (via a meta-reasoning process that is itself governed by deterministic rules), then there's nothing strange about enacting a rule that gets us $1M in 99% of outcomes and $0 in 1% of outcomes; and following through when the unlucky 1% scenario hits us is nothing to agonize over, it's just a consequence of the rule we already decided. In that regard, steering the rule-governed system that is your brain is no different than designing a factory robot that performs well enough in 99% of cases to offset the 1% of cases where something goes wrong.

(Note how a lot of these points are more intuitive in CS language. I don't think it's a coincidence that people coming from CS were able to improve on academic decision theory's ideas on these points; I think it's related to what kinds of stumbling blocks get in the way of thinking in these terms.)

Suppose you initially tell yourself:

"I'm going to one-box in all strictly-future transparent Newcomb problems, since this produces more expected causal (and evidential, and functional) utility. One-boxing and receiving $1M in 99% of future states is worth the $1000 cost of one-boxing in the other 1% of future states."

Suppose that you then find yourself facing the 1%-likely outcome where Omega leaves the box empty regardless of your choice. You then have a change of heart and decide to two-box after all, taking the $1000.

I claim that the above description feels from the inside like your brain is escaping the iron chains of determinism (even if your scientifically literate system-2 verbal reasoning fully recognizes that you're a deterministic process). And I claim that this feeling (plus maybe some reluctance to fully accept the problem description as accurate?) is the only thing that makes CDT's decision seem reasonable in this case.

In reality, however, if we end up not following through on our verbal commitment and we one-box in that 1% scenario, then this would just prove that we'd been mistaken about what rule we had successfully installed in our brains. As it turns out, we were really following the lower-global-utility rule from the outset. A lack of follow-through or a failure of will is itself a part of the decision-making process that Omega is predicting; however much it feels as though a last-minute swerve is you "getting away with something", it's really just you deterministically following through on an algorithm that will get you less utility in 99% of scenarios (while happening to be bad at predicting your own behavior and bad at following through on verbalized plans).

I should emphasize that the above is my own attempt to characterize the intuitions behind CDT and FDT, based on the arguments I've seen in the wild and based on what makes me feel more compelled by CDT, or by FDT. I could easily be wrong about the crux of disagreement between some CDT and FDT advocates.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T08:48:32.471Z · score: 5 (4 votes) · EA · GW

I mostly agree with this. I think the disagreement between CDT and FDT/UDT advocates is less about definitions, and more about which of these things feels more compelling:

  • 1. On the whole, FDT/UDT ends up with more utility.

(I think this intuition tends to hold more force with people the more emotionally salient "more utility" is to you. E.g., consider a version of Newcomb's problem where two-boxing gets you $100, while one-boxing gets you $100,000 and saves your child's life.)

  • 2. I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on.

(I think this intuition tends to hold more force with people the more emotionally salient it is to imagine the dollars sitting right there in front of you and you knowing that it's "too late" for one-boxing to get you any more utility in this world.)

There are other considerations too, like how much it matters to you that CDT isn't self-endorsing. CDT prescribes self-modifying in all future dilemmas so that you behave in a more UDT-like way. It's fine to say that you personally lack the willpower to follow through once you actually get into the dilemma and see the boxes sitting in front of you; but it's still the case that a sufficiently disciplined and foresightful CDT agent will generally end up behaving like FDT in the very dilemmas that have been cited to argue for CDT.

If a more disciplined and well-prepared version of you would have one-boxed, then isn't there something off about saying that two-boxing is in any sense "correct"? Even the act of praising CDT seems a bit self-destructive here, inasmuch as (a) CDT prescribes ditching CDT, and (b) realistically, praising or identifying with CDT is likely to make it harder for a human being to follow through on switching to son-of-CDT (as CDT prescribes).

Mind you, if the sentence "CDT is the most rational decision theory" is true in some substantive, non-trivial, non-circular sense, then I'm inclined to think we should acknowledge this truth, even if it makes it a bit harder to follow through on the EDT+CDT+UDT prescription to one-box in strictly-future Newcomblike problems. When the truth is inconvenient, I tend to think it's better to accept that truth than to linguistically conceal it.

But the arguments I've seen for "CDT is the most rational decision theory" to date have struck me as either circular, or as reducing to "I know CDT doesn't get me the most utility, but something about it just feels right".

It's fine, I think, if "it just feels right" is meant to be a promissory note for some forthcoming account — a clue that there's some deeper reason to favor CDT, though we haven't discovered it yet. As the FDT paper puts it:

These are odd conclusions. It might even be argued that sufficiently odd behavior provides evidence that what FDT agents see as “rational” diverges from what humans see as “rational.” And given enough divergence of that sort, we might be justified in predicting that FDT will systematically fail to get the most utility in some as-yet-unknown fair test.

On the other hand, if "it just feels right" is meant to be the final word on why "CDT is the most rational decision theory", then I feel comfortable saying that "rational" is a poor choice of word here, and neither maps onto a key descriptive category nor maps onto any prescription or norm worthy of being followed.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T07:52:20.219Z · score: 5 (3 votes) · EA · GW

I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!

If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesn't really have an answer.

Some ancient Greeks thought that the planets were intelligent beings; yet many of the Greeks' astronomical observations, and some of their theories and predictive tools, were still true and useful.

I think that terms like "normative" and "rational" are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser's pluralistic moral reductionism).

I would say that (1) some philosophers use "rational" in a very human-centric way, which is fine as long as it's done consistently; (2) others have a much more thin conception of "rational", such as 'tending to maximize utility'; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of "rationality", but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.

I think that type-1, type-2, and type-3 decision theorists have all contributed valuable AI-relevant conceptual progress in the past (most obviously, by formulating Newcomb's problem, EDT, and CDT), and I think all three could do more of the same in the future. I think the type-3 decision theorists are making a mistake, but often more in the fashion of an ancient astronomer who's accumulating useful and real knowledge but happens to have some false side-beliefs about the object of study, not in the fashion of a theologian whose entire object of study is illusory. (And not in the fashion of a developmental psychologist or historian whose field of subject is too human-centric to directly bear on game theory, AI, etc.)

I'd expect type-2 decision theorists to tend to be interested in more AI-relevant things than type-1 decision theorists, but on the whole I think the flavor of decision theory as a field has ended up being more type-2/3 than type-1. (And in this case, even type-1 analyses of "rationality" can be helpful for bringing various widespread background assumptions to light.)

If I'm someone with a twin and I'm implementing P_CDT, I still don't think I will choose to modify myself to cooperate in twin prisoner's dilemmas. The reason is that modifying myself won't cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.

This is true if your twin was copied from you in the past. If your twin will be copied from you in the future, however, then you can indeed cause your twin to cooperate, assuming you have the ability to modify your own future decision-making so as to follow son-of-CDT's prescriptions from now on.

Making the commitment to always follow son-of-CDT is an action you can take; the mechanistic causal consequence of this action is that your future brain and any physical systems that are made into copies of your brain in the future will behave in certain systematic ways. So from your present perspective (as a CDT agent), you can causally control future copies of yourself, as long as the act of copying hasn't happened yet.

(And yes, by the time you actually end up in the prisoner's dilemma, your future self will no longer be able to causally affect your copy. But this is irrelevant from the perspective of present-you; to follow CDT's prescriptions, present-you just needs to pick the action that you currently judge will have the best consequences, even if that means binding your future self to take actions contrary to CDT's future prescriptions.)

(If it helps, don't think of the copy of you as "you": just think of it as another environmental process you can influence. CDT prescribes taking actions that change the behavior of future copies of yourself in useful ways, for the same reason CDT prescribes actions that change the future course of other physical processes.)

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-23T03:06:40.896Z · score: 7 (5 votes) · EA · GW
But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his “Guaranteed Payoffs Principle.”

I think "Don't Make Things Worse" is a plausible principle at first glance.

One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility). The general policy of following the "Don't Make Things Worse Principle" makes things worse.

Once you've already adopted son-of-CDT, which says something like "act like UDT in future dilemmas insofar as the correlations were produced after I adopted this rule, but act like CDT in those dilemmas insofar as the correlations were produced before I adopted this rule", it's not clear to me why you wouldn't just go: "Oh. CDT has lost the thing I thought made it appealing in the first place, this 'Don't Make Things Worse' feature. If we're going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?"

A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state. From Abram Demski's comments:

[...] In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/EDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesn't look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didn't pay well.
[...] One way of thinking about this is to say that the FDT notion of "decision problem" is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified 'bomb' with just the certain information that 'left' is (causally and evidentially) very bad and 'right' is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem.
Another way to think about this is to say that FDT "rejects" decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy.


[...] This also hopefully clarifies the sense in which I don't think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem.
There's a subtle point here, though, since Will describes the decision problem from an updated perspective -- you already know the bomb is in front of you. So UDT "changes the problem" by evaluating "according to the prior". From my perspective, because the very statement of the Bomb problem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances.
Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Let's call the way-you-put-agents-into-the-scenario the "construction". We then evaluate agents on how well they deal with the construction.
For examples like Bomb, the construction gives us the overall probability distribution -- this is then used for the expected value which UDT's optimality notion is stated in terms of.
For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible.
Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-23T02:48:53.788Z · score: 4 (4 votes) · EA · GW
Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions.

This is similar to how you described it here:

Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. [...]
When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational iff taking it would cause the largest expected increase in value. To avoid any ambiguity, let’s label this claim R_CDT.
We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.

This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it's normative, it can be either an algorithm/procedure that's being recommended, or a criterion of rightness like "a decision is rational iff taking it would cause the largest expected increase in value" (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are "normative" or "endorsed").

Some of your discussion above seems to be focusing on the "algorithmic?" dimension, while other parts seem focused on "normative?". I'll say more about "normative?" here.

The reason I proposed the three distinctions in my last comment and organized my discussion around them is that I think they're pretty concrete and crisply defined. It's harder for me to accidentally switch topics or bundle two different concepts together when talking about "trying to optimize vs. optimizing as a side-effect", "directly optimizing vs. optimizing via heuristics", "initially optimizing vs. self-modifying to optimize", or "function vs. algorithm".

In contrast, I think "normative" and "rational" can mean pretty different things in different contexts, it's easy to accidentally slide between different meanings of them, and their abstractness makes it easy to lose track of what's at stake in the discussion.

E.g., "normative" is often used in the context of human terminal values, and it's in this context that statements like this ring obviously true:

I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making.

If we're treating decision-theoretic norms as being like moral norms, then sure. I think there are basically three options:

  • Decision theory isn't normative.
  • Decision theory is normative in the way that "murder is bad" or "improving aggregate welfare is good" is normative, i.e., it expresses an arbitrary terminal value of human beings.
  • Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).

Probability theory has obvious normative force in the context of reasoning and decision-making, but it's not therefore arbitrary or irrelevant to understanding human cognition, AI, etc.

A lot of the examples you've cited are theories from moral philosophy about what's terminally valuable. But decision theory is generally thought of as the study of how to make the right decisions, given a set of terminal preferences; it's not generally thought of as the study of which decision-making methods humans happen to terminally prefer to employ. So I would put it in category 1 or 3.

You could indeed define an agent that terminally values making CDT-style decisions, but I don't think most proponents of CDT or EDT would claim that their disagreement with UDT/FDT comes down to a values disagreement like that. Rather, they'd claim that rival decision theorists are making some variety of epistemic mistake. (And I would agree that the disagreement comes down to one party or the other making an epistemic mistake, though I obviously disagree about who's mistaken.)

I actually don’t think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account won’t cause you to achieve better outcomes here.)

In the twin prisoner's dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).

I think you can model the voting dilemma the same way, just with noise added because the level of correlation is imperfect and/or uncertain. Ten agents following the same decision procedure are trying to decide whether to stay home and watch a movie (which gives a small guaranteed benefit) or go to the polls (which costs them the utility of the movie, but gains them a larger utility iff the other nine agents go to the polls too). Ten FDT agents will vote in this case, if they know that the other agents will vote under similar conditions.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-22T22:23:33.949Z · score: 12 (6 votes) · EA · GW

I agree that these three distinctions are important:

  • "Picking policies based on whether they satisfy a criterion X" vs. "Picking policies that happen to satisfy a criterion X". (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
  • "Trying to follow a decision rule Y 'directly' or 'on the object level'" vs. "Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y". (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you've come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
  • "A decision rule that prescribes outputting some action or policy and doesn't care how you do it" vs. "A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy". (E.g., a rule that says 'maximize the aggregate welfare of moral patients' vs. a specific mental algorithm intended to achieve that end.)

The first distinction above seems less relevant here, since we're mostly discussing AI systems and humans that are self-aware about their decision criteria and explicitly "trying to do what's right".

As a side-note, I do want to emphasize that from the MIRI cluster's perspective, it's fine for correct reasoning in AGI to arise incidentally or implicitly, as long as it happens somehow (and as long as the system's alignment-relevant properties aren't obscured and the system ends up safe and reliable).

The main reason to work on decision theory in AI alignment has never been "What if people don't make AI 'decision-theoretic' enough?" or "What if people mistakenly think CDT is correct and so build CDT into their AI system?" The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we've even been misunderstanding basic things at the level of "decision-theoretic criterion of rightness".

It's not that I want decision theorists to try to build AI systems (even notional ones). It's that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That's part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).

The second distinction ("following a rule 'directly' vs. following it by adopting a sub-rule or via self-modification") seems more relevant. You write:

My understanding is that when philosophers talk about “CDT,” they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.
The difference matters, because people who believe that R_CDT is true don’t generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves.

Far from being a distinction proponents of UDT/FDT neglect, this is one of the main grounds on which UDT/FDT proponents criticize CDT (from within the "success-first" tradition). This is because agents that are reflectively inconsistent in the manner of CDT -- ones that take actions they know they'll regret taking, wish they were following a different decision rule, etc. -- can be money-pumped and can otherwise lose arbitrary amounts of value.

A human following CDT should endorse "stop following CDT," since CDT isn't self-endorsing. It's not even that they should endorse "keep following CDT, but adopt a heuristic or sub-rule that helps us better achieve CDT ends"; they need to completely abandon CDT even at the meta-level of "what sort of decision rule should I follow?" and modify themselves into purely following an entirely new decision rule, or else they'll continue to perform poorly by CDT's lights.

The decision rule that CDT does endorse loses a lot of the apparent elegance and naturalness of CDT. This rule, "son-of-CDT", is roughly:

  • Have whatever disposition-to-act gets the most utility, unless I'm in future situations like "a twin prisoner's dilemma against a perfect copy of my future self where the copy was forked from me before I started following this rule", in which case ignore my correlation with that particular copy and make decisions as though our behavior is independent (while continuing to take into account my correlation with any copies of myself I end up in prisoner's dilemmas with that were copied from my brain after I started following this rule).

The fact that CDT doesn't endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories don't), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.

But this decision rule CDT endorses also still performs suboptimally (from the perspective of success-first decision theory). See the discussion of the Retro Blackmail Problem in "Toward Idealized Decision Theory", where "CDT and any decision procedure to which CDT would self-modify see losing money to the blackmailer as the best available action."

In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agents' due to events that happened after she turned 20 (such as "the summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theory"). But she'll refuse to coordinate for reasons like "we hung out a lot the summer before my 20th birthday", "we spent our whole childhoods and teen years living together and learning from the same teachers", and "we all have similar decision-making faculties due to being members of the same species". There's no principled reason to draw this temporal distinction; it's just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.

Regarding the third distinction ("prescribing a certain kind of output vs. prescribing a step-by-step mental procedure for achieving that kind of output"), I'd say that it's primarily the criterion of rightness that MIRI-cluster researchers care about. This is part of why the paper is called "Functional Decision Theory" and not (e.g.) "Algorithmic Decision Theory": the focus is explicitly on "what outcomes do you produce?", not on how you produce them.

(Thus, an FDT agent can cooperate with another agent whenever the latter agent's input-output relations match FDT's prescription in the relevant dilemmas, regardless of what computations they do to produce those outputs.)

The main reasons I think academic decision theory should spend more time coming up with algorithms that satisfy their decision rules are that (a) this has a track record of clarifying what various decision rules actually prescribe in different dilemmas, and (b) this has a track record of helping clarify other issues in the "understand what good reasoning is" project (e.g., logical uncertainty) and how they relate to decision theory.

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-21T22:41:33.710Z · score: 20 (15 votes) · EA · GW

The comments here have been very ecumenical, but I'd like to propose a different account of the philosophy/AI divide on decision theory:

1. "What makes a decision 'good' if the decision happens inside an AI?" and "What makes a decision 'good' if the decision happens inside a brain?" aren't orthogonal questions, or even all that different; they're two different ways of posing the same question.

MIRI's AI work is properly thought of as part of the "success-first decision theory" approach in academic decision theory, described by Greene (2018) (who also cites past proponents of this way of doing decision theory):

[...] Consider a theory that allows the agents who employ it to end up rich in worlds containing both classic and transparent Newcomb Problems. This type of theory is motivated by the desire to draw a tighter connection between rationality and success, rather than to support any particular account of expected utility. We might refer to this type of theory as a "success-first" decision theory.
[...] The desire to create a closer connection between rationality and success than that offered by standard decision theory has inspired several success-first decision theories over the past three decades, including those of Gauthier (1986), McClennen (1990), and Meacham (2010), as well as an influential account of the rationality of intention formation and retention in the work of Bratman (1999). McClennen (1990: 118) writes: “This is a brief for rationality as a positive capacity, not a liability—as it must be on the standard account.” Meacham (2010: 56) offers the plausible principle, “If we expect the agents who employ one decision making theory to generally be richer than the agents who employ some other decision making theory, this seems to be a prima facie reason to favor the first theory over the second.” And Gauthier (1986: 182–3) proposes that “a [decision-making] disposition is rational if and only if an actor holding it can expect his choices to yield no less utility than the choices he would make were he to hold any alternative disposition.” In slogan form, Gauthier (1986: 187) calls the idea “utility-maximization at the level of dispositions,” Meacham (2010: 68–9) a “cohesive” decision theory, McClennen (1990: 6–13) a form of “pragmatism,” and Bratman (1999: 66) a “broadly consequentialist justification” of rational norms.
[...] Accordingly, the decision theorist’s job is like that of an engineer in inventing decision theories, and like that of a scientist in testing their efficacy. A decision theorist attempts to discover decision theories (or decision “rules,” “algorithms,” or “processes”) and determine their efficacy, under certain idealizing conditions, in bringing about what is of ultimate value.
Someone who holds this view might be called a methodological hypernaturalist, who recommends an experimental approach to decision theory. On this view, the decision theorist is a scientist of a special sort, but their goal should be broadly continuous with that of scientific research. The goal of determining efficacy in bringing about value, for example, is like that of a pharmaceutical scientist attempting to discover the efficacy of medications in treating disease.
For game theory, Thomas Schelling (1960) was a proponent of this view. The experimental approach is similar to what Schelling meant when he called for “a reorientation of game theory” in Part 2 of A Strategy of Conflict. Schelling argues that a tendency to focus on first principles, rather than upshots, makes game-theoretic theorizing shockingly blind to rational strategies in coordination problems.

The FDT paper does a poor job of contextualizing itself because it was written by AI researchers who are less well-versed with the philosophical literature.

MIRI's work is both advocating a particular solution to the question "what kind of decision theory satisfies the 'success' criterion?", and lending some additional support to the claim that "success-first" is a coherent and reasonable criterion for decision theorists to orient towards. (In a world without ideas like UDT, it was harder to argue that we should try to reduce decision theory to 'what decision-making approach yields the best utility?', since neither CDT nor EDT strictly outperforms the other; whereas there's a strong case that UDT does strictly outperform both CDT and EDT, to the extent it's possible for any decision theory to strictly outperform another; though there may be even-better approaches.)

You can go with Paul and say that a lot of these distinctions are semantic rather than substantive -- that there isn't a true, ultimate, objective answer to the question of whether we should evaluate decision theories by whether they're successful, vs. some other criterion. But dissolving contentious arguments and showing why they're merely verbal is itself a hallmark of analytic philosophy, so this doesn't do anything to make me think that these issues aren't the proper province of academic decision theory.

2. Rather than operating in separate magisteria, people like Wei Dai are making contrary claims about how humans should make decisions. This is easiest to see in contexts where a future technology comes along: if whole-brain emulation were developed tomorrow and it was suddenly trivial to put CDT proponents in literal twin prisoner's dilemmas, the CDT recommendation to defect (one-box, etc.) suddenly makes a very obvious and real difference.

I claim (as someone who thinks UDT/FDT is correct) that the reason it tends to be helpful to think about advanced technologies is that it draws out the violations of naturalism that are often implicit in how we talk about human reasoning. Our native way of thinking about concepts like "control," "choice," and "counterfactual" tends to be confused, and bringing in things like predictors and copies of our reasoning draws out those confusions in much the same way that sci-fi thought experiments and the development of new technologies have repeatedly helped clarify confused thinking in philosophy of consciousness, philosophy of personal identity, philosophy of computation, etc.

3. Quoting Paul:

Most causal decision theorists would agree that if they had the power to stop doing the right thing, they should stop taking actions which are right. They should instead be the kind of person that you want to be.
And so there, again, I agree it has implications, but I don't think it's a question of disagreement about truth. It's more a question of, like: you're actually making some cognitive decisions. How do you reason? How do you conceptualize what you're doing?"

I would argue that most philosophers who feel "trapped by rationality" or "unable to stop doing what's 'right,' even though they know they 'should,'" could in fact escape the trap if they saw the flaws in whatever reasoning process led them to their current idea of "rationality" in the first place. I think a lot of people are reasoning their way into making worse decisions (at least in the future/hypothetical scenarios noted above, though I would be very surprised if correct decision-theoretic views had literally no implications for everyday life today) due to object-level misconceptions about the prescriptions and flaws of different decision theories.

And all of this strikes me as very much the bread and butter of analytic philosophy. Philosophers unpack and critique the implicit assumptions in different ways of modeling the world (e.g., "of course I can 'control' physical outcomes but can't 'control' mathematical facts", or "of course I can just immediately tell that I'm in the 'real world'; a simulation of me isn't me, or wouldn't be conscious, etc."). I think MIRI just isn't very good at dialoguing with philosophers, and has had too many competing priorities to put the amount of effort into a scholarly dialogue that I wish were being made.

4. There will obviously be innumerable practical differences between the first AGI systems and human decision-makers. However, putting a huge amount of philosophical weight on this distinction will tend to violate naturalism: ceteris paribus, changing whether you run a cognitive process in carbon or in silicon doesn't change whether the process is doing the right thing or working correctly.

E.g., the rules of arithmetic are the same for humans and calculators, even though we don't use identical algorithms to answer particular questions. Humans tend to correctly treat calculators naturalistically: we often think of them as an extension of our own brains and reasoning, we freely switch back and forth between running a needed computation in our own brain vs. in a machine, etc. Running a decision-making algorithm in your brain vs. in an AI shouldn't be fundamentally different, I claim.

5. For similar reasons, a naturalistic way of thinking about the task "delegating a decision-making process to a reasoner outside your own brain" will itself not draw a deep philosophical distinction between "a human building an AI to solve a problem" and "an AI building a second AI to solve a problem" or for that matter "an agent learning over time and refining its own reasoning process so it can 'delegate' to its future self".

There will obviously be practical differences, but there will also be practical differences between two different AI designs. We don't assume that switching to a different design within AI means that the background rules of decision theory (or arithmetic, etc.) go out the window.

(Another way of thinking about this is that the distinction between "natural" and "artificial" intelligence is primarily a practical and historical one, not one that rests on a deep truth of computer science or rational agency; a more naturalistic approach would think of humans more as a weird special case of the extremely heterogeneous space of "(A)I" designs.)

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-19T17:26:19.168Z · score: 9 (4 votes) · EA · GW

Oops, I saw your question when you first posted it on LessWrong but forgot to get back to you, Issa. My apologies.

I think there are two main kinds of strategic thought we had in mind when we said "details forthcoming":

  • 1. Thoughts on MIRI's organizational plans, deconfusion research, and how we think MIRI can help play a role in improving the future — this is covered by our November 2018 update post,
  • 2. High-level thoughts on things like "what we think AGI developers probably need to do" and "what we think the world probably needs to do" to successfully navigate the acute risk period.

Most of the stuff discussed in "strategic background" is about 2: not MIRI's organizational plan, but our model of some of the things humanity likely needs to do in order for the long-run future to go well. Some of these topics are reasonably sensitive, and we've gone back and forth about how best to talk about them.

Within the macrostrategy / "high-level thoughts" part of the post, the densest part was maybe 7a. The criteria we listed for a strategically adequate AGI project were "strong opsec, research closure, trustworthy command, a commitment to the common good, security mindset, requisite resource levels, and heavy prioritization of alignment work".

With most of these it's reasonably clear what's meant in broad strokes, though there's a lot more I'd like to say about the specifics. "Trustworthy command" and "a commitment to the common good" are maybe the most opaque. By "trustworthy command" we meant things like:

  • The organization's entire command structure is fully aware of the difficulty and danger of alignment.
  • Non-technical leadership can't interfere and won't object if technical leadership needs to delete a code base or abort the project.

By "a commitment to the common good" we meant a commitment to both short-term goodness (the immediate welfare of present-day Earth) and long-term goodness (the achievement of transhumanist astronomical goods), paired with a real commitment to moral humility: not rushing ahead to implement every idea that sounds good to them.

We still plan to produce more long-form macrostrategy exposition, but given how many times we've failed to word our thoughts in a way we felt comfortable publishing, and given how much other stuff we're also juggling, I don't currently expect us to have any big macrostrategy posts in the next 6 months. (Note that I don't plan to give up on trying to get more of our thoughts out sooner than that, if possible. We'll see.)

Comment by robbensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-18T18:16:02.651Z · score: 16 (7 votes) · EA · GW

See also Paul Christiano's take:

Comment by robbensinger on Only a few people decide about funding for community builders world-wide · 2019-10-25T19:39:21.391Z · score: 6 (3 votes) · EA · GW
I think MIRI has/had a programme to fund career transitions

Tangential to the topic, but yes, we have an AI Safety Retraining Program for people interested in reskilling for full-time AI alignment research.

Comment by robbensinger on Is pain just a signal to enlist altruists? · 2019-10-03T14:21:59.351Z · score: 17 (3 votes) · EA · GW

[Epistemic status: Thinking out loud; copying my EA Forum comment about these papers from a couple weeks ago]

If the evolutionary logic here is right, I'd naively also expect non-human animals to suffer more to the extent they're (a) more social, and (b) better at communicating specific, achievable needs and desires.

There are reasons the logic might not generalize, though. Humans have fine-grained language that lets us express very complicated propositions about our internal states. That puts a lot of pressure on individual humans to have a totally ironclad, consistent "story" they can express to others. I'd expect there to be a lot more evolutionary pressure to actually experience suffering, since a human will be better at spotting holes in the narratives of a human who fakes it (compared to, e.g., a bonobo trying to detect whether another bonobo is really in that much pain).

It seems like there should be an arms race across many social species to give increasingly costly signals of distress, up until the costs outweigh the amount of help they can hope to get. But if you don't have the language to actually express concrete propositions like "Bob took care of me the last time I got sick, six months ago, and he can attest that I had a hard time walking that time too", then those costly signals might be mostly or entirely things like "shriek louder in response to percept X", rather than things like "internally represent a hard-to-endure pain-state so I can more convincingly stick to a verbal narrative going forward about how hard-to-endure this was".

Comment by robbensinger on Is pain just a signal to enlist altruists? · 2019-10-03T14:10:57.968Z · score: 12 (5 votes) · EA · GW

(Warning: Graphic medical descriptions)

I'm reminded of Reflections on pain, from the burn unit:

[...] But the one thing that did seem to dramatically affect my pain level was my belief about what was causing the pain. At one point, I was lying on my side and a nurse was pulling a bandage off of one of my burns; I couldn’t see what she was doing, but it felt like the bandage was sticking to the wound, and it was agonizing. But then she said: “Now, keep in mind, I’m just taking off the edges of the bandage here, so this is all normal skin. It just hurts because it’s like pulling tape off your skin.” And once she said that — once I started picturing tape being pulled off of normal, intact skin rather than an open wound — the pain didn’t bother me nearly as much. It really drove home to me how much of my experience of pain is psychological; if I believe the cause of the pain is something frightening or upsetting, then the pain seems much worse.
And in fact, I’d had a similar thought a few months ago, which I’d then forgotten about until the burn experience called it back to mind. I’d been carrying a heavy shopping bag on my shoulder one day, and the weight of the bag’s straps was cutting into the skin on my shoulder. But I barely noticed it. And then it occurred to me that if I had been experiencing that exact same sensation on my shoulder, in the absence of a shopping bag, it would have seemed quite painful. The fact that I knew the sensation was caused by something mundane and harmless reduced the pain so much it didn’t even register in my mind as a negative experience.
Of course, I probably can’t successfully lie to myself about what’s causing me pain, so there’s a limit to how directly useful this observation can be for managing pain in the future. But it was indirectly useful for me, because it proved to me something I’d heard but never quite believed: that the unpleasantness of pain is substantially (entirely?) psychologically constructed. A bit of subsequent reading led me to some fascinating science that underlines that conclusion – for example, the fact that the physical sensation of pain is processed by one region of the brain while the unpleasantness of that sensation is processed by another region. And the existence of a condition called pain asymbolia, in which people with certain kinds of brain damage say they’re able to feel pain but that they don’t find it the slightest bit unpleasant. [...]
Comment by robbensinger on RobBensinger's Shortform · 2019-09-23T19:45:12.777Z · score: 2 (1 votes) · EA · GW

[Epistemic status: Thinking out loud]

If the evolutionary logic here is right, I'd naively also expect non-human animals to suffer more to the extent they're (a) more social, and (b) better at communicating specific, achievable needs and desires.

There are reasons the logic might not generalize, though. Humans have fine-grained language that lets us express very complicated propositions about our internal states. That puts a lot of pressure on individual humans to have a totally ironclad, consistent "story" they can express to others. I'd expect there to be a lot more evolutionary pressure to actually experience suffering, since a human will be better at spotting holes in the narratives of a human who fakes it (compared to, e.g., a bonobo trying to detect whether another bonobo is really in that much pain).

It seems like there should be an arms race across many social species to give increasingly costly signals of distress, up until the costs outweigh the amount of help they can hope to get. But if you don't have the language to actually express concrete propositions like "Bob took care of me the last time I got sick, six months ago, and he can attest that I had a hard time walking that time too", then those costly signals might be mostly or entirely things like "shriek louder in response to percept X", rather than things like "internally represent a hard-to-endure pain-state so I can more convincingly stick to a verbal narrative going forward about how hard-to-endure this was".

Comment by robbensinger on RobBensinger's Shortform · 2019-09-23T19:44:20.263Z · score: 4 (2 votes) · EA · GW

Rolf Degen, summarizing part of Barbara Finlay's "The neuroscience of vision and pain":

Humans may have evolved to experience far greater pain, malaise and suffering than the rest of the animal kingdom, due to their intense sociality giving them a reasonable chance of receiving help.

From the paper:

Several years ago, we proposed the idea that pain, and sickness behaviour had become systematically increased in humans compared with our primate relatives, because human intense sociality allowed that we could ask for help and have a reasonable chance of receiving it. We called this hypothesis ‘the pain of altruism’ [68]. This idea derives from, but is a substantive extension of Wall’s account of the placebo response [43]. Starting from human childbirth as an example (but applying the idea to all kinds of trauma and illness), we hypothesized that labour pains are more painful in humans so that we might get help, an ‘obligatory midwifery’ which most other primates avoid and which improves survival in human childbirth substantially ([67]; see also [69]). Additionally, labour pains do not arise from tissue damage, but rather predict possible tissue damage and a considerable chance of death. Pain and the duration of recovery after trauma are extended, because humans may expect to be provisioned and protected during such periods. The vigour and duration of immune responses after infection, with attendant malaise, are also increased. Noisy expression of pain and malaise, coupled with an unusual responsivity to such requests, was thought to be an adaptation.
We noted that similar effects might have been established in domesticated animals and pets, and addressed issues of ‘honest signalling’ that this kind of petition for help raised. No implication that no other primate ever supplied or asked for help from any other was intended, nor any claim that animals do not feel pain. Rather, animals would experience pain to the degree it was functional, to escape trauma and minimize movement after trauma, insofar as possible.

Finlay's original article on the topic: "The pain of altruism".

Comment by robbensinger on Is preventing child abuse a plausible Cause X? · 2019-06-01T06:30:22.309Z · score: 12 (5 votes) · EA · GW

I also think this suggests something is going wrong. I'm guessing a lot of it is that people feel a need to justify posts as on-topic. If they post a thing because it seems interesting, confusing, exciting, etc., they're likely to get challenged about why the post belongs on the EA Forum.

This means that EAs can't talk about ideas and areas unless either (a) they've already been sufficiently well-explored by EAs elsewhere (e.g., in an 80K blog post or an Open Phil report) that there's a pre-existing consensus this is an especially good thing to talk about; or (b) they're willing to make the discussion very meta-oriented and general. ("Why don't EAs care more about reducing rates of medical error?", as opposed to "Hey, here's an interesting study on things that mediate medical error rates!")

This seems OK iff the EA Forum is only intended to intervene on a particular part of the idea pipeline — maybe the idea is for individuals and groups to explore new frontiers elsewhere, and bring them to the EA Forum once they're already well-established enough that everyone can agree they make sense as an EA priority. In that case, it might be helpful to have canonical locations people can go to have those earlier discussions.

Comment by robbensinger on Is EA unscalable central planning? · 2019-05-08T03:00:45.550Z · score: 6 (4 votes) · EA · GW

Small terminology note: an "existential risk" is anything that would drastically reduce the future's value, so s-risk is a special case of x-risk.

Comment by robbensinger on How do we check for flaws in Effective Altruism? · 2019-05-08T00:44:59.964Z · score: 7 (2 votes) · EA · GW

Yeah, strong upvote to this too.

Comment by robbensinger on How do we check for flaws in Effective Altruism? · 2019-05-07T03:02:43.800Z · score: 21 (7 votes) · EA · GW

+1 to this.

I partly agree with Nathan's post, for a few reasons:

  • If Alice believes X because she trusts that Bob looked into it, then it's useful for Alice to note her reason. Otherwise, you can get bad situations like 'Bob did not in fact look into X, but he observes Alice's confidence and concludes that she must have looked into it, so he takes X for granted too and Alice never realizes why'. This isn't a big problem in two-person groups, but can lead to a lot of double-counted evidence in thousand-person groups.
  • It's important to distinguish 'this feels compelling' from 'this is Bayesian evidence about the physical world'. If an argument seems convincing, but would seem equally convincing if it were false, then you shouldn't actually treat the convincingness as evidence.
  • Getting the right answer here is important enough, and blind spots and black-swan errors are common enough, that it can make a lot of sense to check your work even in cases where you'd be super surprised to learn you'd been wrong. Getting outside feedback can be a good way to do this.

I've noticed that when I worry "what if everything I believe is wrong?", sometimes it's a real worry that I'm biased in a specific way, or that I might just be missing something. Other times, it's more like an urge to be dutifully/performatively skeptical or to get a certain kind of emotional reassurance; see for a good discussion of this.


Arguably this forum kind of does this job, though A) we are all tremendously biased B) are people *really* checking the minutiae? I am not.

Some people check some minutiae. The end of is a cool example that comes to mind.

I haven't had any recent massive updates about EA sources' credibility after seeing a randomized spot check. Which is one way of trying to guess at the expected utility of more marginal spot-checking, vs. putting the same resources into something else.

My main suggestion, though, would be to check out various examples of arguments between EAs, criticisms of EAs by other EAs, etc., and use that to start building a mental model of EA's epistemic hygiene and likely biases or strengths. "Everyone on the EA Forum must be tremendously biased because otherwise they surely wouldn't visit the forum" is a weak starting point by comparison; you can't figure out which groups in the real world are biased (or how much, or in what ways) from your armchair.

Comment by robbensinger on Will splashy philanthropy cause the biosecurity field to focus on the wrong risks? · 2019-05-02T01:10:01.881Z · score: 15 (7 votes) · EA · GW

What are some reasons people think GCBRs deserve less attention (relative to how Open Phil prioritizes this work)?

I'd be interest to learn more about reasons beyond "a diversity of perspectives and research focuses is good for the field", or background on why diversifying outside of GCR might be really important for biosecurity in particular. (E.g., "demanding that biosecurity researchers demonstrate relevance to GCBR is likely to stunt more basic or early-stage research that's also critical for GCBR, but at a greater temporal and causal remove"; or "GCBR is a bad way of thinking about the relationship between GCR and biosecurity, because the main GCR risks in this context are second-order effects from smaller-scale biosecurity incidents rather than e.g. global pandemics".)

The main object-level argument in Lenzos' article seems to be that GCBR is "extremely unlikely":

Biosecurity covers a spectrum of risks, ranging from naturally occurring disease, through unintended consequences of research, lab accidents, negligence, and reckless behavior, to deliberate misuse of pathogens or technology by state and non-state actors. The scenarios all have different likelihoods of playing out—and risks with potential catastrophic consequences on a global scale are among the least likely. But Open Phil dollars are flooding into biosecurity and are absorbing much of the field’s experienced research capacity, focusing the attention of experts on this narrow, extremely unlikely, aspect of biosecurity risk.

If this argument can be made in a compelling way from a perspective that's longtermist and focused on EV, I'd be really interested to learn more about it.

Comment by robbensinger on Long-Term Future Fund: April 2019 grant recommendations · 2019-04-26T03:26:19.735Z · score: 9 (6 votes) · EA · GW

Thanks for continuing to write up your thoughts in so much detail, Oliver; this is super interesting and useful stuff.

When you say "Note: Greg responded to this and I now think this point is mostly false", I assume that "this" refers to the previous point (1) rather than the subsequent point (2)?

Comment by robbensinger on Thoughts on 80,000 Hours’ research that might help with job-search frustrations · 2019-04-20T19:28:37.135Z · score: 3 (2 votes) · EA · GW

Yep, the quiz may be an exception! I was commenting on the general thread of discussion on this page "just take everything down that's out of date," and the quiz subthread was just the one that caught my eye. My apologies for making it sound like the quiz in particular is the thing I want preserved; I don't have a strong view on that.

Comment by robbensinger on Thoughts on 80,000 Hours’ research that might help with job-search frustrations · 2019-04-19T19:44:02.089Z · score: 3 (2 votes) · EA · GW

Yeah, I don't have a strong object-level view about exactly which advice is best for most EAs; I just wanted to voice some support for letting those recommendations drift apart if it does end up looking like EAs and non-EAs benefit from different things. I think "if X then Y" can definitely be a good solution.

Comment by robbensinger on Thoughts on 80,000 Hours’ research that might help with job-search frustrations · 2019-04-19T05:38:38.540Z · score: 15 (5 votes) · EA · GW
One of 80K's strongest features was (since they seem to be moving in a different direction) giving good generic career advice, especially for undergraduates. It would be a shame to lose this because I think it makes a great initial impression to newcomers and convinces them straight off the bat of how useful EA can be in helping them make meaningful impact, even if they aren't convinced by all of the ideas behind EA immediately.

That sounds plausible to me if the same recommendations apply to newcomers and to die-hard EAs, such that "do we give advice that's useful for general audiences?" is just a question of "which good-to-follow advice do we emphasize?" and not "which advice is good for a given demographic to follow?".

On the other hand, I don't want 80K to give advice that's actively bad for die-hard EAs to follow, no matter how useful that advice is to students in general. From my perspective (which might reflect a different set of goals than yours, since I'm not coming at this question from your position), that would make it too hard to just zip over to 80K's website for advice and trust that I'm getting relevant information.

I don't think we should underestimate the value of being able to trust that an information source is giving us exactly what it thinks the very best advice is, without having to worry about how much the source might be diluting that advice to make it more memetic or easy-to-sell. Being able to just take statements on the 80K website at face value is a big deal.

If a certain piece of advice turns out to be good for most students but bad for most EA students, then I could see it being possibly interesting and useful for 80K to make a page like "Here's how our advice to most students would differ from our advice to EA students." That could then serve a dual purpose by clarifying what sensible "baseline" advice looks like. I think it would also be fine for 80K to link to some offsite, non-80K-branded career advice that they especially endorse for other students, even though they specifically don't endorse it for maximizing your career's altruistic impact.

Comment by robbensinger on Thoughts on 80,000 Hours’ research that might help with job-search frustrations · 2019-04-19T05:24:26.025Z · score: 5 (3 votes) · EA · GW

I like there being a record of out-of-date recommendations and tools on the 80K site [edit: so I know how they've updated, and so I can access the parts of old resources that aren't out of date].

A curated list of Archive links might work OK as a replacement, I suppose. But in general, given that various pages have accumulated offsite hyperlinks over the years, I think it's more informative to plaster giant "this content is out-of-of-date because X" disclaimers on the relevant pages, rather than just taking the page down.

Comment by robbensinger on Is Modern Monetary Theory a good idea? · 2019-04-18T19:45:12.029Z · score: 13 (5 votes) · EA · GW

I agree with Aaron, though I might be more in favor than most people of "high-quality EA Forum discussion of things that are important for how the world works but aren't obviously EA-flavored or actionable". Vox probably isn't a great source for topics, but I do think EA should be branching out increasingly into topics that don't feel EA-ish, to build up models and questions that might feed into interventions further down the road.

Comment by robbensinger on Long-Term Future Fund: April 2019 grant recommendations · 2019-04-10T20:03:09.925Z · score: 50 (23 votes) · EA · GW

The main thing that pinged me about anoneaagain's comment was that it's saying things that aren't true, and saying them in ways that aren't epistemically cooperative, more so than that it's merely unkind. If you're going to assert 'this person's youtube videos are unsuccessful', you should say what you mean by that and why you think it. If the thing you're responding to is a long, skimmable 75-page post, you should make sure your readers didn't miss the fact that the person you're alluding to is a Computerphile contributor whose videos there tend to get hundreds of thousands of views, and you should say something about why that's not relevant to your success metric (or to the broader goals LTFF should be focusing on).

Wink-and-nudge, connotation-based argument makes it hard to figure out what argument's being made, which makes it hard to have a back-and-forth. If we strip aside the connotation, it's harder to see what's laughable about ideas like "it can be useful to send people books" or "it can be useful to send people books that aren't textbooks, essay collections, or works of original fiction". Likewise, it doesn't seem silly to me for people with disabilities to work on EA projects, or to suggest that disability treatment could be relevant to some EA projects. But I have no idea where to go from there in responding to anoneaagain, because the comment's argument structure is hidden.

Comment by robbensinger on Long-Term Future Fund: April 2019 grant recommendations · 2019-04-09T19:26:58.637Z · score: 3 (2 votes) · EA · GW

That all makes sense. In principle I like the idea of trying both options at some point, in case one turns out to be obviously better. I do think that splitting things up into 6 books is better than 4, cost allowing, so that the first effort chunk feels smaller.

Comment by robbensinger on Long-Term Future Fund: April 2019 grant recommendations · 2019-04-09T06:21:44.792Z · score: 42 (21 votes) · EA · GW

Money-wise this strikes me as a fine thing to try. I'm a little worried that sending people the entire book set might cause some people to not read it who would have read a booklet, because they're intimidated by the size of the thing.

Psychologically, people generally need more buy-in to decide "I'll read the first few chapters of this 1800-page multi-volume book and see what I think" than to decide "I'll read the first few chapters of this 200-page book that has five sequels and see what I think", and even if the intended framing is the latter one, sending all 1800 pages at once might cause some people to shift to the former frame.

One thing that can help with this is to split HPMoR up into six volumes rather than four, corresponding to the book boundaries Eliezer proposed (though it seems fine to me if they're titled 'HPMoR Vol. 1' etc.). Then the first volume or two will be shorter, and feel more manageable. Then perhaps just send the first 3 (or 2?) volumes, and include a note saying something like 'If you like these books, shoot us an email at [email] and we'll ship you the second half of the story, also available on'

This further carves up the reading into manageable subtasks in a physical, perceptual way. It does carry the risk that some people might stop when they get through the initial volumes. It might be a benefit in its own right to cause email conversations to happen, though, since a back-and-forth can lead to other useful things happening.

Comment by robbensinger on After one year of applying for EA jobs: It is really, really hard to get hired by an EA organisation · 2019-02-28T18:46:38.803Z · score: 19 (12 votes) · EA · GW

MIRI (and other EA orgs, I'd wager) would strongly second "we don't think it's a waste of our time to process applications. If we don't have the staff capacity to process all the applications we receive, we can always just drop a larger fraction of applicants at each stage."

I second the rest of Luke's comment too. That run of applications sounds incredibly rough. The account above makes me wonder if we could be doing a better job of communicating expectations to people applying for jobs early in the process. It's much, much easier to avoid setting misleadingly low or misleadingly high expectations when the information can be personalized and there's an active back-and-forth, vs. in a blog post.

Comment by robbensinger on After one year of applying for EA jobs: It is really, really hard to get hired by an EA organisation · 2019-02-27T04:40:41.736Z · score: 5 (9 votes) · EA · GW

I agree with Stefan, but I do in fact think (for other reasons) that earning to give is a great idea, and one of the main approaches EAs should consider. I think it's a smaller constraint overall than hiring, but still a very important constraint.

Comment by robbensinger on EA Forum 2.0 Initial Announcement · 2018-08-02T21:31:21.395Z · score: 4 (4 votes) · EA · GW

Arbital uses a system where you can separately "upvote" things based on how much you like them, and give an estimate of how much probability you assign to claims. I like this system, and have recommended it be added to LW too. Among other things, I think it has a positive effect on people's mindsets if they practice keeping separate mental accounts of those two quantities.

Comment by robbensinger on EA Forum 2.0 Initial Announcement · 2018-08-02T21:25:18.632Z · score: 3 (3 votes) · EA · GW

I think the LW mods are considering features that will limit how many strong upvotes users can give out. I think the goal is for strong upvotes to look less like "karma totals get determined strictly by what forum veterans think" and more like "if you're a particularly respected and established contributor, you get the privilege of occasionally getting to 'promote/feature' site content so that a lot more people see it, and getting to dish out occasional super-karma rewards".

Comment by robbensinger on Job opportunity at the Future of Humanity Institute and Global Priorities Institute · 2018-04-07T14:05:37.402Z · score: 2 (2 votes) · EA · GW

If clutter is the main concern, might it be useful for 80K to post a regular (say, monthly) EA Forum post noting updates to their job board, and to have other job ad posts get removed and centralized to that post? I personally would have an easier time keeping track of what's new vs. old if there were a canonical location that mentioned key job listing updates.

Comment by robbensinger on Status Regulation and Anxious Underconfidence · 2017-11-18T22:34:19.934Z · score: 0 (0 votes) · EA · GW

See also: "Do Rational People Exist?"

Comment by robbensinger on Status Regulation and Anxious Underconfidence · 2017-11-18T22:19:29.081Z · score: 3 (3 votes) · EA · GW

Cross-posting a reply from FB:

It strikes me as much more prevalent for people to be overconfident in their own idiosyncratic opinions. If you see half of people are 90% confident in X and half of people are 90% confident in not-X, then you know on average they are overconfident. That's how most of the world looks to me.

This seems consistent with Eliezer's claim that "commenters on the Internet are often overconfident" while EAs and rationalists he interacts with in person are more often underconfident. In Dunning and Kruger's original experiment, the worst performers were (highly) overconfident, but the best performers were underconfident.

Your warnings that overconfidence and power-grabbing are big issues seem right to me. Eliezer's written a lot warning about those problems too. My main thought about this is just that different populations can exhibit different social dynamics and different levels of this or that bias; and these can also change over time. Eliezer's big-picture objection to modesty isn't "overconfidence and power-grabbing are never major problems, and you should never take big steps to try combat them"; his objection is "biases vary a lot between individuals and groups, and overcorrection in debiasing is commonplace, so it's important that whatever debiasing heuristics you use be sensitive to context rather than generically endorsing 'hit the brakes' or 'hit the accelerator'".

He then makes the further claim that top EAs and rationalists as a group are in fact currently more prone to reflexive deference, underconfidence, fear-of-failure, and not-sticking-their-neck-out than to the biases of overconfident startup founders. At least on Eliezer's view, this should be a claim that we can evaluate empirically, and our observations should then inform how much we push against overconfidence v. underconfidence.

The evolutionary just-so story isn't really necessary for that critique, though it's useful to keep in mind if we were originally thinking that humans only have overactive status-grabbing instincts, and don't also have overactive status-grab-blocking instincts. Overcorrection is already a common problem, but it's particularly likely if there are psychological drives pushing in both directions.

Comment by robbensinger on Multiverse-wide cooperation in a nutshell · 2017-11-13T03:47:37.061Z · score: 2 (2 votes) · EA · GW

After the boxes have already been placed in front of me, however, I can no longer influence their contents, so it would be good if I two-boxed

You would get more utility if you were willing to one-box even when there's no external penalty or opportunity to bind yourself to the decision. Indeed, functional decision theory can be understood as a formalization of the intuition: "I would be better off if only I could behave in the way I would have precommitted to behave in every circumstance, without actually needing to anticipate each such circumstance in advance." Since the predictor in Newcomb's problem fills the boxes based on your actual action, regardless of the reasoning or contract-writing or other activities that motivate the action, this suffices to always get the higher payout (compared to causal or evidential decision theory).

There are also dilemmas where causal decision theory gets less utility even if it has the opportunity to precommit to the dilemma; e.g., retro blackmail.

For a fuller argument, see the paper "Functional Decision Theory" by Yudkowsky and Soares.

Comment by robbensinger on Moloch's Toolbox (1/2) · 2017-11-09T02:21:17.247Z · score: 1 (1 votes) · EA · GW

I'm not an expert in this area and haven't seen that study, but I believe Eliezer generally defers to Bryan Caplan's analysis on this topic. Caplan's view, discussed in The Case Against Education (which is scheduled to come out in two months), is that something like 80% of the time students spend in school is signaling, and something like 80% of the financial reward students enjoy from school is due to signaling. So the claim isn't that school does nothing to build human capital, just that a very large chunk of schooling is destroying value.