Ben Garfinkel: How sure are we about this AI stuff? 2019-02-09T19:17:31.671Z · score: 81 (41 votes)


Comment by bmg on What posts do you want someone to write? · 2020-04-04T23:55:55.191Z · score: 6 (3 votes) · EA · GW

I think that chapter in the Precipice is really good, but it's not exactly the sort of thing I have in mind.

Although Toby's less optimistic than I am, he's still only arguing for a 10% probability of existentially bad outcomes from misalignment.* The argument in the chapter is also, by necessity, relatively cursory. It's aiming to introduce the field of artificial intelligence and the concept of AGI to readers who might be unfamiliar with it, explain what misalignment risk is, make the idea vivid to readers, clarify misconceptions, describe the state of expert opinion, and add in various other nuances all within the span of about fifteen pages. I think that it succeeds very well in what it's aiming to do, but I would say that it's aiming for something fairly different.

*Technically, if I remember correctly, it's a 10% probability within the next century. So the implied overall probability is at least somewhat higher.

Comment by bmg on What posts do you want someone to write? · 2020-03-26T21:34:48.791Z · score: 25 (10 votes) · EA · GW

I'd be really interested in reading an updated post that makes the case for there being an especially high (e.g. >10%) probability that AI alignment problems will lead to existentially bad outcomes.

There still isn't a lot of writing explaining case for existential misalignment risk. And a significant fraction of what's been produced since Superintelligence is either: (a) roughly summarizing arguments in Superintelligence, (b) pretty cursory, or (c) written by people who are relative optimists and are in large part trying to explain their relative optimism.

Since I have the (possibly mistaken) impression that a decent number of people in the EA community are quite pessimistic regarding existential misalignment risk, on the basis of reasoning that goes significantly beyond what's in Superintelligence, I'd really like to understand this position a lot better and be in a position to evaluate the arguments for it.

(My ideal version of this post would probably assume some degree of familiarity with contemporary machine learning, and contemporary safety/robustness issues, but no previous familiarity with arguments that AI poses an existential risk.)

Comment by bmg on Request for Feedback: Draft of a COI policy for the Long Term Future Fund · 2020-02-08T19:22:13.531Z · score: 13 (4 votes) · EA · GW

More broadly, I just feel really uncomfortable with having to write all of our documents to make sense on a purely associative level. I as a donor would be really excited to see a COI policy as concrete as the one above, similarly to how all the concrete mistake pages on all the EA org websites make me really excited. I feel like making the policy less concrete trades of getting something right and as such being quite exciting to people like me, in favor of being more broadly palatable to some large group of people, and maybe making a bit fewer enemies. But that feels like it's usually going to be the wrong strategy for a fund like ours, where I am most excited about having a small group of really dedicated donors who are really excited about what we are doing, much more than being very broadly palatable to a large audience, without anyone being particularly excited about it.

It seems to me like there's probably an asymmetry here. I would be pretty surprised if the inclusion of specific references to drug use and metamours was the final factor that tipped anyone into a decision to donate to the fund. I wouldn't be too surprised, though, if the inclusion tipped at least some small handful of potential donors into bouncing. At least, if I were encountering the fund for the first time, I can imagine these inclusions being one minor input into any feeling of wariness I might have.

(The obvious qualifier here, though, is that you presumably know the current and target demographics of the fund better than I do. I expect different groups of people will tend to react very differently.)

I feel like the thing that is happening here makes me pretty uncomfortable, and I really don't want to further incentivize this kind of assessment of stuff.

Apologies if I'm misreading, but it feels to me like the suggestion here might be that intentionally using a more "high-level" COI is akin to trying to 'mislead' potential donors by withholding information. If that's the suggestion, then I think I at least mostly disagree. I think that having a COI that describes conflicts in less concrete terms is mostly about demonstrating an expected form of professionalism.

As an obviously extreme analogy, suppose that someone applying for a job decides to include information about their sexual history on their CV. There's some sense in which this person is being more "honest" than someone who doesn't include that information. But any employer who receives this CV will presumably have a negative reaction. This reaction also won't be irrational, since it suggests the applicant is either unaware of norms around this sort of thing or (admittedly a bit circularly) making a bad decision to willfully transgress them. In either case, it's reasonable for the employer to be a lot more wary of the applicant than they otherwise be.

I think the dynamic is roughly the same as the dynamic that leads people to (rationally) prefer to hire lawyers who wear suits over those who don't, to trust think tanks that format and copy-edit their papers properly over those who don't, and so on.

This case is admittedly more complicated than the case of lawyers and suits, since you are in fact depriving potential donors of some amount of information. (At worst, suits just hide information about lawyers' preferred style of dress.) So there's an actual trade-off to be balanced. But I'm inclined to agree with Howie that the extra clarity you get from moving beyond 'high-level' categories probably isn't all that decision-relevant.

I'm not totally sure, though. In part, it's sort of an empirical question whether a merely high-level COI would give any donors an (in their view) importantly inaccurate or incomplete impression of how COIs are managed. If enough potential donors do seem to feel this way, then it's presumably worth being more detailed.

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-08T12:38:46.900Z · score: 5 (3 votes) · EA · GW

No, I think that would be far worse.

But if two people were (for example) betting on a prediction platform that's been set up by public health officials to inform prioritization decisions, then this would make the bet better. The reason is that, in this context, it would obviously matter if their expressed credences are well-callibrated and honestly meant. To the extent that the act of making the bet helps temporarily put some observers "on their toes" when publicly expressing credences, the most likely people to be put "on their toes" (other users of the platform) are also people whose expressed credences have an impact. So there would be an especially solid pro-social case for making the bet.

I suppose this bullet point is mostly just trying to get at the idea that a bet is better if it can clearly be helpful. (I should have said "positively influence" instead of just "influence.") If a bet creates actionable incentives to kill people, on the other hand, that's not a good thing.

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-03T18:57:36.654Z · score: 5 (3 votes) · EA · GW

Maybe you are describing a distinction that is more complicated than I am currently comprehending, but I at least would expect Chi and Greg to object to bets of the type "what is the expected number of people dying in self-driving car accidents over the next decade?", "Will there be an accident involving an AGI project that would classify as a 'near-miss', killing at least 10000 people or causing at least 10 billion dollars in economic damages within the next 50 years?" and "what is the likelihood of this new bednet distribution method outperforming existing methods by more than 30%, saving 30000 additional people over the next year?".

Just as an additional note, to speak directly to the examples you gave: I would personally feel very little discomfort if two people (esp. people actively making or influencing decisions about donations and funding) wanted to publicly bet on the question: "What is the likelihood of this new bednet distribution method outperforming existing methods by more than 30%, saving 30000 additional people over the next year?" I obviously don't know, but I would guess that Chi and Greg would both feel more comfortable about that question as well. I think that some random "passerby" might still feel some amount of discomfort, but probably substantially less.

I realize that there probably aren't very principled reasons to view one bet here as intrinsically more objectionable than others. I listed some factors that seem to contribute to my judgments in my other comment, but they're obviously a bit of a hodgepodge. My fully reflective moral view is also that there probably isn't anything intrinsically wrong with any category of bets. For better or worse, though, I think that certain bets will predictably be discomforting and wrong-feeling to many people (including me). Then I think this discomfort is worth weighing against the plausible social benefits of the individual bet being made. At least on rare occasions, the trade-off probably won't be worth it.

I ultimately don't think my view here is that different than common views on lots of other more mundane social norms. For example: I don't think there's anything intrinsically morally wrong about speaking ill of the dead. I recognize that a blanket prohibition on speaking ill of the dead would be a totally ridiculous and socially/epistemically harmful form of censorship. But it's still true that, in some hard-to-summarize class of cases, criticizing someone who's died is going to strike a lot of people as especially uncomfortable and wrong. Even without any specific speech "ban" in place, I think that it's worth giving weight to these feelings when you decide what to say.

What this general line of thought implies about particular bets is obviously pretty unclear. Maybe the value of publicly betting is consistently high enough to, in pretty much all cases, render feelings of discomfort irrelevant. Or maybe, if the community tries to have any norms around public betting, then the expected cost of wise bets avoided due to "false positives" would just be much higher than the expected the cost of unwise bets made due to "false negatives." I don't believe this, but I obviously don't know. My best guess is that it probably makes sense to strike a (messy/unprincipled/disputed) balance that's not too dissimilar from balances we strike in other social and professional contexts.

(As an off-hand note, for whatever it's worth, I've also updated in the direction of thinking that the particular bet that triggered this thread was worthwhile. I also, of course, feel a bit weird having somehow now written so much about the fine nuances of betting norms in a thread about a deadly virus.)

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-02T17:46:52.974Z · score: 8 (5 votes) · EA · GW

Thanks! I do want to stress that I really respect your motives in this case and your evident thoughtfulness and empathy in response to the discussion; I also think this particular bet might be overall beneficial. I also agree with your suggestion that explicitly stating intent and being especially careful with tone/framing can probably do a lot of work.

It's maybe a bit unfortunate that I'm making this comment in a thread that began with your bet, then, since my comment isn't really about your bet. I realize it's probably pretty unpleasant to have an extended ethics debate somehow spring up around one of your posts.

I mainly just wanted to say that it's OK for people to raise feelings of personal/moral discomfort and that these feelings of discomfort can at least sometimes be important enough to justify refraining from a public bet. It seemed to me like some of the reaction to Chi's comment went too far in the opposite direction. Maybe wrongly/unfairly, it seemed to me that there was some suggestion that this sort of discomfort should basically just be ignored or that people should feel discouraged from expressing their discomfort on the EA Forum.

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-02T16:53:37.301Z · score: 11 (4 votes) · EA · GW

To clarify a bit, I'm not in general against people betting on morally serious issues. I think it's possible that this particular bet is also well-justified, since there's a chance some people reading the post and thread might actually be trying to make decisions about how to devote time/resources to the issue. Making the bet might also cause other people to feel more "on their toes" in the future, when making potentially ungrounded public predictions, if they now feel like there's a greater chance someone might challenge them. So there are potential upsides, which could outweigh the downsides raised.

At the same time, though, I do find certain kinds of bets discomforting and expect a pretty large portion of people (esp. people without much EA exposure) to feel discomforted too. I think that the cases where I'm most likely to feel uncomfortable would be ones where:

  • The bet is about an ongoing, pretty concrete tragedy with non-hypothetical victims. One person "profits" if the victims become more numerous and suffer more.

  • The people making the bet aren't, even pretty indirectly, in a position to influence the management of the tragedy or the dedication of resources to it. It doesn't actually matter all that much, in other words, if one of them is over- or under-confident about some aspect of the tragedy.

  • The bet is made in an otherwise "casual"/"social" setting.

  • (Importantly) It feels like the people are pretty much just betting to have fun, embarrass the other person, or make money.

I realize these aren't very principled criteria. It'd be a bit weird if the true theory of morality made a principled distinction between bets about "hypothetical" and "non-hypothetical" victims. Nevertheless, I do still have a pretty strong sense of moral queeziness about bets of this sort. To use an implausibly extreme case again, I'd feel like something was really going wrong if people were fruitlessly betting about stuff like "Will troubled person X kill themselves this year?"

I also think that the vast majority of public bets that people have made online are totally fine. So maybe my comments here don't actually matter very much. I mainly just want to make the point that: (a) Feelings of common-sense moral discomfort shouldn't be totally ignored or dismissed and (b) it's at least sometimes the right call to refrain from public betting in light of these feelings.

At a more general level, I really do think it's important for the community in terms of health, reputation, inclusiveness, etc., if common-sense feelings of moral and personal comfort are taken seriously. I'm definitely happy that the community has a norm of it typically being OK to publicly challenge others to bets. But I also want to make sure we have a strong norm against discouraging people from raising their own feelings of discomfort.

(I apologize if it turns out I'm disagreeing with an implicit straw-man here.)

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-02T04:23:51.667Z · score: 24 (14 votes) · EA · GW

I can guess that the primary motivation is not "making money" or "the feeling of winning and being right" - which would be quite inappropriate in this context

I don't think these motivations would be inappropriate in this context. Those are fine motivations that we healthily leverage in large parts of the world to cause people to do good things, so of course we should leverage them here to allow us to do good things.

The whole economy relies on people being motivated to make money, and it has been a key ingredient to our ability to sustain the most prosperous period humanity has ever experienced (cf. more broadly the stock market). Of course I want people to have accurate beliefs by giving them the opportunity to make money. That is how you get them to have accurate beliefs!

At least from a common-sense morality perspective, this doesn't sit right with me. I do feel that it would be wrong for two people to get together to bet about some horrible tragedy -- "How many people will die in this genocide?" "Will troubled person X kill themselves this year?" etc. -- purely because they thought it'd be fun to win a bet and make some money off a friend. I definitely wouldn't feel comfortable if a lot of people around me were doing this.

When the motives involve working to form more accurate and rigorous beliefs about ethically pressing issues, as they clearly were in this case, I think that's a different story. I'm sympathetic to the thought that it would be bad to discourage this sort of public bet. I think it might also be possible to argue that, if the benefits of betting are great enough, then it's worth condoning or even encouraging more ghoulishly motivated bets too. I guess I don't really buy that, though. I don't think that a norm specifically against public bets that are ghoulish from a common-sense morality perspective would place very important limitations on the community's ability to form accurate beliefs or do good.

I do also think there are significant downsides, on the other hand, to having a culture that disregards common-sense feelings of discomfort like the ones Chi's comment expressed.

[[EDIT: As a clarification, I'm not classifying the particular bet in this thread as "ghoulish." I share the general sort of discomfort that Chi's comment describes, while also recognizing that the bet was well-motivated and potentially helpful. I'm more generally pushing back against the thought that evident motives don't matter much or that concerns about discomfort/disrespectfulness should never lead people to refrain from public bets.]]

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2020-01-18T19:40:56.247Z · score: 5 (3 votes) · EA · GW

I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, you're just saying it's "often" natural, and I suppose it's natural in some cases and not others. But I think we may disagree on how often it's natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)

In particular, I'm curious what makes you optimistic about finding a "correct" criterion of rightness. In the case of the politician, it seems clear that learning they don't have some of the properties you thought shouldn't call into question whether they exist at all.

But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that there's no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and I'm not sure I understand why.

Hey again!

I appreciated your comment on the LW post. I started writing up a response to this comment and your LW one, back when the thread was still active, and then stopped because it had become obscenely long. Then I ended up badly needing to procrastinate doing something else today. So here’s an over-long document I probably shouldn’t have written, which you are under no social obligation to read.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-30T15:24:37.370Z · score: 5 (7 votes) · EA · GW

Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.

Bomb Case: Omega puts a million dollars in a transparent box if he predicts you'll open it. He puts a bomb in the transparent box if he predicts you won't open it. He's only wrong about one in a trillion times.

Now suppose you enter the room and see that there's a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and don't open the box, then nothing bad will happen to you. You'll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.

Intuitively, this decision strikes me as deeply irrational. You're intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that you're flagrantly violating the "Don't Make Things Worse" principle.

Now, let's step back a time step. Suppose you know that you're sort of person who would refuse to kill yourself by detonating the bomb. You might decide that -- since Omega is such an accurate predictor -- it's worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision you're making now is nonetheless making things better in expectation.

This decision strikes me as pretty intuitively rational. You're violating the second principle -- the "Don't Commit to a Policy..." Principle -- but this violation just doesn't seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.

(This obviously just a description of my own intuitions, as they stand, though.)

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-30T01:23:42.653Z · score: 2 (2 votes) · EA · GW

Yep, thanks for the catch! Edited to fix.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-30T01:14:37.483Z · score: 5 (5 votes) · EA · GW
  • So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_'s, and not about P_'s

  • But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn't get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)

Here are two logically inconsistent principles that could be true:

Don't Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.

Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.

I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since they're mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.

One could argue that R_CDT sympathists don't actually have much stronger intuitions regarding the first principle than the second -- i.e. that their intuitions aren't actually very "targeted" on the first one -- but I don't think that would be right. At least, it's not right in my case.

A more viable strategy might be to argue for something like a meta-principle:

The 'Don't Make Things Worse' Meta-Principle: If you find "Don't Make Things Worse" strongly intuitive, then you should also find "Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse" just about as intuitive.

If the meta-principle were true, then I guess this would sort of imply that people's intuitions in favor of "Don't Make Things Worse" should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.

But I don't see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of "Don't Make Things Worse" :)

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-28T02:20:08.950Z · score: 3 (2 votes) · EA · GW

I may write up more object-level thoughts here, because this is interesting, but I just wanted to quickly emphasize the upshot that initially motivated me to write up this explanation.

(I don't really want to argue here that non-naturalist or non-analytic naturalist normative realism of the sort I've just described is actually a correct view; I mainly wanted to give a rough sense of what the view consists of and what leads people to it. It may well be the case that the view is wrong, because all true normative-seeming claims are in principle reducible to claims about things like preferences. I think the comments you've just made cover some reasons to suspect this.)

The key point is just that when these philosophers say that "Action X is rational," they are explicitly reporting that they do not mean "Action X suits my terminal preferences" or "Action X would be taken by an agent following a policy that maximizes lifetime utility" or any other such reduction.

I think that when people are very insistent that they don't mean something by their statements, it makes sense to believe them. This implies that the question they are discussing -- "What are the necessary and sufficient conditions that make a decision rational?" -- is distinct from questions like "What decision would an agent that tends to win take?" or "What decision procedure suits my terminal preferences?"

It may be the case that the question they are asking is confused or insensible -- because any sensible question would be reducible -- but it's in any case different. So I think it's a mistake to interpret at least these philosophers' discussions of "decisions theories" or "criteria of rightness" as though they were discussions of things like terminal preferences or winning strategies. And it doesn't seem to me like the answer to the question they're asking (if it has an answer) would likely imply anything much about things like terminal preferences or winning strategies.

[[NOTE: Plenty of decision theorists are not non-naturalist or non-analytic naturalist realists, though. It's less clear to me how related or unrelated the thing they're talking about is to issues of interest to MIRI. I think that the conception of rationality I'm discussing here mainly just presents an especially clear case.]]

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-28T01:08:22.944Z · score: 4 (3 votes) · EA · GW

If the thing being argued for is "R_CDT plus P_SONOFCDT" ... If the thing being argued for is "R_CDT plus P_FDT...

Just as a quick sidenote:

I've been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing.

If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT.

The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they aren't). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT.

[[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that you're building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisoner's dilemma (e.g. some version of P_FDT). But if R_CDT is true and you've just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisoner's dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative view -- which I've alluded to above -- is that the various proposed criteria of rightness don't in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-28T00:44:20.797Z · score: 6 (4 votes) · EA · GW

Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT "gets you less utility" rather than the point that P_SONOFCDT "gets you less utility." So my comment was aiming to explain why I don't think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the "No Self-Effacement Principle"). But it sounds like you might agree that this fact doesn't on its own provide a strong challenge.

If the thing being argued for is "R_CDT plus P_SONOFCDT", then that makes sense to me, but is vulnerable to all the arguments I've been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT's "Don't Make Things Worse" principle.

In response to the first argument alluded to here: "Gets the most [expected] utility" is ambiguous, as I think we've both agreed.

My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. So -- if we evaluate the expected utility of a decision to adopt a policy from a casual perspective -- it seems to me that P_SONOFCDT "gets the most expected utility."

If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may "get the most utility" (because policy adoption decisions may be non-causally correlated.)

Apologies if I'm off-base, but it reads to me like you might be suggesting an argument along these lines:

  1. R_CDT says that it is rational to decide to follow a policy that would not maximize "expected utility" (defined in evidential/subjunctive terms).

  2. (Assumption) But it is not rational to decide to follow a policy that would not maximize "expected utility" (defined in evidential/subjunctive terms).

  3. Therefore R_CDT is not true.

The natural response to this argument is that it's not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its "expected utility" defined in causal terms. So someone starting from the position that R_CDT is true obviously won't accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision depends on its "expected utility" defined in evidential or subjunctive terms. So we might allude to R_EDT or R_FDT to justify the assumption, but of course this would also mean arguing backwards from the conclusion that the argument is meant to reach.

Overall at least this particular simple argument -- that R_CDT is false because P_SONOFCDT gets less "expected utility" as defined in evidential/quasi-evidential terms -- would seemingly fail to due circularity. But you may have in mind a different argument.

We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what "expected utility" means.

I felt confused by this comment. Doesn't even R_FDT judge the rationality of a decision by its expected value (rather than its actual value)? And presumably you don't want to say that someone who accepts unpromising gambles and gets lucky (ending up with high actual average utility) has made more "rational" decisions than someone who accepts promising gambles and gets unlucky (ending up with low actual average utility)?

You also correctly point out that the decision procedure that R_CDT implies agents should rationally commit to -- P_SONOFCDT -- sometimes outputs decisions that definitely make things worse. So "Don't Make Things Worse" implies that some of the decisions outputted by P_SONOFCDT are irrational.

But I still don't see what the argument is here unless we're assuming "No Self-Effacement." It still seems to me like we have a few initial steps and then a missing piece.

  1. (Observation) R_CDT implies that it is rational to commit to following the decision procedure P_SONOFCDT.

  2. (Observation) P_SONOFCDT sometimes outputs decisions that definitely make things worse.

  3. (Assumption) It is irrational to take decisions that definitely make things worse. In other words, the "Don't Make Things Worse" Principle is true.

  4. Therefore, as an implication of Step 2 and Step 3, P_SONOFCDT sometimes outputs irrational decisions.

  5. ???

  6. Therefore, R_CDT is false.

The "No Self-Effacement" Principle is equivalent to the principle that: If a criterion of rightness implies that it is rational to commit to a decision procedure, then that decision procedure only produces rational actions. So if we were to assume "No Self-Effacement" in Step 5 then this would allow us to arrive at the conclusion that R_CDT is false. But if we're not assuming "No Self-Effacement," then it's not clear to me how we get there.

Actually, in the context of this particular argument, I suppose we don't really have the option of assuming that "No Self-Effacement" is true -- because this assumption would be inconsistent with the earlier assumption that "Don't Make Things Worse" is true. So I'm not sure it's actually possible to make this argument schema work in any case.

There may be a pretty different argument here, which you have in mind. I at least don't see it yet though.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T22:42:27.890Z · score: 3 (2 votes) · EA · GW

Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this "dominance" argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don't think agents "should" try to achieve outcomes that are impossible from the problem specification itself.

Suppose that we accept the principle that agents never "should" try to achieve outcomes that are impossible from the problem specification -- with one implication being that it's false that (as R_CDT suggests) agents that see a million dollars in the first box "should" two-box.

This seems to imply that it's also false that (as R_UDT suggests) an agent that sees that the first box is empty "should" one box. By the problem specification, of course, one boxing when there is no money in the first box is also an impossible outcome. Since decisions to two box only occur when the first box is empty, this would then imply that decisions to two box are never irrational in the context of this problem. But I imagine you don't want to say that.

I think I probably still don't understand your objection here -- so I'm not sure this point is actually responsive to it -- but I initially have trouble seeing what potential violations of naturalism/determinism R_CDT could be committing that R_UDT would not also be committing.

(Of course, just to be clear, both R_UDT and R_CDT imply that the decision to commit yourself to a one-boxing policy at the start of the game would be rational. They only diverge in their judgments of what actual in-room boxing decision would be rational. R_UDT says that the decision to two-box is irrational and R_CDT says that the decision to one-box is irrational.)

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T02:07:30.366Z · score: 6 (3 votes) · EA · GW

I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!

Thank you for taking the time to respond as well! :)

I think that terms like "normative" and "rational" are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser's pluralistic moral reductionism).

I would say that (1) some philosophers use "rational" in a very human-centric way, which is fine as long as it's done consistently; (2) others have a much more thin conception of "rational", such as 'tending to maximize utility'; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of "rationality", but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.

I'm not positive I understand what (1) and (3) are referring to here, but I would say that there's also at least a fourth way that philosophers often use the word "rational" (which is also the main way I use the word "rational.") This is to refer to an irreducibly normative concept.

The basic thought here is that not every concept can be usefully described in terms of more primitive concepts (i.e. "reduced"). As a close analogy, a dictionary cannot give useful non-circular definitions of every possible word -- it requires the reader to have a pre-existing understanding of some foundational set of words. As a wonkier analogy, if we think of the space of possible concepts as a sort of vector space, then we sort of require an initial "basis" of primitive concepts that we use to describe the rest of the concepts.

Some examples of concepts that are arguably irreducible are "truth," "set," "property," "physical," "existance," and "point." Insofar as we can describe these concepts in terms of slightly more primitive ones, the descriptions will typically fail to be very useful or informative and we will typically struggle to break the slightly more primitive ones down any further.

To focus on the example of "truth," some people have tried to reduce the concept substantially. Some people have argued, for example, that when someone says that "X is true" what they really mean or should mean is "I personally believe X" or "believing X is good for you." But I think these suggested reductions pretty obviously don't entirely capture what people mean when they say "X is true." The phrase "X is true" also has an important meaning that is not amenable to this sort of reduction.

[[EDIT: "Truth" may be a bad example, since it's relatively controversial and since I'm pretty much totally unfamiliar with work on the philosophy of truth. But insofar as any concepts seem irreducible to you in this sense, or buy the more general argument that some concepts will necessarily be irreducible, the particular choice of example used here isn't essential to the overall point.]]

Some philosophers also employ normative concepts that they say cannot be reduced in terms of non-normative (e.g. psychological) properties. These concepts are said to be irreducibly normative.

For example, here is Parfit on the concept of a normative reason (OWM, p. 1):

We can have reasons to believe something, to do something, to have some desire or aim, and to have many other attitudes and emotions, such as fear, regret, and hope. Reasons are given by facts, such as the fact that someone’s finger-prints are on some gun, or that calling an ambulance would save someone’s life.

It is hard to explain the concept of a reason, or what the phrase ‘a reason’ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But ‘counts in favour of’ means roughly ‘gives a reason for’. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words. We must explain such concepts in a different way, by getting people to think thoughts that use these concepts. One example is the thought that we always have a reason to want to avoid being in agony.

When someone says that a concept they are using is irreducible, this is obviously some reason for suspicion. A natural suspicion is that the real explanation for why they can't give a useful description is that the concept is seriously muddled or fails to grip onto anything in the real world. For example, whether this is fair or not, I have this sort of suspicion about the concept of "dao" in daoist philosophy.

But, again, it will necessarily be the case that some useful and valid concepts are irreducible. So we should sometimes take evocations of irreducible concepts seriously. A concept that is mostly undefined is not always problematically "underdefined."

When I talk about "normative anti-realism," I mostly have in mind the position that claims evoking irreducably normative concepts are never true (either because these claims are all false or because they don't even have truth values). For example: Insofar as the word "should" is being used in an irreducibly normative sense, there is nothing that anyone "should" do.

[[Worth noting, though: The term "normative realism" is sometimes given a broader definition than the one I've sketched here. In particular, it often also includes a position known as "analytic naturalist realism" that denies the relevance of irreducibly normative concepts. I personally feel I understand this position less well and I think sometimes waffle between using the broader and narrower definition of "normative realism." I also more generally want to stress that not everyone who makes claims about "criterion of rightness" or employs other seemingly normative language is actually a normative realist in the narrow or even broad sense; what I'm doing here is just sketching one common especially salient perspective.]]

One motivation for evoking irreducibly normative concepts is the observation that -- in the context of certain discussions -- it's not obvious that there's any close-to-sensible way to reduce the seemingly normative concepts that are being used.

For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of "a rational choice" to the concept of "a winning choice" (or, in line with the type-2 conception you mention, a "utility-maximizing choice"). It seems difficult to make sense of a lot of basic claims about rationality if we use this reduction -- and other obvious alternative reductions don't seem to fair much better. To mostly quote from a comment I made elsewhere:

Suppose we want to claim that it is rational to try to maximize the expected winning (i.e. the expected fulfillment of your preferences). Due to randomness/uncertainty, though, an agent that tries to maximize expected "winning" won't necessarily win compared to an agent that does something else. If I spend a dollar on a lottery ticket with a one-in-a-billion chance of netting me a billion-and-one "win points," then I'm taking the choice that maximizes expected winning but I'm also almost certain to lose. So we can't treat "the rational action" as synonymous with "the action taken by an agent that wins."

We can try to patch up the issue here by reducing "the rational action" to "the action that is consistent with the VNM axioms," but in fact either action in this case is consistent with the VNM axioms. The VNM axioms don't imply that an agent must maximize the expected desirability of outcomes. They just imply that an agent must maximize the expected value of some function. It is totally consistent with the axioms, for example, to be effectively risk averse and instead maximize the expected square root of desirability. If we try to define "the action I should take" in this way, then the claim "it is rational to act consistently with the VNM axioms" also becomes an empty tautology.

We could of course instead reduce "the rational action" to "the action that maximizes expected winning." But now, of course, the claim "it is rational to maximize expected winning" no longer has any substantive content. When we make this claim, do we really mean to be stating an empty tautology? And do we really consider it trivially incoherent to wonder -- e.g. in a Pascal's mugging scenario -- whether it might be "rational" to take an action other than the one that maximizes expected winning? If not, then this reduction is a very poor fit too.

It ultimately seems hard, at least to me, to make non-vacuous true claims about what it's "rational" to do withoit evoking a non-reducible notion of "rationality." If we are evoking a non-reducible notion of rationality, then it makes sense that we can't provide a satisfying reduction.

FN15 in my post on normative realism elaborates on this point.

At the same time, though, I do think there are also really good and hard-to-counter epistemological objections to the existance of irreducibly normative properties (e.g. the objection described in this paper). You might also find the difficulty of reducing normative concepts a lot less obvious-seeming or problematic than I do. You might think, for example, that the difficulty of reducing "rationality" is less like the difficulty of reducing "truth" (which IMO mainly reflects the fact that truth is an important primitive concept) and more like the difficulty of defining the word "soup" in a way that perfectly matches our intuitive judgments about what counts as "soup" (which IMO mainly reflects the fact that "soup" is a high-dimensional concept). So I definitely don't want to say normative realism is obviously or even probably right.

I mainly just want to communicate the sort of thing that I think a decent chunk of philosophers have in mind when they talk about a "rational decision" or a "criterion of rightness." Although, of course, philosophy being philosophy, plenty of people do of course have in mind plenty of different things.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T23:03:09.721Z · score: 6 (5 votes) · EA · GW

Sorry to drop in in the middle of this back and forth, but I am curious -- do you think it's quite likely that there is a single criterion of rightness that is objectively "correct"?

It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. "don't make things worse", or "don't be self-effacing"). And so far there doesn't seem to be any single criterion that satisfies all of them.

So why not just conclude that, similar to the case with voting and Arrow's theorem, perhaps there's just no single perfect criterion of rightness.

Happy to be dropped in on :)

I think it's totally conceivable that no criterion of rightness is correct (e.g. because the concept of a "criterion of rightness" turns out to be some spooky bit of nonsense that doesn't really map onto anything in the real world.)

I suppose the main things I'm arguing are just that:

  1. When a philosopher expresses support for a "decision theory," they are typically saying that they believe some claim about what the correct criterion of rightness is.

  2. Claims about the correct criterion of rightness are distinct from decision procedures.

  3. Therefore, when a member of the rationalist community uses the word "decision theory" to refer to a decision procedure, they are talking about something that's pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems [[EDIT: or what decision procedure most closely matches our preferences about decision procedures]] don't directly speak to the questions that most academic "decision theorists" are actually debating with one another.

I also think that, conditional on there being a correct criterion of rightness, R_CDT is more plausible than R_UDT. But this is a relatively tentative view. I'm definitely not a super hardcore R_CDT believer.

It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. "don't make things worse", or "don't be self-effacing"). And so far there doesn't seem to be any single criterion that satisfies all of them.

So why not just conclude that, similar to the case with voting and Arrow's theorem, perhaps there's just no single perfect criterion of rightness.

I guess here -- in almost definitely too many words -- is how I think about the issue here. (Hopefully these comments are at least somewhat responsive to your question.)

It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.

One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isn't even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property P -- but just drop the assumption that these things will also have both property Q1 and property Q2.

This obviously a pretty abstract description, so I'll give a few examples. (No need to read the examples if the point seems obvious.)

Ethics: I might initially be inclined to think that it's always ethical (property P) to maximize happiness and that it's always unethical to torture people. But then I may realize that there's an inconsistency here: in at least rare circumstances, such as ticking time-bomb scenarios where torture can extract crucial information, there may be no decision that is both happiness maximizing (Q1) and torture-avoiding (Q2). It seems like a natural reaction here is just to drop either the belief that maximizing happiness is always ethical or that torture is always unethical. It doesn't seem like I need to abandon my belief that some actions have the property of being ethical.

Theology: I might initially be inclined to think that God is all-knowing, all-powerful, and all-good. But then I might come to believe (whether rightly or not) that, given the existance of evil, these three properties are inconsistent. I might then continue to believe that God exists, but just drop my belief that God is all-good. (To very awkwardly re-express this in the language of properties: This would mean dropping my belief that any entity that has the property of being God also has the property of being all-good).

Politician-bashing: I might initially be inclined to characterize some politician both as an incompetent leader and as someone who's successfully carrying out an evil long-term plan to transform the country. Then I might realize that these two characterizations are in tension with one another. A pretty natural reaction, then, might be to continue to believe the politician exists -- but just drop my belief that they're incompetent.

To turn to the case of the decision-theoretic criterion of rightness, I might initially be inclined to think that the correct criterion of rightness will satisfy both "Don't Make Things Worse" and "No Self-Effacement." It's now become clear, though, that no criterion of rightness can satisfy both of these principles. I think it's pretty reasoanble, then, to continue to believe that there's a correct criterion of rightness -- but just drop the belief that the correct criterion of rightness will also satisfy "No Self-Effacement."

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T21:39:54.068Z · score: 4 (3 votes) · EA · GW

In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:

  1. I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on.

The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the "deterministic subprocess" view of our decision-making, we would find nothing strange about the idea that it's sometimes right for this subprocess to do locally incorrect things for the sake of better global results.

Is the following a roughly accurate re-characterization of the intuition here?

"Suppose that there's an agent that implements P_UDT. Because it is following P_UDT, when it enters the box room it finds a ton of money in the first box and then refrains from taking the money in the second box. People who believe R_CDT claim that the agent should have also taken the money in the second box. But, given that the universe is deterministic, this doesn't really make sense. From before the moment the agent the room, it was already determined that the agent would one box. Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there's no relevant sense in which the agent should have two-boxed."

If so, then I suppose my first reaction is that this seems like a general argument against normative realism rather than an argument against any specific proposed criterion of rightness. It also applies, for example, to the claim that a P_CDT agent "should have" one-boxed -- since in a physically deterministic sense it could not have. Therefore, I think it's probably better to think of this as an argument against the truth (and possibly conceptual coherence) of both R_CDT and R_UDT, rather than an argument that favors one over the other.

In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only one choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So -- insofar as we accept this kind of objection from determinism -- there seems to be something problematically non-naturalistic about discussing what "would have happened" if we built in one decision procedure or another.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T20:15:49.491Z · score: 7 (4 votes) · EA · GW

The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.

I do think the argument ultimately needs to come down to an intuition about self-effacingness.

The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT.

But there's nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, there's also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning.

So why shouldn't I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing.

More formally, it seems like the argument needs to be something along these lines:

  1. Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures.
  2. (Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility.
  3. Therefore, agents should not implement P_CDT.
  4. (Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true.
  5. Therefore -- as an implication of points (3) and (4) -- R_CDT is not true.

Whether you buy the "No Self-Effacement" assumption in Step 4 -- or, alternatively, the countervailing "Don't Make Things Worse" assumption that supports R_CDT -- seems to ultimately be a mattter of intuition. At least, I don't currently know what else people can appeal to here to resolve the disagreement.

[[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesn't specify how expected lifetime utility is being evaluated. For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don't think this ambiguity matters much for the argument.]]

[[SECOND SIDENOTE: I'm using the phrase "self-effacing" rather than "self-contradictory" here, because I think it's more standard and because "self-contradictory" seems to suggest logical inconsistency.]]

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-23T15:35:31.542Z · score: 5 (4 votes) · EA · GW

This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it's normative, it can be either an algorithm/procedure that's being recommended, or a criterion of rightness like "a decision is rational iff taking it would cause the largest expected increase in value" (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are "normative" or "endorsed").

Just on this point: I think you're right I may be slightly glossing over certain distinctions, but I might still draw them slightly differently (rather than doing a 2x2 grid). Some different things one might talk about in this context:

  1. Decisions
  2. Decision procedures
  3. The decision procedure that is optimal with regard to some given metric (e.g. the decision procedure that maximizes expected lifetime utility for some particular way of calculating expected utility)
  4. The set of properties that makes a decision rational ("criterion of rightness")
  5. A claim about what the criterion of rightness is ("normative decision theory")
  6. The decision procedure that it would be rational to decide to build into an agent (as implied by the criterion of rightness)

(4), (5), and (6) have to do with normative issues, while (1), (2), and (3) can be discussed without getting into normativity.

My current-although-not-firmly-held view is also that (6) probably isn't very sensitive to the what the criterion of rightness is, so in practice can be reasoned about without going too deep into the weeds thinking about competing normative decision theories.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-23T13:17:52.242Z · score: 7 (4 votes) · EA · GW

I think there are basically three options:

  • Decision theory isn't normative.

  • Decision theory is normative in the way that "murder is bad" or "improving aggregate welfare is good" is normative, i.e., it expresses an arbitrary terminal value of human beings.

  • Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).

[[Disclaimer: I'm not sure this will be useful, since it seems like most of discussions that verge on meta-ethics end up with neither side properly understanding the other.]]

I think the kind of decision theory that philosophers tend to work on is typically explicitly described as "normative." (For example, the SEP article on decision theory is about "normative decision theory.") So when I'm talking about "academic decision theories" or "proposed criteria of rightness" I'm talking about normative theories. When I use the word "rational" I'm also referring to a normative property.

I don't think there's any very standard definition of what it means for something to be normative, maybe because it's often treated as something pretty close to a primitive concept, but a partial account is that a "normative theory" is a claim about what someone should do. At least this is what I have in mind. This is different from the second option you list (and I think the third one).

Some normative theories concern "ends." These are basically claims about what people should do, if they can freely choose outcomes. For example: A subjectivist theory might say that people should maximize the fulfillment of their own personal preferences (whatever they are). Whereas a hedonistic utilitarian theory might say that people should should maximize total happiness. I'm not sure what the best terminology is, and think this choice is probably relatively non-standard, but let's label these "moral theories."

Some normative theories, including "decision theories," concern "means." These theories put aside the question of which ends people should pursue and instead focus on how people should respond to uncertainty about the results/implications of their actions. For example: Expected utility theory says that people should take whatever actions maximize expected fulfillment of the relevant ends. Risk-weighted expected utility theory (and other alternative theories) say different things. Typical versions of CDT and EDT flesh out expected utility theory in different ways to specify what the relevant measure of "expected fulfillment" is.

Moral theory and normative decision theory seem to me to have pretty much the same status. They are both bodies of theory that bear on what people should do. On some views, the division between them is more a matter of analytic convenience than anything else. For example, David Enoch, a prominent meta-ethicist, writes: "In fact, I think that for most purposes [the line between the moral and the non-moral] is not a line worth worrying about. The distinction within the normative between the moral and the non-moral seems to me to be shallow compared to the distinction between the normative and the non-normative" (Taking Morality Seriously, 86).

One way to think of moral theories and normative decision theories is as two components that fit together to form more fully specified theories about what people should do. Moral theories describe the ends people should pursue; given these ends, decision theories then describe what actions people should take when in states of uncertainty. To illustrate, two examples of more complete normative theories that combine moral and decision-theoretic components would be: "You should take whatever action would in expectation cause the largest increase in the fulfillment of your preferences" and "You should take whatever action would, if you took it, lead you to anticipate the largest expected amount of future happiness in the world." The first is subjectivism combined with CDT, while the second is total view hedonistic utilitarianism combined with EDT.

(On this conception, a moral theory is not a description of "an arbitrary terminal value of human beings." Decision theory here also is not "the study of which decision-making methods humans happen to terminally prefer to employ." These are both theories are about what people should do, rather than theories about about what people's preferences are.)

Normativity is obviously pretty often regarded as a spooky or insufficiently explained thing. So a plausible position is normative anti-realism: It might be the case that no normative claims are true, either because they're all false or because they're not even well-formed enough to take on truth values. If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesn't really have an answer.

In the twin prisoner's dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).

If I'm someone with a twin and I'm implementing P_CDT, I still don't think I will choose to modify myself to cooperate in twin prisoner's dilemmas. The reason is that modifying myself won't cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.

(The fact P_CDT agents won't modify themselves to cooperate with their twins could of course be interpretted as a mark against R_CDT.)

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-23T12:12:17.161Z · score: 7 (3 votes) · EA · GW

One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility).

A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state.

This just seems to be the point that R_CDT is self-effacing: It says that people should not follow P_CDT, because following other decision procedures will produce better outcomes in expectation.

I definitely agree that R_CDT is self-effacing in this way (at least in certain scenarios). The question is just whether self-effacingness or failure to satisfy "Don't Make Things Worse" is more relevant when trying to judge the likelihood of a criterion of rightness being correct. I'm not sure whether it's possible to do much here other than present personal intuitions.

The point that R_UDT only violates the "Don't Make Things Worse" principle only infrequently seems relevant, but I'm still not sure this changes my intuitions very much.

If we're going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?

I may just be missing something, but I don't see what this theoretical ugliness is. And I don't intuitively find the ugliness/elegance of the decision procedure recommend by a criterion of rightness to be very relevant when trying to judge whether the criterion is correct.

[[EDIT: Just an extra thought on the fact that R_CDT is self-effacing. My impression is that self-effacingness is typically regarded as a relatively weak reason to reject a moral theory. For example, a lot of people regard utilitarianism as self-effacing both because it's costly to directly evaluate the utility produced by actions and because others often react poorly to people who engage in utilitarian-style reasoning -- but this typically isn't regarded as a slam-dunk reasons to believe that utilitarianism is false. I think the SEP article on consequentialism is expressing a pretty mainstream position when it says: "[T]here is nothing incoherent about proposing a decision procedure that is separate from one’s criterion of the right.... Criteria can, thus, be self-effacing without being self-refuting." Insofar as people don't tend to buy self-effacingness as a slam-dunk argument against the truth of moral theories, it's not clear why they should buy it as a slam-dunk argument against the truth of normative decision theories.]]

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-23T00:50:49.111Z · score: 10 (4 votes) · EA · GW

I agree that these three distinctions are important

  • "Picking policies based on whether they satisfy a criterion X" vs. "Picking policies that happen to satisfy a criterion X". (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)

  • "Trying to follow a decision rule Y 'directly' or 'on the object level'" vs. "Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y". (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you've come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)

  • "A decision rule that prescribes outputting some action or policy and doesn't care how you do it" vs. "A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy". (E.g., a rule that says 'maximize the aggregate welfare of moral patients' vs. a specific mental algorithm intended to achieve that end.)

The second distinction here is most closely related to the one I have in mind, although I wouldn’t say it’s the same. Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions.

“Hedonistic utilitarianism is correct” would be a non-decision-theoretic example of (a). “Making decisions on the basis of coinflips” would be an example of (b).

In the context of decision theory, of course, I am thinking of R_CDT as an example of (a) and P_CDT as an example of (b).

I now have the sense I’m probably not doing a good job of communicating what I have in mind, though.

The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we've even been misunderstanding basic things at the level of "decision-theoretic criterion of rightness".

I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making. Although we might have different intuitions here.

It's that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That's part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).

I agree that this is a worthwhile goal and that philosophers can probably contribute to it. I guess I’m just not sure that the question that most academic decision theorists are trying to answer -- and the literature they’ve produced on it -- will ultimately be very relevant.

The fact that CDT doesn't endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories don't), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.

The fact that R_CDT is “self-effacing” -- i.e. the fact that it doesn’t always recommend following P_CDT -- definitely does seem like a point of intuitive evidence against R_CDT.

But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his “Guaranteed Payoffs Principle.”

There’s then a question of which of these considerations is more relevant, when judging which of the two normative theories is more likely to be correct. The failure of R_UDT to satisfy the “Don’t Make Things Worse Principle” seems more important to me, but I don’t really know how to argue for this point beyond saying that this is just my intuition. I think that the failure of R_UDT to satisfying this principle -- or something like it -- is also probably the main reason why many philosophers find it intuitively implausible.

(IIRC the first part of Reasons and Persons is mostly a defense of the view that the correct theory of rationality may be self-effacing. But I’m not really familiar with the state of arguments here.)

In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agents' due to events that happened after she turned 20 (such as "the summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theory"). But she'll refuse to coordinate for reasons like "we hung out a lot the summer before my 20th birthday", "we spent our whole childhoods and teen years living together and learning from the same teachers", and "we all have similar decision-making faculties due to being members of the same species". There's no principled reason to draw this temporal distinction; it's just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.

I actually don’t think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account won’t cause you to achieve better outcomes here.) So I don’t think there should be any weird “Frankenstein” decision procedure thing going on.

….Thinking more about it, though, I’m now less sure how much the different normative decision theories should converge in their recommendations about AI design. I think they all agree that we should build systems that one-box in Newcomb-style scenarios. I think they also agree that, if we’re building twins, then we should design these twins to cooperate in twin prisoner’s dilemmas. But there may be some other contexts where acausal cooperation considerations do lead to genuine divergences. I don’t have very clear/settled thoughts about this, though.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-22T14:01:24.647Z · score: 20 (10 votes) · EA · GW

"What makes a decision 'good' if the decision happens inside an AI?" and "What makes a decision 'good' if the decision happens inside a brain?" aren't orthogonal questions, or even all that different; they're two different ways of posing the same question.

I actually agree with you about this. I have in mind a different distinction, although I might not be explaining it well.

Here’s another go:

Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. The phrase “decision theory” in this context typically refers to a claim about necessary and/or sufficient conditions for a decision being rational. To use different jargon, in this context a “decision theory” refers to a proposed “criterion of rightness.”

When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational only if taking it would cause the largest expected increase in value. To avoid any ambiguity, let’s label this claim R_CDT.

We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.

For each proposed criterion of rightness, it’s possible to define a decision procedure that only outputs decisions that fulfill the criterion. For example, we can define P_CDT as a decision procedure that involves only taking actions that R_CDT claims are rational.

My understanding is that when philosophers talk about “CDT,” they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.

The difference matters, because people who believe that R_CDT is true don’t generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves. R_CDT claims that we should do whatever will have the best effects -- and, in many cases, building agents that follow a decision procedure other than P_CDT is likely to have the best effects. More generally: Most proposed criteria of rightness imply that it can be rational to build agents that sometimes behave irrationally.

MIRI's AI work is properly thought of as part of the "success-first decision theory" approach in academic decision theory.

One possible criterion of rightness, which I’ll call R_UDT, is something like this: An action is rational only if it would have been chosen by whatever decision procedure would have produced the most expected value if consistently followed over an agent’s lifetime. For example, this criterion of rightness says that it is rational to one-box in the transparent Newcomb scenario because agents who consistently follow one-boxing policies tend to do better over their lifetimes.

I could be wrong, but I associate the “success-first approach” with something like the claim that R_UDT is true. This would definitely constitute a really interesting and significant divergence from mainstream opinion within academic decision theory. Academic decision theorists should care a lot about whether or not it’s true.

But I’m also not sure if it matters very much, practically, whether R_UDT or R_CDT is true. It’s not obvious to me that they recommend building different kinds of decision procedures into AI systems. For example, both seem to recommend building AI systems that would one-box in the transparent Newcomb scenario.

You can go with Paul and say that a lot of these distinctions are semantic rather than substantive -- that there isn't a true, ultimate, objective answer to the question of whether we should evaluate decision theories by whether they're successful, vs. some other criterion.

I disagree that any of the distinctions here are purely semantic. But one could argue that normative anti-realism is true. In this case, there wouldn’t really be any such thing as the criterion of rightness for decisions. Neither R_CDT nor R_UDT nor any other proposed criterion would be “correct.”

In this case, though, I think there would be even less reason to engage with academic decision theory literature. The literature would be focused on a question that has no real answer.

[[EDIT: Note that Will also emphasizes the importance of the criterion-of-rightness vs. decision-procedure distinction in his critique of the FDT paper: "[T]hey’re [most often] asking what the best decision procedure is, rather than what the best criterion of rightness is... But, if that’s what’s going on, there are a whole bunch of issues to dissect. First, it means that FDT is not playing the same game as CDT or EDT, which are proposed as criteria of rightness, directly assessing acts. So it’s odd to have a whole paper comparing them side-by-side as if they are rivals."]]

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-20T20:21:30.219Z · score: 21 (11 votes) · EA · GW

I think it’s plausible that Paul is being overly charitable to decision theorists; I’d love to hear whether skeptics of updateless decision theories actually agree that you shouldn’t build a CDT agent.

FWIW, I could probably be described as a "skeptic" of updateless decision theories; I’m pretty sympathetic to CDT. But I also don’t think we should build AI systems that consistently take the actions recommended by CDT. I know at least a few other people who favor CDT, but again (although small sample size) I don’t think any of them advocate for designing AI systems that consistently act in accordance with CDT.

I think the main thing that’s going on here is that academic decision theorists are primarily interested in normative principles. They’re mostly asking the question: “What criterion determines whether or not a decision is ‘rational’?” For example, standard CDT claims that an action is rational only if it’s the action that can be expected to cause the largest increase in value.

On the other hand, AI safety researchers seem to be mainly interested in a different question: “What sort of algorithm would it be rational for us to build into an AI system?” The first question doesn’t seem very relevant to the second one, since the different criteria of rationality proposed by academic decision theorists converge in most cases. For example: No matter whether CDT, EDT, or UDT is correct, it will not typically be rational to build a two-boxing AI system. It seems to me, then, that it's probably not very pressing for the AI safety community to think about the first question or engage with the academic decision theory literature.

At the same time, though, AI safety writing on decision theory sometimes seems to ignore (or implicitly deny?) the distinction between these two questions. For example: The FDT paper seems to be pitched at philosophers and has an abstract that frames the paper as an exploration of “normative principles.” I think this understandably leads philosophers to interpret FDT as an attempt to answer the first question and to criticize it on those grounds.

they aren’t as oriented by the question of “how do I write down a decision theory which would have good outcomes if I created an intelligent agent which used it”

I would go further and say that (so far as I understand the field) most academic decisions theorists aren't at all oriented by this question. I think the question they're asking is again mostly independent. I'm also not sure it would even make sense to talk about "using" a "decision theory" in this context, insofar as we're conceptualizing decision theories the way most academic decision theorists do (as normative principles). Talking about "using" CDT in this context is sort of like talking about "using" deontology.

[[EDIT: See also this short post for a better description of the distinction between a "criterion of rightness" and a "decision procedure." Another way to express my impression of what's going on is that academic decision theorists are typically talking about critera of rightness and AI safety decision theorists are typically (but not always) talking about decision procedures.]]

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-20T03:42:24.268Z · score: 4 (2 votes) · EA · GW

Lightly editing some thoughts I previously wrote up on this issue, somewhat in line with Paul's comments:

Rationalist community writing on decision theory sometimes seems to switch back and forth between describing decision theories as normative principles (which I believe is how academic philosophers typically describe decision theories) and as algorithms to be used (which seems to be inconsistent with how academic philosophers typically describe decision theories). I think this tendency to switch back and forth between describing decision theories in these two distinct ways can be seen both in papers proposing new decision theories and in online discussions. I also think this switching tendency can make things pretty confusing. Although it makes sense to discuss how an algorithm “performs” when “implemented,” once we specify a sufficiently precise performance metric, it does not seem to me to make sense to discuss the performance of a normative principle. I think the tendency to blur the distinction between algorithms and normative principles -- or, as Will MacAskill puts it in his recent and similar critique, between "decision procedures" and "criteria of rightness" -- partly explains why proponents of FDT and other new decision theories have not been able to get much traction with academic decision theorists.
For example, causal decision theorists are well aware that people who always take the actions that CDT says they should take will tend to fare less well in Newcomb scenarios than people who always take the actions that EDT says they should take. Causal decision theorists are also well aware that that there are some scenarios -- for example, a Newcomb scenario with a perfect predictor and the option to get brain surgery to pre-commit yourself to one-boxing -- in which there is no available sequence of actions such that CDT says you should take each of the actions in the sequence. If you ask a causal decision theorist what sort of algorithm you should (according to CDT) put into an AI system that will live in a world full of Newcomb scenarios, if the AI system won’t have the opportunity to self-modify, then I think it's safe to say a causal decision theorist won’t tell you to put in an algorithm that only produces actions that CDT says it should take. This tells me that we really can’t fluidly switch back and forth between making claims about the correctness of normative principles and claims about the performance of algorithms, as though there were an accepted one-to-one mapping between these two sorts of claims. Insofar as rationalist writing on decision theory tends to do this sort of switching, I suspect that it contributes to confusion and dismissiveness on the part of many academic readers.
Comment by bmg on Are we living at the most influential time in history? · 2019-09-13T13:55:42.598Z · score: 10 (6 votes) · EA · GW
If we treat the priors as hypotheses about the distribution of events in the world, then past data can provide evidence about which one is right, and (the principle of) Will's prior would have given excessively low credence to humanity's first million years being the million years when life traveled to the Moon, humanity becoming such a large share of biomass, the first 10,000 years of agriculture leading to the modern world, and so forth.

On the other hand, the kinds of priors Toby suggests would also typically give excessively low credence to these events taking so long. So the data doesn't seem to provide much active support for the proposed alternative either.

It also seems to me like different kinds of priors are probably warranted for predictions about when a given kind of event will happen for the first time (e.g. the first year in which someone is named Steve) and predictions about when a given property will achieve its maximum value (e.g. the year with the most Steves). It can therefore be consistent to expect the kinds of "firsts" you list to be relatively bunched up near the start of human history, while also expecting relevant "mosts" (such as the most hingey year) to be relatively spread out.

That being said, I find it intuitive that periods with lots of "firsts" should tend to be disproportionately hingey. I think this intuition could be used to construct a model in which early periods are especially likely to be hingey.

Comment by bmg on Are we living at the most influential time in history? · 2019-09-12T20:19:40.441Z · score: 15 (4 votes) · EA · GW
As a general rule if you have a domain like this that extends indefinitely in one direction, the correct prior is one that diminishes as you move further away in that direction, rather than picking a somewhat arbitrary end point and using a uniform prior on that.

Just a quick thought on this issue: Using Laplace's rule of succession (or any other similar prior) also requires picking a somewhat arbitrary start point. You suggest 200000BC as a start point, but one could of course pick earlier or later years and get out different numbers. So the uniform prior's sensitivity to decisions about how to truncate the relevant time interval isn't a special weakness; it doesn't seem to provide grounds for prefering the Laplacian prior.

I think that for some notion of an "arbitrary superlative," a uniform prior also makes a lot more intuitive sense than a Laplacian prior. The Laplacian prior would give very strange results, for example, if you tried to use it to estimate the hottest day on Earth, the year with the highest portion of Americans named Zach, or the year with the most supernovas.

Moreover in your case in particular, there are also good reasons to suspect that the chance of a century being the most influential should diminish over time.

I agree with this intuition, but I suppose see it as a reason to shift away from a uniform prior rather than to begin from something as lopsided as a Laplacian. I think that this intuition is also partially (but far from entirely) counterbalanced by the countervailing intuitions Will lists for expecting influence to increase over time.

Comment by bmg on Altruistic Motivations · 2019-01-06T21:46:55.644Z · score: 27 (14 votes) · EA · GW

I'm a bit concerned that this post is blurring the distinction between two different questions: “Do we have obligations to others?” and “What way of 'framing' effective altruism to yourself is most productive or sits best emotionally?”

For example, it may be the case that "guilt and shame are poor motivators," but this would have no bearing on the question of whether or not we have moral obligations. People who say that we "ought to" help others don't normally say it because they think that obligation is an instrumentally useful framing -- they say it because they believe that what they're saying is true.

Just do what you want to do.

Internalising this principle might make many people happier -- and might even lead many altruistically-inclined people to do more good in the long run.

But I also think the principle is probably false. It implies, for example, that sadists and abusers should just do what they want to do as well. If there are actually any "oughtthorities to ordain what is right and what is wrong," then it seems unlikely these oughtthorities would endorse harming others in such cases. On the other hand, if the post is right about there not being any oughtthorities (i.e. normative facts), then the principle is still at minimum no more correct than the principle that people should "just do what helps others the most."