Posts

Julia Galef and Matt Yglesias on bioethics and "ethics expertise" 2021-03-30T03:06:41.561Z
Politics is far too meta 2021-03-17T23:57:33.321Z
DontDoxScottAlexander.com - A Petition 2020-06-25T23:29:46.491Z
RobBensinger's Shortform 2019-09-23T19:44:20.095Z
New edition of "Rationality: From AI to Zombies" 2018-12-15T23:39:22.975Z
AI Summer Fellows Program: Applications open 2018-03-23T21:20:05.203Z
Anonymous EA comments 2017-02-07T21:42:24.686Z
Ask MIRI Anything (AMA) 2016-10-11T19:54:25.621Z
MIRI is seeking an Office Manager / Force Multiplier 2015-07-05T19:02:24.163Z

Comments

Comment by RobBensinger on [deleted post] 2021-04-21T13:34:40.533Z

Seems fine to start with the simpler system, as you propose, and add wrinkles only if problems actually arise in practice.

Comment by RobBensinger on [deleted post] 2021-04-20T23:57:45.802Z

A natural way of tagging AI-related content on the EA Forum might be something like:

  1. Discussion  of the sort of AI that existential risk EAs are worried about.
  2. Discussion of other sorts of AI.

And within 1:

  • 1a. Technical work aimed at increasing the probability that well-intentioned developers can reliably produce good outcomes from category-1 AI systems.
  • 1b. Attempts to forecast AI progress as it bears on category-1 AI systems.
  • 1c. Attempts to answer macrostrategy questions about such AI systems: How should they be used? What kind of group(s) should develop them? How do we ensure that developers are informed, responsible, and/or well-intentioned? Etc.

Plausibly 1b and 1c should just be one tag (which can then link to multiple different daughter wiki articles explaining different subtopics), since there's lots of overlap and keeping the number of tags small makes it easier to find articles you're looking for and remember what the tags are.

(There may also be no need to make 1 its own category, if everything falls under at least one of 1a/1b/1c anyway. But maybe some things will be meta enough to benefit from a supertag -- e.g., discussions of the orgs working on AI x-risk.)

If those are good categories, then the next question is what to name them. Some established options for category 1 (roughly in increasing order of how much I like them for this purpose):

  • Superintelligent AI - Defined by Bostrom as AI "that is much smarter than the best human brains
    in practically every field, including scientific creativity, general wisdom and
    social skills". This seems overly specific: AI might destroy the world with superhuman science and engineering skills even if it lacks superhuman social skills, for example.
  • Transformative AI - Defined by Open Phil as "AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution". I think this is too vague, and wouldn't help people discussing Bostrom/Christiano/etc. doomsday scenarios find each other on the forum. E.g., Logan Zoellner wonders  whether existing AI is already "transformative" in this sense; whatever the answer, it seems like a question that's tangential to the kinds of considerations x-risk folks mostly care about.
  • Advanced AI - A vague term that can variously mean any of the terms on this list. Its main advantage is that its lack of a clear definition would let the EA Forum stipulate some definition just for the sake of the wiki and tagging systems.
  • Artificial general intelligence - AI that can do the same sort of general reasoning about messy physical environments that allowed humans to land on the Moon and build particle accelerators (even though those are very different tasks, neither capability was directly selected for in our ancestral environment, and chimpanzees can't do either). This seems like a good option relative to my own way of thinking about AI x-risk. The main disadvantage of the term (for this use case) is that some thinkers worried about AI x-risk are more skeptical that "general intelligence" is a natural or otherwise useful category. A slightly more theory-neutral term might be better for a tag or overview article.
  • Prepotent AI - A  new term defined by Andrew Critch and David Krueger to mean AI whose deployment "would transform the state of humanity’s habitat—currently the Earth—in a manner that is at least as impactful as humanity and unstoppable to humanity". This seems to nicely subsume both the slower-takeoff Christiano doomsday scenarios and the faster-takeoff Bostromite doomsday scenarios, without weighing in on whether "AGI" is a good category.

Category 1 could then be called "Non-prepotent AI" or "Narrow AI" or similar.

I haven't used the term "prepotent AI" much, so it might have issues that I'm not tracking. But if so, giving it a test run on the EA Forum might be a good way to reveal such issues.

I think the best term for 1a is AI alignment, with the wrinkle that most researchers are focused on "intent alignment" (getting the AI to try to produce good outcomes), and Paul Christiano thinks it would be more natural to define the field's goal as intent alignment, while some other researchers want to include topics like boxing in 'AI alignment' (I've called this larger category "outcome alignment" to disambiguate).

For 1b+1c, I like AI strategy and forecasting.  The term "AI governance" is popular, but seems too narrow (and potentially alienating or confusing) to me. You could also maybe call it 'Prepotent AI strategy and forecasting' or 'AGI strategy and forecasting' to clarify that we aren't talking about the strategic implications of using existing AI tech to augment anti-malaria efforts or what-have-you.

Comment by RobBensinger on [deleted post] 2021-04-20T18:06:13.658Z

Makes sense to me!

Comment by RobBensinger on Avoiding the Repugnant Conclusion is not necessary for population ethics: new many-author collaboration. · 2021-04-20T17:17:30.358Z · EA · GW

while I respect the urge the move a stale conversation on, I don't think the authors provide new object-level reasons to do so.

If adequate object-level reasons were already provided for something, but a field hasn't updated on those reasons, then what should a field do?

Two ideas that come to mind:

  • Summarize and/or signal-boost the existing reasons.
  • Write a paper speculating about why, psychologically or sociologically, the field hasn't updated enough, in the hope that this will cause the field to reflect on its mistakes and change.

The Utilitas paper falls in the first category. (It does summarize / signal-boost past psychological accounts of why people have put too much weight on anti-repugnant-conclusion intuitions; but it doesn't offer new explanations of why people didn't update on those past psychological accounts and other arguments.) 

Regardless of the merits of the second category, I'm not keen on the idea of getting rid of the first category, because I think one of the bigger reasons the world's institutions are failing today, and one of the bigger reasons science is dysfunctional, is an over-emphasis on advancing-the-frontiers-of-knowledge over summarizing-and-synthesizing-what's-known within science and academia.

Cf. Holden Karnofsky's account of science.

Comment by RobBensinger on [deleted post] 2021-04-20T16:47:25.839Z

Previous name for this tag was "epistemic humility", which seems confusing because 'modesty vs. humility' is an old distinction on LessWrong, and 'epistemic humility' here is referring more to the 'modesty' side than the 'humility' side.  I've changed it to "deference and social epistemology" for now.

Twelve Virtues of Rationality: "To be humble is to take specific actions in anticipation of your own errors. [...] Who are most humble? Those who most skillfully prepare for the deepest and most catastrophic errors in their own beliefs and plans. [...] To be human is to make ten thousand errors. No one in this world achieves perfection."

The Proper Use of Humility: "This is social modesty, not humility. It has to do with regulating status in the tribe, rather than scientific process. If you ask someone to 'be more humble,' by default they’ll associate the words to social modesty—which is an intuitive, everyday, ancestrally relevant concept. Scientific humility is a more recent and rarefied invention, and it is not inherently social. Scientific humility is something you would practice even if you were alone in a spacesuit, light years from Earth with no one watching. Or even if you received an absolute guarantee that no one would ever criticize you again, no matter what you said or thought of yourself. You’d still double-check your calculations if you were wise."

Status Regulation and Anxious Underconfidence: "I try to be careful to distinguish the virtue of avoiding overconfidence, which I sometimes call 'humility,' from the phenomenon I’m calling 'modest epistemology.'"

Comment by RobBensinger on Avoiding the Repugnant Conclusion is not necessary for population ethics: new many-author collaboration. · 2021-04-18T22:40:43.451Z · EA · GW

but then what else do you want people to do?

The paper doesn't provide a roadmap for this, but it does indicate what kinds of problems it thinks are more worthy of population ethicists' time: problems that help us make real-world moral decisions.

"Ethical arguments are widely used in public debate, everyday decision-making, and policy-making. For example, ethical arguments against social inequality and discrimination are common – although not universal, not always successful, and not always correct. Many public decisions affect the world's future population. Population ethics is therefore an essential foundation for making these decisions properly. It is not simply an academic exercise, and we should not let it be governed by undue attention to one consideration."

Comment by RobBensinger on Launching a new resource: 'Effective Altruism: An Introduction' · 2021-04-17T23:37:41.537Z · EA · GW

Yeah, I endorse all of these things:

  • Criticizing 80K when you think they're wrong (especially about object-level factual questions like "is longtermism true?").
  • Criticizing EAs when you think they're wrong even if you think they've spent hundreds of hours reaching some conclusion, or producing some artifact.
    • (I.e.: try to model how much thought and effort people have put into things, and keep in mind that no amount of effort makes you infallible. Even if it turns out the person didn't make a mistake, raising the question of whether they messed up can help make it clearer why a choice was made.)
  • Using the comment section on a post like this to solicit interest in developing a competitor-podcast-episode-intro-resource.
  • Loudly advertising your competitor episode here, so people can compare the merits of 80K's playlist to yours.

The thing I don't endorse is what I talk about in my comments.

Comment by RobBensinger on Launching a new resource: 'Effective Altruism: An Introduction' · 2021-04-17T22:02:47.900Z · EA · GW

1) (far) future lives, (2) near-term humans, (3) (near-term) animals

This isn't the main problem I had in mind, but it's worth noting that EA animal advocacy is also aimed at improving welfare and/or preventing suffering in future minds, even when it's not aimed at far-future animals. The goal of factory farm reform for chickens is to affect (or prevent) future chickens, not chickens that are alive at the time people develop or push for the reform.

Comment by RobBensinger on Launching a new resource: 'Effective Altruism: An Introduction' · 2021-04-17T16:22:54.732Z · EA · GW

The point of EA is to work out how to do the most good, then do it. There are three target groups one might try to benefit - (1) (far) future lives, (2) near-term humans, (3) (near-term) animals. Given this, one cannot, in good faith, call something an 'introduction' when it focuses almost exclusively on object-level attempts to benefit just one group.

This is a specific way of framing EA, and one that I think feels natural in part for 'sociology and history of EA' reasons: individual EAs often self-identify as either interested in existential risk, interested in animal welfare, or interested in third-world development, in large part due to the early influence of Peter Singer, GiveWell, LessWrong, and the Oxford longtermists, who broke in different directions on these questions. The EA Funds use a division like this, and early writing about EA liked to emphasize this division.

But I don't agree that this is the most natural (much less the only reasonable) way of dividing up the space of high-impact altruistic goals or projects, so I don't think all intro resources should emphasize this framing. 

If you'd framed EA as being about '(1) causing positive experiences and (2) preventing negative ones', you could have argued that EA is about the choice between negative-leaning and positive-leaning utilitarianism, and that all intro resources must put similar emphasis on those two perspectives (regardless of the merits of the perspectives in the eyes of the intro-resource-maker).

If you'd framed EA as being about 'direct aid, institution reform, cause prioritization, and improving EAs' effectiveness', you could argue that any intro resource is obviously bad if it neglects any one of those categories, even if it's just because they're carving up the space differently.

If you'd framed EA as being about 'helping people in the developed world, helping people in the developing world, helping animals, or helping far-future lives', then we'd have needed to give equal prominence to more nationalist and regionalist perspectives on altruism as well.

My main objection is to the structure of this argument. There are worlds where EA initially considered it an open question whether nationalism is a reasonable perspective to bring to cause prioritization; and worlds where lots of EAs later realized they were wrong and nationalism isn't a good perspective. In those worlds, it's important that we not be so wedded to early framings of 'the key disagreements in the movement' that no one can ever move on from treating nationalist-EA as a contender.

(This isn't intended as an argument for 'our situation is analogous to the nationalism one'; it's intended as a structural objection to arguments that take for granted a certain framing of EA, require all intro sources to fit that frame, and make it hard to update away from that frame in worlds where some of the options do turn out to be bad.)

Comment by RobBensinger on Launching a new resource: 'Effective Altruism: An Introduction' · 2021-04-16T19:17:34.254Z · EA · GW

Conversely, if the 80K intro podcast list was just tossed together in a few minutes without much concern for narrative flow / sequencing / cohesiveness, then I'm much less averse to redesign-via-quick-EA-Forum-comments. :)

Comment by RobBensinger on Launching a new resource: 'Effective Altruism: An Introduction' · 2021-04-16T19:15:37.090Z · EA · GW

Possible-bias disclosure: am longtermist, focused on x-risk.

I haven't heard all of the podcast episodes under consideration, but methodologically I like the idea of there being a wide variety of 'intro' EA resources that reflect different views of what EA causes and approaches are best, cater to different audiences, and employ different communication/pedagogy methods. If there's an unresolved disagreement about one of those things, I'd usually rather see people make new intro resources to compete with the old one, rather than trying to make any one resource universally beloved (which can lead to mediocre or uncohesive designed-by-committee end products).

In this case, I'd rather see a new podcast episodes collection that's more shorttermist and see whether a cohesive, useful playlist can be designed that way.

And if hours went into carefully picking the original ten episodes and deciding how to sequence them, I'd like to see modifications made via a process of re-listening to different podcasts for hours and experimenting with their effects in different orders, seeing what "arcs" they form, etc., rather than via quick EA Forum comments and happy recollections of isolated episodes.

Comment by RobBensinger on Avoiding the Repugnant Conclusion is not necessary for population ethics: new many-author collaboration. · 2021-04-16T15:03:55.268Z · EA · GW

Intuition check: If philosophy were a brain, and published articles were how it did its "thinking", then would it reach better conclusions if it avoided thinking about whether it's giving too much attention to a topic?

In the case of an individual, we value the idea of reflecting on your thought process and methodology. Reasoning about your own reasoning is good -- indeed, such thoughts can be among the most leveraged parts of a person's life, since any improvements you make to your allocation of effort or the quality of your reasoning will improve all your future reasoning.

The group version of this is reasoning about whether the group is reasoning well, or whether the group is misallocating its attention and effort.

You could argue that articles like this are unnecessary even when a field goes totally off the rails, because academic articles aren't the only way the field can think. Individuals in the field can think in the privacy of their own head, and reach correlated conclusions because the balance of evidence is easy to assess. They can talk at conferences, or send emails to each other.

But if those are an essential part of the intellectual process anyway, I guess I don't see the value in trying to hide that process from the public eye or the public record. And once they're public, I'm not sure it matters much whether it's a newspaper article, a journal article, or a blog post.

Comment by RobBensinger on What are your main reservations about identifying as an effective altruist? · 2021-03-30T22:28:44.506Z · EA · GW

Yeah, I'm an EA: an Estimated-as-Effective-in-Expectation (in Excess of Endeavors with Equivalent Ends I've Evaluated) Agent with an Audaciously Altruistic Agenda.

Comment by RobBensinger on Julia Galef and Matt Yglesias on bioethics and "ethics expertise" · 2021-03-30T03:42:54.170Z · EA · GW

Recommended! :) 

Comment by RobBensinger on Politics is far too meta · 2021-03-19T01:48:06.324Z · EA · GW

It's frustrating to have people who agree with you bat for the other team.

I don't like "bat for the other team" here; it reminds me of "arguments are soldiers" and the idea that people on your "side" should agree your ideas are great, while the people who criticize your ideas are the enemy.

Criticism is good! Having accurate models of tractability (including political tractability) is good!

What I would say is:

  • Some "criticisms" are actually self-fulfilling prophecies, rather than being objective descriptions of reality. EAs aren't wary enough of these, and don't have strong enough norms against meta/PR becoming overrepresented or leaking into object-level discussions. This is especially bad in early-stage brainstorming and discussion.
  • On Doing the Improbable + Status Regulation and Anxious Underconfidence: EAs are far too inclined to abandon high-EV ideas that are <50% likely to succeed. There should be far larger number of failures, weird experiments, and risky bets in EA. If you're too willing to give up at the smallest problem, then "seeking out criticism" can turn into "seeking out rationalizations for inaction" (or "seeking out rationalizations for only doing normal/simple/predictable things").

Using a general reference class when you have a better, more specific class available

I agree this is one of the biggest things EAs currently tend to get wrong. I'd distinguish two kinds of mistake here, both of which I think EAs tend to make:

  • Over-relying on outside views over inside views. Inside views (making predictions based on details and causal mechanisms) and outside views (making predictions based on high-level similarities) are both important, but EA currently puts too much thought into outside views and not enough into inside views. If you're NASA, your outside views help you predict budget and time overruns and build in good safety/robustness margins, while your inside views let you build a rocket at all.
  • Picking the wrong outside view / reference class, or not even considering the different reference classes on offer. Picking a good reference class can be extremely difficult; in some cases, many years of accumulated domain expertise may be the only thing that allows you to spot the right surface similarities to put your weight down on.
Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-18T17:28:55.170Z · EA · GW

Relatedly, in my experience 'writing an article or blog post' can have bad effects on my ability to reason about stuff. I want to say things that are relevant and congruent and that flow together nicely; but my actual thought process includes a bunch of zig-zagging and updating and sorting-through-thoughts-that-don't-initially-make-perfect-crisp-sense. So focusing on the writing makes me focus less on my thought process, and it becomes tempting to for me confuse the writing process or written artifact for my thought process or beliefs.

You've spent a lot of time living and breathing EA/rationalist stuff, so I don't know that I have any advice that will be useful to you. But if I were giving advice to a random reporter, I'd warn about the above phenomenon and say that this can lead to overconfidence when someone's just getting started adding probabilistic forecasts to their blogging.

I think this calibration-and-reflection bug is important -- it's a bug in your ability to recognize what you believe, not just in your ability to communicate it -- and I think it's fixable with some practice, without having to do the superforecaster 'sink lots of hours into getting expertise about every topic you predict' thing.

(And I don't know, maybe the journey to fixing this could be an interesting one that generates an article of its own? Maybe a thing that could be linked to at the bottom of posts to give context for readers who are confused about why the numbers are there and why they're so low-confidence?)

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-18T17:25:09.081Z · EA · GW

If you haven't spent time on calibration training, I recommend it! Open Phil has a tool here: https://www.openphilanthropy.org/blog/new-web-app-calibration-training. Making good forecasts is a mix of 'understand the topic you're making a prediction about' and 'understand yourself well enough to interpret your own feelings of confidence'. Even if they mostly don't have expertise in the topic they're writing about, I think most people can become pretty well-calibrated with an hour or two of practice.

And that's a valuable service in its own right, I think. It would be a major gift to the public even if the only take-away readers got from predictions at the end of articles were 'wow, even though these articles sound confident, the claims almost always tend to be 50% or 60% probable according to the reporter; guess I should keep in mind these topics are complex and these articles are being banged out in a few hours rather than being the product of months of study, so of course things are going to end up being pretty uncertain'.

If you also know enough about a topic to make a calibrated 80% or 90% (or 99%!) prediction about it, that's great. But one of the nice things about probabilities is just that they clarify what you're saying -- they can function like an epistemic status disclaimer that notes how uncertain you really are, even if it was hard to make your prose flow without sounding kinda confident in the midst of the article. Making probabilistic predictions doesn't have to be framed as 'here's me using my amazing knowledge of the world to predict the future'; it can just be framed as an attempt to disambiguate what you were saying in the article.

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-18T17:04:25.088Z · EA · GW

Thanks, Tom. :) I'm interested to hear about reporters who aren't "EA-ish" but are worth paying attention to anyway — I think sometimes EA's blind spots arise from things that don't have the EA "vibe" but that would come up in a search anyway if you just classified writers by "awesome", "insightful", "unusually rigorous and knowledgeable", "getting at something important", etc.

For people who missed my post: Politics Is Far Too Meta

Comment by RobBensinger on Politics is far too meta · 2021-03-18T00:45:30.654Z · EA · GW

Thanks to Chana Messinger for discussing some of these topics with me. Any remaining errors in the post are society's fault for raising me wrong, not Chana's.

Note on the definitions: People use the word "meta" to refer to plenty of other things. If you're in a meeting to discuss Clinton's electability and someone raises a point of process, you might want to call that "meta" and distinguish it from "object-level" discussion of electability. When I define "meta", I'm just clarifying terminology in the post itself, not insisting that other posts use "meta" to refer to the exact same things.

Comment by RobBensinger on Strong Evidence is Common · 2021-03-14T21:35:59.915Z · EA · GW

More discussion on LessWrong: https://www.lesswrong.com/posts/JD7fwtRQ27yc8NoqS/strong-evidence-is-common#comments 

Comment by RobBensinger on Strong Evidence is Common · 2021-03-14T02:29:01.424Z · EA · GW

More Facebook discussion of this post:

___________________________

Ronny Fernandez:  I think maybe what’s actually going on here is that extraordinary claims usually have much lower prior prob than 10^-6

Genuinely extraordinary claims, not claims that seem weird

Comment by RobBensinger on Strong Evidence is Common · 2021-03-14T02:27:55.507Z · EA · GW

More Facebook discussion of this post:

___________________________

Satvik Beri:  I think Bayes' Theorem is extremely hard to apply usefully, to the point that I rarely use it at all despite working in data science.

A major problem that leads people to be underconfident is the temptation to round down evidence to reasonable odds, like the post mentions. A major problem that leads people to be overconfident is applying lots of small pieces of information while discounting the correlations between them.

A comment [on LessWrong] mentions that if you have excellent returns for a year, that's strong evidence you're a top 1% trader. That's not really true, the market tends to move in regimes for long periods of time, so a strategy that works well for a year is pretty likely to have average performance the next year. Studies on hedge fund managers have found it is extremely difficult to find consistent outperformers, e.g. 5-year performance on pretty much any metric is uncorrelated to the performance on that metric next year.

Comment by RobBensinger on Strong Evidence is Common · 2021-03-14T02:24:40.511Z · EA · GW

Facebook discussion of this post:

___________________________

Duncan Sabien:  This is ... not a clean argument. Haven't read the full post, but I feel the feeling of someone trying to do sleight-of-hand on me.

[Added by Duncan: "my apologies for not being able to devote more time to clarity and constructivity.  Mark Xu is good people in my experience."]

Rob Bensinger:  Isn't 'my prior odds were x, my posterior odds were y, therefore my evidence strength must be z' already good enough?

Are you worried that the person might not actually have a posterior that extreme? Like, if they actually took 21 bets like that they'd get more than 1 of them wrong?

Guy Srinivasan:  I feel like "fight! fight!" except with the word "unpack!"

Duncan Sabien:  > The prior odds that someone’s name is 'Mark Xu' are generously 1:1,000,000. Posterior odds of 20:1 implies that the odds ratio of me saying 'Mark Xu' is 20,000,000:1, or roughly 24 bits of evidence. That’s a lot of evidence.

This is beyond "spherical frictionless cows" and into disingenuous adversarial levels of oversimplification. I'm having a hard time clarifying what's sending up red flags here, except to say "the claim that his mere assertion provided 24 bits of evidence is false, and saying it in this oddly specific and confident way will cow less literate reasoners into just believing him, and I feel gross."

Guy Srinivasan:  Could it be that there's a smuggled intuition here that we're trying to distinguish between names in a good faith world, and that the bad faith hypothesis is important in ways that "the name might be John" isn't, and that just rounding it off to bits of evidence makes it seem like the extra 0.1 bits "maybe this exchange is bad faith" are small in comparison when actually they are the most important bits to gain?

(the above is not math)

Marcello Herreshoff:  I share Duncan's intuition that there's a sleight of hand happening here. Here's my candidate for where the slight of hand might live:

Vast odds ratios do lurk behind many encounters, but specifically, they show up much more often in situations that raise an improbable hypothesis to consideration worthiness (as in Mark Xu's first set of examples) than in the situation where they raise consideration worthy hypotheses to very high levels of certainty (as in Mark Xu's second set of examples.)

Put another way, how correlated your available observations are to some variable puts a ceiling on how certain you're ever allowed to get about that variable. So we should often expect the last mile of updates in favor of a hypothesis to be much harder to obtain than the first mile.

Ronny Fernandez:  @Duncan Sabien   So is the prior higher or is the posterior lower?

Chana Messinger:  I wonder if this is similar to my confusion at whether expected conservation of evidence is violated if you have a really good experiment that would give you strong evidence for A if it comes out one way and strong evidence for B if it comes out the other way.

Ronny Fernandez:  @Marcello Mathias Herreshoff I don’t think I actually understand the last paragraph in your explanation. Feel like elaborating?

Marcello Herreshoff:  Consider the driver's license example. If we suppose 1/1000 of people are identity thieves carrying perfect driver's license forgeries (of randomly selected victims), then there is absolutely nothing you can do (using drivers licenses alone) to get your level of certainty that the person you're talking to is Mark Xu above 99.9%, because the evidence you can access can't separate the real Mark Xu from a potential impersonator. That's the flavor of effect the first sentence of the last paragraph was trying to point at.

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-11T15:41:20.996Z · EA · GW

Are there journalists or outlets you think EAs or rationalists should especially be following? Particularly ones who might not already be on our radar?

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-11T15:37:45.048Z · EA · GW

How do stats get misrepresented in the news? What can you do to spot it when they are? :)

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-11T15:35:42.470Z · EA · GW

And: Are there common proposals for reforming or improving journalism that you don't think are good ideas?

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-11T15:35:08.191Z · EA · GW

If you could snap your fingers and change some things about journalistic norms, what kinds of articles tend to get written, how articles and news sites are structured, etc., what would you change?

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-11T15:33:57.942Z · EA · GW

What's UnHerd? Is there anything unusual about it, or should I approximately treat it as "typical news outlet that happens to host your content"?

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-11T15:31:07.626Z · EA · GW

What are two problems/bottlenecks you wish EAs spent more time thinking/working on?

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-11T15:30:11.512Z · EA · GW

What books or articles (or movies, podcast episodes, etc.) would you most recommend to a random thinker (or, say, a random science journalist) if they wanted to level up their rationality and/or their altruistic efficacy?

Comment by RobBensinger on AMA: Tom Chivers, science writer, science editor at UnHerd · 2021-03-11T15:25:10.646Z · EA · GW

If you could re-write The Rationalist's Guide to the Galaxy today (and you weren't too worried about making the book too long), what are the ~three things you'd add that aren't covered there?

Comment by RobBensinger on Some thoughts on the EA Munich // Robin Hanson incident · 2020-09-02T16:27:47.084Z · EA · GW
I believe that the social dynamics leading to development of CC do not depend on the balance of opinions favoring CC, and only require that those who are against it are afraid to speak up honestly and publicly

I agree with this. This seems like an opportune time for me to say in a public, easy-to-google place that I think cancel culture is a real thing, and very harmful.

Comment by RobBensinger on Study results: The most convincing argument for effective donations · 2020-08-30T02:48:23.426Z · EA · GW

I haven't talked to Schwitzgebel, and you're of course welcome to pass this all on. :)

Comment by RobBensinger on Study results: The most convincing argument for effective donations · 2020-08-27T17:22:47.948Z · EA · GW

An anonymous comment someone asked me to post for them (similar to Mati Roy's recent comment):

I always felt kind of uneasy about Eric Scwhitzgebel's competition to find the most convincing argument for making people donate. It felt kind of symmetric in a way I don't like, like you could do that for any action you want to convince people to take. I also pattern matched it to starting with a conclusion and then trying to find all the best arguments for that conclusion, which is a grave sin in my culture.
I thought of something I could do to make it better, which I am probably not going to do because I feel like I would get yelled at.
edit: I'm not sure it did actually work like this, but it was something similar.
The way his competition worked was that he had subjects read the arguments submitted by competitors, then gave them 10 usd, and had them decide how much of that 10 usd if any they wanted to give to charity. People can now publicize the argument that turns out to most reliably cause readers to donate most. My idea to correct for this is to hold the opposite competition. What argument is best for convincing people not to donate, measured the same way? Then we could publicize both arguments together. Probably not going to do this, but I wish somebody would, and I would support them.

I took "It felt kind of symmetric" to be referring to Guided By The Beauty Of Our Weapons, and "starting with a conclusion and then trying to find all the best arguments for that conclusion" to be referring to The Bottom Line.

I replied (edited):

This is a cool idea!
Another potential problem is that persuasiveness isn't the same thing as accuracy, informational value, honesty, epistemic empowerment, etc.
And a third problem is that since each submission is trying to be maximally convincing, there's no incentive for any submission to note the weaknesses, limitations, or caveats affecting its own arguments. A 'debate' format, allowing for the two sides to respond to each other, seems better than a 'we each make our own arguments in separate rooms from each other' format, since it's a lot easier to come away uninformed or lacking important context if you don't hear anyone pick apart the original arguments.
Maybe the best version of this contest would be some version of https://rationalconspiracy.com/2017/01/03/four-layers-of-intellectual-conversation/, where people submit arguments for or against a proposition, and the most compelling argument wins; then people submit rebuttals to the most compelling argument, and the most compelling rebuttal wins; then the original winner gets a chance to respond, and the second winner gets a chance to counter-respond. Then all four entries get posted together, so the argument gets a fairer hearing.
Comment by RobBensinger on The academic contribution to AI safety seems large · 2020-07-31T16:43:07.211Z · EA · GW

I agree with Max's take. MIRI researchers still look at Alignment for Advanced Machine Learning Systems (AAMLS) problems periodically, but per Why I am not currently working on the AAMLS agenda, they mostly haven't felt the problems are tractable enough right now to warrant a heavy focus.

Nate describes our new work here: 2018 Update: Our New Research Directions.

Since 2016, actually “about half” of MIRI’s research has been on their ML agenda, apparently to cover the chance of prosaic AGI.

I don't think any of MIRI's major research programs, including AAMLS, have been focused on prosaic AI alignment. (I'd be interested to hear if Jessica or others disagree with me.)

Paul introduced prosaic AI alignment (in November 2016) with:

It’s conceivable that we will build “prosaic” AGI, which doesn’t reveal any fundamentally new ideas about the nature of intelligence or turn up any “unknown unknowns.” I think we wouldn’t know how to align such an AGI; moreover, in the process of building it, we wouldn’t necessarily learn anything that would make the alignment problem more approachable. So I think that understanding this case is a natural priority for research on AI alignment.

In contrast, I think of AAMLS as assuming that we'll need new deep insights into intelligence in order to actually align an AGI system. There's a large gulf between (1) "Prosaic AGI alignment is feasible" and (2) "AGI may be produced by techniques that are descended from current ML techniques" or (3) "Working with ML concepts and systems can help improve our understanding of AGI alignment", and I think of AAMLS as assuming some combination of 2 and 3, but not 1. From a post I wrote in July 2016:

[... AAMLS] is intended to help more in scenarios where advanced AI is relatively near and relatively directly descended from contemporary ML techniques, while our agent foundations agenda is more agnostic about when and how advanced AI will be developed.
As we recently wrote, we believe that developing a basic formal theory of highly reliable reasoning and decision-making “could make it possible to get very strong guarantees about the behavior of advanced AI systems — stronger than many currently think is possible, in a time when the most successful machine learning techniques are often poorly understood.” Without such a theory, AI alignment will be a much more difficult task.
The authors of “Concrete problems in AI safety” write that their own focus “is on the empirical study of practical safety problems in modern machine learning systems, which we believe is likely to be robustly useful across a broad variety of potential risks, both short- and long-term.” Their paper discusses a number of the same problems as the [AAMLS] agenda (or closely related ones), but directed more toward building on existing work and finding applications in present-day systems.
Where the agent foundations agenda can be said to follow the principle “start with the least well-understood long-term AI safety problems, since those seem likely to require the most work and are the likeliest to seriously alter our understanding of the overall problem space,” the concrete problems agenda [by Amodei, Olah, Steinhardt, Christiano, Schulman, and Mané] follows the principle “start with the long-term AI safety problems that are most applicable to systems today, since those problems are the easiest to connect to existing work by the AI research community.”
Taylor et al.’s new [AAMLS] agenda is less focused on present-day and near-future systems than “Concrete problems in AI safety,” but is more ML-oriented than the agent foundations agenda.
Comment by RobBensinger on Are Humans 'Human Compatible'? · 2020-01-31T07:01:11.548Z · EA · GW

I agree that suffering is bad in all universes, for the reasons described in https://www.lesswrong.com/posts/zqwWicCLNBSA5Ssmn/by-which-it-may-be-judged. I'd say that "ethics... is not constituted by any feature of the universe" in the sense you note, but I'd point to our human brains if we were asking any question like:

  • What explains why "suffering is bad" is true in all universes? How could an agent realistically discover this truth -- how do we filter out the false moral claims and zero in on the true ones, and how could an alien do the same?
Comment by RobBensinger on Are Humans 'Human Compatible'? · 2019-12-10T05:05:57.209Z · EA · GW
I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone's wellbeing into account, human or otherwise.

The view I would advocate is that something like utilitarianism (i.e., some form of impartial, species-indifferent welfare maximization) is a core part of human values. What I mean by 'human values' here isn't on your list; it's closer to an idealized version of our preferences: what we would prefer if we were smarter, more knowledgeable, had greater self-control.

Russels' assumption that "The machine’s only objective is to maximize the realization of human preferences" seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?

The language of "human-compatible" is very speciesist, since ethically we should want AGI to be "compatible" with all moral patients, human or not.

On the other hand, the idea of using human brains as a "starting point" for identifying what's moral makes sense. "Which ethical system is correct?" isn't written in the stars or in Plato's heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).

The same is true for identifying the right notion of "impartial", "fair", "compassionate", "taking other species' welfare into account", etc.; to figure out the correct moral account of those important values, you would primarily need to learn facts about human brains. You'd then need to learn facts about non-humans' brains in order to implement the resultant impartiality procedure (because the relevant criterion, "impartiality", says that whether you have human DNA is utterly irrelevant to moral conduct).

The need to bootstrap from values encoded in our brains doesn't and shouldn't mean that humans are the only moral patients (or even that we're particularly important moral patients; insects could turn out to be utility monsters, for all we know today). Hence "human-compatible" is an unfortunate phrase here.

But it does mean that if, e.g., it turns out that cats' ultimate true preferences are to torture all species forever, we shouldn't give that particular preference equal decision weight. Speaking very loosely, the goal is more like 'ensuring all beings gets to have a good life', not like 'ensuring all species (however benevolent or sadistic they turn out to be) get an equal say in what kind of life all beings get to live'.

If there's a more benevolent species than humans, I'd hope that sufficiently advanced science could identify that species, and pass the buck to them. (In an odd sense, we're already building an alien species to defer to if we're constructing 'an idealized version of human preferences', since I would expect sufficiently idealized preferences to turn out to be pretty alien compared to the views human beings espouse today.)

I think it's reasonable to worry that given humans' flaws, humans might not in fact build AGI that 'ensures all beings get to have a good life'. But I do think that something like the latter is the goal; and when you ask me what physical facts in the world make that 'the goal', and what we would need to investigate in order to work out all the wrinkles and implementation details, I'm forced to initially point to facts about human (if only to identify the right notions of 'what a moral patient is' and 'how one ought to impartially take into account all moral patients' welfare').

Comment by RobBensinger on EA Leaders Forum: Survey on EA priorities (data and analysis) · 2019-12-08T22:30:56.377Z · EA · GW

Speaking as a random EA — I work at an org that attended the forum (MIRI), but I didn't personally attend and am out of the loop on just about everything that was discussed there — I'd consider it a shame if CEA stopped sharing interesting take-aways from meetings based on an "everything should either be 100% publicly disclosed or 100% secret" policy.

I also don't think there's anything particularly odd about different orgs wanting different levels of public association with EA's brand, or having different levels of risk tolerance in general. EAs want to change the world, and the most leveraged positions in the world don't perfectly overlap with 'the most EA-friendly parts of the world'. Even in places where EA's reputation is fine today, it makes sense to have a diversified strategy where not every wonkish, well-informed, welfare-maximizing person in the world has equal exposure if EA's reputation takes a downturn in the future.

MIRI is happy to be EA-branded itself, but I'd consider it a pretty large mistake if MIRI started cutting itself off from everyone in the world who doesn't want to go all-in on EA (refuse to hear their arguments or recommendations, categorically disinvite them from any important meetings, etc.). So I feel like I'm logically forced to say this broad kind of thing is fine (without knowing enough about the implementation details in this particular case to weigh in on whether people are making all the right tradeoffs).

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-28T00:49:35.393Z · EA · GW

+1, I agree with all this.

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T07:24:25.987Z · EA · GW

This is a good discussion! Ben, thank you for inspiring so many of these different paths we've been going down. :) At some point the hydra will have to stop growing, but I do think the intuitions you've been sharing are widespread enough that it's very worthwhile to have public discussion on these points.

Therefore, when a member of the rationalist community uses the word "decision theory" to refer to a decision procedure, they are talking about something that's pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems don't directly speak to the questions that most academic "decision theorists" are actually debating with one another.

On the contrary:

  • MIRI is more interested in identifying generalizations about good reasoning ("criteria of rightness") than in fully specifying a particular algorithm.
  • MIRI does discuss decision algorithms in order to better understand decision-making, but this isn't different in kind from the ordinary way decision theorists hash things out. E.g., the traditional formulation of CDT is underspecified in dilemmas like Death in Damascus. Joyce and Arntzenius' response to this wasn't to go "algorithms are uncouth in our field"; it was to propose step-by-step procedures that they think capture the intuitions behind CDT and give satisfying recommendations for how to act.
  • MIRI does discuss "what decision procedure performs best", but this isn't any different from traditional arguments in the field like "naive EDT is wrong because it performs poorly in the smoking lesion problem". Compared to the average decision theorist, the average rationalist puts somewhat more weight on some considerations and less weight on others, but this isn't different in kind from the ordinary disagreements that motivate different views within academic decision theory, and these disagreements about what weight to give categories of consideration are themselves amenable to argument.
  • As I noted above, MIRI is primarily interested in decision theory for the sake of better understanding the nature of intelligence, optimization, embedded agency, etc., not for the sake of picking a "decision theory we should build into future AI systems". Again, this doesn't seem unlike the case of philosophers who think that decision theory arguments will help them reach conclusions about the nature of rationality.
I think it's totally conceivable that no criterion of rightness is correct (e.g. because the concept of a "criterion of rightness" turns out to be some spooky bit of nonsense that doesn't really map onto anything in the real world.)

Could you give an example of what the correctness of a meta-criterion like "Don't Make Things Worse" could in principle consist in?

I’m not looking here for a “reduction” in the sense of a full translation into other, simpler terms. I just want a way of making sense of how human brains can tell what’s “decision-theoretically normative” in cases like this.

Human brains didn’t evolve to have a primitive “normativity detector” that beeps every time a certain thing is Platonically Normative. Rather, different kinds of normativity can be understood by appeal to unmysterious matters like “things brains value as ends”, “things that are useful for various ends”, “things that accurately map states of affairs”...

When I think of other examples of normativity, my sense is that in every case there's at least one good account of why a human might be able to distinguish "truly" normative things from non-normative ones. E.g. (considering both epistemic and non-epistemic norms):


1. If I discover two alien species who disagree about the truth-value of "carbon atoms have six protons", I can evaluate their correctness by looking at the world and seeing whether their statement matches the world.

2. If I discover two alien species who disagree about the truth value of "pawns cannot move backwards in chess" or "there are statements in the language of Peano arithmetic that can neither be proved nor disproved in Peano arithmetic", then I can explain the rules of 'proving things about chess' or 'proving things about PA' as a symbol game, and write down strings of symbols that collectively constitute a 'proof' of the statement in question.

I can then assert that if any member of any species plays the relevant 'proof' game using the same rules, from now until the end of time, they will never prove the negation of my result, and (paper, pen, time, and ingenuity allowing) they will always be able to re-prove my result.

(I could further argue that these symbol games are useful ones to play, because various practical tasks are easier once we've accumulated enough knowledge about legal proofs in certain games. This usefulness itself provides a criteria for choosing between "follow through on the proof process" and "just start doodling things or writing random letters down".)

The above doesn't answer questions like "do the relevant symbols have Platonic objects as truthmakers or referents?", or "why do we live in a consistent universe?", or the like. But the above answer seems sufficient for rejecting any claim that there's something pointless, epistemically suspect, or unacceptably human-centric about affirming Gödel's first incompleteness theorem. The above is minimally sufficient grounds for going ahead and continuing to treat math as something more significant than theology, regardless of whether we then go on to articulate a more satisfying explanation of why these symbol games work the way they do.

3. If I discover two alien species who disagree about the truth-value of "suffering is terminally valuable", then I can think of at least two concrete ways to evaluate which parties are correct. First, I can look at the brains of a particular individual or group, see what that individual or group terminally values, and see whether the statement matches what's encoded in those brains. Commonly the group I use for this purpose is human beings, such that if an alien (or a housecat, etc.) terminally values suffering, I say that this is "wrong".

Alternatively, I can make different "wrong" predicates for each species: , , , , etc.

This has the disadvantage of maybe making it sound like all these values are on "equal footing" in an internally inconsistent way ("it's wrong to put undue weight on what's !", where the first "wrong" is secretly standing in for ""), but has the advantage of making it easy to see why the aliens' disagreement might be important and substantive, while still allowing that aliens' normative claims can be wrong (because they can be mistaken about their own core values).

The details of how to go from a brain to an encoding of "what's right" seem incredibly complex and open to debate, but it seems beyond reasonable dispute that if the information content of a set of terminal values is encoded anywhere in the universe, it's going to be in brains (or constructs from brains) rather than in patterns of interstellar dust, digits of pi, physical laws, etc.


If a criterion like “Don’t Make Things Worse” deserves a lot of weight, I want to know what that weight is coming from.

If the answer is “I know it has to come from something, but I don’t know what yet”, then that seems like a perfectly fine placeholder answer to me.

If the answer is “This is like the ‘terminal values’ case, in that (I hypothesize) it’s just an ineradicable component of what humans care about”, then that also seems structurally fine, though I’m extremely skeptical of the claim that the “warm glow of feeling causally efficacious” is important enough to outweigh other things of great value in the real world.

If the answer is “I think ‘Don’t Make Things Worse’ is instrumentally useful, i.e., more useful than UDT for achieving the other things humans want in life”, then I claim this is just false. But, again, this seems like the right kind of argument to be making; if CDT is better than UDT, then that betterness ought to consist in something.

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T06:48:02.423Z · EA · GW
Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there's no relevant sense in which the agent should have two-boxed."

No, I don't endorse this argument. To simplify the discussion, let's assume that the Newcomb predictor is infallible. FDT agents, CDT agents, and EDT agents each get a decision: two-box (which gets you $1000 plus an empty box), or one-box (which gets you $1,000,000 and leaves the $1000 behind). Obviously, insofar as they are in fact following the instructions of their decision theory, there's only one possible outcome; but it would be odd to say that a decision stops being a decision just because it's determined by something. (What's the alternative?)

I do endorse "given the predictor's perfect accuracy, it's impossible for the P_UDT agent to two-box and come away with $1,001,000". I also endorse "given the predictor's perfect accuracy, it's impossible for the P_CDT agent to two-box and come away with $1,001,000". Per the problem specification, no agent can two-box and get $1,001,000 or one-box and get $0. But this doesn't mean that no decision is made; it just means that the predictor can predict the decision early enough to fill the boxes accordingly.

(Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this "dominance" argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don't think agents "should" try to achieve outcomes that are impossible from the problem specification itself. The reason agents get more utility than CDT in Newcomb's problem is that non-CDT agents take into account that the predictor is a predictor when they construct their counterfactuals.)

In the transparent version of this dilemma, the agent who sees the $1M and one-boxes also "could have two-boxed", but if they had two-boxed, it would only have been after making a different observation. In that sense, if the agent has any lingering uncertainty about what they'll choose, the uncertainty goes away as soon as they see whether the box is full.

In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So -- insofar as we accept this kind of objection from determinism -- there seems to be something problematically non-naturalistic about discussing what "would have happened" if we built in one decision procedure or another.

No, there's nothing non-naturalistic about this. Consider the scenario you and I are in. Simplifying somewhat, we can think of ourselves as each doing meta-reasoning to try to choose between different decision algorithms to follow going forward; where the new things we learn in this conversation are themselves a part of that meta-reasoning.

The meta-reasoning process is deterministic, just like the object-level decision algorithms are. But this doesn't mean that we can't choose between object-level decision algorithms. Rather, the meta-reasoning (in spite of having deterministic causes) chooses either "I think I'll follow P_FDT from now on" or "I think I'll follow P_CDT from now on". Then the chosen decision algorithm (in spite of also having deterministic causes) outputs choices about subsequent actions to take. Meta-processes that select between decision algorithms (to put into an AI, or to run in your own brain, or to recommend to other humans, etc.)) can make "real decisions", for exactly the same reason (and in exactly the same sense) that the decision algorithms in question can make real decisions.

It isn't problematic that all these processes requires us to consider counterfactuals that (if we were omniscient) we would perceive as inconsistent/impossible. Deliberation, both at the object level and at the meta level, just is the process of determining the unique and only possible decision. Yet because we are uncertain about the outcome of the deliberation while deliberating, and because the details of the deliberation process do determine our decision (even as these details themselves have preceding causes), it feels from the inside of this process as though both options are "live", are possible, until the very moment we decide.

(See also Decisions are for making bad outcomes inconsistent.)


Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T05:42:47.141Z · EA · GW
But there's nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT.

If the thing being argued for is "R_CDT plus P_SONOFCDT", then that makes sense to me, but is vulnerable to all the arguments I've been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT's "Don't Make Things Worse" principle.

If the thing being argued for is "R_CDT plus P_FDT", then I don't understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over "R_FDT plus P_FDT"? (Indeed, what difference between the two views would be intended here?)

So why shouldn't I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be not self-effacing.

The argument against "R_CDT plus P_SONOFCDT" doesn't require any mention of self-effacingness; it's entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.

The argument against "R_CDT plus P_FDT" seems to demand some reference to self-effacingness or inconsistency, or triviality / lack of teeth. But I don't understand what this view would mean or why anyone would endorse it (and I don't take you to be endorsing it).

For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don't think this ambiguity matters much for the argument.

We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what "expected utility" means.

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-27T05:22:21.379Z · EA · GW

So, as an experiment, I'm going to be a very obstinate reductionist in this comment. I'll insist that a lot of these hard-seeming concepts aren't so hard.

Many of them are complicated, in the fashion of "knowledge" -- they admit an endless variety of edge cases and exceptions -- but these complications are quirks of human cognition and language rather than deep insights into ultimate metaphysical reality. And where there's a simple core we can point to, that core generally isn't mysterious.

It may be inconvenient to paraphrase the term away (e.g., because it packages together several distinct things in a nice concise way, or has important emotional connotations, or does important speech-act work like encouraging a behavior). But when I say it "isn't mysterious", I mean it's pretty easy to see how the concept can crop up in human thought even if it doesn't belong on the short list of deep fundamental cosmic structure terms.

I would say that there's also at least a fourth way that philosophers often use the word "rational," which is also the main way I use the word "rational." This is to refer to an irreducibly normative concept.

Why is this a fourth way? My natural response is to say that normativity itself is either a messy, parochial human concept (like "love," "knowledge," "France") , or it's not (in which case it goes in bucket 2).

Some examples of concepts that are arguably irreducible are "truth," "set," "property," "physical," "existance," and "point."

Picking on the concept here that seems like the odd one out to me: I feel confident that there isn't a cosmic law (of nature, or of metaphysics, etc.) that includes "truth" as a primitive (unless the list of primitives is incomprehensibly long). I could see an argument for concepts like "intentionality/reference", "assertion", or "state of affairs", though the former two strike me as easy to explain in simple physical terms.

Mundane empirical "truth" seems completely straightforward. Then there's the truth of sentences like "Frodo is a hobbit", "2+2=4", "I could have been the president", "Hamburgers are more delicious than battery acid"... Some of these are easier or harder to make sense of in the naive correspondence model, but regardless, it seems clear that our colloquial use of the word "true" to refer to all these different statements is pre-philosophical, and doesn't reflect anything deeper than that "each of these sentences at least superficially looks like it's asserting some state of affairs, and each sentence satisfies the conventional assertion-conditions of our linguistic community".

I think that philosophers are really good at drilling down on a lot of interesting details and creative models for how we can try to tie these disparate speech-acts together. But I think there's also a common failure mode in philosophy of treating these questions as deeper, more mysterious, or more joint-carving than the facts warrant. Just because you can argue about the truthmakers of "Frodo is a hobbit" doesn't mean you're learning something deep about the universe (or even something particularly deep about human cognition) in the process.

[Parfit:] It is hard to explain the concept of a reason, or what the phrase ‘a reason’ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But ‘counts in favour of’ means roughly ‘gives a reason for’. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words.

Suppose I build a robot that updates hypotheses based on observations, then selects actions that its hypotheses suggest will help it best achieve some goal. When the robot is deciding which hypotheses to put more confidence in based on an observation, we can imagine it thinking, "To what extent is observation o a [WORD] to believe hypothesis h?" When the robot is deciding whether it assigns enough probability to h to choose an action a, we can imagine it thinking, "To what extent is P(h)=0.7 a [WORD] to choose action a?" As a shorthand, when observation o updates a hypothesis h that favors an action a, the robot can also ask to what extent o itself is a [WORD] to choose a.

When two robots meet, we can moreover add that they negotiate a joint "compromise" goal that allows them to work together rather than fight each other for resources. In communicating with each other, they then start also using "[WORD]" where an action is being evaluated relative to the joint goal, not just the robot's original goal.

Thus when Robot A tells Robot B "I assign probability 90% to 'it's noon', which is [WORD] to have lunch", A may be trying to communicate that A wants to eat, or that A thinks eating will serve A and B's joint goal. (This gets even messier if the robots have an incentive to obfuscate which actions and action-recommendations are motivated by the personal goal vs. the joint goal.)

If you decide to relabel "[WORD]" as "reason", I claim that this captures a decent chunk of how people use the phrase "a reason". "Reason" is a suitcase word, but that doesn't mean there are no similarities between e.g. "data my goals endorse using to adjust the probability of a given hypothesis" and "probabilities-of-hypotheses my goals endorse using to select an action"), or that the similarity is mysterious and ineffable.

(I recognize that the above story leaves out a lot of important and interesting stuff. Though past a certain point, I think the details will start to become Gettier-case nitpicks, as with most concepts.)

For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of "a rational choice" to the concept of "a winning choice" (or, in line with the type-2 conception you mention, a "utility-maximizing choice").

That essay isn't trying to "reduce" the term "rationality" in the sense of taking a pre-existing word and unpacking or translating it. The essay is saying that what matters is utility, and if a human being gets too invested in verbal definitions of "what the right thing to do is", they risk losing sight of the thing they actually care about and were originally in the game to try to achieve (i.e., their utility).

Therefore: if you're going to use words like "rationality", make sure that the words in question won't cause you to shoot yourself in the foot and take actions that will end up costing you utility (e.g., costing human lives, costing years of averted suffering, costing money, costing anything or everything). And if you aren't using "rationality" in a safe "nailed-to-utility" way, make sure that you're willing to turn on a time and stop being "rational" the second your conception of rationality starts telling you to throw away value.

It ultimately seems hard, at least to me, to make non-vacuous true claims about what it's "rational" to do withoit evoking a non-reducible notion of "rationality."

"Rationality" is a suitcase word. It refers to lots of different things. On LessWrong, examples include not just "(systematized) winning" but (as noted in the essay) "Bayesian reasoning", or in Rationality: Appreciating Cognitive Algorithms, "cognitive algorithms or mental processes that systematically produce belief-accuracy or goal-achievement". In philosophy, the list is a lot longer.

The common denominator seems to largely be "something something reasoning / deliberation" plus (as you note) "something something normativity / desirability / recommendedness / requiredness".

The idea of "normativity" doesn't currently seem that mysterious to me either, though you're welcome to provide perplexing examples. My initial take is that it seems to be a suitcase word containing a bunch of ideas tied to:

  • Goals/preferences/values, especially overridingly strong ones.
  • Encouraged, endorsed, mandated, or praised conduct.

Encouraging, endorsing, mandating, and praising are speech-acts that seem very central to how humans perceive and intervene on social situations; and social situations seem pretty central to human cognition overall. So I don't think it's particularly surprising if words associated with such loaded ideas would have fairly distinctive connotations and seem to resist reduction, especially reduction that neglects the pragmatic dimensions of human communication and only considers the semantic dimension.

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T10:03:14.906Z · EA · GW
Whereas others think self-consistency is more important.

The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.

It's not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn't it come down to which principles you favor?

FDT gets you more utility than CDT. If you value literally anything in life more than you value "which ritual do I use to make my decisions?", then you should go with FDT over CDT; that's the core argument.

This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think there's no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).

The latter argument for CDT isn't circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T09:34:17.060Z · EA · GW

My impression is that most CDT advocates who know about FDT think FDT is making some kind of epistemic mistake, where the most popular candidate (I think) is some version of magical thinking.

Superstitious people often believe that it's possible to directly causally influence things across great distances of time and space. At a glance, FDT's prescription ("one-box, even though you can't causally affect whether the box is full") as well as its account of how and why this works ("you can somehow 'control' the properties of abstract objects like 'decision functions'") seem weird and spooky in the manner of a superstition.

FDT's response: if a thing seems spooky, that's a fine first-pass reason to be suspicious of it. But at some point, the accusation of magical thinking has to cash out in some sort of practical, real-world failure -- in the case of decision theory, some systematic loss of utility that isn't balanced by an equal, symmetric loss of utility from CDT. After enough experience of seeing a tool outperforming the competition in scenario after scenario, at some point calling the use of that tool "magical thinking" starts to ring rather hollow. At that point, it's necessary to consider the possibility that FDT is counter-intuitive but correct (like Einstein's "spukhafte Fernwirkung"), rather than magical.

In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:

2. I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on.

The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the "deterministic subprocess" view of our decision-making, we would find nothing strange about the idea that it's sometimes right for this subprocess to do locally incorrect things for the sake of better global results.

E.g., consider the transparent Newcomb problem with a 1% chance of predictor error. If we think of the brain's decision-making as a rule-governed system whose rules we are currently determining (via a meta-reasoning process that is itself governed by deterministic rules), then there's nothing strange about enacting a rule that gets us $1M in 99% of outcomes and $0 in 1% of outcomes; and following through when the unlucky 1% scenario hits us is nothing to agonize over, it's just a consequence of the rule we already decided. In that regard, steering the rule-governed system that is your brain is no different than designing a factory robot that performs well enough in 99% of cases to offset the 1% of cases where something goes wrong.

(Note how a lot of these points are more intuitive in CS language. I don't think it's a coincidence that people coming from CS were able to improve on academic decision theory's ideas on these points; I think it's related to what kinds of stumbling blocks get in the way of thinking in these terms.)

Suppose you initially tell yourself:

"I'm going to one-box in all strictly-future transparent Newcomb problems, since this produces more expected causal (and evidential, and functional) utility. One-boxing and receiving $1M in 99% of future states is worth the $1000 cost of one-boxing in the other 1% of future states."

Suppose that you then find yourself facing the 1%-likely outcome where Omega leaves the box empty regardless of your choice. You then have a change of heart and decide to two-box after all, taking the $1000.

I claim that the above description feels from the inside like your brain is escaping the iron chains of determinism (even if your scientifically literate system-2 verbal reasoning fully recognizes that you're a deterministic process). And I claim that this feeling (plus maybe some reluctance to fully accept the problem description as accurate?) is the only thing that makes CDT's decision seem reasonable in this case.

In reality, however, if we end up not following through on our verbal commitment and we one-box in that 1% scenario, then this would just prove that we'd been mistaken about what rule we had successfully installed in our brains. As it turns out, we were really following the lower-global-utility rule from the outset. A lack of follow-through or a failure of will is itself a part of the decision-making process that Omega is predicting; however much it feels as though a last-minute swerve is you "getting away with something", it's really just you deterministically following through on an algorithm that will get you less utility in 99% of scenarios (while happening to be bad at predicting your own behavior and bad at following through on verbalized plans).

I should emphasize that the above is my own attempt to characterize the intuitions behind CDT and FDT, based on the arguments I've seen in the wild and based on what makes me feel more compelled by CDT, or by FDT. I could easily be wrong about the crux of disagreement between some CDT and FDT advocates.

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T08:48:32.471Z · EA · GW

I mostly agree with this. I think the disagreement between CDT and FDT/UDT advocates is less about definitions, and more about which of these things feels more compelling:

  • 1. On the whole, FDT/UDT ends up with more utility.

(I think this intuition tends to hold more force with people the more emotionally salient "more utility" is to you. E.g., consider a version of Newcomb's problem where two-boxing gets you $100, while one-boxing gets you $100,000 and saves your child's life.)

  • 2. I'm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I've already observed that the second box is empty), I'm "getting away with something" and getting free utility that the FDT agent would miss out on.

(I think this intuition tends to hold more force with people the more emotionally salient it is to imagine the dollars sitting right there in front of you and you knowing that it's "too late" for one-boxing to get you any more utility in this world.)

There are other considerations too, like how much it matters to you that CDT isn't self-endorsing. CDT prescribes self-modifying in all future dilemmas so that you behave in a more UDT-like way. It's fine to say that you personally lack the willpower to follow through once you actually get into the dilemma and see the boxes sitting in front of you; but it's still the case that a sufficiently disciplined and foresightful CDT agent will generally end up behaving like FDT in the very dilemmas that have been cited to argue for CDT.

If a more disciplined and well-prepared version of you would have one-boxed, then isn't there something off about saying that two-boxing is in any sense "correct"? Even the act of praising CDT seems a bit self-destructive here, inasmuch as (a) CDT prescribes ditching CDT, and (b) realistically, praising or identifying with CDT is likely to make it harder for a human being to follow through on switching to son-of-CDT (as CDT prescribes).


Mind you, if the sentence "CDT is the most rational decision theory" is true in some substantive, non-trivial, non-circular sense, then I'm inclined to think we should acknowledge this truth, even if it makes it a bit harder to follow through on the EDT+CDT+UDT prescription to one-box in strictly-future Newcomblike problems. When the truth is inconvenient, I tend to think it's better to accept that truth than to linguistically conceal it.

But the arguments I've seen for "CDT is the most rational decision theory" to date have struck me as either circular, or as reducing to "I know CDT doesn't get me the most utility, but something about it just feels right".

It's fine, I think, if "it just feels right" is meant to be a promissory note for some forthcoming account — a clue that there's some deeper reason to favor CDT, though we haven't discovered it yet. As the FDT paper puts it:

These are odd conclusions. It might even be argued that sufficiently odd behavior provides evidence that what FDT agents see as “rational” diverges from what humans see as “rational.” And given enough divergence of that sort, we might be justified in predicting that FDT will systematically fail to get the most utility in some as-yet-unknown fair test.

On the other hand, if "it just feels right" is meant to be the final word on why "CDT is the most rational decision theory", then I feel comfortable saying that "rational" is a poor choice of word here, and neither maps onto a key descriptive category nor maps onto any prescription or norm worthy of being followed.

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-26T07:52:20.219Z · EA · GW

I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!

If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesn't really have an answer.

Some ancient Greeks thought that the planets were intelligent beings; yet many of the Greeks' astronomical observations, and some of their theories and predictive tools, were still true and useful.

I think that terms like "normative" and "rational" are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser's pluralistic moral reductionism).

I would say that (1) some philosophers use "rational" in a very human-centric way, which is fine as long as it's done consistently; (2) others have a much more thin conception of "rational", such as 'tending to maximize utility'; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of "rationality", but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.

I think that type-1, type-2, and type-3 decision theorists have all contributed valuable AI-relevant conceptual progress in the past (most obviously, by formulating Newcomb's problem, EDT, and CDT), and I think all three could do more of the same in the future. I think the type-3 decision theorists are making a mistake, but often more in the fashion of an ancient astronomer who's accumulating useful and real knowledge but happens to have some false side-beliefs about the object of study, not in the fashion of a theologian whose entire object of study is illusory. (And not in the fashion of a developmental psychologist or historian whose field of subject is too human-centric to directly bear on game theory, AI, etc.)

I'd expect type-2 decision theorists to tend to be interested in more AI-relevant things than type-1 decision theorists, but on the whole I think the flavor of decision theory as a field has ended up being more type-2/3 than type-1. (And in this case, even type-1 analyses of "rationality" can be helpful for bringing various widespread background assumptions to light.)

If I'm someone with a twin and I'm implementing P_CDT, I still don't think I will choose to modify myself to cooperate in twin prisoner's dilemmas. The reason is that modifying myself won't cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.

This is true if your twin was copied from you in the past. If your twin will be copied from you in the future, however, then you can indeed cause your twin to cooperate, assuming you have the ability to modify your own future decision-making so as to follow son-of-CDT's prescriptions from now on.

Making the commitment to always follow son-of-CDT is an action you can take; the mechanistic causal consequence of this action is that your future brain and any physical systems that are made into copies of your brain in the future will behave in certain systematic ways. So from your present perspective (as a CDT agent), you can causally control future copies of yourself, as long as the act of copying hasn't happened yet.

(And yes, by the time you actually end up in the prisoner's dilemma, your future self will no longer be able to causally affect your copy. But this is irrelevant from the perspective of present-you; to follow CDT's prescriptions, present-you just needs to pick the action that you currently judge will have the best consequences, even if that means binding your future self to take actions contrary to CDT's future prescriptions.)

(If it helps, don't think of the copy of you as "you": just think of it as another environmental process you can influence. CDT prescribes taking actions that change the behavior of future copies of yourself in useful ways, for the same reason CDT prescribes actions that change the future course of other physical processes.)

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-23T03:06:40.896Z · EA · GW
But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his “Guaranteed Payoffs Principle.”

I think "Don't Make Things Worse" is a plausible principle at first glance.

One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility). The general policy of following the "Don't Make Things Worse Principle" makes things worse.

Once you've already adopted son-of-CDT, which says something like "act like UDT in future dilemmas insofar as the correlations were produced after I adopted this rule, but act like CDT in those dilemmas insofar as the correlations were produced before I adopted this rule", it's not clear to me why you wouldn't just go: "Oh. CDT has lost the thing I thought made it appealing in the first place, this 'Don't Make Things Worse' feature. If we're going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?"

A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state. From Abram Demski's comments:

[...] In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/EDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesn't look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didn't pay well.
[...] One way of thinking about this is to say that the FDT notion of "decision problem" is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified 'bomb' with just the certain information that 'left' is (causally and evidentially) very bad and 'right' is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem.
Another way to think about this is to say that FDT "rejects" decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy.

And:

[...] This also hopefully clarifies the sense in which I don't think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem.
There's a subtle point here, though, since Will describes the decision problem from an updated perspective -- you already know the bomb is in front of you. So UDT "changes the problem" by evaluating "according to the prior". From my perspective, because the very statement of the Bomb problem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances.
Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Let's call the way-you-put-agents-into-the-scenario the "construction". We then evaluate agents on how well they deal with the construction.
For examples like Bomb, the construction gives us the overall probability distribution -- this is then used for the expected value which UDT's optimality notion is stated in terms of.
For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible.
Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-23T02:48:53.788Z · EA · GW
Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions.

This is similar to how you described it here:

Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. [...]
When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational iff taking it would cause the largest expected increase in value. To avoid any ambiguity, let’s label this claim R_CDT.
We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.

This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it's normative, it can be either an algorithm/procedure that's being recommended, or a criterion of rightness like "a decision is rational iff taking it would cause the largest expected increase in value" (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are "normative" or "endorsed").

Some of your discussion above seems to be focusing on the "algorithmic?" dimension, while other parts seem focused on "normative?". I'll say more about "normative?" here.

The reason I proposed the three distinctions in my last comment and organized my discussion around them is that I think they're pretty concrete and crisply defined. It's harder for me to accidentally switch topics or bundle two different concepts together when talking about "trying to optimize vs. optimizing as a side-effect", "directly optimizing vs. optimizing via heuristics", "initially optimizing vs. self-modifying to optimize", or "function vs. algorithm".

In contrast, I think "normative" and "rational" can mean pretty different things in different contexts, it's easy to accidentally slide between different meanings of them, and their abstractness makes it easy to lose track of what's at stake in the discussion.

E.g., "normative" is often used in the context of human terminal values, and it's in this context that statements like this ring obviously true:

I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making.

If we're treating decision-theoretic norms as being like moral norms, then sure. I think there are basically three options:

  • Decision theory isn't normative.
  • Decision theory is normative in the way that "murder is bad" or "improving aggregate welfare is good" is normative, i.e., it expresses an arbitrary terminal value of human beings.
  • Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).

Probability theory has obvious normative force in the context of reasoning and decision-making, but it's not therefore arbitrary or irrelevant to understanding human cognition, AI, etc.

A lot of the examples you've cited are theories from moral philosophy about what's terminally valuable. But decision theory is generally thought of as the study of how to make the right decisions, given a set of terminal preferences; it's not generally thought of as the study of which decision-making methods humans happen to terminally prefer to employ. So I would put it in category 1 or 3.

You could indeed define an agent that terminally values making CDT-style decisions, but I don't think most proponents of CDT or EDT would claim that their disagreement with UDT/FDT comes down to a values disagreement like that. Rather, they'd claim that rival decision theorists are making some variety of epistemic mistake. (And I would agree that the disagreement comes down to one party or the other making an epistemic mistake, though I obviously disagree about who's mistaken.)

I actually don’t think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account won’t cause you to achieve better outcomes here.)

In the twin prisoner's dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).

I think you can model the voting dilemma the same way, just with noise added because the level of correlation is imperfect and/or uncertain. Ten agents following the same decision procedure are trying to decide whether to stay home and watch a movie (which gives a small guaranteed benefit) or go to the polls (which costs them the utility of the movie, but gains them a larger utility iff the other nine agents go to the polls too). Ten FDT agents will vote in this case, if they know that the other agents will vote under similar conditions.

Comment by RobBensinger on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-22T22:23:33.949Z · EA · GW

I agree that these three distinctions are important:

  • "Picking policies based on whether they satisfy a criterion X" vs. "Picking policies that happen to satisfy a criterion X". (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
  • "Trying to follow a decision rule Y 'directly' or 'on the object level'" vs. "Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y". (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you've come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
  • "A decision rule that prescribes outputting some action or policy and doesn't care how you do it" vs. "A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy". (E.g., a rule that says 'maximize the aggregate welfare of moral patients' vs. a specific mental algorithm intended to achieve that end.)

The first distinction above seems less relevant here, since we're mostly discussing AI systems and humans that are self-aware about their decision criteria and explicitly "trying to do what's right".

As a side-note, I do want to emphasize that from the MIRI cluster's perspective, it's fine for correct reasoning in AGI to arise incidentally or implicitly, as long as it happens somehow (and as long as the system's alignment-relevant properties aren't obscured and the system ends up safe and reliable).

The main reason to work on decision theory in AI alignment has never been "What if people don't make AI 'decision-theoretic' enough?" or "What if people mistakenly think CDT is correct and so build CDT into their AI system?" The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we've even been misunderstanding basic things at the level of "decision-theoretic criterion of rightness".

It's not that I want decision theorists to try to build AI systems (even notional ones). It's that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That's part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).


The second distinction ("following a rule 'directly' vs. following it by adopting a sub-rule or via self-modification") seems more relevant. You write:

My understanding is that when philosophers talk about “CDT,” they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.
The difference matters, because people who believe that R_CDT is true don’t generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves.

Far from being a distinction proponents of UDT/FDT neglect, this is one of the main grounds on which UDT/FDT proponents criticize CDT (from within the "success-first" tradition). This is because agents that are reflectively inconsistent in the manner of CDT -- ones that take actions they know they'll regret taking, wish they were following a different decision rule, etc. -- can be money-pumped and can otherwise lose arbitrary amounts of value.

A human following CDT should endorse "stop following CDT," since CDT isn't self-endorsing. It's not even that they should endorse "keep following CDT, but adopt a heuristic or sub-rule that helps us better achieve CDT ends"; they need to completely abandon CDT even at the meta-level of "what sort of decision rule should I follow?" and modify themselves into purely following an entirely new decision rule, or else they'll continue to perform poorly by CDT's lights.

The decision rule that CDT does endorse loses a lot of the apparent elegance and naturalness of CDT. This rule, "son-of-CDT", is roughly:

  • Have whatever disposition-to-act gets the most utility, unless I'm in future situations like "a twin prisoner's dilemma against a perfect copy of my future self where the copy was forked from me before I started following this rule", in which case ignore my correlation with that particular copy and make decisions as though our behavior is independent (while continuing to take into account my correlation with any copies of myself I end up in prisoner's dilemmas with that were copied from my brain after I started following this rule).

The fact that CDT doesn't endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories don't), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.

But this decision rule CDT endorses also still performs suboptimally (from the perspective of success-first decision theory). See the discussion of the Retro Blackmail Problem in "Toward Idealized Decision Theory", where "CDT and any decision procedure to which CDT would self-modify see losing money to the blackmailer as the best available action."

In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agents' due to events that happened after she turned 20 (such as "the summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theory"). But she'll refuse to coordinate for reasons like "we hung out a lot the summer before my 20th birthday", "we spent our whole childhoods and teen years living together and learning from the same teachers", and "we all have similar decision-making faculties due to being members of the same species". There's no principled reason to draw this temporal distinction; it's just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.


Regarding the third distinction ("prescribing a certain kind of output vs. prescribing a step-by-step mental procedure for achieving that kind of output"), I'd say that it's primarily the criterion of rightness that MIRI-cluster researchers care about. This is part of why the paper is called "Functional Decision Theory" and not (e.g.) "Algorithmic Decision Theory": the focus is explicitly on "what outcomes do you produce?", not on how you produce them.

(Thus, an FDT agent can cooperate with another agent whenever the latter agent's input-output relations match FDT's prescription in the relevant dilemmas, regardless of what computations they do to produce those outputs.)

The main reasons I think academic decision theory should spend more time coming up with algorithms that satisfy their decision rules are that (a) this has a track record of clarifying what various decision rules actually prescribe in different dilemmas, and (b) this has a track record of helping clarify other issues in the "understand what good reasoning is" project (e.g., logical uncertainty) and how they relate to decision theory.