[Linkpost] Shorter version of report on existential risk from power-seeking AI 2023-03-22T18:06:49.610Z
A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my dissertation) 2023-02-21T17:16:19.005Z
Seeing more whole 2023-02-17T05:14:02.248Z
Why should ethical anti-realists do ethics? 2023-02-16T16:27:02.008Z
[Linkpost] Human-narrated audio version of "Is Power-Seeking AI an Existential Risk?" 2023-01-31T19:19:31.416Z
On sincerity 2022-12-23T17:14:13.539Z
Against meta-ethical hedonism 2022-12-02T00:19:53.552Z
Against the normative realist's wager 2022-10-13T16:36:46.018Z
Video and Transcript of Presentation on Existential Risk from Power-Seeking AI 2022-05-08T03:52:43.864Z
On expected utility, part 4: Dutch books, Cox, and Complete Class 2022-03-24T07:49:25.806Z
On expected utility, part 3: VNM, separability, and more 2022-03-22T03:05:11.731Z
On expected utility, part 2: Why it can be OK to predictably lose 2022-03-18T08:28:49.825Z
On expected utility, part 1: Skyscrapers and madmen 2022-03-16T21:54:47.342Z
Simulation arguments 2022-02-18T10:57:01.803Z
On infinite ethics 2022-01-31T07:17:21.298Z
The ignorance of normative realism bot 2022-01-18T05:18:32.474Z
Morality and constrained maximization, part 2 2022-01-12T01:49:07.219Z
Morality and constrained maximization, part 1 2021-12-22T08:53:05.198Z
Reviews of "Is power-seeking AI an existential risk?" 2021-12-16T20:50:46.574Z
Anthropics and the Universal Distribution 2021-11-28T20:46:45.356Z
On the Universal Distribution 2021-10-29T17:52:11.607Z
SIA > SSA, part 4: In defense of the presumptuous philosopher 2021-10-01T07:00:15.779Z
SIA > SSA, part 3: An aside on betting in anthropics 2021-10-01T06:59:31.124Z
SIA > SSA, part 2: Telekinesis, reference classes, and other scandals 2021-10-01T06:58:52.266Z
SIA > SSA, part 1: Learning from the fact that you exist 2021-10-01T06:58:04.972Z
Can you control the past? 2021-08-27T19:34:18.926Z
In search of benevolence (or: what should you get Clippy for Christmas?) 2021-07-20T01:11:10.343Z
On the limits of idealized values 2021-06-22T02:00:36.022Z
Draft report on existential risk from power-seeking AI 2021-04-28T21:41:03.856Z
Problems of evil 2021-04-19T10:33:00.000Z
The innocent gene 2021-04-05T03:26:16.961Z
The importance of how you weigh it 2021-03-29T04:58:17.862Z
On future people, looking back at 21st century longtermism 2021-03-22T08:21:04.205Z
Against neutrality about creating happy lives 2021-03-15T01:54:22.612Z
Care and demandingness 2021-03-08T06:59:27.554Z
Subjectivism and moral authority 2021-03-01T08:59:29.742Z
Two types of deference 2021-02-22T03:27:32.368Z
Contact with reality 2021-02-15T04:53:34.381Z
Killing the ants 2021-02-07T23:16:50.147Z
Believing in things you cannot see 2021-02-01T07:24:16.051Z
On clinging 2021-01-24T23:25:41.907Z
A ghost 2021-01-21T07:14:14.326Z
Actually possible: thoughts on Utopia 2021-01-18T08:27:32.025Z
Alienation and meta-ethics (or: is it possible you should maximize helium?) 2021-01-15T07:06:54.124Z
The impact merge 2021-01-13T07:26:47.630Z
Shouldn't it matter to the victim? 2021-01-11T07:14:20.069Z
Thoughts on personal identity 2021-01-08T04:19:09.765Z
Grokking illusionism 2021-01-06T05:50:07.646Z
The despair of normative realism bot 2021-01-03T22:59:06.126Z
Thoughts on being mortal 2021-01-01T19:16:55.944Z


Comment by Joe_Carlsmith on A Critique of AI Takeover Scenarios · 2022-11-07T21:55:43.947Z · EA · GW

Noting that the passage you quote in your appendix from my report isn't my definition of the type of AI I'm focused on. I'm focused on AI systems with the following properties:

  • Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation). 
  • Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world. 
  • Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.

See section 2.1 in the report for more in-depth description.

Comment by Joe_Carlsmith on Against the normative realist's wager · 2022-10-13T18:45:49.480Z · EA · GW

Hi Jake, 

Thanks for this comment. I discuss this sort of case in footnote 33 here -- I think it's a good place to push back on the argument. Quoting what I say there:

there is, perhaps, some temptation to say “even if I should be indifferent to these people burned alive, I’m not! Screw indifference-ism world! Sounds like a shitty objective normative order anyway – let’s rebel against it.” That is, it feels like indifference-ism worlds have told me what the normative facts are, but they haven’t told me about my loyalty to the normative facts, and the shittyness of these normative facts puts that loyalty even more in question.  

And perhaps, as well, there’s some temptation to think that “Well, indifference-ism world is morally required to be indifferent to my overall decision-procedure as well – so I’ll use a decision-procedure that isn’t indifferent to what happens in indifference-ism world. Indifference-ism world isn't allowed to care!”  

These responses might seem dicey, though. If they (or others) don't end up working, ultimately I think that biting the bullet and taking this sort of deal is in fact less bad than doing so in the nihilism-focused version or the original. So it’s an option if necessary – and one I’d substantially prefer to biting the bullet in all of them.

That is, I'm interested in some combination of: 

  • Not taking the deal because you're uncertain of your loyalty to the normative facts (e.g., something about internalism/externalism etc)
  • Not taking the deal because indifference-ism world is indifferent to your decision procedure (or to your actions more generally), so whatever, let's save my family in those worlds. 
  • Biting the bullet and taking the deal if it comes to that, but not taking it in the other cases discussed in the post. 

Adding a few more thoughts, I think part of what I'm interested in here is the question of what you would be "trying" to do (from some kind of "I endorse this" perspective, even if the endorsement doesn't have any external backing from the normative facts) conditional on a given world. If, in indifference-ism world, you wouldn't be trying, in this sense, to protect your family, such that your representative from indifference-ism world would indeed be like "yeah, go ahead, burn my family alive," then taking the deal looks more OK to me. But if, conditional on indifference-ism, you would be trying to protect your family anyway (maybe because: the normative facts are indifferent, so might as well), such that your representative from indifference-ism world would be like "I'm against this deal," then taking the deal looks worse to me. And the second thing seems more like where I'd expect to end up.

Comment by Joe_Carlsmith on A Bird's Eye View of the ML Field [Pragmatic AI Safety #2] · 2022-07-18T19:28:04.054Z · EA · GW

I found this helpful, thanks for writing.

Comment by Joe_Carlsmith on On infinite ethics · 2022-01-31T18:36:55.080Z · EA · GW

A few questions about this: 

  1. Does this view imply that it is actually not possible to have a world where e.g. a machine creates one immortal happy person per day, forever, who then form an ever-growing line?
  2. How does this view interpret cosmological hypotheses on which the universe is infinite? Is the claim that actually, on those hypotheses, the universe is finite after all? 
  3. It seems like lots of the (countable) worlds and cases discussed in the post can simply be reframed as never-ending processes, no? And then similar (identical?) questions will arise? Thus, for example, w5 is equivalent to a machine that creates a1 at -1, then a3 at -1, then a5 at -1, etc. w6 is equivalent to a machine that creates a1 at -1, then a2 at -1, a3 at -1, etc. What would this view say about which of these machines we should create, given the opportunity? How should we compare these to a w8 machine that creates b1 at -1, b2 at -1, b3 at -1, b4 at -1, etc?

Re: the Jaynes quote: I'm not sure I've understood the full picture here, but in general, to me it doesn't feel like the central issues here have to do with dependencies on "how the limit is approached," such that requiring that each scenario pin down an "order" solves the problems. For example, I think that a lot of what seems strange about Neutrality-violations in these cases is that even if we pin down an order for each case, the fact that you can re-arrange one into the other makes it seem like they ought to be ethically equivalent. Maybe we deny that, and maybe we do so for reasons related to what you're talking about - but it seems like the same bullet. 

Comment by Joe_Carlsmith on The ignorance of normative realism bot · 2022-01-18T17:38:01.252Z · EA · GW

Thanks! Fixed.

Comment by Joe_Carlsmith on Listen to more EA content with The Nonlinear Library · 2021-11-16T21:43:55.350Z · EA · GW

Thanks for doing this! I've found it useful, and I expect that it will increase my engagement with EA Forum/LW content going forward.

Comment by Joe_Carlsmith on SIA > SSA, part 3: An aside on betting in anthropics · 2021-10-02T17:20:10.309Z · EA · GW

"that just indicates that EDT-type reasoning is built into the plausibility of SIA"

 If by this you mean "SIA is only plausible if you accept EDT," then I disagree. I think many of the arguments for SIA -- for example, "you should 1/4 on each of tails-mon, tails-tues, heads-mon, and heads-tues in Sleeping Beauty with two wakings each, and then update to being a thirder if you learn you're not in heads-tues," "telekinesis doesn't work," "you should be one-half on not-yet-flipped fair coins," "reference classes aren't a thing," etc -- don't depend on EDT, or even on EDT-ish intuitions. 

you talk about contorting one's epistemology in order to bet a particular way, but what's the alternative? If I'm an EDT agent who wants to bet at odds of a third, what is the principled reasoning that leads me to have credence of a half?

The alternative is to just bet the way you want to anyway, in the same way that the (most attractive, imo) alternative to two-boxing in transparent newcomb is not "believe that the boxes are opaque" but "one-box even though you know they're transparent." You don't need to have a credence of a half to bet how you want to -- especially if you're updateless. And note that EDT-ish SSA-ers have the fifthing problem too, in cases like the "wake up twice regardless, then learn that you're not heads-tuesday" version I just mentioned (where SSA ends up at 1/3rd on heads, too).

You argue that questions like "could I have been a chimpanzee" seem ridiculous. But these are closely analogous to the types of questions that one needs to ask when making decisions according to FDT (e.g. "are the decisions of chimpanzees correlated with my own?") So, if we need to grapple with these questions somehow in order to make decisions, grappling with them via our choice of a reference class doesn't seem like the worst way to do so.

I think that "how much are my decisions correlated with those of the chimps?" is a much more meaningful and tractable question, with a much more determinate answer, than "are the chimps in my reference class?" Asking questions about correlations between things is the bread and butter of Bayesianism. Asking questions anthropic reference classes isn't -- or, doesn't need to be. 

I'm reminded of Yudkowsky's writing about why he isn't prepared to get rid of the concept of "anticipated subjective experience", despite the difficulties it poses from a quantum-mechanical perspective.

Thanks for the link. I haven't read this piece, but fwiw, to me it feels like "there is a truth about the way that the world is/about what world I'm living in, I'm trying to figure out what that truth is" is something we shouldn't give up lightly. I haven't engaged much with the QM stuff here, and I can imagine it moving me, but "how are you going to avoid fifth-ing?" doesn't seem like a strong enough push on its own.

Comment by Joe_Carlsmith on SIA > SSA, part 1: Learning from the fact that you exist · 2021-10-02T16:52:00.663Z · EA · GW

It’s a good question, and one I considered going into in more detail on in the post (I'll add a link to this comment). I think it’s helpful to have in mind two types of people: “people who see the exact same evidence you do” (e.g., they look down on the same patterns of wrinkles on your hands, the same exact fading on the jeans they’re wearing, etc) and “people who might, for all you know about a given objective world, see the exact same evidence you do” (an example here would be “the person in room 2”). By “people in your epistemic situation,” I mean the former. The latter I think of as actually a disguised set of objective worlds, which posit different locations (and numbers) of the former-type people. But SIA, importantly, likes them both (though on my gloss, liking the former is more fundamental).

Here are some cases to illustrate. Suppose that God creates either one person in room 1 (if heads) or two people (if tails) in rooms 1 and 2. And suppose that there are two types of people: “Alices” and “Bobs.” Let’s say that any given Alice sees the exact same evidence as the other Alices (the same wrinkles, faded jeans, etc), and that the same holds for Bobs, and that if you’re an Alice or a Bob, you know it. Now consider three cases: 

  1. For each person God creates, he flips a second coin. If it’s heads, he creates an Alice. If tails, a Bob. 
  2. God flips a second coin. If it’s heads, he makes the person in room 1 Alice; if tails, Bob. But if the first coin was tails and he needs to create a second person, he makes that person different from the first. Thus, if tails-heads, it’s an Alice in room 1, and a Bob in room 2. But if it’s tails-tails, then it’s a Bob in room 1, and an Alice in room 2. (I talk about this case in part 4, XV.)
  3. God creates all Alices no matter what. 

Let’s write people’s names with “A” or “B,” in order of room number. And let’s say you wake up as an Alice. 

  • In case one, “coin 1 heads” (I’ll write the coin-1 results in parentheses) corresponds to two objective worlds — A, and B — each with 1/4 prior probability. Coin 1 tails corresponds to four objective worlds — AA, AB, BA, and BB — each with 1/8th prior probability. So as Alice, you start by crossing off B and BB, because there are no Alices. So you’re left with 1/4 on A, and 1/8th on each of AA, AB, and BA, so an overall odds-ratio of 2:1:1:1. But now, as SIA, you scale the prior in proportion to the number of Alices there are, so AA gets double weight. Now you’re 2:2:1:1. Thus, you end up with 1/3rd on A, 1/3 on AA (with 1/6th on each of the corresponding centered worlds), and 1/6th on each of AB and BA. And you’re a “thirder" overall. 
  • Now let’s look at case two. Here, the prior is 1/4 on A, 1/4 on B, 1/4 on AB, and 1/4 on BA. So SIA doesn’t actually do any scaling of the prior: there’s a maximum of one A in each world. Rather, it crosses off B, and ends up with 1/3rd on anything else, and stays a “thirder” overall. 
  • Case three is just Sleeping Beauty: SIA scales in proportion to the number of Alices, and ends up a thirder overall. 

So in each of these cases, SIA gives the same result, even though the distribution of Alices is in some sense pretty different. And notice, we can redescribe case 1 and 2 in terms of SIA liking “people who, for all you know about a given objective world, might be an Alice” instead of in terms of SIA liking Alices. E.g., in both cases, there are twice as many such people on tails. But importantly, their probability of being an Alice isn’t correlated with coin 1 heads vs. coin 1 tails. 

Anthropics cases are sometimes ambiguous about whether they’re talking about cases of type 1 or of type 3. God’s coin toss is closer to case 1: e.g., you wake up as a person in a room, but we didn’t specify that God was literally making exact copies of you in the other rooms -- your reasoning, though, treats his probability of giving any particular objective-world person your exact evidence is constant across people. Sleeping Beauty is often treated as more like case 3, but it’s compatible with being more of a case 1 type (e.g., if the experimenters also flip another coin on each waking, and leave it for Beauty to see, this doesn’t make a difference; and in general, the Beauties could have different subjective experiences on each waking, as long as —as far as Beauty knows — these variations in experience are independent of the coin toss outcome). I'm not super careful about these distinctions in the post, partly because actually splitting out all of the possible objective worlds in type-1 cases isn't really  do-able (there's no well-defined distribution that God is "choosing from" when he creates each person in God's coin toss --but his choice is treated, from your perspective, as independent from the coin toss outcome); and as noted, SIA's verdicts end up the same.

Comment by Joe_Carlsmith on Can you control the past? · 2021-09-02T18:47:45.373Z · EA · GW

Cool, this gives me a clearer picture of where you're coming from. I had meant the central question of the post to be whether it ever makes sense to do the EDT-ish try-to-control-the-past thing, even in pretty unrealistic cases -- partly because I think answering "yes" to this is weird and disorienting in itself, even if it doesn't end up making much of a practical difference day-to-day; and partly because a central objection to EDT is that the past, being already fixed, is never controllable in any practically-relevant sense, even in e.g. Newcomb's cases. It sounds like your main claim is that in our actual everyday circumstances, with respect to things like the WWI case, EDTish and CDT recommendations don't come apart -- a topic I don't spend much time on or have especially strong views about.

"you’re going to lean on the difference between 'cause' and 'control'" -- indeed, and I had meant the "no causal interaction with" part of opening sentence to indicate this. It does seem like various readers object to/were confused by the use of the term "control" here, and I think there's room for more emphasis early on as to what specifically I have in mind; but at a high-level, I'm inclined to keep the term "control," rather than trying to rephrase things solely in terms of e.g. correlations, because I think it makes sense to think of yourself as, for practical purposes, "controlling" what your copy writes on his whiteboard, what Omega puts in the boxes, etc; that more broadly, EDT-ish decision-making is in fact weird in the way that trying to control the past is weird, and that this makes it all the more striking and worth highlighting that EDT-ish decision-making seems, sometimes, like the right way to go. 

Comment by Joe_Carlsmith on Can you control the past? · 2021-09-02T02:00:31.459Z · EA · GW

Not sure exactly what words people have used, but something like this idea is pretty common in the non-CDT literature, and I think e.g. MIRI explicitly talks about "controlling" things like your algorithm.

Comment by Joe_Carlsmith on Can you control the past? · 2021-09-02T01:52:49.730Z · EA · GW

I think this is an interesting objection. E.g., "if you're into EDT ex ante, shouldn't you be into EDT ex post, and say that it was a 'good action' to learn about the Egyptians, because you learned that they were better off than you thought in expectation?" I think it depends, though, on how you are doing the ex post evaluation: and the objection doesn't work if the ex post evaluation conditions on the information you learn. 

That is, suppose that before you read Wikipedia, you were 50% on the Egyptians were at 0 welfare, and 50% they were at 10 welfare, so 5 in expectation, but reading is 0 EV. After reading, you find out that their welfare was 10. OK, should we count this action, in retrospect, as worth 5 welfare for the Egyptians? I'd say no, because the ex post evaluation should go: "Granted that the Egyptians were at 10 welfare, was it good to learn that they were at 10 welfare?". And the answer is no: the learning was a 0-welfare change.

Comment by Joe_Carlsmith on Can you control the past? · 2021-09-02T01:39:53.694Z · EA · GW

"the emphasis here seems to be much more about whether you can actually have a causal impact on the past" -- I definitely didn't mean to imply that you could have a causal impact on the past. The key point is that the type of control in question is acausal. 

I agree that many of these cases involve unrealistic assumptions, and that CDT may well be an effective heuristic most of the time (indeed, I expect that it is). 

I don't feel especially hung up on calling it "control" -- ultimately it's the decision theory (e.g., rejecting CDT) that I'm interested in. I like the word "control," though, because I think there is a very real sense in which you get to choose what your copy writes on his whiteboard, and that this is pretty weird; and because, more broadly, one of the main objections to non-CDT decision theories is that it feels like they are trying to "control" the past in some sense (and I'm saying: this is OK).

Simulation stuff does seem like it could be one in principle application here, e.g.: "if we create civilizations simulations, then this makes it more likely that others whose actions are correlated with ours create simulations, in which case we're more likely to be in a simulation, so because we don't want to be in a simulation, this is a reason to not create simulations." But it seems there are various empirical assumptions about the correlations at stake here, and I haven't thought about cases like this much (and simulation stuff gets gnarly fast, even without bringing weird decision-theory in).

Comment by Joe_Carlsmith on Can you control the past? · 2021-09-02T01:22:34.387Z · EA · GW

Thanks for these comments. 

Re: “physics-based priors,” I don't think I have a full sense of what you have in mind, but at a high level, I don’t yet see how physics comes into the debate. That is, AFAICT everyone agrees about the relevant physics — and in particular, that you can’t causally influence the past, “change” the past, and so on. The question as I see it (and perhaps I should’ve emphasized this more in the post, and/or put things less provocatively) is more conceptual/normative: whether when making decisions we should think of the past the way CDT does — e.g., as a set of variables whose probabilities our decision-making can’t alter — or in the way that e.g. EDT does — e.g., as a set of variables whose probabilities our decision-making can alter (and thus, a set of variables that EDT-ish decision-making implicitly tries to “control” in a non-causal sense). Non-causal decision theories are weird; but they aren’t actually “I don’t believe in normal physics” weird. They’re more “I believe in managing the news about the already-fixed past” weird. 

Re: CDT’s domain of applicability, it sounds like your view is something like: “CDT generally works, but it fails in the type of cases that Joe treats as counter-examples to CDT.” I agree with this, and I think most people who reject CDT would agree, too (after all, most decision theories agree on what to do in most everyday cases; the traditional questions have been about what direction to go when their verdicts come apart). I’m inclined to think of this as CDT being wrong, because I’m inclined to think of decision theory as searching for the theory that will get the full range of cases right — but I’m not sure that much hinges on this. That said, I do think that even acknowledging that CDT fails sometimes involves rejecting some principles/arguments one might’ve thought would hold good in general (e.g. “c’mon, man, it’s no use trying to control the past,”the "what would your friend who can see what's in the boxes say is better" argument, and so on) and thereby saying some striking and weird stuff (e.g. “Ok, it makes sense to try to control the past sometimes, just not that often"). 

Re: 1-4, I agree that whether or not CDT leads you astray in a given case is an empirical question. I don’t have strong views about what range of actual cases are like this — though I’m sympathetic to your view re: 1, and as I mention in the post, I generally think we should just err on the side of not doing stuff that looks silly by normal lights. I also don’t have strong views about the relevance of non-causal decision-theory research for AGI safety (this project mostly emerged from personal interest).

Comment by Joe_Carlsmith on Can you control the past? · 2021-08-28T08:25:19.894Z · EA · GW

I'm imagining computers with sufficiently robust hardware to function deterministically at the software level, in the sense of very reliably performing the same computation, even if there's quantum randomness at a lower level. Imagine two good-quality calculators, manufactured by the same factory using the same process, which add together the same two numbers using the same algorithm, and hence very reliably move through the same high-level memory states and output the same answer. If quantum randomness makes them output different answers, I count that as a "malfunction."

Comment by Joe_Carlsmith on Can you control the past? · 2021-08-28T08:12:21.211Z · EA · GW

I have sympathy for responses like "look, it's just so clear that you can't control the past in any practically relevant sense that we should basically just assume the type of arguments in this post are wrong somehow." But I'm curious where you think the arguments actually go wrong, if you have a view about that? For example, do you think defecting in perfect deterministic twin prisoner's dilemmas with identical inputs is the way to go?

Comment by Joe_Carlsmith on Narration: "Against neutrality about creating happy lives" · 2021-07-10T22:33:11.362Z · EA · GW

Thanks for doing this!

Comment by Joe_Carlsmith on On the limits of idealized values · 2021-06-24T07:33:47.616Z · EA · GW

Thanks, Richard :). Re: arbitrariness, in a sense the relevant choices might well end up arbitrary (and as you say, subjectivists need to get used to some level of unavoidable arbitrariness), but I do think that it at least seems worth trying to capture/understand some sort of felt difference between e.g. picking between Buridan's bales of hay, and choosing e.g. what career to pursue, even if you don't think there's a "right answer" in either case. 

I agree that "infallible" maybe has the wrong implications, here, though I do think that part of the puzzle is the sense in which these choices feel like candidates for mistake or success; e.g., if I choose the puppies, or the crazy galaxy Joe world, I have some feeling like "man, I hope this isn't a giant mistake." That said, things we don't have control over, like desires, do feel like they have less of this flavor.

Comment by Joe_Carlsmith on On the limits of idealized values · 2021-06-24T07:24:43.966Z · EA · GW

I'm glad you liked it, Lukas. It does seem like an interesting question how your current confidence in your own values relates to your interest in further "idealization," of what kind, and how much convergence makes a difference. Prima facie, it does seems plausible that greater confidence speaks in favor"conservatism" about what sorts of idealization you go in for, though I can imagine very uncertain-about-their-values people opting for conservatism, too. Indeed, it seems possible that conservatism is just generally pretty reasonable, here.

Comment by Joe_Carlsmith on Draft report on existential risk from power-seeking AI · 2021-05-08T00:07:40.815Z · EA · GW

Hi Ben, 

This does seem like a helpful kind of content to include (here I think of Luke’s section on this here, in the context of his work on moral patienthood). I’ll consider revising to say more in this vein. In the meantime, here are a few updates off the top of my head:

  • It now feels more salient to me now just how many AI applications may be covered by systems that either aren’t agentic planning/strategically aware (including e.g. interacting modular systems, especially where humans are in the loop for some parts, and/or intuitively “sphexish/brittle” non-APS systems ), or by systems which are specialized/myopic/limited in capability in various ways. That is, a generalized learning agent that’s superhuman (let alone better than e.g. all of human civilization) in ~all domains, with objectives as open-ended and long-term as “maximize paperclips,” now seems to me a much more specific type of system, and one whose role in an automated economy -- especially early on -- seems more unclear. (I discuss this a bit in Section 3, section, and section 4.3.2).
  • Thinking about the considerations discussed in the "unusual difficulties" section generally gave me more clarity about how this problem differs from safety problems arising in the context of other technologies (I think I had previously been putting more weight on considerations like "building technology that performs function F is easier than building some technology that performs function F safely and reliably," which apply more generally).
  • I realized how much I had been implicitly conceptualizing the “alignment problem” as “we must give these AI systems objectives that we’re OK seeing pursued with ~arbitrary degrees of capability” (something akin to the “omni test”). Meeting standards in this vicinity (to the extent that they're well defined in a given case) seems like a very desirable form of robustness (and I’m sympathetic to related comments from Eliezer to the effect that “don’t build systems that are searching for ways to kill you, even if you think the search will come up empty”), but I found it helpful to remember that the ultimate problem is “we need to ensure that these systems don’t seek power in misaligned ways on any inputs they’re in fact exposed to” (e.g., what I’m calling “practical PS-alignment”) -- a framing that leaves more conceptual room, at least, for options that don’t “get the objectives exactly right," and/or that involve restricting a system’s capabilities/time horizons, preventing it from “intelligence exploding,” controlling its options/incentives, and so on (though I do think options in this vein raise their own issues, of the type of that the "omni test" is meant to avoid, see,, and 4.3.3). I discuss this a bit in section 4.1.
  • I realized that my thinking re: “races to the bottom on safety” had been driven centrally by abstract arguments/models that could apply in principle to many industries (e.g., pharmaceuticals). It now seems to me a knottier and more empirical question how models of this kind will actually apply in a given real-world case re: AI. I discuss this a bit in section 5.3.1
Comment by Joe_Carlsmith on Draft report on existential risk from power-seeking AI · 2021-05-07T18:39:51.542Z · EA · GW

Hi Ben, 

A few thoughts on this: 

  • It seems possible that attempting to produce “great insight” or “simple arguments of world-shattering importance” warrants a methodology different from the one I’ve used here. But my aim here is humbler: to formulate and evaluate an existing argument that I and various others take seriously, and that lots of resources are being devoted to; and to come to initial, informal, but still quantitative best-guesses about the premises and conclusion, which people can (hopefully) agree/disagree with at a somewhat fine-grained level -- e.g., a level that just giving overall estimates, or just saying e.g. “significant probability,” “high enough to worry about,” etc can make more difficult to engage on.
  • In that vein, I think it’s possible you’re over-estimating how robust I take the premises and numbers here to be (I'm thinking here of your comments re: “very accurately carve the key parts of reality that are relevant,” and "trust the outcome number"). As I wrote in response to Rob above, my low-end/high-end range here is .1% to 40% (see footnote 179, previously 178), and in general, I hold the numbers here very lightly (I try to emphasize this in section 8). 
  • FWIW, I think Superintelligence can be pretty readily seen as a multi-step argument (e.g., something like: superintelligence will happen eventually; fast take-off is plausible; if fast-take-off, then a superintelligence will probably get a decisive strategic advantage; alignment will be tricky; misalignment leads to power-seeking; therefore plausible doom). And more broadly, I think that people make arguments with many premises all the time (though sometimes the premises are suppressed). It’s true that people don’t usually assign probabilities to the premises (and Bostrom doesn’t, in Superintelligence -- a fact that leaves the implied p(doom) correspondingly ambiguous) -- but I think this is centrally because assigning informal probabilities to claims (whether within a multi-step argument, or in general) just isn’t a very common practice, for reasons not centrally to do with e.g. multi-stage-fallacy type problems. Indeed, I expect I’d prefer a world where people assigned informal, lightly-held probabilities to their premises and conclusions (and formulated their arguments in premise-premise-conclusion form) more frequently.
  • I’m not sure exactly what you have in mind re: “examining a single worldview to see whether it’s consistent,” but consistency in a strict sense seems too cheap? E.g., “Bob has always been wrong before, but he’ll be right this time”; “Mortimer Snodgrass did it”; etc are all consistent. That said, my sense is that you have something broader in mind -- maybe something like "plausible," "compelling," "sense-making," etc. But it seems like these still leave the question of overall probabilities open...

Overall, my sense is that disagreement here is probably more productively focused on the object level -- e.g., on the actual probabilities I give to the premises, and/or on pointing out and giving weight to scenarios that the premises don’t cover -- rather than on the methodology in the abstract. In particular, I doubt that people who disagree a lot with my bottom line will end up saying: “If I was to do things your way, I’d roughly agree with the probabilities you gave to the premises; I just disagree that you should assign probabilities to premises in a multi-step argument as a way of thinking about issues like this.” Rather, I expect a lot of it comes down to substantive disagreement about the premises at issue (and perhaps, to people assigning significant credence to scenarios that don’t fit these premises, though I don't feel like I've yet heard strong candidates -- e.g., ones that seem to me to plausibly account for, say, >2/3rds of the overall X-risk from power-seeking, misaligned AI by 2070 -- in this regard).

Comment by Joe_Carlsmith on Draft report on existential risk from power-seeking AI · 2021-05-01T01:18:33.099Z · EA · GW

Hi Hadyn, 

Thanks for your kind words, and for reading. 

  1. Thanks for pointing out these pieces. I like the breakdown of the different dimensions of long-term vs. near-term. 
  2. Broadly, I agree with you that the document could benefit from more about premise 5. I’ll consider revising to add some.
  3. I’m definitely concerned about misuse scenarios too (and I think lines here can get blurry -- see e.g. Katja Grace’s recent post); but I wanted, in this document, to focus on misalignment in particular. The question of how to weigh misuse vs. misalignment risk, and how the two are similar/different more generally, seems like a big one, so I’ll mostly leave it for another time (one big practical difference is that misalignment makes certain types of technical work more relevant).
  4. Eventually, the disempowerment has to scale to ~all of humanity (a la premise 5), so that would qualify as TAI in the “transition as big of a deal as the industrial revolution” sense. However, it’s true that my timelines condition in premise 1 (e.g., APS systems become possible and financially feasible) is weaker than Ajeya’s.
Comment by Joe_Carlsmith on Draft report on existential risk from power-seeking AI · 2021-05-01T00:12:52.463Z · EA · GW

(Continued from comment on the main thread)

I'm understanding your main points/objections in this comment as: 

  1. You think the multiple stage fallacy might be the methodological crux behind our disagreement. 
  2. You think that >80% of AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would assign >10% probability to existential catastrophe from technical problems with AI (at some point, not necessarily before 2070). So it seems like 80k saying 1-10% reflects a disagreement with the experts, which would be strange in the context of e.g. climate change, and at least worth flagging/separating. (Presumably, something similar would apply to my own estimates.)
  3. You worry that there are social reasons not to sound alarmist about weird/novel GCRs, and that it can feel “conservative” to low-ball rather than high-ball the numbers. But low-balling (and/or focusing on/making salient lower-end numbers) has serious downsides. And you worry that EA folks have a track record of mistakes in this vein.

(as before, let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p)

Re 1 (and 1c, from my response to the main thread): as I discuss in the document, I do think there are questions about multiple-stage fallacies, here, though I also think that not decomposing a claim into sub-claims can risk obscuring conjunctiveness (and I don’t see “abandon the practice of decomposing a claim into subclaims” as a solution to this). As an initial step towards addressing some of these worries, I included an appendix that reframes the argument using fewer premises (and also, in positive (e.g., “p is false”) vs. negative (“p is true”) forms). Of course, this doesn’t address e.g. the “the conclusion could be true, but some of the premises false” version of the “multiple stage fallacy” worry; but FWIW, I really do think that the premises here capture the majority of my own credence on p, at least. In particular, the timelines premise is fairly weak, premises 4-6 are implied by basically any p-like scenario, so it seems like the main contenders for false premises (even while p is true) are 2: (“There will be strong incentives to build APS systems”) and 3: (“It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway”). Here, I note the scenarios most salient to me in footnote 173, namely: “we might see unintentional deployment of practical PS-misaligned APS systems even if they aren’t superficially attractive to deploy” and “practical PS-misaligned might be developed and deployed even absent strong incentives to develop them (for example, simply for the sake of scientific curiosity).” But I don’t see these are constituting more than e.g. 50% of the risk. If your own probability is driven substantially by scenarios where the premises I list are false, I’d be very curious to hear which ones (setting aside scenarios that aren’t driven by power-seeking, misaligned AI), and how much credence if you give them. I’d also be curious, more generally, to hear your more specific disagreements with the probabilities I give to the premises I list. 

Re: 2, your characterization of the distribution of views amongst AI safety researchers (outside of MIRI) is in some tension with my own evidence; and I consulted with a number of people who fit your description of “specialists”/experts in preparing the document. That said, I’d certainly be interested to see more public data in this respect, especially in a form that breaks down in (rough) quantitative terms the different factors driving the probability in question, as I’ve tried to do in the document (off the top of my head, the public estimates most salient to me are Ord (2020) at 10% by 2100, Grace et al (2017)’s expert survey (5% median, with no target date), and FHI’s (2008) survey (5% on extinction from superintelligent AI by 2100), though we could gather up others from e.g. LW and previous X-risk books.) That said, importantly, and as indicated in my comment on the main thread, I don’t think of the community of AI safety researchers at the orgs you mention as in an epistemic position analogous to e.g. the IPCC, for a variety of reasons (and obviously, there are strong selection effects at work). Less importantly, I also don’t think the technical aspects of this problem the only factors relevant to assessing risk; at this point I have some feeling of having “heard the main arguments”; and >10% (especially if we don’t restrict to pre-2070 scenarios) is within my “high-low” range mentioned in footnote 178 (e.g., .1%-40%). 

Re: 3, I do think that the “conservative” thing to do here is to focus on the higher-end estimates (especially given uncertainty/instability in the numbers), and I may revise to highlight this more in the text. But I think we should distinguish between the project of figuring out “what to focus on”/what’s “appropriately conservative,” and what our actual best-guess probabilities are; and just as there are risks of low-balling for the sake of not looking weird/alarmist, I think there are risks of high-balling for the sake of erring on the side of caution. My aim here has been to do neither; though obviously, it’s hard to eliminate biases (in both directions).

Comment by Joe_Carlsmith on Draft report on existential risk from power-seeking AI · 2021-04-30T23:57:00.299Z · EA · GW

Hi Rob, 

Thanks for these comments. 

Let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p. I’m understanding your main objections in this comment as: 

  1. It seems to you like we’re in a world where p is true, by default. Hence, 5% on p seems too low to you. In particular:
    1. It implies 95% confidence on not p, which seems to you overly confident.
    2. If p is true by default, you think the world would look like it does now; so if this world isn’t enough to get me above 5%, what would be?
    3. Because p seems true to you by default, you suspect that an analysis that only ends up putting 5% on p involves something more than “the kind of mistake you should make in any ordinary way,” and requires some kind of mistake in methodology.

One thing I’ll note at the outset is the content of footnote 178, which (partly prompted by your comment) I may revise to foreground more in the main text: “In sensitivity tests, where I try to put in ‘low-end’ and ‘high-end’ estimates for the premises above, this number varies between ~.1% and ~40% (sampling from distributions over probabilities narrows this range a bit, but it also fails to capture certain sorts of correlations). And my central estimate varies between ~1-10% depending on my mood, what considerations are salient to me at the time, and so forth. This instability is yet another reason not to put too much weight on these numbers. And one might think variation in the direction of higher risk especially worrying.”

Re 1a: I’m open to 5% being too low. Indeed, I take “95% seems awfully confident,” and related worries in that vein, seriously as an objection. However, as the range above indicates, I also feel open to 5% being too high (indeed, at times it seems that way too me), and I don’t see “it would be strange to be so confident that all of humanity won’t be killed/disempowered because of X” as a forceful argument on its own (quite the contrary): rather, I think we really need to look at the object-level evidence and argument for X, which is what the document tries to do (not saying that quote represents your argument; but hopefully it can illustrate why one might start from a place of being unsurprised if the probability turns out low).

Re 1b: I’m not totally sure I’ve understood you here, but here are a few thoughts. At a high level, one answer to “what sort of evidence would make me update towards p being more likely” is “the considerations discussed in the document that I see as counting against p don’t apply, or seem less plausible” (examples here include considerations related to longer timelines, non-APS/modular/specialized/myopic/constrained/incentivized/not-able-to-easily-intelligence-explode systems sufficing in lots/maybe ~all of incentivized applications, questions about the ease of eliminating power-seeking behavior on relevant inputs during training/testing given default levels of effort, questions about why and in what circumstances we might expect PS-misaligned systems to be superficially/sufficiently attractive to deploy, warning shots, corrective feedback loops, limitations to what APS systems with lopsided/non-crazily-powerful capabilities can do, general incentives to avoid/prevent ridiculously destructive deployment, etc, plus more general considerations like “this feels like a very specific way things could go”). 

But we could also imagine more “outside view” worlds where my probability would be higher: e.g., there is a body of experts as large and established as the experts working on climate change, which uses quantitative probabilistic models of the quality and precision used by the IPCC, along with an understanding of the mechanisms underlying the threat as clear and well-established as the relationship between carbon emissions and climate change, to reach a consensus on much higher estimates. Or: there is a significant, well-established track record of people correctly predicting future events and catastrophes of this broad type decades in advance, and people with that track record predict p with >5% probability.

That said, I think maybe this isn’t getting at the core of your objection, which could be something like: “if in fact this is a world where p is true, is your epistemology sensitive enough to that? E.g., show me that your epistemology is such that, if p is true, it detects p as true, or assigns it significant probability.” I think there may well be something to objections in this vein, and I'm interested in thinking about the more; but I also want to flag that at a glance, it feels kind of hard to articulate them in general terms. Thus, suppose Bob has been wrong about 99/100 predictions in the past. And you say: “OK, but if Bob was going to be right about this one, despite being consistently wrong in the past, the world would look just like it does now. Show me that your epistemology is sensitive enough to assign high probability to Bob being right about this one, if he’s about to be.” But this seems like a tough standard; you just should have low probability on Bob being right about this one, even if he is. Not saying that’s the exact form of your objection, or even that it's really getting at the heart of things, but maybe you could lay out your objection in a way that doesn’t apply to the Bob case?

(Responses to 1c below)

Comment by Joe_Carlsmith on Problems of evil · 2021-04-20T05:11:45.826Z · EA · GW

Sounds right to me.  Per a conversation with Aaron a while back, I've been relying on the moderators to tag posts as personal blog, and had been assuming this one would be.

Comment by Joe_Carlsmith on The importance of how you weigh it · 2021-04-08T06:03:51.832Z · EA · GW

Glad to hear you found it helpful. Unfortunately, I don't think I have a lot to add at the moment re: how to actually pursue moral weighting research, beyond what I gestured at in the post (e.g., trying to solicit lots of your own/other people's intuitions across lots of cases, trying to make them consistent,  that kind of thing). Re: articles/papers/posts, you could also take a look at GiveWell's process here, and the moral weight post from Luke Muelhauser I mentioned has a few references at the end that might be helpful (though most of them I haven't engaged with myself). I'll also add, FWIW, that I actually think the central point in the post most applicable outside of the EA community than inside it, as I think of EA as fairly "basic-set oriented" (though there are definitely some questions in EA where weightings matter).

Comment by Joe_Carlsmith on Against neutrality about creating happy lives · 2021-03-18T09:20:59.574Z · EA · GW

Hi Michael — 

I meant, in the post, for the following paragraphs to address the general issue you mention: 

Some people don’t think that gratitude of this kind makes sense. Being created, we might say, can’t have been “better for” me, because if I hadn’t been created, I wouldn’t exist, and there would be no one that Wilbur’s choice was “worse for.” And if being created wasn’t better for me, the thought goes, then I shouldn’t be grateful to Wilbur for creating me.

Maybe the issues here are complicated, but at a high level: I don’t buy it. It seems to me very natural to see Wilbur as having done, for me, something incredibly significant — to have given me, on purpose, something that I value deeply. One option, for capturing this, is to say that something can be good for me, without being “better” for me (see e.g. McMahan (2009)). Another option is just to say that being created is better for me than not being created, even if I only exist — at least concretely — in one of the cases. Overall, I don’t feel especially invested in the metaphysics/semantics of “good for” and “better for” in this sort of case. I don’t have a worked out account of these issues, but neither do I see them as especially forceful reason not to be glad that I’m alive, or grateful to someone who caused me to be so.

That is, I don’t take myself to be advocating directly for comparativism here (though a few bits of the language in the post, in particular the reference to “better off dead,” do suggest that). As the quoted paragraphs note, comparativism is one option; another is to say that creating me is good for me, even if it’s not better for me (a la McMahan). 

FWIW, though, I do currently feel intuitively open/sympathetic to comparativism, partly because it seems plausible that we can say truly things like “Joe would prefer to be live rather than not to live,” even if Joe doesn’t and never will exist; and clear that we can truly say "Joe prefers to live" in worlds  where he does exist; and I tend to think about treating people well as centrally about being responsive to what they care about/would care about. But I haven’t tried to dig in on this stuff, partly because I see things like being glad I’m alive, and grateful to someone who caused me to be so, as on more generally solid ground than things like “betterness for Joe is a relation that requires two concrete Joe lives as relata" (see e.g. the Menagerie argument in Hilary's powerpoint, p. 13, for the type of thing that makes me think that metaphysical premises like that aren't a "super solid ground" type area). 

At a higher level, though: the point I’m arguing against is specifically that the neutrality intuition is directly intuitive. I don’t see it that way, and the point of “poetically tugging at people’s intuitions” was precisely to try to illustrate and make vivid the intuitive situation as I see it. But as I note at the end —  e.g., “direct intuitions about neutrality aren’t the only data available” — it’s a further question whether there is more to be said for neutrality overall (indeed, I think there is — though metaphysical issues like the ones you mention aren’t very central for me here). That said, I tend to see much of person-affecting ethics as driven at least in substantial part by appeal direct intuition, so I do think it would change the overall dialectical landscape a bit if people come in going “intuitively, we have strong reasons to create happy lives. But there are some metaphysical/semantic questions about how to make sense of this…” 

Comment by Joe_Carlsmith on Contact with reality · 2021-02-18T06:41:14.505Z · EA · GW

Thanks! Re: mental manipulation, do you have similar worries even granted that you’ve already been being manipulated in these ways? We can stipulate that there won’t be any increase in the manipulation in question, if you stay. One analogy might be: extreme cognitive biases that you’ve had all along. They just happen to be machine-imposed. 

That said, I don’t think this part is strictly necessary for the thought experiment, so I’m fine with folks leaving it out if it trips them up.

Comment by Joe_Carlsmith on On clinging · 2021-02-01T08:58:32.492Z · EA · GW

Glad to hear you enjoyed it. 

I haven't engaged much with tranquilism. Glancing at that piece, I do think that the relevant notions of "craving" and "clinging" are similar; but I wouldn't say, for example, that an absence of clinging makes an experience as good as it can be for someone.

Comment by Joe_Carlsmith on Actually possible: thoughts on Utopia · 2021-01-25T07:31:48.954Z · EA · GW

Thanks :). I haven't thought much about personal universes, but glancing at the paper, I'd expect resource-distribution, for example, to remain an issue.

Comment by Joe_Carlsmith on Alienation and meta-ethics (or: is it possible you should maximize helium?) · 2021-01-20T08:33:40.456Z · EA · GW

Glad to hear it :)

Re: "my motivational system is broken, I'll try to fix it" as the thing to say as an externalist realist: I think this makes sense as a response. The main thing that seems weird to me is the idea that you're fundamentally "cut off" from seeing what's good about helium, even though there's nothing you don't understand about reality. But it's a weird case to imagine, and the relevant notions of "cut off" and "understanding" are tricky.

Comment by Joe_Carlsmith on Alienation and meta-ethics (or: is it possible you should maximize helium?) · 2021-01-16T09:58:59.860Z · EA · GW

Thanks for reading. Re: your version of anti-realism: is "I should create flourishing (or whatever your endorsed theory says)" in your mouth/from your perspective true, or not truth-apt? 

To me Clippy's having or not having a moral theory doesn't seem very central. E.g., we can imagine versions in which Clippy (or some other human agent) is quite moralizing, non-specific, universal, etc about clipping, maximizing pain, or whatever.