Have faux-evil EA energy 2022-08-23T22:55:46.314Z
The inordinately slow spread of good AGI conversations in ML 2022-06-29T04:02:23.445Z
Twitter-length responses to 24 AI alignment arguments 2022-03-14T19:34:22.452Z
AI views and disagreements AMA: Christiano, Ngo, Shah, Soares, Yudkowsky 2022-03-01T01:13:34.685Z
Animal welfare EA and personal dietary options 2022-01-05T18:53:52.698Z
Conversation on technology forecasting and gradualism 2021-12-09T19:00:00.000Z
Discussion with Eliezer Yudkowsky on AGI interventions 2021-11-11T03:21:50.685Z
2020 PhilPapers Survey Results 2021-11-02T05:06:42.834Z
Quick general thoughts on suffering and consciousness 2021-10-30T18:09:17.811Z
Outline of Galef's "Scout Mindset" 2021-08-10T00:18:01.442Z
"Existential risk from AI" survey results 2021-06-01T20:19:33.282Z
Predict responses to the "existential risk from AI" survey 2021-05-28T01:38:02.530Z
Julia Galef and Matt Yglesias on bioethics and "ethics expertise" 2021-03-30T03:06:41.561Z
Politics is far too meta 2021-03-17T23:57:33.321Z - A Petition 2020-06-25T23:29:46.491Z
RobBensinger's Shortform 2019-09-23T19:44:20.095Z
New edition of "Rationality: From AI to Zombies" 2018-12-15T23:39:22.975Z
AI Summer Fellows Program: Applications open 2018-03-23T21:20:05.203Z
Anonymous EA comments 2017-02-07T21:42:24.686Z
Ask MIRI Anything (AMA) 2016-10-11T19:54:25.621Z
MIRI is seeking an Office Manager / Force Multiplier 2015-07-05T19:02:24.163Z


Comment by RobBensinger on On how various plans miss the hard bits of the alignment challenge · 2022-09-22T19:30:10.128Z · EA · GW

scientists who are trying to understand human brains do spend a lot (most?) of their time looking at nonhuman brains, no?

My sense is that this is mostly for ethics reasons, rather than representing a strong stance that animal models are the fastest way to make progress on understanding human cognition.

Comment by RobBensinger on Have faux-evil EA energy · 2022-09-22T15:53:20.901Z · EA · GW

Whatever works for you!

For me, "guilty pleasure" is a worse tag because it encourages me to feel guilty, which is exactly what I don't want to encourage myself to do.

"I'm being so evil by listening to Barry Manilow" works well for me exactly because it's too ridiculous to take seriously, so it diffuses guilt. I'm making light of the feel-guilty impulse, not just acknowledging it.

Comment by RobBensinger on Have faux-evil EA energy · 2022-08-24T17:44:09.609Z · EA · GW

Yep. As the author of , I think of myself as an MtG red person who cosplays as esper because I happen to have found myself in a world that has a lot of urgent WUB-shaped problems.

(The world has plenty of fiery outrage and impulsive action and going-with-the-flow, but not a lot of cold utilitarian calculus, principled integrity, rationalism, utopian humanism, and technical alignment research. If I don't want humanity's potential to be snuffed out, I need to prioritize filling those gaps over pure self-expression.)

The specific phrasing "ruthlessly do whatever you want all the time" sounds more MtG-black to me than MtG-red, but if I interpret it as MtG-red, I think I understand what it's trying to convey. :)

Comment by RobBensinger on AGI ruin scenarios are likely (and disjunctive) · 2022-08-04T18:01:55.207Z · EA · GW

My basic reason for thinking "early rogue [AGI] will inevitably succeed in defeating us" is:

  • I think human intelligence is crap. E.g.:
    • Human STEM ability occurs in humans as an accidental side-effect — our brains underwent zero selection for the ability to do STEM in our EAA, and barely-any selection to optimize this skill since the Scientific Revolution. We should expect that much more is possible when humans are deliberately optimizing brains to be good at STEM.
    • There are many embarrassingly obvious glaring flaws in human reasoning.
    • One especially obvious example is "ability to think mathematically at all". This seems in many respects like a reasoning ability that's both relatively simple (it doesn't require grappling with the complexity of the physical world) and relatively core. Yet the average human can't even do trivial tasks like 'multiply two eight-digit numbers together in your head in under a second'. This gap on its own seems sufficient for AGI to blow humans out of the water.
    • (E.g., I expect there are innumerable scientific fields, subfields, technologies, etc. that are easy to find when you can hold a hundred complex mathematical structures in your 200 slots of working memory simultaneously and perceive connections between those structures. Many things are hard to do across a network of separated brains, calculators, etc. that are far easier to do within a single brain that can hold everything in view at once, understand the big picture, consciously think about many relationships at once, etc.)
    • Example: AlphaGo Zero. There was a one-year gap between 'the first time AI ever defeated a human professional' and 'the last time a human professional ever beat a SotA AI'. AlphaGo Zero in particular showed that 2500 years of human reasoning about Go was crap compared to what was pretty easy to do with 2017 hardware and techniques and ~72 hours of self-play. This isn't a proof that human intelligence is similarly crap in physical-data-dependent STEM work, or in other formal settings, but it seems like a strong hint.
  • I'd guess we already have a hardware overhang for running AGI. (Considering, e.g., that we don't need to recapitulate everything a human brain is doing in order to achieve AGI. Indeed, I'd expect that we only need to capture a small fraction of what the human brain is doing in order to produce superhuman STEM reasoning. I expect that AGI will be invented in the future (i.e., we don't already have it), and that we'll have more than enough compute.)

I'd be curious to know (1) whether you disagree with these points, and (2) whether you disagree that theses points are sufficient to predict that at least one early AGI system will be capable enough to defeat humans, if we don't succeed on the alignment problem.

(I usually think of "early AGI systems" as 'AGI systems built within five years of when humanity first starts building a system that could be deployed to do all the work human engineers can do in at least one hard science, if the developers were aiming at that goal'.)

Comment by RobBensinger on AGI ruin scenarios are likely (and disjunctive) · 2022-08-04T17:27:46.031Z · EA · GW

It's strange to me that this is aimed at people who aren't aware that MIRI staffers are quite pessimistic about AGI risk.

It's not. It's mainly aimed at people who found it bizarre and hard-to-understand that Nate views AGI risk as highly disjunctive. (Even after reading all the disjunctive arguments in AGI Ruin.) This post is primarily aimed at people who understand that MIRI folks are pessimistic, but don't understand where "it's disjunctive" is coming from.

Comment by RobBensinger on AGI ruin scenarios are likely (and disjunctive) · 2022-08-04T17:23:22.653Z · EA · GW

Alex Lintz's take: 

Comment by RobBensinger on Brainstorm of things that could force an AI team to burn their lead · 2022-07-27T03:35:51.369Z · EA · GW

Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.

By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that

  • is highly safety-conscious.
  • has a large resource advantage over the other groups, so that it can hope to reach AGI with more than a year of lead time — including accumulated capabilities ideas and approaches that it hasn’t been publishing.
  • has adequate closure and opsec practices, so that it doesn’t immediately lose its technical lead if it successfully acquires one.

The magnitude and variety of difficulties that are likely to arise in aligning the first AGI systems also suggests that failure is very likely in trying to align systems as opaque as current SotA systems; and suggests an AGI developer likely needs to have spent preceding years deliberately steering toward approaches to AGI that are relatively alignable; and it suggests that we need to up our game in general, approaching the problem in ways that are closer to the engineering norms at (for example) NASA, than to the engineering norms that are standard in ML today.

Comment by RobBensinger on On how various plans miss the hard bits of the alignment challenge · 2022-07-12T22:47:00.262Z · EA · GW

Oops, thanks! I checked for those variants elsewhere but forgot to do so here. :)

It is also possible that dignity is a good framing overall and I'm just weird, in which case I fully endorse using it.

I think it's a good framing for some people and not for others. I'm confident that many people shouldn't use this framing regularly in their own thinking. I'm less sure about whether the people who do find it valuable should steer clear of mentioning it, that's a bit more extreme.

Comment by RobBensinger on On how various plans miss the hard bits of the alignment challenge · 2022-07-12T21:46:10.443Z · EA · GW

"Dignity" indeed only occurs once, and I assume it's calling back to the same "death with dignity" concept from the April Fool's post (which I agree shouldn't have been framed as an April Fool's thing).

I assume EY didn't expect the post to have such a large impact, in part because he'd already said more or less the same thing, with the same terminology, in a widely-read post back in November 2021:


At a high level one thing I want to ask about is research directions and prioritization. For example, if you were dictator for what researchers here (or within our influence) were working on, how would you reallocate them?

Eliezer Yudkowsky 

The first reply that came to mind is "I don't know." I consider the present gameboard to look incredibly grim, and I don't actually see a way out through hard work alone. We can hope there's a miracle that violates some aspect of my background model, and we can try to prepare for that unknown miracle; preparing for an unknown miracle probably looks like "Trying to die with more dignity on the mainline" (because if you can die with more dignity on the mainline, you are better positioned to take advantage of a miracle if it occurs).

The term also shows up a ton in the Late 2021 MIRI Conversations, e.g., here and here

I appreciate the data point about the term being one you find upsetting to run into; thanks for sharing about that, Devin. And, for whatever it's worth, I'm sorry. I don't like sharing info (or framings) that cause people distress like that.

I don't know whether data points like this will update Nate and/or Eliezer all the way to thinking the term is net-negative to use. If not, and this is a competing access needs issue ('one group finds it much more motivating to use the phrase X; another group finds that exact same phrase extremely demotivating'), then I think somebody should make a post walking folks through a browser text-replacement method that can swap out words like 'dignity' and 'dignified' (on LW, the EA Forum, the MIRI website, etc.) for something more innocuous/silly.

Comment by RobBensinger on Co-Creation of the Library of Effective Altruism [Information Design] (1/2) · 2022-07-11T21:56:43.906Z · EA · GW

On Bullshit doesn't seem important to me. (Having read it.)

I'd guess Expert Political Judgment might be better than the average book in the rationality section? (But I haven't read it.)

Comment by RobBensinger on Co-Creation of the Library of Effective Altruism [Information Design] (1/2) · 2022-07-11T21:52:39.596Z · EA · GW

My picks for a Core Longtermist EA Bookshelf (I don't see myself as having any expertise on what belongs in a Core Neartermist EA Bookshelf) would be:

  • HPMoR ↔ Scout Mindset
  • Rationailty: A-Z ↔ Good and Real
  • SSC (Abridged)
  • Superintelligence
  • Inadequate Equilibria ↔ Modern Principles of Economics (Cowen and Tabarrok)
  • Getting Things Done (Allen)

Some people hate Eliezer's style, so I tried to think of books that might serve as replacements for at least some of the core content in RAZ etc.

If I got a slightly longer list, I might add: How to Measure Anything, MPE, The Blank Slate (Pinker), Zero to One (Thiel), Focusing (Gendlin).

Note that I tried to pick books based on what I'd expect to have a maximally positive impact if lots of people-who-might-help-save-the-future read them, not based on whether the books 'feel EA' or cover EA topics.

Including R:AZ is sort of cheating, though, since it's more like six books in a trenchcoat and therefore uses up my Recommended EA Reading Slots all on its own. :p

I haven't read the vast majority of books on the longer list, and if I did read them, I'd probably change my recommendations a bunch.

I've read only part of The Blank Slate and Good and Real, and none of MPE, How to Measure Anything, or Focusing, so I'm including those partly on how strongly others have recommended them, and my abstract sense of the skills and knowledge the books impart.

Comment by RobBensinger on Co-Creation of the Library of Effective Altruism [Information Design] (1/2) · 2022-07-11T21:19:51.430Z · EA · GW

I like this project, and the book selection looks good to me! :)

I would vote against The Singularity is Near, because I don't think Kurzweil meets the epistemic bar for EA and I don't think he contributes any important new ideas. If you want more intro AI books, there's always Smarter Than Us (nice for being short).

Though honestly, a smaller AI section seems fine to me too; I would rather trade away some AI space on the EA Bookshelf in exchange for extra space in a hypothetical future EA Blog Post Shelf. :P The only published AI-risk book I'm super attached to is Superintelligence (in spite of its oldness).

+1 for adding Elephant in the Brain in the next version. :)

I don't much like The Structure of Scientific Revolutions; a lot of people acquire a sort of mystical, non-gearsy model of scientific progress from Kuhn. And I'd guess a blog post or two suffices for learning the key concepts?

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-07-06T22:54:58.762Z · EA · GW

Could someone point to the actual quotes where Eliezer compares heliocentrism to MWI? I don't generally assume that when people are 'comparing' two very-high-probability things, they're saying they have the same probability. Among other things, I'd want confirmation that 'Eliezer and Paul assign roughly the same probability to MWI, but they have different probability thresholds for comparing things to heliocentrism' is false.

E.g., if I compare Flat Earther beliefs, beliefs in psychic powers, belief 'AGI was secretly invented in the year 2000', geocentrism, homeopathy, and theism to each other, it doesn't follow that I'd assign the same probabilities to all of those six claims, or even probabilities that are within six orders of magnitude of each other.

In some contexts it might indeed Griceanly imply that all six of those things pass my threshold for 'unlikely enough that I'm happy to call them all laughably silly views', but different people have their threshold for that kind of thing in different places.

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-07-06T22:28:15.291Z · EA · GW

I happen to agree with Eliezer that careful thought shows MWI to be unambiguously correct, and given that, the more extreme his confidence in this (IMO correct) claim, the more credit he deserves.

'The more probability someone assigns to a claim, the more credit they get when the claim turns out to be true' is true as a matter of Bayesian math. And I agree with you that MWI is true, and that we have enough evidence to say it's true with very high confidence, if by 'MWI' we just mean a conjunction like "Objective collapse is false." and "Quantum non-realism is false / the entire complex amplitude is in some important sense real".

(I think Eliezer had a conjunction like this in mind when he talked about 'MWI' in the Sequences; he wasn't claiming that decoherence explains the Born rule, and he certainly wasn't claiming that we need to reify 'worlds' as a fundamental thing. I think a better term for MWI might be the 'Much World Interpretation', since the basic point is about how much stuff there is, not about a division of that stuff into discrete 'worlds'.)

That said, I have no objection in principle to someone saying 'Eliezer was right about MWI (and gets more points insofar as he was correct), but I also dock him more points than he gained because I think he was massively overconfident'.

E.g., imagine someone who assigns probability 1 (or probability .999999999) to a coin flip coming up heads. If the coin then comes up heads, then I'm going to either assume they were trolling me, or I'm going to infer that they're very bad at reasoning. Even if they somehow rigged the coin, .999999999 is just too extreme a probability to be justified here.

By the same logic, if Eliezer had said that MWI is true with probability 1, or if he'd put too many '9s' at the end of his .99... probability assignment, then I'd probably dock him more points than he gained for being object-level-correct. (Or I'd at least assume he has a terrible understanding of how Bayesian probability works. Someone could indeed be very miscalibrated and bad at talking in probabilistic terms, and yet be very knowledgeable and correct on object-level questions like MWI.)

I'm not sure exactly how many 9s is too many in the case of MWI, but it's obviously possible to have too many 9s here. E.g., a hundred 9s would be too many! So I think this objection can make sense; I just don't think Eliezer is in fact overconfident about MWI.

Comment by RobBensinger on Why AGI Timeline Research/Discourse Might Be Overrated · 2022-07-03T21:55:22.453Z · EA · GW

Yeah, I'm specifically interested in AGI / ASI / "AI that could cause us to completely lose control of the future in the next decade or less", and I'm more broadly interested in existential risk / things that could secure or burn the cosmic endowment. If I could request one thing, it would be clarity about when you're discussing "acutely x-risky AI" (or something to that effect) versus other AI things; I care much more about that than about you flagging personal views vs. consensus views.

Comment by RobBensinger on Why AGI Timeline Research/Discourse Might Be Overrated · 2022-07-03T21:50:23.865Z · EA · GW

Clarifying the kind of timelines work I think is low-importance:

I think there's value in distinguishing worlds like "1% chance of AGI by 2100" versus "10+% chance", and distinguishing "1% chance of AGI by 2050" versus "10+% chance".

So timelines work enabling those updates was good.[1]

But I care a lot less about, e.g., "2/3 by 2050" versus "1/3 by 2050".

And I care even less about distinguishing, e.g., "30% chance of AGI by 2030, 80% chance of AGI by 2050" from "15% chance of AGI by 2030, 50% chance of AGI by 2050".

  1. ^

    Though I think it takes very little evidence or cognition to rationally reach 10+% probability of AGI by 2100.

    One heuristic way of seeing this is to note how confident you'd need to be in 'stuff like the deep learning revolution (as well as everything that follows it) won't get us to AGI in the next 85 years', in order to make a 90+% prediction to that effect.

    Notably, you don't need a robust or universally persuasive 10+% in order to justify placing the alignment problem at or near the top of your priority list.

    You just needs that to be your subjective probability at all, coupled with a recognition that AGI is an absurdly big deal and aligning the first AGI systems looks non-easy.

Comment by RobBensinger on Why AGI Timeline Research/Discourse Might Be Overrated · 2022-07-03T09:21:21.282Z · EA · GW

Agreed! Indeed, I think AGI timelines research is even less useful than this post implies; I think just about all of the work to date didn't help and shouldn't have been a priority.

I disagree with Reason 6 as a thing that should influence our behavior; if we let our behavior be influenced by reputational risks as small as this, IMO we'll generally be way too trigger-happy about hiding our honest views in order to optimize reputation, which is not a good way to make intellectual progress or build trust.

Regardless of timelines, there are many things we need to be making progress on as quickly as possible. These include improving discourse and practice around publication norms in AI; improving the level of rigor for risk assessment and management for developed and deployed AI systems;


improving dialogue and coordination among actors building powerful AI systems, to avoid reinvention of the wheel re: safety assessments and mitigations;

I'm not sure exactly what you have in mind here, but at a glance, this doesn't sound like a high priority to me. I don't think we have wheels to reinvent; the priority is to figure out how to do alignment at all, not to improve communication channels so we can share our current absence-of-ideas.

I would agree, however, that it's very high-priority to get people on the same page about basic things like 'we should be trying to figure out alignment at all', insofar as people aren't on that page.

getting competent, well-intentioned people into companies and governments to work on these things;

Getting some people into gov seems fine to me, but probably not on the critical path. Getting good people into companies seems more on the critical path to me, but this framing seems wrong to me, because of my background model that (e.g.) we're hopelessly far from knowing how to do alignment today.

I think the priority should be to cause people to think about alignment who might give humanity a better idea of a realistic way we could actually align AGI systems, not to find nice smart people and reposition them to places that vaguely seem more important. I'd guess most 'placing well-intentioned people at important-seeming AI companies' efforts to date have been net-negative.

 getting serious AI regulation started in earnest;

Seems like plausibly a bad idea to me. I don't see a way this can realistically help outside of generically slowing the field down, and I'm not sure even this would be net-positive, given the likely effect on ML discourse?

I'd at least want to hear more detail, rather than just "let's regulate AI, because something must be done, and this is something".

 and doing basic safety and policy research.

I would specifically say 'figure out how to do technical alignment of AGI systems'. (Still speaking from my own models.)

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-29T02:01:59.312Z · EA · GW

I think that part of why Eliezer's early stuff sounds weird is:

  • He generally had a lower opinion of the competence of elites in business, science, etc. (Which he later updated about.)
  • He had a lower opinion of the field of AI in particular, as it existed in the 1990s and 2000s. Maybe more like nutrition science or continental philosophy than like chemistry, on the scale of 'field rigor and intellectual output'.

If you think of A(G)I as a weird, neglected, pre-paradigmatic field that gets very little attention outside of science fiction writing, then it's less surprising to think it's possible to make big, fast strides in the field. Outperforming a competitive market is very different from outperforming a small, niche market where very little high-quality effort is going into trying new things.

Similarly, if you have a lower opinion of elites, you should be more willing to endorse weird, fringe ideas, because you should be less confident that the mainstream is efficient relative to you. (And I think Eliezer still has a low opinion of elites on some very important dimensions, compared to a lot of EAs. But not to the same degrees as teenaged Eliezer.)

From Competent Elites:


I used to think—not from experience, but from the general memetic atmosphere I grew up in—that executives were just people who, by dint of superior charisma and butt-kissing, had managed to work their way to the top positions at the corporate hog trough.

No, that was just a more comfortable meme, at least when it comes to what people put down in writing and pass around.  The story of the horrible boss gets passed around more than the story of the boss who is, not just competent, but more competent than you.


But the business world is not the only venue where I've encountered the upper echelons and discovered that, amazingly, they actually are better at what they do.

Case in point:  Professor Rodney Brooks, CTO of iRobot and former director of the MIT AI Lab, who spoke at the 2007 Singularity Summit.  I had previously known "Rodney Brooks" primarily as the promoter of yet another dreadful nouvelle paradigm in AI—the embodiment of AIs in robots, and the forsaking of deliberation for complicated reflexes that didn't involve modeling.  Definitely not a friend to the Bayesian faction.  Yet somehow Brooks had managed to become a major mainstream name, a household brand in AI...

And by golly, Brooks sounded intelligent and original.  He gave off a visible aura of competence.

And from Above-Average AI Scientists:

At one of the first conferences organized around the tiny little subfield of Artificial General Intelligence, I met someone who was heading up a funded research project specifically declaring AGI as a goal, within a major corporation.  I believe he had people under him on his project.  He was probably paid at least three times as much as I was paid (at that time).  His academic credentials were superior to mine (what a surprise) and he had many more years of experience.  He had access to lots and lots of computing power.

And like nearly everyone in the field of AGI, he was rushing forward to write code immediately—not holding off and searching for a sufficiently precise theory to permit stable self-improvement.

In short, he was just the sort of fellow that...  Well, many people, when they hear about Friendly AI, say:  "Oh, it doesn't matter what you do, because [someone like this guy] will create AI first."  He's the sort of person about whom journalists ask me, "You say that this isn't the time to be talking about regulation, but don't we need laws to stop people like this from creating AI?"

"I suppose," you say, your voice heavy with irony, "that you're about to tell us, that this person doesn't really have so much of an advantage over you as it might seem.  Because your theory—whenever you actually come up with a theory—is going to be so much better than his.  Or," your voice becoming even more ironic, "that he's too mired in boring mainstream methodology—"

No.  I'm about to tell you that I happened to be seated at the same table as this guy at lunch, and I made some kind of comment about evolutionary psychology, and he turned out to be...

...a creationist.

This was the point at which I really got, on a gut level, that there was no test you needed to pass in order to start your own AGI project.

One of the failure modes I've come to better understand in myself since observing it in others, is what I call, "living in the should-universe".  The universe where everything works the way it common-sensically ought to, as opposed to the actual is-universe we live in.  There's more than one way to live in the should-universe, and outright delusional optimism is only the least subtle.  Treating the should-universe as your point of departure—describing the real universe as the should-universe plus a diff—can also be dangerous.

Up until the moment when yonder AGI researcher explained to me that he didn't believe in evolution because that's not what the Bible said, I'd been living in the should-universe.  In the sense that I was organizing my understanding of other AGI researchers as should-plus-diff.  I saw them, not as themselves, not as their probable causal histories, but as their departures from what I thought they should be.

[...] When Scott Aaronson was 12 years old, he: "set myself the modest goal of writing a BASIC program that would pass the Turing Test by learning from experience and following Asimov's Three Laws of Robotics.  I coded up a really nice tokenizer and user interface, and only got stuck on the subroutine that was supposed to understand the user's question and output an intelligent, Three-Laws-obeying response."  It would be pointless to try and construct a diff between Aaronson12 and what an AGI researcher should be.  You've got to explain Aaronson12 in forward-extrapolation mode:  He thought it would be cool to make an AI and didn't quite understand why the problem was difficult.

It was yonder creationist who let me see AGI researchers for themselves, and not as departures from my ideal.


The really striking fact about the researchers who show up at AGI conferences, is that they're so... I don't know how else to put it...


Not at the intellectual level of the big mainstream names in Artificial Intelligence.  Not at the level of John McCarthy or Peter Norvig (whom I've both met).

More like... around, say, the level of above-average scientists, which I yesterday compared to the level of partners at a non-big-name venture capital firm.  Some of whom might well be Christians, or even creationists if they don't work in evolutionary biology.

The attendees at AGI conferences aren't literally average mortals, or even average scientists.  The average attendee at an AGI conference is visibly one level up from the average attendee at that random mainstream AI conference I talked about yesterday.

[...] But even if you just poke around on Norvig or McCarthy's website, and you've achieved sufficient level yourself to discriminate what you see, you'll get a sense of a formidable mind.  Not in terms of accomplishments—that's not a fair comparison with someone younger or tackling a more difficult problem—but just in terms of the way they talk.  If you then look at the website of a typical AGI-seeker, even one heading up their own project, you won't get an equivalent sense of formidability.

[...] If you forget the should-universe, and think of the selection effect in the is-universe, it's not difficult to understand.  Today, AGI attracts people who fail to comprehend the difficulty of AGI.  Back in the earliest days, a bright mind like John McCarthy would tackle AGI because no one knew the problem was difficult.  In time and with regret, he realized he couldn't do it.  Today, someone on the level of Peter Norvig knows their own competencies, what they can do and what they can't; and they go on to achieve fame and fortune (and Research Directorship of Google) within mainstream AI.

And then...

Then there are the completely hopeless ordinary programmers who wander onto the AGI mailing list wanting to build a really big semantic net.

Or the postdocs moved by some (non-Singularity) dream of themselves presenting the first "human-level" AI to the world, who also dream an AI design, and can't let go of that.

Just normal people with no notion that it's wrong for an AGI researcher to be normal.

Indeed, like most normal people who don't spend their lives making a desperate effort to reach up toward an impossible ideal, they will be offended if you suggest to them that someone in their position needs to be a little less imperfect.

This misled the living daylights out of me when I was young, because I compared myself to other people who declared their intentions to build AGI, and ended up way too impressed with myself; when I should have been comparing myself to Peter Norvig, or reaching up toward E. T. Jaynes.  (For I did not then perceive the sheer, blank, towering wall of Nature.)

I don't mean to bash normal AGI researchers into the ground.  They are not evil.  They are not ill-intentioned.  They are not even dangerous, as individuals.  Only the mob of them is dangerous, that can learn from each other's partial successes and accumulate hacks as a community.

And that's why I'm discussing all this—because it is a fact without which it is not possible to understand the overall strategic situation in which humanity finds itself, the present state of the gameboard.  It is, for example, the reason why I don't panic when yet another AGI project announces they're going to have general intelligence in five years.  It also says that you can't necessarily extrapolate the FAI-theory comprehension of future researchers from present researchers, if a breakthrough occurs that repopulates the field with Norvig-class minds.

Even an average human engineer is at least six levels higher than the blind idiot god, natural selection, that managed to cough up the Artificial Intelligence called humans, by retaining its lucky successes and compounding them.  And the mob, if it retains its lucky successes and shares them, may also cough up an Artificial Intelligence, with around the same degree of precise control.  But it is only the collective that I worry about as dangerous—the individuals don't seem that formidable.

If you yourself speak fluent Bayesian, and you distinguish a person-concerned-with-AGI as speaking fluent Bayesian, then you should consider that person as excepted from this whole discussion.

Of course, among people who declare that they want to solve the AGI problem, the supermajority don't speak fluent Bayesian.

Why would they?  Most people don't.

I think this, plus Eliezer's general 'fuck it, I'm gonna call it like I see it rather than be reflexively respectful to authority' attitude, explains most of Ben's 'holy shit, your views were so weird!!' thing.

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-29T00:25:09.650Z · EA · GW

but he doesn’t appear to have any such record

I want to register a gripe: when Eliezer says that he, Demis Hassabis, and Dario Amodei have a good "track record" because of their qualitative prediction successes, Jotto objects that the phrase "track record" should be reserved for things like Metaculus forecasts.

But when Ben Garfinkel says that Eliezer has a bad "track record" because he made various qualitative predictions Ben disagrees with, Jotto sets aside his terminological scruples and slams the retweet button.

I already thought this narrowing of the term "track record" was weird. If you're saying that we shouldn't count Linus Pauling's achievements in chemistry, or his bad arguments for Vitamin C megadosing, as part of Pauling's "track record", because they aren't full probability distributions over concrete future events, then I worry a lot that this new word usage will cause confusion and lend itself to misuse.

As long as it's used even-handedly, though, it's ultimately just a word. On my model, the main consequence of this is just that "track records" matter a lot less, because they become a much smaller slice of the evidence we have about a lot of people's epistemics, expertise, etc. (Jotto apparently disagrees, but this is orthogonal to the thing his post focuses on, which is 'how dare you use the phrase "track record"'.)

But if you're going to complain about "track record" talk when the track record is alleged to be good but not when it's alleged to be bad, then I have a genuine gripe with this terminology proposal. It already sounded a heck of a lot like an isolated demand for rigor to me, but if you're going to redefine "track record" to refer to  a narrow slice of the evidence, you at least need to do this consistently, and not crow some variant of 'Aha! His track record is terrible after all!' as soon as you find equally qualitative evidence that you like.

This was already a thing I worried would happen if we adopted this terminological convention, and it happened immediately.

</end of gripe>

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-28T23:03:32.893Z · EA · GW

Of course it is meaningful that Eliezer Yudkowsky has made a bunch of terrible predictions in the past that closely echo predictions he continues to make in slightly different form today.

I assume you're mainly talking about young-Eliezer worrying about near-term risk from molecular nanotechnology, and current-Eliezer worrying about near-term risk from AGI?

I think age-17 Eliezer was correct to think widespread access to nanotech would be extremely dangerous. See my comment. If you or Ben disagree, why do you disagree?

Age-20 Eliezer was obviously wrong about the timing for nanotech, and this is obviously Bayesian evidence for 'Eliezer may have overly-aggressive tech timelines in general'.

I don't think this is generally true -- e.g., if you took a survey of EAs worried about AI risk in 2010 or in 2014, I suspect Eliezer would have longer AI timelines than others at the time. (E.g., he expected it to take longer to solve Go than Carl Shulman did.) When I joined MIRI, the standard way we summarized MIRI's view was roughly 'We think AI risk is high, but not because we think AGI is imminent; rather, our worry is that alignment is likely to take a long time, and that civilization may need to lay groundwork decades in advance in order to have a realistic chance of building aligned AGI.'

But nanotech is a totally fair data point regardless.

Of course it is relevant that he has neither owned up to those earlier terrible predictions or explained how he has learned from those mistakes.

Eliezer wrote a 20,000-word essay series on his update, and the mistakes he thought he was making. Essay titles include "My Childhood Death Spiral", "The Sheer Folly of Callow Youth", "Fighting a Rearguard Action Against the Truth", and "The Magnitude of His Own Folly".

He also talks a lot about how he's updated and revised his heuristics and world-models in other parts of the Sequences. (E.g., he writes that he underestimated elite competence when he was younger.)

What specific cognitive error do you want him to write about, that he hasn't already written on?

This is sensible advice in any complex domain, and saying that we should "evaluate every argument in isolation on its merits" is a type of special pleading or sophistry.

I don't think the argument I'm making (or most others are making) is 'don't update on people's past mistakes' or 'never do deference'. Rather, a lot of the people discussing this matter within EA (Wei Dai, Gwern Branwen, Richard Ngo, Rohin Shah, Carl Shulman, Nate Soares, Ajeya Cotra, etc.) are the world's leading experts in this area, and a lot of the world's frontier progess on this topic is happening on Internet fora like the EA Forum and LessWrong. It makes sense for domain specialists to put much more focus into evaluating arguments on the merits; object-level conversations like these are how the intellectual advances occur that can then be reflected in aggregators like Metaculus.

Metaculus and prediction markets will be less accurate if frontier researchers replace object-level discussion with debates about who to defer to, in the same way that stock markets would be less efficient if everyone overestimated the market's efficiency and put minimal effort into beating the market.

Insofar as we're trying to grow the field, it also makes sense to encourage more EAs to try to think about these topics and build their own inside-view models; and this has the added benefit of reducing the risk of deference cascades.

(I also think there are other reasons it would be healthy for EA to spend a lot more time on inside-view building on topics like AI, normative ethics, and global poverty, as I briefly said here. But it's possible to practice model-building and then decide at the end of the day, nonetheless, that you don't put much weight on the domain-specific inside views you've built.)

extreme claims

When people use words like "extreme" here, I often get the sense that they aren't crisply separating "extreme" in the sense of "weird-sounding" from "extreme" in the sense of "low prior probability". I think Eliezer's views are weird-sounding, not unlikely on priors.

E.g., why should we expect generally intelligent machines to be low-impact if built, or to never be built?

The idea that a post-AGI world looks mostly the same as a pre-AGI world might sound more normal and unsurprising to an early-21st-century well-off Anglophone intellectual, but I think this is just an error. It's a clear case of the availability heuristic misfiring, not a prior anyone should endorse upon reflection.

I view the Most Important Century series as an attempt to push back against many versions of this conflation.

Epistemically, I view Paul's model as much more "extreme" than Eliezer's because I think it's much more conjunctive. I obviously share the view that soft takeoff sounds more normal in some respects, but I don't think this should inform our prior much. I'd guess we should start with a prior that assigns lots of weight to soft takeoff as well as to hard takeoff, and then mostly arrive at a conclusion based on the specific arguments for each view.

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-28T20:42:55.206Z · EA · GW

Commenting on a few minor points from Scott's post, since I meant to write a full reply at some point but haven't had the time:

But also, there are about 10^15 synapses in the brain, each one spikes about once per second, and a synaptic spike probably does about one FLOP of computation. [...] So a human-level AI would also need to do 10^15 floating point operations per second? Unclear.

I'd say 'clearly not, for some possible AI designs'; but maybe it will be true for the first AIs we actually build, shrug.

Or you might do what OpenPhil did and just look at a bunch of examples of evolved vs. designed systems and see which are generally better:

Why aren't there examples like 'amount of cargo a bird can carry compared to an airplane', or 'number of digits a human can multiply together in ten seconds compared to a computer'?

Seems like you'll get a skewed number if your brainstorming process steers away from examples like these altogether.

'AI physicist' is less like an artificial heart (trying to exactly replicate the structure of a biological organ functioning within a specific body), more like a calculator (trying to do a certain kind of cognitive work, without any constraint at all to do it in a human-like way).

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-28T20:39:17.932Z · EA · GW

Eliezer's post was less a takedown of the report, and more a takedown of the idea that the report provides a strong basis for expecting AGI in ~2050, or for discriminating scenarios like 'AGI in 2030', 'AGI in 2050', and 'AGI in 2070'.

The report itself was quite hedged, and Holden posted a follow-up clarification emphasizing that “biological anchors” is about bounding, not pinpointing, AI timelines. So it's not clear to me that Eliezer and Ajeya/Holden/etc. even disagree about the core question "do biological anchors provide a strong case for putting a median AGI year in ~2050?", though maybe they disagree on the secondary question of how useful the "bounds" are.

Copying over my high-level view, which I recently wrote on Twitter:

I agree with the basic Eliezer argument in Biology-Inspired AGI Timelines that the bio-anchors stuff isn't important or useful because AGI is a software problem, and we neither know which specific software insights are needed, nor how long it will take to get to those software insights, nor the relationship between those insights and hardware requirements.

Focusing on things like bio-anchors and hardware trends is streetlight-fallacy reasoning: it's taking the 2% of the territory we do know about and heavily heavily focusing on that 2%, while shrugging our shoulders at the other 98%.

Like, bio-anchors reasoning might help tell you whether to expect AGI this century versus expecting it in a thousand years, but it won't help you discriminate 2030 from 2050 from 2070 at all.

Insofar as we need to think about timelines at all, it's true that we need some sort of prior, at least a very vague one.

The problem with the heuristic 'look under the streetlight and anchor your prior to whatever you found under the streetlight, however marginal' is that the info under the streetlight isn't a random sampling from the space of relevant unknown facts about AGI; it's a very specific and unusual kind of information.

IMO you'd be better off thinking first about that huge space of unknowns and anchoring to far fuzzier and more uncertain guesses about the whole space, rather than fixating on a very specific much-more-minor fact that's easier to gather data about.

E.g., consider five very different a priori hypotheses about 'what insights might be needed for AGI', another five very different hypotheses about 'how might different sorts of software progress relate to hardware requirements', etc.

Think about different world-histories that might occur, and how surprised you'd be by those world-histories.

Think about worlds where things go differently than you're expecting in 2060, and about what those worlds would genuinely retrodict about the present / past.

E.g., I think scenario analysis makes it more obvious that in worlds where AGI is 30 years away, current trends will totally break at some point on that path, radically new techniques will be developed, etc.

Think about how different the field of AI was in 1992 compared to today, or in 1962 compared to 1992.

When you're spending most of your time looking under the streetlight — rather than grappling with how little is known, trying to painstakingly refine your instincts and intuitions about the harder-to-reason-about aspects of the problem, etc. — I think it becomes overly tempting to treat current trendlines as laws of nature that will be true forever (or that at least have a strong default of being true forever), rather than as 'patterns that arose a few years ago and will plausibly continue for a few years more, before being replaced by new patterns and growth curves'.


Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-28T19:33:52.230Z · EA · GW

However, I feel I have to evaluate each of his arguments on its own merit rather than deferring to what I see as his appeal to authority (where he’s the authority)

Eliezer isn't saying "believe me because I'm a trustworthy authority"; just the opposite. Eliezer is explicitly claiming that we're all dead if we base our beliefs on this topic on deference, as opposed to evaluating arguments on their merits, figuring out the domain for ourselves, generating our own arguments for and against conclusions, refining our personal inside views of AGI alignment, etc.

(At least, his claim is that we need vastly, vastly more people doing that. Not every EA needs to do that, but currently we're far below water on this dimension, on Eliezer's model and on mine.)

Comment by RobBensinger on "Two-factor" voting ("two dimensional": karma, agreement) for EA forum? · 2022-06-25T18:33:17.121Z · EA · GW

I agree it's not a panacea, but I could imagine it helping mitigate bias/politicization in a few ways:

  • It prompts people to think about 'liking' and 'agreeing' as two separate questions at all. I don't expect this to totally de-bias either 'liking' or 'agreeing', but I do expect some progress if people are prompted like this.
  • Goodwill and trust is generated when people are upvoted in spite of having an unpopular-on-the-forum view. This can create virtuous cycles, where those people reciprocate and in general there are fewer comment sections that turn into 'one side mass-downvotes the other side, the other side retaliates, etc.'.

Example: Improving EA Forum discourse by 8% would obviously be worth it, even if this is via a "superficial improvement" that doesn't fix the whole problem.

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-24T03:06:45.151Z · EA · GW

Cool! I figured your reasoning was probably something along those lines, but I wanted to clarify that the survey is anonymous and hear your reasoning. I personally don't know who wrote the response you're talking about, and I'm very uncertain how many researchers at MIRI have 90+% p(doom), since only five MIRI researchers answered the survey (and marked that they're from MIRI).

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-23T17:58:26.269Z · EA · GW

Why do you think it suggests that? There are two MIRI responses in that range, but responses are anonymous, and most MIRI staff didn't answer the survey.

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-23T05:30:46.235Z · EA · GW

Telling people you're the responsible adult, or the only one who notices things, still means telling them you're smarter than them and they should just defer to you.

Those are four very different claims. In general, I think it's bad to collapse all (real or claimed) differences in ability into a single status hierarchy, for the reasons stated in Inadequate Equilibria.

Eliezer is claiming that other people are not taking the problem sufficiently seriously, claiming ownership of it, trying to form their own detailed models of the full problem, and applying enough rigor and clarity to make real progress on the problem.

He is specifically not saying "just defer to me", and in fact is saying that he and everyone else is going to die if people rely on deference here. A core claim in AGI Ruin is that we need more people with "not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you".

Deferring to Eliezer means that Eliezer is the bottleneck on humanity solving the alignment problem; which means we die. The thing Eliezer claims we need is a larger set of people who arrive at true, deep, novel insights about the problem on their own —without Eliezer even mentioning the insights, much less spending a ton of time trying to persuade anyone of them—and writing them up.

It's true that Eliezer endorses his current stated beliefs; this goes without saying, or he obviously wouldn't have written them down. It doesn't mean that he thinks humanity has any path to survival via deferring to him, or that he thinks he has figured out enough of the core problems (or ever could conceivably could do so, on his own) to give humanity a significant chance of surviving. Quoting AGI Ruin:

It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction.  The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it[.]

The end of the "death with dignity" post is also alluding to Eliezer's view that it's pretty useless to figure out what's true merely via deferring to Eliezer.

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-23T04:51:54.692Z · EA · GW

Since Eliezer thinks something like 99.99% chance of doom from AI

I could be wrong, but I'd guess Eliezer's all-things-considered p(doom) is less extreme than that.

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-23T04:37:26.786Z · EA · GW

The post is serious. Details: 

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-23T03:41:55.696Z · EA · GW

I noted some places I agree with your comment here, Ben. (Along with my overall take on the OP.)

Some additional thoughts:

Notably, since that post didn’t really have substantial arguments in it (although the later one did), I think the fact it had an impact is seemingly a testament to the power of deference

The “death with dignity” post came in the wake of Eliezer writing hundreds of thousands of words about why he thinks alignment is hard in the Late 2021 MIRI Conversations (in addition to the many specific views and arguments about alignment difficulty he’s written up in the preceding 15+ years). So it seems wrong to say that everyone was taking it seriously based on deference alone.

The post also has a lot of content beyond “p(doom) is high”. Indeed, I think the post’s focus (and value-add) is mostly in its discussion of rationalization, premature/excessive conditionalizing, and ethical injunctions, not in the bare assertion that p(doom) is high. Eliezer was already saying pretty similar stuff about p(doom) back in September.

I’d make it clearer that my main claim is: it would have been unreasonable to assign a very high credence to fast take-offs back in (e.g.) the early- or mid-2000s, since the arguments for fast take-offs had significant gaps. For example, there were a lots of possible countervailing arguments for slow take-offs that pro-fast-take-off authors simply hadn’t address yet — as evidenced, partly, by the later publication of slow-take-off arguments leading a number of people to become significantly more sympathetic to slow take-offs.

I disagree; I think that, e.g., noting how powerful and widely applicable general intelligence has historically been, and noting a bunch of standard examples of how human cognition is a total shitshow, is sufficient to have a very high probability on hard takeoff.

I think the people who updated a bunch toward hard takeoff based on the recent debate were making a mistake, and should have already had a similarly high p(hard takeoff) going back to the Foom debate, if not earlier.

Insofar as others disagree, I obviously think it’s a good thing for people to publish arguments like “but ML might be very competitive”, and for people to publicly respond to them. But I don’t think “but ML might be very competitive” and related arguments ought to look compelling at a glance (given the original simple arguments for hard takeoff), so I don’t think someone should need to consider the newer discussion in order to arrive at a confident hard-takeoff view.

(Also, insofar as Paul recently argued for X and Eliezer responded with a valid counter-argument for Y, it doesn’t follow that Eliezer had never considered anything like X or Y in initially reaching his confidence. Eliezer’s stated view is that the new Paul arguments seem obviously invalid and didn’t update him at all when he read them. Your criticism would make more sense here if Eliezer had said “Ah, that’s an important objection I hadn’t considered; but now that I’m thinking about it, I can generate totally new arguments that deal with the objections, and these new counter-arguments seem correct to me.”)

The main questions that matter are: What has the intellectual gotten wrong and right? Beyond whether they were wrong or right, about a given case, does it also seem like their predictions were justified?

At least as important, IMO, is the visible quality of their reasoning and arguments, and their retrodictions.

AGI, moral philosophy, etc. are not topics where we can observe extremely similar causal processes today and test all the key claims and all the key reasoning heuristics with simple experiments. Tossing out ‘argument evaluation’ and ‘how well does this fit what I already know?’ altogether would mean tossing out the majority of our evidence about how much weight to put on people’s views.

Ultimately, I don’t buy the comparison. I think it’s really out-of-distribution for someone in their late teens and early twenties to pro-actively form the view that an emerging technology is likely to kill everyone within a decade, found an organization and devote years of their professional life to address the risk, and talk about how they’re the only person alive who can stop it.

I take the opposite view on this comparison. I agree that this is really unusual, but I think the comparison is unfavorable to the high school students, rather than unfavorable to Eliezer. Having unusual views and then not acting on them in any way is way worse than actually acting on your predictions.

I agree that Eliezer acting on his beliefs to this degree suggests he was confident; but in a side-by-side comparison of a high schooler who’s expressed equal confidence in some other unusual view, but takes no unusual actions as a result, the high schooler is the one I update negatively about.

(This also connects up to my view that EAs generally are way too timid/passive in their EA activity, don’t start enough new things, and (when they do start new things) start too many things based on ‘what EA leadership tells them’ rather than based on their own models of the world. The problem crippling EA right now is not that we're generating and running with too many wildly different, weird, controversial moonshot ideas. The problem is that we're mostly just passively sitting around, over-investing in relatively low-impact meta-level interventions, and/or hoping that the most mainstream already-established ideas will somehow suffice.)

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-23T03:26:07.590Z · EA · GW

I work at MIRI, but as usual, this comment is me speaking for myself, and I haven’t heard from Eliezer or anyone else on whether they'd agree with the following.

My general thoughts:

  • The primary things I like about this post are that (1) it focuses on specific points of disagreement, encouraging us to then hash out a bunch of object-level questions; and (2) it might help wake some people from their dream if they hero-worship Eliezer, or if they generally think that leaders in this space can do no wrong.

    • By "hero-worshipping" I mean a cognitive algorithm, not a set of empirical conclusions. I'm generally opposed to faux egalitarianism and the Modest-Epistemology reasoning discussed in Inadequate Equilibria: if your generalized anti-hero-worship defenses force the conclusion that there just aren't big gaps in skills or knowledge (or that skills and knowledge always correspond to mainstream prestige and authority), then your defenses are ruling out reality a priori. In saying "people need to hero-worship Eliezer less", I'm opposing a certain kind of reasoning process and mindset, not a specific factual belief like "Eliezer is the clearest thinker about AI risk".

      In a sense, I want to promote the idea that the latter is a boring claim, to be evaluated like any other claim about the world; flinching away from it (e.g., because Eliezer is weird and says sci-fi-sounding stuff) and flinching toward it (e.g., because you have a bunch of your identity invested in the idea that the Sequences are awesome and rationalists are great) are both errors of process.
  • The main thing I dislike about this post is that it introduces a bunch of not-obviously-false Eliezer-claims — claims that EAs either widely disagree about, or haven’t discussed — and then dives straight into ‘therefore Eliezer has a bad track record'.

    E.g., I disagree that molecular nanotech isn't a big deal (if that's a claim you're making?), that Robin better predicted deep learning than Eliezer did, and that your counter-arguments against Eliezer and Bostrom are generally strong. Certainly I don't think these points have been well-established enough that it makes sense to cite them in the mode 'look at these self-evident ways Yudkowsky got stuff wrong; let us proceed straight to psychoanalysis, without dwelling on the case for why I think he's wrong about this stuff'. At this stage of the debate on those topics, it would be more appropriate to talk in terms of cruxes like 'I think the history of tech shows it's ~always continuous in technological change and impact', so it's clear why you disagree with Eliezer in the first place.
  • I generally think that EA’s core bottlenecks right now are related to ‘willingness to be candid and weird enough to make intellectual progress (especially on AI alignment), and to quickly converge on our models of the world’.

    My own models suggest to me that EA’s path to impact is almost entirely as a research community and a community that helps produce other research communities, rather than via ‘changing the culture of the world at large’ or going into politics or what-have-you. In that respect, rigor and skepticism is good, but singling out Eliezer because he’s unusually weird and candid is bad, because it discourages others from expressing weird/novel/minority views and from blurting out their true thought processes. (I recognize that this isn’t the only reason you’re singling Eliezer out, but it’s obviously a contributing factor.)
  • I am a big fan of Ben’s follow-up comment. Especially the part where he outlines the thought process that led to him generating the post’s contents. I think this is an absolutely wonderful thing to include in a variety of posts, or to add in the comment sections for a lot of posts.

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 

Some specific thoughts on Ben's follow-up comment:

1. I agree with Ben on this: “If a lot of people in the community believe AI is probably going to kill everyone soon, then (if they’re wrong) this can have really important negative effects”.

I think they’re not wrong, and I think the benefits of discussing this openly strongly outweigh the costs. But the negative effects are no less real for that.

(Separately, I think the “death with dignity” post was a suboptimal way to introduce various people to the view that p(doom) is very high. I’m much more confident that we should discuss this at all, than that Eliezer or I or others have been discussing this optimally.)

2. “Directly and indirectly, deference to Yudkowsky has a significant influence on a lot of people’s views


Roughly speaking, my own view is:

  • EAs currently do a very high amount of deferring to others (both within EA and outside of EA) on topics like AI, global development, moral philosophy, economics, cause prioritization, organizational norms, personal career development, etc.
  • On the whole, EAs currently do a low amount of model-building and developing their own inside views.
  • EAs should switch to doing a medium amount of deference on topics like the ones I listed, and a very high amount of personal model-building.
    • Note that model-building can be useful even if you think all your conclusions will be strictly worse than the models of some other person you've identified. I'm pretty radical on this topic, and think that nearly all EAs should spend a nontrivial fraction of their time developing their own inside-view models of EA-relevant stuff, in spite of the obvious reasons (like gains from specialization) that this would normally not make sense.
      • Happy to say more about my views here, and I'll probably write a post explaining why I think this.
    • I think the Alignment Research Field Guide, in spite of nominally being about “alignment”, is the best current intro resource for “how should I go about developing my own models on EA stuff?” A lot of the core advice is important and generalizes extremely well, IMO.
  • Insofar as EAs should do deference at all, Eliezer is in the top tier of people it makes sense to defer to.
  • But I’d guess the current amount of Eliezer-deference is way too high, because the current amount of deference overall is way too high. Eliezer should get a relatively high fraction of the deference pie IMO, but the overall pie should shrink a lot.

3. I also agree with Ben on “The track records of influential intellectuals (including Yudkowsky) should be publicly discussed.

I don’t like the execution of the OP, but I strongly disagree with the people in the comments who have said “let us never publicly talk about individuals’ epistemic track records at all”—both because I think ‘how good is EY’s reasoning’ is a genuine crux for lots of people, and because I think this is a very common topic people think about, both in more pro-Eliezer and in more anti-Eliezer camps.

Discussing cruxes is obviously good, but even if this weren’t a crux for anyone, I’m strongly in favor of EAs doing a lot more “sharing their actual thoughts out loud”, including the more awkward and potentially inflammatory ones. (I’m happy to say more about why I think this.)

I do think it’s worth talking about what the best way is to discuss individuals' epistemic track records, without making EA feel hostile/unpleasant/scary. I think EAs are currently way too timid (on average) about sharing their thoughts, so I worry about any big norm shifts that might make that problem even worse.

But Eliezer’s views are influential enough (and cover a topic, AGI, that is complicated and difficult enough to reason about) that this just seems like an important topic to me (similar to ‘how much should we defer to Paul?’, etc.). I’d rather see crappy discussion of this in the community than zero discussion whatsoever.

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 

Some specific thoughts on claims in the OP:

such that all we can hope to do is “die with dignity.”

This is in large part Eliezer's fault for picking such a bad post title, but I should still note that this is a very misleading summary. "Dying with dignity" often refers to giving up on taking any actions to keep yourself alive.

Eliezer's version of "dying with dignity" is exactly the opposite: he's advocating for doing whatever it takes to maximize the probability that humanity survives.

It's true that he thinks we'll probably fail (and I agree), and he thinks we should emotionally reconcile ourselves with that fact (because he thinks this emotional reconciliation will itself increase our probability of surviving!!), but he doesn't advocate giving up.

Quoting the post:

"Q1:  Does 'dying with dignity' in this context mean accepting the certainty of your death, and not childishly regretting that or trying to fight a hopeless battle?

"Don't be ridiculous.  How would that increase the log odds of Earth's survival?"

At least up until 1999, admittedly when he was still only about 20 years old, Yudkowsky argued that transformative nanotechnology would probably emerge suddenly and soon (“no later than 2010”) and result in human extinction by default.

I think the "no later than 2010" prediction is from when Eliezer was 20, but the bulk of the linked essay was written when he was 17. The quotation here is: "As of '95, Drexler was giving the ballpark figure of 2015.  I suspect the timetable has been accelerated a bit since then.  My own guess would be no later than 2010."

The argument for worrying about extinction via molecular nanotech to some non-small degree seems pretty straightforward and correct: molecular nanotech lets you build arbitrary structures, including dangerous ones, and some humans would want to destroy the world given the power to do so.

Eliezer was overconfident about nanotech timelines (though roughly to the same degree as Drexler, the world's main authority on nanotech).

Eliezer may have also been overconfident about nanotech's riskiness, but the specific thing he said when he was 17 is that he considered it important for humanity to achieve AGI "before nanotechnology, given the virtual certainty of deliberate misuse - misuse of a purely material (and thus, amoral) ultratechnology, one powerful enough to destroy the planet".

It's not clear to me whether this is saying that human-extinction-scale misuse from nanotech is 'virtually certain', versus the more moderate claim that some misuse is 'virtually certain' if nanotech sees wide usage (and any misuse is pretty terrifying in EV terms). The latter seems reasonable to me, given how powerful molecular nanotechnology would be.

Eliezer denies that he has a general tendency toward alarmism:

(As a side note, I think that if Eliezer had been around in the 1930s, and you described to him what actually happened with nukes over the next 80 years, he would have called that "insanely optimistic".)

Mmmmmmaybe.  Do note that I tend to be more optimistic than the average human about, say, global warming, or everything in transhumanism outside of AGI. 

Nukes have going for them that, in fact, nobody has an incentive to start a global thermonuclear war.  Eliezer is not in fact pessimistic about everything and views his AGI pessimism as generalizing to very few other things, which are not, in fact, as bad as AGI. 

[...] So yeah, I picture 1930s-Eliezer pointing to technological trends and being like "by default, 30 years after the first nukes are built, you'll be able to build one in your back yard. And governments aren't competent enough to stop that happening."

And I don't think I could have come up with a compelling counterargument back then. 

So, I mean, in fact, I don't prophesize doom from very many trends at all!  It's literally just AGI that is anywhere near that unmanageable!  Many people in EA are more worried about biotech than I am, for example.

It seems fair to note that nanotech is a second example of Eliezer raising alarm bells. But this remains a pretty small number of data points, and in neither of those cases does it actually look unreasonable to worry a fair bit—those are genuinely some of the main ways we could destroy ourselves.

I think 'Eliezer predicted nanotech way too early' is a better data point here, as evidence for 'maybe Eliezer tends to have overly aggressive tech forecasts'.

If Eliezer was deferring to Drexler to some extent, that makes the data a bit less relevant, but 'I was deferring to someone else who was also wrong' is not in fact a general-purpose excuse for getting the wrong answer.

In 2001, and possibly later, Yudkowsky apparently believed that his small team would be able to develop a “final stage AI” that would “reach transhumanity sometime between 2005 and 2020, probably around 2008 or 2010.”

In the first half of the 2000s, he produced a fair amount of technical and conceptual work related to this goal. It hasn't ultimately had much clear usefulness for AI development, and, partly on the basis, my impression is that it has not held up well - but that he was very confident in the value of this work at the time.

That view seems very dumb to me — specifically the belief that SingInst's very first unvetted idea would pan out and result in them building AGI, more so than the timelines per se.

I don't fault 21-year-old Eliezer for trying (except insofar as he was totally wrong about the probability of Unfriendly AI at the time!), because the best way to learn that a weird new path is unviable is often to just take a stab at it. But insofar as 2001-Eliezer thought his very first idea was very likely to work, this seems like a totally fair criticism of the quality of his reasoning at the time.

Looking at the source text, I notice that the actual text is much more hedged than Ben's summary (though it still sounds foreseeably overconfident to me, to the extent I can glean likely implicit probabilities from tone):

[...] The Singularity Institute is fully aware that creating true intelligence will not be easy.  In addition to the enormous power deficit between modern computers and the human brain, there is an even more severe software deficit.  The software of the human brain is the result of millions of years of evolution and contains perhaps tens of thousands of complex functional adaptations.  The human brain itself is not a homogenous lump but a highly modular supersystem; the cerebral cortex is divided into two hemispheres, each containing 52 areas, each area subdivided into a half-dozen distinguishable maps.  Cortical neurons group into minicolumns of perhaps a hundred neurons and macrocolumns of a few hundred minicolumns, with perhaps 1,000 macrocolumns to a cortical map.  Of the 750 megabytes of human DNA, the vast majority is believed to be junk and 98% is identical to chimpanzee DNA, with perhaps 1% being concerned with intelligence - leaving 7.5 megabytes to specify, not the actual wiring of the brain, but the neuroanatomy of areas and maps and pathways, and the initial tiling patterns and learning algorithms for neurons and minicolumns and macrocolumns.

The Singularity Institute seriously intends to build a true general intelligence, possessed of all the key subsystems of human intelligence, plus design features unique to AI.  We do not hold that all the complex features of the human mind are "emergent", or that intelligence is the result of some simple architectural principle, or that general intelligence will appear if we simply add enough data or computing power.  We are willing to do the work required to duplicate the massive complexity of human intelligence; to explore the functionality and behavior of each system and subsystem until we have a complete blueprint for a mind.  For more about our Artificial Intelligence plans, see the document General Intelligence and Seed AI.

Our specific cognitive architecture and development plan forms our basis for answering questions such as "Will transhumans be friendly to humanity?" and "When will the Singularity occur?"  At the Singularity Institute, we believe that the answer to the first question is "Yes" with respect to our proposed AI design - if we didn't believe that, the Singularity Institute would not exist.  Our best guess for the timescale is that our final-stage AI will reach transhumanity sometime between 2005 and 2020, probably around 2008 or 2010.  As always with basic research, this is only a guess, and heavily contingent on funding levels. [...]


A later piece of work which I also haven’t properly read is “Levels of Organization in General Intelligence.”

Note that this paper was written much earlier than its publication date. Description from "Book chapter I wrote in 2002 for an edited volume, Artificial General Intelligence, which is now supposed to come out in late 2006. I no longer consider LOGI’s theory useful for building de novo AI. However, it still stands as a decent hypothesis about the evolutionary psychology of human general intelligence."

Although Hanson very clearly wasn’t envisioning something like deep learning either, his side of the argument seems to fit better with what AI progress has looked like over the past decade.

I agree that Eliezer loses Bayes points (e.g., relative to Shane Legg and Dario Amodei) for not predicting the enormous success of deep learning. See also Nate's recent post about this.

I disagree that Robin Hanson scored Bayes points off of Eliezer, on net, from the deep learning revolution, or that Hanson's side of the Foom debate looks good (compared to Eliezer's) with the benefit of hindsight. I side with Gwern here; I think Robin's predictions and arguments on this topic have been terrible, as a rule.

I think that Yudkowsky's prediction - that a small amount of code, run using only a small amount of computing power, was likely to abruptly jump economic output upward by more than a dozen orders-of-magnitude - was extreme enough to require very strong justifications.

I think Eliezer assigned too high a probability to 'it's easy to find relatively clean, understandable approaches to AGI', and too low a probability to 'it's easy to find relatively messy, brute-forced approaches to AGI'. A consequence of the latter is that he (IMO) underestimated how compute-intensive AGI was likely to be, and overestimated how important recursive self-improvement was likely to be.

I otherwise broadly agree with his picture. E.g.:

  • I expect AGI to represent a large, sharp capabilities jump. (I think this is unlikely to require a bunch of recursive self-improvement.)
  • I think AGI is mainly bottlenecked on software, rather than hardware. (E.g., I think GPT-3 is impressive, but isn't a baby AGI; rather than AGI just being 'current systems but bigger', I expect at least one more key insight lies on the shortest likely path to AGI.)
  • And I expect AGI to be much more efficient than current systems at utilizing small amounts of data. Though (because it's likely to come from a relatively brute-forced, unalignable approach) I still expect it to be more compute-intensive than 2009-Eliezer was imagining.

However, later analysis has suggested that coherence arguments have either no or very limited implications for how we should expect future AI systems to behave.

This seems completely wrong to me. See Katja Grace's Coherence arguments imply a force for goal-directed behavior.

Comment by RobBensinger on On Deference and Yudkowsky's AI Risk Estimates · 2022-06-23T01:08:35.785Z · EA · GW

So at least from Garfinkel's perspective, Yudkowsky and Soares do count as data points, they're just equal in weight to other relevant data points.

Ben has said this about Eliezer, but not about Nate, AFAIK.

Comment by RobBensinger on Pivotal outcomes and pivotal processes · 2022-06-22T10:24:34.147Z · EA · GW

An example of a possible "pivotal act" I like that isn't "melt all GPUs" is:

Use AGI to build fast-running high-fidelity human whole-brain emulations. Then run thousands of very-fast-thinking copies of your best thinkers. Seems to me this plausibly makes it realistic to keep tabs on the world's AGI progress, and locally intervene before anything dangerous happens, in a more surgical way rather than via mass property destruction of any sort.

Looking for pivotal acts that are less destructive (and, more importantly for humanity's sake, less difficult to align) than "melt all GPUs" seems like a worthy endeavor to me. But I prefer the framing 'let's discuss the larger space of pivotal acts, brainstorm new ideas, and try to find options that are easier to achieve, because that particular toy proposal seems suboptimally dangerous and there just hasn't been very much serious analysis and debate about pathways'. In the course of that search, if it then turns out that the most likely-to-succeed option is a process, then we should obviously go with a process.

But I don't like constraining that search to 'processes only, not acts', because:

  • (a) I'm guessing something more local, discrete, and act-like will be necessary, even if it's less extreme than "melt all GPUs";
  • (b) insofar as I'm uncertain about which paths will be viable and think the problem is already extremely hard and extremely constrained, I don't want to further narrow the space of options that humanity can consider and reason through;
  • (c) I worry that the "processes" framing will encourage more Rube-Goldberg-machine-like proposals, where the many added steps and layers and actors obscure the core world-saving cognition and action, making it harder to spot flaws and compare tradeoffs;
  • and (d) I worry that the extra steps, layers, and actors will encourage "design by committee" and slow-downs that doom otherwise-promising projects.

I suspect we also have different intuitions about pivotal acts because we have different high-level pictures of the world's situation.

I think that humanity as it exists today is very far off from thinking like a serious civilization would about these issues. As a consequence, our current trajectory has a negligible chance of producing good long-run outcomes. Rather than trying to slightly nudge the status quo toward marginally better thinking, we have more hope if we adopt a heuristic like speak candidly and realistically about things, as though we lived on the Earth that does take these issues seriously, and hope that this seriousness and sanity might be infectious.

On my model, we don't have much hope if we continue to half-say-the-truth, and continue to make small steady marginal gains, and continue to talk around the hard parts of the problem; but we do have the potential within us to just drop the act and start fully sharing our models and being real with each other, including being real about the parts where there will be harsh disagreements.

I think that a large part of the reason humanity is currently endangering itself is that everyone is too focused on 'what's in the Overton window?', and is too much trying to finesse each other's models and attitudes, rather than blurting out their actual views and accepting the consequences.

This makes the situation I described in The inordinately slow spread of good AGI conversations in ML much stickier: very little of the high-quality / informed public discussion of AGI is candid and honest, and people notice this, so updating and epistemic convergence is a lot harder; and everyone is dissembling in the same direction, toward 'be more normal', 'treat AGI more like business-as-usual', 'pretend that the future is more like the past'.

All of this would make me less eager to lean into proposals like "yes, let's rush into establishing a norm that large parts of the strategy space are villainous and not to be talked about" even if I agreed that pivotal processes are a better path to long-run good outcomes than pivotal acts. This is inviting even more of the central problem with current discourse, which is that people don't feel comfortable even talking about their actual views.

You may not think that a pivotal act is necessary, but there are many who disagree with you. Of those, I would guess that most aren't currently willing to discuss their thoughts, out of fear that the resultant discussion will toss norms of scholarly discussion out the window. This seems bad to me, and not like the right direction for a civilization to move into if it's trying to emulate 'the kind of civilization that handles AGI successfully'. I would rather a world where humanity's best and brightest were debating this seriously, doing scenario analysis, assigning probabilities and considering specific mainline and fallback plans, etc., over one where we prejudge 'discrete pivotal acts definitely won't be necessary' and decide at the outset to roll over and die if it does turn out that pivotal acts are necessary.

My alternative proposal would be: Let's do scholarship at the problem, discuss it seriously, and not let this topic be ruled by 'what is the optimal social-media soundbite?'.

If the best idea sounds bad in soundbite form, then let's have non-soundbite-length conversations about it. It's an important enough topic, and a complex enough one, that this would IMO be a no-brainer in a world well-equipped to handle developments like AGI.

it's safer to aim for a pivotal outcome to be carried out by a distributed process spanning multiple institutions and states, because the process can happen in a piecemeal fashion that doesn't change the whole world at once

We should distinguish "safer" in the sense of "less likely to cause a bad outcome" from "safer" in the sense of "less likely to be followed by a bad outcome".

E.g., the FDA banning COVID-19 testing in the US in the early days of the pandemic was "safer" in the narrow sense that they legitimately reduced the risk that COVID-19 tests would cause harm. But the absence of testing resulted in much more harm, and was "unsafe" in that sense.

Similarly: I'm mildly skeptical that humanity refusing to attempt any pivotal acts makes us safer from the particular projects that enact this norm. But I'm much more skeptical that humanity refusing to attempt any pivotal acts makes us safer from harm in general. These two versions of "safer" need to be distinguished and argued for separately.

Any proposal that adds red tape, inefficiencies, slow-downs, process failures, etc. will make AGI projects "safer" in the first sense, inasmuch as it cripples the project or slows it down to the point of irrelevance.

As someone who worries that timelines are probably way too short for us to solve enough of the "pre-AGI alignment prerequisites" to have a shot at aligned AGI, I'm a big fan of sane, non-adversarial ideas that slow down the field's AGI progress today.

But from my perspective, the situation is completely reversed when you're talking about slowing down a particular project's progress when they're actually building, aligning, and deploying their AGI.

At some point, a group will figure out how to build AGI. When that happens, I expect an AGI system to destroy the world within just a few years, if no pivotal act or processes finishes occurring first. And I expect safety-conscious projects to be at a major speed disadvantage relative to less safety-conscious projects.

Adding any unnecessary steps to the process—anything that further slows down the most safety-conscious groups—seems like suicide to me, insofar as it either increases the probability that the project fails to produce a pivotal outcome in time, or increases the probability that the project cuts more corners on safety because it knows that it has that much less time.

I obviously don't want the first AGI projects to rush into a half-baked plan and destroy the world. First and foremost, do not destroy the world by your own hands, or commit the fallacy of "something must be done, and this is something!".

But I feel more worried about AGI projects insofar as they don't have a lot of time to carefully align their systems (so I'm extremely reluctant to tack on any extra hurdles that might slow them down and that aren't crucial for alignment), and also more worried insofar as they haven't carefully thought about stuff like this in advance. (Because I think a pivotal act is very likely to be necessary, and I think disaster is a lot more likely if people don't feel like they can talk candidly about it, and doubly so if they're rushing into a plan like this at the last minute rather than having spent decades prior carefully thinking about and discussing it.)

Comment by RobBensinger on RobBensinger's Shortform · 2022-06-19T02:44:30.969Z · EA · GW

Well, it's less effort insofar as these are very low-confidence, unstable ass numbers. I wouldn't want to depend on a plan that assumes there will be no warning shots, or a plan that assumes there will be some.

Comment by RobBensinger on RobBensinger's Shortform · 2022-06-19T01:29:19.495Z · EA · GW

I'm thinking of a 'warning shot' roughly as 'an event where AI is widely perceived to have caused a very large amount of destruction'. Maybe loosely operationalized as 'an event about as sudden as 9/11, and at least one-tenth as shocking, tragic, and deadly as 9/11'.

I don't have stable or reliable probabilities here, and I expect that other MIRI people would give totally different numbers. But my  current ass numbers are something like:

  • 12% chance of a warning shot happening at some point.
  • 6% chance of a warning shot happening more than 6 years before AGI destroys and/or saves the world.
  • 10% chance of a warning shot happening 6 years or less before AGI destroys and/or saves the world.

My current unconditional odds on humanity surviving are very low (with most of my optimism coming from the fact that the future is just inherently hard to predict). Stating some other ass numbers:

  • Suppose that things go super well for alignment and timelines aren't that short, such that we achieve AGI in 2050 and have a 10% chance of existential success. In that world, if we held as much as possible constant except that AGI comes in 2060 instead of 2050, then I'm guessing that would double our success odds to 20%.
  • If we invented AGI in 2050 and somehow impossibly had fifteen years to work with AGI systems before anyone would be able to destroy the world, and we knew as much, then I'd imagine our success odds maybe rising from 10% to 55%.
  • The default I expect instead is that the first AGI developer will have more than three months, and less than five years, before someone else destroys the world with AGI. (And on the mainline, I expect them to have ~zero chance of aligning their AGI, and I expect everyone else to have ~zero chance as well.)

If a warning shot had a large positive impact on our success probability, I'd expect it to look something like:

'20 or 30 or 40 years before we'd naturally reach AGI, a huge narrow-AI disaster unrelated to AGI risk occurs. This disaster is purely accidental (not terrorism or whatever). Its effect is mainly just to cause it to be in the Overton window that a wider variety of serious technical people can talk about scary AI outcomes at all, and maybe it slows timelines by five years or something. Also, somehow none of this causes discourse to become even dumber; e.g., people don't start dismissing AGI risk because "the real risk is narrow AI systems like the one we just saw", and there isn't a big ML backlash to regulatory/safety efforts, and so on.'

I don't expect anything at all like that to happen, not least because I suspect we may not have 20+ years left before AGI. But that's a scenario where I could (maybe, optimistically) imagine real, modest improvements.

If I imagine that absent any warning shots, AGI is coming in 2050, and there's a 1% chance of things going well in 2050, then:

  • If we add a warning shot in 2023, then I'd predict something like: 85% chance it has no major effect, 12% chance if makes the situation a lot worse, 3% chance it makes the situation a lot better. (I.e., an 80% chance that if it has a major effect, it makes things worse.)

This still strikes me as worth thinking about some, in part because these probabilities are super unreliable. But mostly I think EAs should set aside the idea of warning shots and think more about things we might be able to cause to happen, and things that have more effects like 'shift the culture of ML specifically' and/or 'transmit lots of bits of information to technical people', rather than 'make the world as a whole panic more'.

I'm much more excited by scenarios like: 'a new podcast comes out that has top-tier-excellent discussion of AI alignment stuff, it becomes super popular among ML researchers, and the culture, norms, and expectations of ML thereby shift such that water-cooler conversations about AGI catastrophe are more serious, substantive, informed, candid, and frequent'.

It's rare for a big positive cultural shift like that to happen; but it does happen sometimes, and it can result in very fast changes to the Overton window. And since it's a podcast containing many hours of content, there's the potential to seed subsequent conversations with a lot of high-quality background thoughts.

To my eye, that seems more like the kind of change that might shift us from a current trajectory of "~definitely going to kill ourselves" to a new trajectory of "viable chance of an existential win".

Whereas warning shots feel more unpredictable to me, and if they're unhelpful, I expect the helpfulness to at best look like "we were almost on track to win, and then the warning shot nudged us just enough to secure a win".

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-17T06:41:02.135Z · EA · GW

I assume Eliezer means Eric Drexler's book Nanosystems.

Comment by RobBensinger on A central AI alignment problem: capabilities generalization, and the sharp left turn · 2022-06-16T09:29:05.291Z · EA · GW

I'd mainly point to relatively introductory / high-level resources like Alignment research field guide and Risks from learned optimization, if you haven't read them. I'm more confident in the relevance of methodology and problem statements than of existing attempts to make inroads on the problem.

There's a lot of good high-level content on Arbital (, but it's not very organized and a decent amount of it is in draft form.

Comment by RobBensinger on RobBensinger's Shortform · 2022-06-16T04:15:27.018Z · EA · GW

A thing I wrote on social media a few months ago, in response to someone asking if an AI warning shot might happen:

[... I]t's a realistic possibility, but I'd guess it won't happen before AI destroys the world, and if it happens I'm guessing the reaction will be stupid/panicky enough to just make the situation worse.

(It's also possible it happens before AI destroys the world, but six weeks before rather than six years before, when it's too late to make any difference.)

A lot of EAs feel confident we'll get a "warning shot" like this, and/or are mostly predicating their AI strategy around "warning-shot-ish things will happen and suddenly everyone will get serious and do much more sane things". Which doesn't sound like, eg, how the world reacted to COVID or 9/11, though it sounds a bit like how the world (eventually) reacted to nukes and maybe to the recent Ukraine invasion?

Someone then asked why I thought a warning shot might make things worse, and I said:

It might not buy time, or might buy orders of magnitude less time than matters; and/or some combination of:

- the places that are likely to have the strictest regulations are (maybe) the most safety-conscious parts of the world. So you may end up slowing down the safety-conscious researchers much more than the reckless ones.

- more generally, it's surprising and neat that the frontrunner (DM) is currently one of the least allergic to thinking about AI risk. I don't think it's anywhere near sufficient, but if we reroll the dice we should by default expect a worse front-runner.

- regulations and/or safety research are misdirected, because people have bad models now and are therefore likely to have bad models when the warning shot happens, and warning shots don't instantly fix bad underlying models.

The problem is complicated, and steering in the right direction requires that people spend time (often years) setting multiple parameters to the right values in a world-model. Warning shots might at best fix a single parameter, 'level of fear', not transmit the whole model. And even if people afterwards start thinking more seriously and thereby end up with better models down the road, their snap reaction to the warning shot may lock in sticky bad regulations, policies, norms, culture, etc., because they don't already have the right models before the warning shot happens.

- people tend to make worse decisions (if it's a complicated issue like this, not just 'run from tiger') when they're panicking and scared and feeling super rushed. As AGI draws visibly close / more people get scared (if either of those things ever happen), I expect more person-hours spent on the problem, but I also expect more rationalization, rushed and motivated reasoning, friendships and alliances breaking under the emotional strain, uncreative and on-rails thinking, unstrategic flailing, race dynamics, etc.

- if regulations or public backlash do happen, these are likely to sour a lot of ML researchers on the whole idea of AI safety and/or sour them on xrisk/EA ideas/people. Politicians or the public suddenly caring or getting involved, can easily cause a counter-backlash that makes AI alignment progress even more slowly than it would have by default.

- software is not very regulatable, software we don't understand well enough to define is even less regulatable, whiteboard ideas are less regulatable still, you can probably run an AGI on a not-expensive laptop eventually, etc.

So regulation is mostly relevant as a way to try to slow everything down indiscriminately, rather than as a way to specifically target AGI; and it would be hard to make it have a large effect on that front, even if this would have a net positive effect.

- a warning shot could convince everyone that AI is super powerful and important and we need to invest way more in it.

- (insert random change to the world I haven't thought of, because events like these often have big random hard-to-predict effects)

Any given big random change will tend to be bad on average, because the end-state we want requires setting multiple parameters to pretty specific values and any randomizing effect will be more likely to break a parameter we already have in approximately the right place, than to coincidentally set a parameter to exactly the right value.

There are far more ways to set the world to the wrong state than the right one, so adding entropy will usually make things worse.

We may still need to make some high-variance choices like this, if we think we're just so fucked that we need to reroll the dice and hope to have something good happen by coincidence. But this is very different from expecting the reaction to a warning shot to be a good one. (And even in a best-case scenario we'll need to set almost all of the parameters via steering rather than via rerolling; rerolling can maybe get us one or even two values close-to-correct if we're crazy lucky, but the other eight values will still need to be locked in by optimization, because relying on ten independent one-in-ten coincidences to happen is obviously silly.)

- oh, [redacted]'s comments remind me of a special case of 'worse actors replace the current ones': AI is banned or nationalized and the UK or US government builds it instead. To my eye, this seems a lot likelier to go poorly than the status quo.

There are plenty of scenarios that I think make the world go a lot better, but I don't think warning shots are one of them.

(One might help somewhat, if it happened; it's mostly just hard to say, and we'll need other major changes to happen first. Those other major changes are more the thing I'd suggest focusing on.)

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-12T01:15:56.941Z · EA · GW

This makes me wish there were a popular LW- or EAF-ish forum that doesn't use the karma system (or uses a very different karma system). If karma sometimes makes errors sticky because disagreement gets downvoted, then it would be nice to have more venues that don't have exactly that issue.

(Also, if more forums exist, this makes it likelier that different forums will downvote different things, so a wider variety of ideas can be explored in at least one place.)

This is also another reason to roll out separate upvotes for 'I like this' versus 'I agree with this'.

Comment by RobBensinger on Twitter-length responses to 24 AI alignment arguments · 2022-06-10T10:29:45.081Z · EA · GW

Are you expecting a general solution with a low false negative rate? Isn't this doable if the designs are simple enough to understand fully or fall within a well-known category that we do have a method to check for?

I don't know of a way to save the world using only blueprints that, e.g., a human could confirm (in a reasonable length of time) is a safe way to save the world, in the face of superintelligent optimization to manipulate the human.

From my perspective, the point here is that human checking might add a bit of extra safety or usefulness, but the main challenge is to get the AGI to want to help with the intended task. If the AGI is adversarial, you've already failed.

Also, why does it need to be fast, and how fast? To not give up your edge to others who are taking more risk?

Yes. My guess would be that the first AGI project will have less than five years to save the world (before a less cautious project destroys it), and more than three months. Time is likely to be of the essence, and I quickly become more pessimistic about save-the-world-with-AGI plans as they start taking more than e.g. one year in expectation.

Comment by RobBensinger on Twitter-length responses to 24 AI alignment arguments · 2022-06-10T10:22:59.642Z · EA · GW


Comment by RobBensinger on Twitter-length responses to 24 AI alignment arguments · 2022-06-10T10:20:55.540Z · EA · GW


Seems to me like a thing that's hard to be confident about. Misaligned AGI will want to kill humans because we're potential threats (e.g., we could build a rival AGI), and because we're using matter and burning calories that could be put to other uses. It would also want to use the resources that we depend on to survive (e.g., food, air, water, sunlight). I don't understand the logic of fixating on exactly which of these reasons is most mentally salient to the AGI at the time it kills us.

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-10T03:33:43.537Z · EA · GW

I don't think this shows that there is a 99% chance that AI systems will deceive their programmers.

Agreed. I wasn't trying to argue for a specific probability assignment; that seems hard, and it seems harder to reach extreme probabilities if you're new to the field and haven't searched around for counter-arguments, counter-counter-arguments, etc.

The AI might still be  good at advancing  human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.

In the vast majority of 'AGI with a random goal trying to deceiving you' scenarios, I think the random goal produces outcomes like paperclips, rather than 'sort-of-good' outcomes.

I think the same in the case of 'AGI with a goal sort-of related to advancing human welfare in the training set', though the argument for this is less obvious.

I think Complex Value Systems are Required to Realize Valuable Futures is a good overview: human values are highly multidimensional, and in such a way that there are many different dimensions where a slightly wrong answer can lose you all of the value. Structurally like a combination lock, where getting 9/10 of the numbers correct gets you 0% of the value but getting 10/10 right gets you 100% of the value.

Also relevant is Stuart Russell's point:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.  This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want.

And Goodhart's Curse:

Goodhart's Curse in this form says that a powerful agent neutrally optimizing a proxy measure U that we hoped to align with true values V, will implicitly seek out upward divergences of U from V.

In other words: powerfully optimizing for a utility function is strongly liable to blow up anything we'd regard as an error in defining that utility function.

[...] Suppose the humans have true values V. We try to convey these values to a powerful AI, via some value learning methodology that ends up giving the AI a utility function U.

Even if U is locally an unbiased estimator of V, optimizing U will seek out what we would regard as 'errors in the definition', places where U diverges upward from V. Optimizing for a high U may implicitly seek out regions where U - V is high; that is, places where V is lower than U. This may especially include regions of the outcome space or policy space where the value learning system was subject to great variance; that is, places where the value learning worked poorly or ran into a snag.

Goodhart's Curse would be expected to grow worse as the AI became more powerful. A more powerful AI would be implicitly searching a larger space and would have more opportunity to uncover what we'd regard as "errors"; it would be able to find smaller loopholes, blow up more minor flaws.

[...] We could see the genie as implicitly or emergently seeking out any possible loophole in the wish: Not because it is an evil genie that knows our 'truly intended' V and is looking for some place that V can be minimized while appearing to satisfy U; but just because the genie is neutrally seeking out very large values of U and these are places where it is unusually likely that U diverged upward from V.

So part of the issue is that human values inherently require getting a lot of bits correct simultaneously, in order to produce any value.  (And also, getting a lot of the bits right while getting a few wrong can pose serious s-risks.)

Another part of the problem is that powerfully optimizing one value will tend to crowd out  other values.

And a third part of the problem is that insofar as there are flaws in our specification of what we value, AGI is likely to disproportionately seek out and exploit those flaws, since "places where our specification of what's good was wrong" are especially likely to include more "places where you can score extremely high on the specification".

To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered. 

Agreed! If I thought a misaligned AGI were likely to produce an awesome flourishing civilization (but kill humans in the process), I would be vastly less worried. By far the main reason I'm worried is that I expect misaligned AGI to produce things morally equivalent to "granite spheres" instead.

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-10T03:14:07.358Z · EA · GW

"What Eliezer's saying here is that current ML doesn't have a way to point the system's goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal."

I'm not sure whether I have misunderstood, but doesn't this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)?

No; this is why I said "current ML doesn't have a way to point the system's goals at specific physical objects in the world", and why I said "getting a specific programmer-intended concept into the goal".

The central difficulty isn't 'getting the AGI to instrumentally care about the world's state' or even 'getting the AGI to terminally care about the world's state'. (I don't know how one would do the latter with any confidence, but maybe there's some easy hack.)

Instead, the central difficulty is 'getting the AGI to terminally care about a specific thing, as opposed to something relatively random'.

If we could build an AGI that we knew in advance, with confidence, would specifically optimize for the number of paperclips in the universe and nothing else, then that would mean that we've probably solved most of the alignment problem. It's not necessarily a huge leap from this to saving the world.

The problem is that we don't know how to do that, so AGI will instead (by default) end up with some random unintended goal. When I mentioned 'paperclips', 'granite spheres', etc. in my previous comments, I was using these as stand-ins for 'random goals that have little to do with human flourishing'. I wasn't saying we know how to specifically aim an AGI at paperclips, or at granite spheres, on purpose. If we could, that would be a totally different ball game.

If I gave an AI the aim of 'kill all humans' then don't the system's goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn't that mean it would be straightforward to give AIs the goal of 'kill all humans'?

The instrumental convergence thesis implies that it's straightforward, if you know how to build AGI at all, to build an AGI that has the instrumental strategy 'kill all humans' (if any humans exist in its environment).

This doesn't transfer over to 'we know how to robustly build AGI that has humane values', because (a) humane values aren't a convergent instrumental strategy, and (b) we only know how to build AGIs that pursue convergent instrumental strategies with high probability, not how to build AGIs that pursue arbitrary goals with high probability.

But yes, if 'kill all humans' or 'acquire resources' or 'make an AGI that's very smart' or 'make an AGI that protects itself from being destroyed' were the only thing we wanted from AGI, then the problem would already be solved.

Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world?

No, because (e.g.) a deceptive agent that is "playing nice" will be just as able to answer those questions well. There isn't an external behavioral test that reliably distinguishes deceptive agents from genuinely friendly ones; and most agents are unfriendly/deceptive, so the prior is strongly that you'll get those before you get real friendliness.

This doesn't mean that it's impossible to get real friendliness, but it means that you'll need some method other than just looking at external behaviors in order to achieve friendliness.

This argument doesn't tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they'll get past our current safety testing. 

The paragraph you quoted isn't talking about safety testing. It's saying 'gradient-descent-ish processes that score sufficiently well on almost any highly rich, real-world task will tend to converge on similar core capabilities, because these core capabilities are relatively simple and broadly useful for many tasks', plus 'there isn't an analogous process pushing arbitrary well-performing gradient-descent-ish processes toward being human-friendly'.

An important note in passing. At the start, Eliezer defines alignment as ">0 people survive" but in the remainder of the piece, he often seems to refer to alignment as the more prosaic 'alignment with the intent of the programmer'. I find this ambiguity pops up a lot in AI safety writing. 

He says "So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it." The "carries out some pivotal superhuman engineering task" is important too. This part, and the part where the AGI somehow respects the programmer's "don't kill people" goal, connects the two phrasings.

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-09T12:45:02.855Z · EA · GW

Quoting Scott Alexander here:

I agree it's not necessarily a good idea to go around founding the Let's Commit A Pivotal Act AI Company.

But I think there's room for subtlety somewhere like "Conditional on you being in a situation where you could take a pivotal act, which is a small and unusual fraction of world-branches, maybe you should take a pivotal act."

That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)

Somewhere halfway between "found the Let's Commit A Pivotal Act Company" and "if you happen to stumble into a pivotal act, take it", there's an intervention to spread a norm of "if a good person who cares about the world happens to stumble into a pivotal-act-capable AI, take the opportunity". I don't think this norm would necessarily accelerate a race. After all, bad people who want to seize power can take pivotal acts whether we want them to or not. The only people who are bound by norms are good people who care about the future of humanity. I, as someone with no loyalty to any individual AI team, would prefer that (good, norm-following) teams take pivotal acts if they happen to end up with the first superintelligence, rather than not doing that.

Another way to think about this is that all good people should be equally happy with any other good person creating a pivotal AGI, so they won't need to race among themselves. They might be less happy with a bad person creating a pivotal AGI, but in that case you should race and you have no other option. I realize "good" and "bad" are very simplistic but I don't think adding real moral complexity changes the calculation much.

I am more concerned about your point where someone rushes into a pivotal act without being sure their own AI is aligned. I agree this would be very dangerous, but it seems like a job for normal cost-benefit calculation: what's the risk of your AI being unaligned if you act now, vs. someone else creating an unaligned AI if you wait X amount of time? Do we have any reason to think teams would be systematically biased when making this calculation?

I'm more confident than Scott that the first AGI systems will be capable enough to execute a pivotal act (though alignability is another matter!). And, unlike Scott, I think AGI orgs should take the option more seriously at an earlier date, and center more of their strategic thinking around this scenario class. But if you don't agree with me there, I think you should endorse a position more like Scott's.

The alternative seems to just amount to writing off futures where early AGI systems are highly capable or impactful — giving up in advance, effectively deciding that endorsing a strategy that sounds weirdly extreme is a larger price to pay than human extinction. Phrased in those terms, this seems obviously absurd. (More absurd if you agree with me that this would mean writing off most possible futures.)

Nuclear weapons were an extreme technological development in their day, and MAD was an extreme and novel strategy developed in response to the novel properties of nuclear weapons. Strategically novel technologies force us to revise our strategies in counter-intuitive ways. The responsible way to handle this is to seriously analyze the new strategic landscape, have conversations about it, and engage in dialogue between major players until we collectively have a clear-sighted picture of what strategy makes sense, even if that strategy sounds weirdly extreme relative to other strategic landscapes.

If there's some alternative to intervening on AGI proliferation, then that seems important to know as well. But we should discover that, if so, via investigation, argument, and analysis of the strategic situation, rather than encouraging a mindset under which most of the relevant strategy space is taboo or evil (and then just hoping that this part of the strategy space doesn't end up being relevant).

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-09T09:08:46.731Z · EA · GW

I don't think MIRI has tried this much; we were unusually excited about Edward Kmett.

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-08T23:07:44.046Z · EA · GW

FWIW, I strongly encourage and endorse folks engaging with whatever parts of Eliezer's post they want to, without feeling obliged to respond to every single topic or sub-topic or whatever.

(Also, I like your comment and find it helpful.)

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-08T23:05:28.279Z · EA · GW

Sounds right to me! I think we should try lots of things.

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-08T23:04:18.234Z · EA · GW

Technical AI alignment isn't impossible, we just don't currently know how to do it. (And it looks hard.)

Comment by RobBensinger on AGI Ruin: A List of Lethalities · 2022-06-08T23:00:44.965Z · EA · GW

I think that makes sense as a worry, but I think EAs' caution and reluctance to model-build and argue about this stuff has turned out to do more harm than good, so we should change tactics. (And we very probably should have done things differently from the get-go.)

If you're worried that it's dangerous to talk about something publicly, I'd start off by thinking about it privately and talking about it over Signal with friends, etc. Then you can progress to contacting more EAs privately, then to posting publicly, as it becomes increasingly clear "there's real value in talking about this stuff" and "there's not a strong-enough reason to keep quiet".

Step one in doing that, though, has to be a willingness to think about the topic at all, even if there isn't clear public social proof that this is a normal or "approved" direction to think in. I think a thing that helps here is to recognize how small the group of "EA leaders and elite researchers" is, how divided their attention is between hundreds of different tasks and subtasks, and how easy it is for many things to therefore fall through the cracks or just-not-happen.