What does it take to defend the world against out-of-control AGIs? 2022-10-25T14:47:42.007Z
Changing the world through slack & hobbies 2022-07-21T18:01:06.935Z
“Intro to brain-like-AGI safety” series—just finished! 2022-05-17T15:35:38.485Z
“Intro to brain-like-AGI safety” series—halfway point! 2022-03-09T15:21:02.710Z
A case for AGI safety research far in advance 2021-03-26T12:59:36.244Z
[U.S. specific] PPP: free money for self-employed & orgs (time-sensitive) 2021-01-09T19:39:14.250Z

I had a very bad time with RSI from 2006-7, followed by a crazy-practically-overnight-miracle-cure-happy-ending. See my recent blog post The “mind-body vicious cycle” model of RSI & back pain for details & discussion.  :)

The implications for "brand value" would depend on whether people learn about "EA" as the perpetrator vs. victim. For example, I think there were charitable foundations that got screwed over by Bernie Madoff, and I imagine that their wiki articles would have also had a spike in views when that went down, but not in a bad way.

See also Nate Soares arguing against Joe’s conjunctive breakdown of risk here, and me here.

Related:

I have some discussion of this area in general and one of David Jilk’s papers in particular at my post Two paths forward: “Controlled AGI” and “Social-instinct AGI”.

In short, it seems to me that if you buy into this post, then the next step should be to figure out how human social instincts work, not just qualitatively but in enough detail to write it into AGI source code.

I claim that this is an open problem, involving things like circuits in the hypothalamus and neuropeptide receptors in the striatum. And it’s the main thing that I’m working on myself.

Additionally, there are several very good reasons to work on the human social instincts problem, even if you don’t buy into other parts of David Jilk’s assertions here.

Additionally, figuring out human social instincts is (I claim) (at least mostly) orthogonal to work that accelerates AGI timelines, and therefore we should all be able to rally around it as a good idea.

Whether we should also try to accelerate anthropomorphic AGI timelines, e.g. by studying the learning algorithms in the neocortex, is bound to be a much more divisive question. I claim that on balance, it’s mostly a very bad idea, with certain exceptions including closed (and not-intended-to-be-published) research projects by safety/alignment-concerned people. [I’m stating this opinion without justifying it.]

I think things like “If we see Sign X of misalignment from the AI, we should shut it down and retrain” comprise a small fraction of AI safety research, and I think even that small fraction consists primarily of stating extremely obvious ideas (let’s use honeypots! let’s do sandbox tests! let’s use interpretability! etc.) and exploring whether or not they would work, rather than stating non-obvious ideas. The horse has long ago left the barn on “the idea of sandbox testing and honeypots” being somewhere in an LLM’s training data!

I think a much larger fraction of AI safety research is geared towards thinking about how to make the AI not misaligned in the first place. So if the AI is scheming against us, reading those posts won’t be very helpful to it, because those ideas have evidently already failed.

I also think you’re understating how secrecy would inhibit progress. And we need progress, if we are to succeed at the goal of knowing how to make an AI that’s not misaligned in the first place.

In fact, even in the “If we see Sign X of misalignment from the AI, we should shut it down and retrain” type of research, I would strongly vote for open-and-therefore-better research (that the AI can also see) versus closed-and-therefore-probably-worse research (that the AI can’t see). For example, really good interpretability could be robust enough that it still works even if the AI has read the same articles as the programmers, and bad interpretability won’t work even if the AI hasn’t.

But meanwhile a very big and real secrecy-related problem is the kind of conventional AGI-related infohazards that safety researchers talk about all the time, i.e. people don’t want to publicly share ideas that would make AGI happen sooner. For example, lots of people disagree with Eliezer Yudkowsky about important aspects of AGI doom, and it’s not getting resolved because Eliezer is not sharing important parts of his beliefs that he sees as sensitive. Ditto with me for sure, ditto with lots of people I’ve talked to.

Would this problem be solvable with a giant closed Manhattan Project thing like you talked about? I dunno. The Manhattan project itself had a bunch of USSR spies in it. Not exactly reassuring! OTOH I’m biased because I like living in Boston and don’t want to move to a barbed-wire-enclosed base in the desert  :-P

My paraphrase of the SDO argument is:

With our best-guess parameters in the Drake equation, we should be surprised that there are no aliens. But for all we know, maybe one or more of the parameters in the Drake equation is many many orders of magnitude lower than our best guess. And if that’s in fact the case, then we should not be surprised that there are no aliens!

…which seems pretty obvious, right?

So back to the context of AI risk. We have:

1. a framework in which risk is a conjunctive combination of factors…
2. …in which, at several of the steps, a subset of survey respondents give rather low probabilities for that factor being present

So at each step in the conjunctive argument, we wind up with some weight on “maybe this factor is really low”. And those add up.

I don’t find the correlation table (of your other comment) convincing. When I look at the review table, there seem to be obvious optimistic outliers—two of the three lowest numbers on the whole table came from the same person. And your method has those optimistic outliers punching above their weight.

(At least, you should be calculating correlations between log(probability), right? Because it’s multiplicative.)

Anyway, I think that AI risk is more disjunctive than conjunctive, so I really disagree with the whole setup. Recall that Joe’s conjunctive setup is:

1. It will become possible and financially feasible to build APS systems.
2. There will be strong incentives to build APS systems | (1).
3. It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway | (1)-(2).
4. Some deployed APS systems will be exposed to inputs where they seek power in misaligned and high-impact ways (say, collectively causing >1 trillion 2021-dollars of damage) | (1)-(3). 5. Some of this misaligned power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4). 6. This will constitute an existential catastrophe | (1)-(5). Of these: • 1 is legitimately a conjunctive factor: If there’s no AGI, then there’s no AGI risk. (Though I understand that 1 is out of scope for this post?) • I don’t think 2 is a conjunctive factor. If there are not strong incentives to build APS systems, I expect people to do so anyway, sooner or later, because it’s scientifically interesting, it’s cool, it helps us better understand the human brain, etc. For example, I would argue that there are not strong incentives to do recklessly dangerous gain-of-function research, but that doesn’t seem to be stopping people. (Or if “doing this thing will marginally help somebody somewhere to get grants and tenure” counts as “strong incentives”, then that’s a very low bar!) • I don’t think 3 is a conjunctive factor, because even if alignment is easy in principle, there are bound to be people who want to try something different just because they’re curious what would happen, and people who have weird bad ideas, etc. etc. It’s a big world! • 4-5 does constitute a conjunctive factor, I think, but I would argue that avoiding 4-5 requires a conjunction of different factors, factors that get us to a very different world involving something like a singleton AI or extreme societal resilience against destructive actors, of a type that seems unlikely to me. (More on this topic in my post here.) • 6 is also a conjunctive factor, I think, but again avoiding 6 requires (I think) a conjunction of other factors. The way I see it, if you successfully get a grant in Year N, then that should be strong evidence that you can successfully get a grant in Year N+1. After all, you’ll now have an extra year of highly-relevant experience, plus better connections etc. Right? (Well, unless you waste the grant money and get a bad reputation.) (Or unless the cause area funding situation gets worse in general, but that would equally be a concern as an employee at big nonprofit too, and anyway seems unlikely for major EA cause areas in the near future.) And if not, whatever type of job you were doing before, you can apply for that type of job again! (If you leave on good terms, you could apply for literally the same job you left.) Should they take their grant in small amounts spaced out year-by-year instead of all in the first year? Do your taxes with accrual accounting! One time I wound up getting 26 months of pay in one calendar year. It would have been a catastrophe with cash-basis accounting, but it was perfectly lovely thanks to accrual accounting. :) For tax efficiency, should grant recipients optimally incorporate themselves as an S-corporation, or a charitable foundation, or something else? You can be self-employed automatically without filing any special paperwork. That’s the category I’m in. IIUC, the advantages of being a charitable foundation are all on the grant-giver side, not the grant-receiver side. Namely: (1) If you’re a charitable foundation, and another nonprofit wants to give you money, it is extremely easy for them to do so. (2) If you’re a charitable foundation, and an individual wants to give you money, then they can tax-deduct it. However, some institutions including EA Funds have jumped through whatever hoops there are such that they can give money to individuals. If your grantor is willing to give you the money as an individual, I think there’s no reason on your end to do anything different than that. (If you want the advantages of being a nonprofit, e.g. getting money from SFF, without filing all the paperwork to be a nonprofit, I vaguely recall that there is an institution in the EA space that will “take you in” under its umbrella. But I can’t remember which one. There are also “virtual research institutes” (Theiss, Ronin, IGDORE, maybe others), that offer the same advantage (i.e. that your grantor would be officially granting to a nonprofit), but they’ll take a cut of every grant you get. A different advantage of the “virtual research institutes”, I suspect, is their ability to handle government grants, which I imagine come with a ton of bureaucracy & paperwork.) Certain kinds of incorporation give you liability protection, which would be relevant if your “business” is going to borrow money or where there’s a risk of getting sued. That hasn’t been applicable for me. If you get a50K grant, is this better or worse on net than earning $50K of traditional W-2 employment income? … How do EA freelance researchers deal with the things that are typically provided through the employer/employee relationship — things like healthcare, disability insurance, retirement savings accounts, and so forth? If you want to know how big a grant is necessary to support your living expenses, you have to do the annoying spreadsheet where you calculate the major taxes and deductions and expenses etc. To answer your specific questions: • For me,$X of grant income was considerably worse than X of W-2 income, even leaving aside the fact that the latter often comes with employer-provided benefits. I guess the question is: do we think of lesswrong as a “blogging platform” akin to substack? Or do we think of it as a “community forum” akin to hacker news? (Or both!)

The same question, of course, applies to people who “blog” exclusively on EA Forum!

You might say: Maybe my lesswrong posts don’t constitute a proper “blog” because people can’t see just my posts, separated from everyone else’s lesswrong posts? Ah, but they can! Not only that, they can also view just my posts on my solo RSS feed, or my solo twitter, or an index of my posts on my personal website!

For my part, I find lesswrong to be a nice “blogging platform”, and have not so far felt tempted to set up a separate substack / wordpress / whatever. If I did, I would probably wind up cross-posting to lesswrong anyway, and the end result would just be a split-up comment section and more hassle posting and editing, with no appreciable upside, it seems to me. However, maybe I’d do it anyway, if eligibility for this giant prize is on the line. Is it?

A better open-source human-legible world-model, to be incorporated into future ML interpretability systems

Artificial intelligence

[UPDATE 3 MONTHS LATER: Better description and justification is now available in Section 15.2.2.1 here.]

It is probable that future powerful AGI systems will involve a learning algorithm that builds a common-sense world-model in the form of a giant unlabeled black-box data structure—after all, something like this is true in both modern machine learning and (I claim) human brains. Improving our ability, as humans, to look inside and understand the contents of such a black box is overwhelmingly (maybe even universally) viewed  by AGI safety experts as an important step towards safe and beneficial AGI.

A future interpretability system will presumably look like an interface, with human-legible things on one side of the interface, and things-inside-the-black-box on the other side of the interface.  For the former (i.e., human-legible) side of the interface, it would be helpful to have access to an open-source world-model / knowledge-graph data structure with the highest possible quality, comprehensiveness, and especially human-legibility, including clear and unambiguous labels. We are excited to fund teams to build, improve, and open-source such human-legible world-model data structures, so that they may be freely used as one component of current and future interpretability systems.

Note 1: For further discussion, see my post Let's buy out Cyc, for use in AGI interpretability systems? I still think that a hypothetical open-sourcing of Cyc would be a promising project along these lines. But I’m open-minded to the possibility that other approaches are even better (see the comment section of that post for some possible examples). As it happens, I’m not personally familiar with what open-source human-legible world-models are out there right now. I’d just be surprised if they're already so good that it wouldn't be helpful to make them even better (more human-legible, more comprehensive, fewer errors, uncertainty quantification, etc.). After all, there are people building knowledge webs right now, but nobody is doing it for the purpose of future AGI interpretability systems. So it would be quite a coincidence if they were already doing everything exactly right for that application.

Note 2: Speaking of which, there could also be a separate project—or a different aspect of this same project—which entails trying to build an automated tool that matches up (a subset of) the entries of an existing open-source human-legible world-model / web-of-knowledge data structure with (a subset of) the latent variables in a language model like GPT-3. (It may be a fuzzy, many-to-many match, but that would still be helpful.) I’m even less of an expert there; I have no idea if that would work, or if anyone is currently trying to do that. But it does strike me as the kind of thing we should be trying to do.

Note 3: To be clear, I don't think of myself as an interpretability expert. Don’t take my word for anything here.  :-)  [However, later in my post series I'll have more detailed discussion of exactly where this thing would fit into an AGI control system, as I see it. Check back in a few weeks. Here’s the link.]

One of my theories here is that it's helpful to pivot quickly towards "here's an example concrete research problem that seem hard but not impossible, and people are working on it, and not knowing the solution seems obviously problematic". This is good for several reasons, including "pattern-matching to serious research, safety engineering, etc., rather than pattern-matching to sci-fi comics", providing a gentler on-ramp (as opposed to wrenching things like "your children probably won't die of natural causes" or whatever), providing food for thought, etc. Of course this only works if you can engage in the technical arguments. Brian Christian's book is the extreme of this approach.

Vicarious and Numenta are both explicitly trying to build AGI, and neither does any safety/alignment  research whatsoever. I don't think this fact is particularly relevant to OpenAI, but I do think it's an important fact in its own right, and I'm always looking for excuses to bring it up.  :-P

Comment by Steven Byrnes (steve2152) on Why does (any particular) AI safety work reduce s-risks more than it increases them? · 2021-10-07T20:34:10.631Z · EA · GW

I don't really distinguish between effects by order*

I agree that direct and indirect effects of an action are fundamentally equally important (in this kind of outcome-focused context) and I hadn't intended to imply otherwise.

Hmm, it seems to me (and you can correct me) that we should be able to agree that there are SOME technical AGI safety research publications that are positive under some plausible beliefs/values and harmless under all plausible beliefs/values, and then we don't have to talk about cluelessness and tradeoffs, we can just publish them.

And we both agree that there are OTHER technical AGI safety research publications that are positive under some plausible beliefs/values and negative under others. And then we should talk about your portfolios etc. Or more simply, on a case-by-case basis, we can go looking for narrowly-tailored approaches to modifying the publication in order to remove the downside risks while maintaining the upside.

I feel like we're arguing past each other: I keep saying the first category exists, and you keep saying the second category exists. We should just agree that both categories exist! :-)

Perhaps the more substantive disagreement is what fraction of the work is in which category. I see most but not all ongoing technical work as being in the first category, and I think you see almost all ongoing technical work as being in the second category. (I think you agreed that "publishing an analysis about what happens if a cosmic ray flips a bit" goes in the first category.)

(Luke says "AI-related" but my impression is that he mostly works on AGI governance not technical, and the link is definitely about governance not technical. I would not be at all surprised if proposed governance-related projects were much more heavily weighted towards the second category, and am only saying that technical safety research is mostly first-category.)

For example, if you didn't really care about s-risks, then publishing a useful considerations for those who are concerned about s-risks might take attention away from your own priorities, or it might increase cooperation, and the default position to me should be deep uncertainty/cluelessness here, not that it's good in expectation or bad in expectation or 0 in expectation.

This points to another (possible?) disagreement. I think maybe you have the attitude where (to caricature somewhat) if there's any downside risk whatsoever, no matter how minor or far-fetched, you immediately jump to "I'm clueless!". Whereas I'm much more willing to say: OK, I mean, if you do anything at all there's a "downside risk" in a sense, just because life is uncertain, who knows what will happen, but that's not a good reason to let just sit on the sidelines and let nature take its course and hope for the best. If I have a project whose first-order effect is a clear and specific and strong upside opportunity, I don't want to throw that project out unless there's a comparably clear and specific and strong downside risk. (And of course we are obligated to try hard to brainstorm what such a risk might be.)  Like if a firefighter is trying to put out a fire, and they aim their hose at the burning interior wall, they don't stop and think, "Well I don't know what will happen if the wall gets wet, anything could happen, so I'll just not pour water on the fire, y'know, don't want to mess things up."

The "cluelessness" intuition gets its force from having a strong and compelling upside story weighed against a strong and compelling downside story, I think.

If the first-order effect of a project is "directly mitigating an important known s-risk", and the second-order effects of the same project are "I dunno, it's a complicated world, anything could happen", then I say we should absolutely do that project.

In practice, we can't really know with certainty that we're making AI safer, and without strong evidence/feedback, our judgements of tradeoffs may be prone to fairly arbitrary subjective judgements, motivated reasoning and selection effects.

This strikes me as too pessimistic. Suppose I bring a complicated new board game to a party. Two equally-skilled opposing teams each get a copy of the rulebook to study for an hour before the game starts. Team A spends the whole hour poring over the rulebook and doing scenario planning exercises. Team B immediately throws the rulebook in the trash and spends the hour watching TV.

Neither team has "strong evidence/feedback"—they haven't started playing yet. Team A could think they have good strategy ideas but in fact they are engaging in arbitrary subjective judgments and motivated reasoning. In fact, their strategy ideas, which seemed good on paper, could in fact turn out to be counterproductive!

Still, I would put my money on Team A beating Team B. Because Team A is trying. Their planning abilities don't have to be all that good to be strictly better (in expectation) than "not doing any planning whatsoever, we'll just wing it". That's a low bar to overcome!

So by the same token, it seems to me that vast swathes of AGI safety research easily surpasses the (low) bar of doing better in expectation than the alternative of "Let's just not think about it in advance, we'll wing it".

For example, compare (1) a researcher spends some time thinking about what happens if a cosmic ray flips a bit (or a programmer makes a sign error, like in the famous GPT-2 incident), versus (2) nobody spends any time thinking about that. (1) is clearly better, right? We can always be concerned that the person won't do a great job, or that it will be counterproductive because they'll happen across very dangerous information and then publish it, etc. But still, the expected value here is  clearly positive, right?

You also bring up the idea that (IIUC) there may be objectively good safety ideas but they might not actually get implemented because there won't be a "strong and justified consensus" to do them. But again, the alternative is "nobody comes up with those objectively good safety ideas in the first place". That's even worse, right? (FWIW I consider "come up with crisp and rigorous and legible arguments for true facts about AGI safety" to be a major goal of AGI safety research.)

Anyway, I'm objecting to undirected general feelings of "gahhhh we'll never know if we're helping at all", etc. I think there's just a lot of stuff in the AGI safety research field which is unambiguously good in expectation, where we don't have to feel that way. What I don't object to—and indeed what I strongly endorse—is taking a more directed approach and say "For AGI safety research project #732, what are the downside risks of this research, and how do they compare to the upsides?"

So that brings us to "ambitious value alignment". I agree that an ambitiously-aligned AGI comes with a couple potential sources of s-risk that other types of AGI wouldn't have, specifically via (1) sign flip errors, and (2) threats from other AGIs. (Although I think (1) is less obviously a problem than it sounds, at least in the architectures I think about.) On the other hand, (A) I'm not sure anyone is really working on ambitious alignment these days … at least Rohin Shah & Paul Christiano have stated that narrow (task-limited) alignment is a better thing to shoot for (and last anyone heard MIRI was shooting for task-limited AGIs too) (UPDATE: actually this was an overstatement, see e.g. 1,2,3); (B) my sense is that current value-learning work (e.g. at CHAI) is more about gaining conceptual understanding then creating practical algorithms / approaches that will scale to AGI. That said, I'm far from an expert on the current value learning literature; frankly I'm often confused by what such researchers are imagining for their longer-term game-plan.

BTW I put a note on my top comment that I have a COI. If you didn't notice. :)

Hmm, just a guess, but …

• Maybe you're conceiving of the field as "AI alignment", pursuing the goal "figure out how to bring an AI's goals as close as possible to a human's (or humanity's) goals, in their full richness" (call it "ambitious value alignment")
• Whereas I'm conceiving the field as "AGI safety", with the goal "reduce the risk of catastrophic accidents involving AGIs".

"AGI safety research" (as I think of it) includes not just how you would do ambitious value alignment, but also whether you should do ambitious value alignment. In fact, AGI safety research may eventually result in a strong recommendation against doing ambitious value alignment, because we find that it's dangerously prone to backfiring, and/or that some alternative approach is clearly superior (e.g. CAIS, or microscope AI, or act-based corrigibility or myopia or who knows what). We just don't know yet. We have to do the research.

"AGI safety research" (as I think of it) also includes lots of other activities like analysis and mitigation of possible failure modes (e.g. asking what would happen if a cosmic ray flips a bit in the computer), and developing pre-deployment testing protocols, etc. etc.

Does that help? Sorry if I'm missing the mark here.

Thanks!

(Incidentally, I don't claim to have an absolutely watertight argument here that AI alignment research couldn't possibly be bad for s-risks, just that I think the net expected impact on s-risks is to reduce them.)

If s-risks were increased by AI safety work near (C), why wouldn't they also be increased near (A), for the same reasons?

I think suffering minds are a pretty specific thing, in the space of "all possible configurations of matter". So optimizing for something random (paperclips, or "I want my field-of-view to be all white", etc.) would almost definitely lead to zero suffering (and zero pleasure). (Unless the AGI itself has suffering or pleasure.) However, there's a sense in which suffering minds are "close" to the kinds of things that humans might want an AGI to want to do. Like, you can imagine how if a cosmic ray flips a bit, "minimize suffering" could turn into "maximize suffering". Or at any rate, humans will try (and I expect succeed even without philanthropic effort) to make AGIs with a prominent human-like notion of "suffering", so that it's on the table as a possible AGI goal.

In other words, imagine you're throwing a dart at a dartboard.

• The bullseye has very positive point value.
• That's representing the fact that basically no human wants astronomical suffering, and basically everyone wants peace and prosperity etc.
• On other parts of the dartboard, there are some areas with very negative point value.
• That's representing the fact that if programmers make an AGI that desires something vaguely resembling what they want it to desire, that could be an s-risk.
• If you miss the dartboard entirely, you get zero points.
• That's representing the fact that a paperclip-maximizing AI would presumably not care to have any consciousness in the universe (except possibly its own, if applicable).

So I read your original post as saying "If the default is for us to miss the dartboard entirely, it could be s-risk-counterproductive to improve our aim enough that we can hit the dartboard", and my response to that was "I don't think that's relevant, I think it will be really easy to not miss the dartboard entirely, and this will happen "by default". And in that case, better aim would be good, because it brings us closer to the bullseye."

Sorry I'm not quite sure what you mean. If we put things on a number line with (A)=1, (B)=2, (C)=3, are you disagreeing with my claim "there is very little probability weight in the interval ", or with my claim "in the interval , moving down towards 1 probably reduces s-risk", or with both, or something else?

[note that I have a COI here]

That's because everybody wants the AI to do the thing they want it to do, not just long-term AGI risk people. And I think there are really obvious things that anyone would immediately think to try, and these really obvious techniques would be good enough to get us from (C) to (B) but not good enough to get us to (A).

[Warning: This claim is somewhat specific to a particular type of AGI architecture that I work on and consider most likely—see e.g. here. Other people have different types of AGIs in mind and would disagree. In particular, in the "deceptive mesa-optimizer" failure mode (which relates to a different AGI architecture than mine) we would plausibly expect failures to have random goals like "I want my field-of-view to be all white", even after reasonable effort to avoid that. So maybe people working in other areas would have different answers, I dunno.]

I agree that it's at least superficially plausible that (C) might be better than (B) from an s-risk perspective. But if (C) is off the table and the choice is between (A) and (B), I think (A) is preferable for both s-risks and x-risks.

The main argument of Stuart Russell's book focuses on reward modeling as a way to align AI systems with human preferences.

Hmm, I remember him talking more about IRL and CIRL and less about reward modeling. But it's been a little while since I read it, could be wrong.

If it's really difficult to write a reward function for a given task Y, then it seems unlikely that AI developers would deploy a system that does it in an unaligned way according to a misspecified reward function. Instead, reward modeling makes it feasible to design an AI system to do the task at all.

Maybe there's an analogy where someone would say "If it's really difficult to prevent accidental release of pathogens from your lab, then it seems unlikely that bio researchers would do research on pathogens whose accidental release would be catastrophic". Unfortunately there's a horrifying many-decades-long track record of accidental release of pathogens from even BSL-4 labs, and it's not like this kind of research has stopped. Instead it's like, the bad thing doesn't happen every time, and/or things seem to be working for a while before the bad thing happens, and that's good enough for the bio researchers to keep trying.

So as I talk about here, I think there are going to be a lot of proposals to modify an AI to be safe that do not in fact work, but do seem ahead-of-time like they might work, and which do in fact work for a while as training progresses. I mean, when x-risk-naysayers like Yann LeCun or Jeff Hawkins are asked how to avoid out-of-control AGIs, they can spout off a list of like 5-10 ideas that would not in fact work, but sound like they would. These are smart people and a lot of other smart people believe them too. Also, even something as dumb as "maximize the amount of money in my bank account" would plausibly work for a while and do superhumanly-helpful things for the programmers, before it starts doing superhumanly-bad things for the programmers.

Even with reward modeling, though, AI systems are still going to have similar drives due to instrumental convergence: self-preservation, goal preservation, resource acquisition, etc., even if they have goals that were well specified by their developers. Although maybe corrigibility and not doing bad things can be built into the systems' goals using reward modeling.

Yup, if you don't get corrigibility then you failed.

I really liked this!!!

Since you asked for feedback, here's a little suggestion, take it or leave it: I found a couple things at the end slightly out-of-place, in particular "If you choose to tackle the problem of nuclear security, what angle can you attack the problem from that will give you the most fulfillment?" and "Do any problems present even bigger risks than nuclear war?"

Immediately after such an experience, I think the narrator would not be thinking about option of not bothering to work on nuclear security because other causes are more important, nor thinking about their own fulfillment. If other causes came to mind, I imagine it would be along the lines of "if I somehow manage to stop the nuclear war, what other potential catastrophes are waiting in the wings, ready to strike anytime in the months and years after that—and this time with no reset button?"

Or if you want it to fit better as written now, then shortly after the narrator snaps back to age 18 the text could say something along the lines of "You know about chaos theory and the butterfly effect; this will be a new re-roll of history, and there might not be a nuclear war this time around. Maybe last time was a fluke?" Then that might remove some of the single-minded urgency that I would otherwise expect the narrator to feel, and thus it would become a bit more plausible that the narrator might work on pandemics or whatever.

(Maybe that "new re-roll of history" idea is what you had in mind? Whereas I was imagining the Groundhog Day / Edge of Tomorrow / Terminator trope where the narrator knows 100% for sure that there will be a nuclear war on this specific hour of this specific day, if the narrator doesn't heroically stop it.)

Comment by Steven Byrnes (steve2152) on A mesa-optimization perspective on AI valence and moral patienthood · 2021-09-16T18:08:18.128Z · EA · GW

Hmm, yeah, I guess you're right about that.

Oh, you said "evolution-type optimization", so I figured you were thinking of the case where the inner/outer distinction is clear cut. If you don't think the inner/outer distinction will be clear cut, then I'd question whether you actually disagree with the post :) See the section defining what I'm arguing against, in particular the "inner as AGI" discussion.

Comment by Steven Byrnes (steve2152) on A mesa-optimization perspective on AI valence and moral patienthood · 2021-09-14T15:40:49.019Z · EA · GW

For example, for the situation that you're talking about (I called it "Case 2" in my post) I wrote "It seems highly implausible that the programmers would just sit around for months and years and decades on end, waiting patiently for the outer algorithm to edit the inner algorithm, one excruciatingly-slow step at a time. I think the programmers would inspect the results of each episode, generate hypotheses for how to improve the algorithm, run small tests, etc." If the programmers did just sit around for years not looking at the intermediate training results, yes I expect the project would still succeed sooner or later. I just very strongly expect that they wouldn't sit around doing nothing.

Comment by Steven Byrnes (steve2152) on A mesa-optimization perspective on AI valence and moral patienthood · 2021-09-14T00:07:49.936Z · EA · GW

AlphaGo has a human-created optimizer, namely MCTS. Normally people don't use the term "mesa-optimizer" for human-created optimizers.

Then maybe you'll say "OK there's a human-created search-based consequentialist planner, but the inner loop of that planner is a trained ResNet, and how do you know that there isn't also a search-based consequentialist planner inside each single run through the ResNet?"

Admittedly, I can't prove that there isn't. I suspect that there isn't, because there seems to be no incentive for that (there's already a search-based consequentialist planner!), and also because I don't think ResNets are up to such a complicated task.

I find most justifications and arguments made in favor of a timeline of less than 50 years to be rather unconvincing.

If we don't have convincing evidence in favor of a timeline <50 years, and we also don't have convincing evidence in favor of a timeline ≥50 years, then we just have to say that this is a question on which we don't have convincing evidence of anything in particular. But we still have to take whatever evidence we have and make the best decisions we can. ¯\_(ツ)_/¯

(You don't say this explicitly but your wording kinda implies that ≥50 years is the default, and we need convincing evidence to change our mind away from that default. If so, I would ask why we should take ≥50 years to be the default. Or sorry if I'm putting words in your mouth.)

I am simply not able to understand why we are significantly closer to AGI today than we were in 1950s

Lots of ingredients go into AGI, including (1) algorithms, (2) lots of inexpensive chips that can do lots of calculations per second, (3) technology for fast communication between these chips, (4) infrastructure for managing large jobs on compute clusters, (5) frameworks and expertise in parallelizing algorithms, (6) general willingness to spend millions of dollars and roll custom ASICs to run a learning algorithm, (7) coding and debugging tools and optimizing compilers, etc. Even if you believe that you've made no progress whatsoever on algorithms since the 1950s, we've made massive progress in the other categories. I think that alone puts us "significantly closer to AGI today than we were in the 1950s": once we get the algorithms, at least everything else will be ready to go, and that wasn't true in the 1950s, right?

But I would also strongly disagree with the idea that we've made no progress whatsoever on algorithms since the 1950s. Even if you think that GPT-3 and AlphaGo have absolutely nothing whatsoever to do with AGI algorithms (which strikes me as an implausibly strong statement, although I would endorse much weaker versions of that statement), that's far from the only strand of research in AI, let alone neuroscience. For example, there's a (IMO plausible) argument that PGMs and causal diagrams will be more important to AGI than deep neural networks are. But that would still imply that we've learned AGI-relevant things about algorithms since the 1950s. Or as another example, there's a (IMO misleading) argument that the brain is horrifically complicated and we still have centuries of work ahead of us in understanding how it works. But even people who strongly endorse that claim wouldn't also say that we've made "no progress whatsoever" in understanding brain algorithms since the 1950s.

Sorry if I'm misunderstanding.

isn't there an infinite degree of freedom associated with a continuous function?

I'm a bit confused by this; are you saying that the only possible AGI algorithm is "the exact algorithm that the human brain runs"? The brain is wired up by a finite number of genes, right?

most contemporary progress on AI happens by running base-optimizers which could support mesa-optimization

GPT-3 is of that form, but AlphaGo/MuZero isn't (I would argue).

I'm not sure how to settle whether your statement about "most contemporary progress" is right or wrong. I guess we could count how many papers use model-free RL vs model-based RL, or something? Well anyway, given that I haven't done anything like that, I wouldn't feel comfortable making any confident statement here. Of course you may know more than me! :-)

If we forget about "contemporary progress" and focus on "path to AGI", I have a post arguing against what (I think) you're implying at Against evolution as an analogy for how humans will create AGI, for what it's worth.

Ideally we'd want a method for identifying valence which is more mechanistic that mine. In the sense that it lets you identify valence in a system just by looking inside the system without looking at how it was made.

Yeah I dunno, I have some general thoughts about what valence looks like in the vertebrate brain (e.g. this is related, and this) but I'm still fuzzy in places and am not ready to offer any nice buttoned-up theory. "Valence in arbitrary algorithms" is obviously even harder by far.  :-)