## Posts

What does it take to defend the world against out-of-control AGIs? 2022-10-25T14:47:42.007Z
Changing the world through slack & hobbies 2022-07-21T18:01:06.935Z
“Intro to brain-like-AGI safety” series—just finished! 2022-05-17T15:35:38.485Z
“Intro to brain-like-AGI safety” series—halfway point! 2022-03-09T15:21:02.710Z
A case for AGI safety research far in advance 2021-03-26T12:59:36.244Z
[U.S. specific] PPP: free money for self-employed & orgs (time-sensitive) 2021-01-09T19:39:14.250Z

Comment by Steven Byrnes (steve2152) on How to best address Repetitive Strain Injury (RSI)? · 2022-12-06T18:16:37.387Z · EA · GW

I had a very bad time with RSI from 2006-7, followed by a crazy-practically-overnight-miracle-cure-happy-ending. See my recent blog post The “mind-body vicious cycle” model of RSI & back pain for details & discussion.  :)

Comment by Steven Byrnes (steve2152) on RyanCarey's Shortform · 2022-11-20T14:10:24.724Z · EA · GW

The implications for "brand value" would depend on whether people learn about "EA" as the perpetrator vs. victim. For example, I think there were charitable foundations that got screwed over by Bernie Madoff, and I imagine that their wiki articles would have also had a spike in views when that went down, but not in a bad way.

Comment by Steven Byrnes (steve2152) on What should I ask Joe Carlsmith — Open Phil researcher, philosopher and blogger? · 2022-11-10T15:23:59.031Z · EA · GW

See also Nate Soares arguing against Joe’s conjunctive breakdown of risk here, and me here.

Comment by Steven Byrnes (steve2152) on Is AI forecasting a waste of effort on the margin? · 2022-11-07T19:09:41.097Z · EA · GW

Related:

Comment by Steven Byrnes (steve2152) on "Develop Anthropomorphic AGI to Save Humanity from Itself" (Future Fund AI Worldview Prize submission) · 2022-11-07T14:32:57.897Z · EA · GW

I have some discussion of this area in general and one of David Jilk’s papers in particular at my post Two paths forward: “Controlled AGI” and “Social-instinct AGI”.

In short, it seems to me that if you buy into this post, then the next step should be to figure out how human social instincts work, not just qualitatively but in enough detail to write it into AGI source code.

I claim that this is an open problem, involving things like circuits in the hypothalamus and neuropeptide receptors in the striatum. And it’s the main thing that I’m working on myself.

Additionally, there are several very good reasons to work on the human social instincts problem, even if you don’t buy into other parts of David Jilk’s assertions here.

Additionally, figuring out human social instincts is (I claim) (at least mostly) orthogonal to work that accelerates AGI timelines, and therefore we should all be able to rally around it as a good idea.

Whether we should also try to accelerate anthropomorphic AGI timelines, e.g. by studying the learning algorithms in the neocortex, is bound to be a much more divisive question. I claim that on balance, it’s mostly a very bad idea, with certain exceptions including closed (and not-intended-to-be-published) research projects by safety/alignment-concerned people. [I’m stating this opinion without justifying it.]

Comment by Steven Byrnes (steve2152) on Why do we post our AI safety plans on the Internet? · 2022-10-31T17:27:53.640Z · EA · GW

I think things like “If we see Sign X of misalignment from the AI, we should shut it down and retrain” comprise a small fraction of AI safety research, and I think even that small fraction consists primarily of stating extremely obvious ideas (let’s use honeypots! let’s do sandbox tests! let’s use interpretability! etc.) and exploring whether or not they would work, rather than stating non-obvious ideas. The horse has long ago left the barn on “the idea of sandbox testing and honeypots” being somewhere in an LLM’s training data!

I think a much larger fraction of AI safety research is geared towards thinking about how to make the AI not misaligned in the first place. So if the AI is scheming against us, reading those posts won’t be very helpful to it, because those ideas have evidently already failed.

I also think you’re understating how secrecy would inhibit progress. And we need progress, if we are to succeed at the goal of knowing how to make an AI that’s not misaligned in the first place.

In fact, even in the “If we see Sign X of misalignment from the AI, we should shut it down and retrain” type of research, I would strongly vote for open-and-therefore-better research (that the AI can also see) versus closed-and-therefore-probably-worse research (that the AI can’t see). For example, really good interpretability could be robust enough that it still works even if the AI has read the same articles as the programmers, and bad interpretability won’t work even if the AI hasn’t.

But meanwhile a very big and real secrecy-related problem is the kind of conventional AGI-related infohazards that safety researchers talk about all the time, i.e. people don’t want to publicly share ideas that would make AGI happen sooner. For example, lots of people disagree with Eliezer Yudkowsky about important aspects of AGI doom, and it’s not getting resolved because Eliezer is not sharing important parts of his beliefs that he sees as sensitive. Ditto with me for sure, ditto with lots of people I’ve talked to.

Would this problem be solvable with a giant closed Manhattan Project thing like you talked about? I dunno. The Manhattan project itself had a bunch of USSR spies in it. Not exactly reassuring! OTOH I’m biased because I like living in Boston and don’t want to move to a barbed-wire-enclosed base in the desert  :-P

Comment by Steven Byrnes (steve2152) on ‘Dissolving’ AI Risk – Parameter Uncertainty in AI Future Forecasting · 2022-10-19T19:48:41.174Z · EA · GW

My paraphrase of the SDO argument is:

With our best-guess parameters in the Drake equation, we should be surprised that there are no aliens. But for all we know, maybe one or more of the parameters in the Drake equation is many many orders of magnitude lower than our best guess. And if that’s in fact the case, then we should not be surprised that there are no aliens!

…which seems pretty obvious, right?

So back to the context of AI risk. We have:

1. a framework in which risk is a conjunctive combination of factors…
2. …in which, at several of the steps, a subset of survey respondents give rather low probabilities for that factor being present

So at each step in the conjunctive argument, we wind up with some weight on “maybe this factor is really low”. And those add up.

I don’t find the correlation table (of your other comment) convincing. When I look at the review table, there seem to be obvious optimistic outliers—two of the three lowest numbers on the whole table came from the same person. And your method has those optimistic outliers punching above their weight.

(At least, you should be calculating correlations between log(probability), right? Because it’s multiplicative.)

Anyway, I think that AI risk is more disjunctive than conjunctive, so I really disagree with the whole setup. Recall that Joe’s conjunctive setup is:

1. It will become possible and financially feasible to build APS systems.
2. There will be strong incentives to build APS systems | (1).
3. It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway | (1)-(2).
4. Some deployed APS systems will be exposed to inputs where they seek power in misaligned and high-impact ways (say, collectively causing >1 trillion 2021-dollars of damage) | (1)-(3). 5. Some of this misaligned power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4). 6. This will constitute an existential catastrophe | (1)-(5). Of these: • 1 is legitimately a conjunctive factor: If there’s no AGI, then there’s no AGI risk. (Though I understand that 1 is out of scope for this post?) • I don’t think 2 is a conjunctive factor. If there are not strong incentives to build APS systems, I expect people to do so anyway, sooner or later, because it’s scientifically interesting, it’s cool, it helps us better understand the human brain, etc. For example, I would argue that there are not strong incentives to do recklessly dangerous gain-of-function research, but that doesn’t seem to be stopping people. (Or if “doing this thing will marginally help somebody somewhere to get grants and tenure” counts as “strong incentives”, then that’s a very low bar!) • I don’t think 3 is a conjunctive factor, because even if alignment is easy in principle, there are bound to be people who want to try something different just because they’re curious what would happen, and people who have weird bad ideas, etc. etc. It’s a big world! • 4-5 does constitute a conjunctive factor, I think, but I would argue that avoiding 4-5 requires a conjunction of different factors, factors that get us to a very different world involving something like a singleton AI or extreme societal resilience against destructive actors, of a type that seems unlikely to me. (More on this topic in my post here.) • 6 is also a conjunctive factor, I think, but again avoiding 6 requires (I think) a conjunction of other factors. Like, to avoid 6 being true, we’d probably need to a unipolar outcome (…I would argue…), and the AI would need to have properties that are “good” in our judgment, and the AI would probably need to be able to successfully align its successors and avoid undesired value drift over the vast times and distances. Comment by Steven Byrnes (steve2152) on What Do AI Safety Pitches Not Get About Your Field? · 2022-09-24T13:51:34.389Z · EA · GW I join you in strongly disagreeing with people who say that we should expect unprecedented GDP growth from AI which is very much like AI today but better. OTOH, at some point we'll have AI that is like a new intelligent species arriving on our planet, and then I think all bets are off. Comment by Steven Byrnes (steve2152) on Changing the world through slack & hobbies · 2022-09-03T17:16:46.839Z · EA · GW Principal Investigator, i.e. a professor in charge of a group of grad students and/or other underlings. I just changed the wording to "professor" or "advisor" instead of "PI". Comment by Steven Byrnes (steve2152) on Changing the world through slack & hobbies · 2022-08-11T21:49:07.627Z · EA · GW Principal Investigator, i.e. a professor in charge of a group of grad students and/or other underlings. [UPDATE: I changed the wording.] Comment by Steven Byrnes (steve2152) on Grantees: how do you structure your finances & career? · 2022-08-05T01:06:23.564Z · EA · GW (This whole answer is USA-specific) This seems to me like a scary situation with essentially zero job security, but maybe I’m wrong about this? The only real job security is to have marketable skills. Eternal perfect job security is extremely rare in the USA—I can’t think of anyone but tenured professors who have that. If you work at a startup, the startup could go under. If you work at a big firm, there could be layoffs. Etc. The way I see it, if you successfully get a grant in Year N, then that should be strong evidence that you can successfully get a grant in Year N+1. After all, you’ll now have an extra year of highly-relevant experience, plus better connections etc. Right? (Well, unless you waste the grant money and get a bad reputation.) (Or unless the cause area funding situation gets worse in general, but that would equally be a concern as an employee at big nonprofit too, and anyway seems unlikely for major EA cause areas in the near future.) And if not, whatever type of job you were doing before, you can apply for that type of job again! (If you leave on good terms, you could apply for literally the same job you left.) Should they take their grant in small amounts spaced out year-by-year instead of all in the first year? Do your taxes with accrual accounting! One time I wound up getting 26 months of pay in one calendar year. It would have been a catastrophe with cash-basis accounting, but it was perfectly lovely thanks to accrual accounting. :) For tax efficiency, should grant recipients optimally incorporate themselves as an S-corporation, or a charitable foundation, or something else? You can be self-employed automatically without filing any special paperwork. That’s the category I’m in. IIUC, the advantages of being a charitable foundation are all on the grant-giver side, not the grant-receiver side. Namely: (1) If you’re a charitable foundation, and another nonprofit wants to give you money, it is extremely easy for them to do so. (2) If you’re a charitable foundation, and an individual wants to give you money, then they can tax-deduct it. However, some institutions including EA Funds have jumped through whatever hoops there are such that they can give money to individuals. If your grantor is willing to give you the money as an individual, I think there’s no reason on your end to do anything different than that. (If you want the advantages of being a nonprofit, e.g. getting money from SFF, without filing all the paperwork to be a nonprofit, I vaguely recall that there is an institution in the EA space that will “take you in” under its umbrella. But I can’t remember which one. There are also “virtual research institutes” (Theiss, Ronin, IGDORE, maybe others), that offer the same advantage (i.e. that your grantor would be officially granting to a nonprofit), but they’ll take a cut of every grant you get. A different advantage of the “virtual research institutes”, I suspect, is their ability to handle government grants, which I imagine come with a ton of bureaucracy & paperwork.) Certain kinds of incorporation give you liability protection, which would be relevant if your “business” is going to borrow money or where there’s a risk of getting sued. That hasn’t been applicable for me. If you get a50K grant, is this better or worse on net than earning $50K of traditional W-2 employment income? … How do EA freelance researchers deal with the things that are typically provided through the employer/employee relationship — things like healthcare, disability insurance, retirement savings accounts, and so forth? If you want to know how big a grant is necessary to support your living expenses, you have to do the annoying spreadsheet where you calculate the major taxes and deductions and expenses etc. To answer your specific questions: • For me,$X of grant income was considerably worse than X of W-2 income, even leaving aside the fact that the latter often comes with employer-provided benefits. The QBI deduction helps, but not nearly enough to compensate for the employer contribution to payroll taxes etc. It’s possible that this is income-dependent, I’m just saying what it was for me. • Yes I pay out-of-pocket for disability insurance, and (Roth & regular) IRAs, and an obamacare plan. what lies do you tell your relatives to stop them from nagging you about your unorthodox career decisions It was fine, partly because I didn’t quit my old job until my first 1-year grant was finalized, and so far I have gotten renewal grants well before the previous grants ran out. (Sample size = 1, but still.) Comment by Steven Byrnes (steve2152) on What might decorticate rats tell us about the distribution of consciousness? · 2022-07-20T21:39:18.272Z · EA · GW For Possibility 3, I guess you mean more specifically “Decorticate rats are not conscious, and neither are intact rats”, correct? If so, I think you’re prematurely rejecting, let’s call it, Possibility 5: “Decorticate rats are not conscious, whereas intact rats are conscious.” I think it’s just generally tricky to infer consciousness from behavior. For example, you mention “survive, navigate their environment, or interact with their peers… find their way around landmarks, solve basic reasoning tasks, and learn to avoid painful stimuli.” But deep-RL agents can do all those things too, right? Are deep-RL agents conscious? Well, maybe you think they are. But I and lots of people think they aren’t. At the very least, you need to make that argument, it doesn’t go without saying. And if we can’t unthinkingly infer consciousness from behavior in deep RL, then we likewise can’t unthinkingly infer consciousness from seeing not-very-different behaviors in decorticate rats (or any other animals). I also am a bit confused by your suggestion that decorticate-from-birth rats are wildly different from decorticate-from-birth primates. Merker 2007a argues that humans with hydranencephaly are basically decorticate-from-birth, and discusses all their behavior on p79, which very much seemed conscious to both Merker and the parents of these children, just as decorticate rats seem conscious to you. We don’t have to agree with Merker (and I don’t), but it seems that the basic issue is present in humans, unless of course Merker is mis-describing the nature of hydranencephaly. (I don’t know anything about hydranencephaly except from this one paper.) (My actual [tentative] position is that, to the limited extent that phenomenal consciousness is a real meaningful notion in the first place, decorticate rats are not conscious, and intact rats might or might not be conscious, I don't know, I’m still a bit hazy on the relevant neuroanatomy. I’m mostly a Graziano-ist, a.k.a. Attention Schema Theory.) (My take on superior colliculus versus visual cortex is that they’re doing two very different types of computations, see §3.2.1 here.) (Separately, mammal cortex seems to have a lot in common with bird pallium, such that "all mammals are conscious and no birds are conscious" would be a very weird position from my perspective. I've never heard anyone take that position, have you?) Comment by Steven Byrnes (steve2152) on Criticism of EA Criticism Contest · 2022-07-14T16:52:05.830Z · EA · GW It feels intellectually lazy to "strongly disagree" with principles like "The best way to do good yourself is to act selflessly to do good" and then not explain why. I could be wrong, but I was figuring that here Zvi was coming from a pro-market perspective, i.e. the perspective in which Jeff Bezos has made the world a better place by founding Amazon and thus we can now get consumer goods more cheaply and conveniently etc. (But Jeff Bezos did so, presumably, out of a selfish desire to make money.) I also suggest replacing “feels intellectually lazy” with something like “I know you’re busy but I sure hope you’ll find the time to spell out your thoughts on this topic in the future”. (In which case, I second that!) Comment by Steven Byrnes (steve2152) on On Deference and Yudkowsky's AI Risk Estimates · 2022-07-10T17:27:05.865Z · EA · GW For what it's worth, consider the claim “The Judeo-Christian God, the one who listens to prayers and so on, doesn't exist.” I have such high confidence in this claim that I would absolutely state it as a fact without hedging, and psychoanalyze people for how they came to disagree with me. Yet there's a massive theology literature arguing to the contrary of that claim, including by some very smart and thoughtful people, and I've read essentially none of this theology literature, and if you asked me to do an anti-atheism ITT I would flunk it catastrophically. I'm not sure what lesson you'll take from that; for all I know you yourself are very religious, and this anecdote will convince you that I have terrible judgment. But if you happen to be on the same page as me, then maybe this would be an illustration of the fact that (I claim) one can rationally and correctly arrive at extremely-confident beliefs without it needing to pass through a deep understanding and engagement with the perspectives of the people who disagree with you. I agree that this isn’t too important a conversation, it’s just kinda interesting. :) Comment by Steven Byrnes (steve2152) on On Deference and Yudkowsky's AI Risk Estimates · 2022-07-07T00:19:55.512Z · EA · GW Gotcha, thanks. I guess we have an object-level disagreement: I think that careful thought reveals MWI to be unambiguously correct, with enough 9’s as to justify Eliezer’s tone. And you don’t. ¯\_(ツ)_/¯ (Of course, this is bound to be a judgment call; e.g. Eliezer didn’t state how many 9’s of confidence he has. It’s not like there’s a universal convention for how many 9’s are enough 9’s to state something as a fact without hedging, or how many 9’s are enough 9’s to mock the people who disagree with you.) Comment by Steven Byrnes (steve2152) on On Deference and Yudkowsky's AI Risk Estimates · 2022-07-06T22:36:27.965Z · EA · GW Fair enough, thanks. Comment by Steven Byrnes (steve2152) on On Deference and Yudkowsky's AI Risk Estimates · 2022-07-06T03:25:53.728Z · EA · GW Hmm, I’m a bit confused where you’re coming from. Suppose that the majority of eminent mathematicians believe 5+5=10, but a significant minority believes 5+5=11. Also, out of the people in the 5+5=10 camp, some say “5+5=10 and anyone who says otherwise is just totally wrong”, whereas other people said “I happen to believe that the balance of evidence is that 5+5=10, but my esteemed colleagues are reasonable people and have come to a different conclusion, so we 5+5=10 advocates should approach the issue with appropriate humility, not overconfidence.” In this case, the fact of the matter is that 5+5=10. So in terms of who gets the most credit added to their track-record, the ranking is: • 1st place: The ones who say “5+5=10 and anyone who says otherwise is just totally wrong”, • 2nd place: The ones who say “I think 5+5=10, but one should be humble, not overconfident”, • 3rd place: The ones who say “I think 5+5=11, but one should be humble, not overconfident”, • Last place: The ones who say “5+5=11 and anyone who says otherwise is just totally wrong. Agree so far? (See also: Bayes’s theorem, Brier score, etc.) Back to the issue here. Yudkowsky is claiming “MWI, and anyone who says otherwise is a just totally wrong”. (And I agree—that’s what I meant when I called myself a pro-MWI extremist.) IF the fact of the matter is that careful thought shows MWI to be unambiguously correct, then Yudkowsky (and I) get more credit for being more confident. Basically, he’s going all in and betting his reputation on MWI being right, and (in this scenario) he won the bet. Conversely, IF the fact of the matter is that careful thought shows MWI to be not unambiguously correct, then Eliezer loses the maximum number of points. He staked his reputation on MWI being right, and (in this scenario) he lost the bet. So that’s my model, and in my model “overconfidence” per se is not really a thing in this context. Instead we first have to take a stand on the object-level controversy. I happen to agree with Eliezer that careful thought shows MWI to be unambiguously correct, and given that, the more extreme his confidence in this (IMO correct) claim, the more credit he deserves. I’m trying to make sense of why you’re bringing up “overconfidence” here. The only thing I can think of is that you think that maybe there is simply not enough information to figure out whether MWI is right or wrong (not even for even an ideal reasoner with a brain the size of Jupiter and a billion years to ponder the topic), and therefore saying “MWI is unambiguously correct” is “overconfident”? If that’s what you’re thinking, then my reply is: if “not enough information” were the actual fact of the matter about MWI, then we should criticize Yudkowsky first and foremost for being wrong, not for being overconfident. As for your point (2), I forget what mistakes Yudkowsky claimed that anti-MWI-advocates are making, and in particular whether he thought those mistakes were “elementary”. I am open-minded to the possibility that Yudkowsky was straw-manning the MWI critics, and that they are wrong for more interesting and subtle reasons than he gives them credit for, and in particular that he wouldn’t pass an anti-MWI ITT. (For my part, I’ve tried harder, see e.g. here.) But that’s a different topic. FWIW I don’t think of Yudkowsky as having a strong ability to explain people’s wrong opinions in a sympathetic and ITT-passing way, or if he does have that ability, then I find that he chooses not to exercise it too much in his writings. :-P Comment by Steven Byrnes (steve2152) on On Deference and Yudkowsky's AI Risk Estimates · 2022-07-05T17:34:58.607Z · EA · GW OTOH, I am (or I guess was?) a professional physicist, and when I read Rationality A-Z, I found that Yudkowsky was always reaching exactly the same conclusions as me whenever he talked about physics, including areas where (IMO) the physics literature itself is a mess—not only interpretations of QM, but also how to think about entropy & the 2nd law of thermodynamics, and, umm, I thought there was a third thing too but I forget. That increased my respect for him quite a bit. And who the heck am I? Granted, I can’t out-credential Scott Aaronson in QM. But FWIW, hmm let’s see, I had the highest physics GPA in my Harvard undergrad class and got the highest preliminary-exam score in my UC Berkeley physics grad school class, and I’ve played a major role in designing I think 5 different atomic interferometers (including an atomic clock) for various different applications, and in particular I was always in charge of all the QM calculations related to estimating their performance, and also I once did a semester-long (unpublished) research project on quantum computing with superconducting qubits, and also I have made lots of neat wikipedia QM diagrams and explanations including a pedagogical introduction to density matrices and mixed states. I don’t recall feeling strongly that literally every word Yudkowsky wrote about physics was correct, more like “he basically figured out the right idea, despite not being a physicist, even in areas where physicists who are devoting their career to that particular topic are all over the place”. In particular, I don’t remember exactly what Yudkowsky wrote about the no-communication theorem. But I for one absolutely understand mixed states, and that doesn’t prevent me from being a pro-MWI extremist like Yudkowsky. Comment by Steven Byrnes (steve2152) on Digital people could make AI safer · 2022-06-11T18:52:14.355Z · EA · GW Strong agree, see for example my post Randal Koene on brain understanding before whole brain emulation Comment by Steven Byrnes (steve2152) on A tale of 2.5 orthogonality theses · 2022-05-03T12:24:50.510Z · EA · GW The quote above is an excerpt from here, and immediately after listing those four points, Eliezer says “But there are further reasons why the above problem might be difficult to solve, as opposed to being the sort of thing you can handle straightforwardly with a moderate effort…”. Comment by Steven Byrnes (steve2152) on A tale of 2.5 orthogonality theses · 2022-05-02T02:17:21.024Z · EA · GW My concern is with the non-experts… My perspective is “orthogonality thesis is one little ingredient of an argument that AGI safety is an important cause area”. One possible different perspective is “orthogonality thesis is the reason why AGI safety is an important cause area”. Your belief is that a lot of non-experts hold the latter perspective, right? If so, I’m skeptical. I think I’m reasonably familiar with popular expositions of the case for AGI safety, and with what people inside and outside the field say about why or why not to work on AGI safety. And I haven’t come across “orthogonality thesis is the reason why AGI safety is an important cause area” as a common opinion, or even a rare opinion, as far as I can recall. For example, Brian Christian, Stuart Russell, and Nick Bostrom all talk about Goodhart’s law and/or instrumental convergence in addition to (or instead of) orthogonality, Sam Harris talks about arms races and fragility-of-value, Ajeya Cotra talks about inner misalignment, Rob Miles talks about all of the above, Toby Ord uses the “second species argument”, etc. People way outside the field don’t talk about “orthogonality thesis” because they’ve never heard of it. So if lots of people are saying “orthogonality thesis is the reason why AGI safety is an important cause area”, I don’t know where they would have gotten that idea, and I remain skeptical that this is actually the case. I don't know how to understand 'the space of all possible intelligent algorithms' as a statistical relationship without imagining it populated with actual instances. My main claim here is that asking random EA people about the properties of “intelligence” (in the abstract) is different from asking them about the properties of “intelligent algorithms that will actually be created by future AI programmers”. I suspect that most people would feel that these are two different things, and correspondingly give different answers to questions depending on which one you ask about. (This could be tested, of course.) A separate question is how random EA people conceptualize “intelligence” (in the abstract). I suspect “lots of different ways”, and those ways might be more or less coherent. For example, one coherent possibility is to consider the set of all 2^8000000 possible 1-megabyte source code algorithms, then select the subset that is “intelligent” (operationalized somehow), and then start talking about the properties of algorithms in that set. Comment by Steven Byrnes (steve2152) on A tale of 2.5 orthogonality theses · 2022-05-01T16:08:50.308Z · EA · GW I think the "real" orthogonality thesis is what you call the motte. I don't think the orthogonality thesis by itself proves "alignment is hard"; rather you need additional arguments (things like Goodhart's law, instrumental convergence, arguments about inner misalignment, etc.). I don't want to say that nobody has ever made the argument "orthogonality, therefore alignment is hard"—people say all kinds of things, especially non-experts—but it's a wrong argument and I think you're overstating how popular it is among experts. Armstrong initially states that he’s arguing for the thesis that ‘high-intelligence agents can exist having more or less any final goals’ - ie theoretical possibility - but then adds that he will ‘be looking at proving the … still weaker thesis [that] the fact of being of high intelligence provides extremely little constraint on what final goals an agent could have’ - which I think Armstrong meant as ‘there are very few impossible pairings of high intelligence and motivation’, but which much more naturally reads to me as ‘high intelligence is almost equally as likely to be paired with any set of motivations as any other’. I think the last part of this excerpt ("almost equally") is unfair. I mean, maybe some readers are interpreting it that way, but if so, I claim that those readers don't know what the word "constraint" means. Right? I posted one poll asking ‘what the orthogonality thesis implies about [a relationship between] intelligence and terminal goals’, to which 14 of 16 respondents selected the option ‘there is no relationship or only an extremely weak relationship between intelligence and goals’, but someone pointed out that respondents might have interpreted ‘no relationship’ as ‘no strict logical implication from one to the other’. The other options hopefully gave context, but in a differently worded version of the poll 10 of 13 people picked options describing theoretical possibility. I think the key reason that knowledgeable optimistic people are optimistic is the fact that humans will be trying to make aligned AGIs. But neither of the polls mention that. The statement “There is no statistical relationship between intelligence and goals” is very different from “An AGI created by human programmers will have a uniformly-randomly-selected goal”; I subscribe to (something like) the former (in the sense of selecting from “the space of all possible intelligent algorithms” or something) but I put much lower probability on (something like) the latter, despite being pessimistic about AGI doom. Human programmers are not uniformly-randomly sampling the space of all possible intelligent algorithms (I sure hope!) Comment by Steven Byrnes (steve2152) on Is AI safety still neglected? · 2022-03-31T16:50:00.605Z · EA · GW OK, thanks. Here’s a chart I made: I think the problem is that when I said “technical AGI safety”, I was thinking the red box, whereas you were thinking “any technical topic in either the red or blue boxes”. I agree that there are technical topics in the top-right blue box in particular, and that’s where “conflict scenarios” would mainly be. My understanding is that working on those topics does not have much of a direct connection to AGI, in the sense that technologies for reducing human-human conflict tend to overlap with technologies for reducing AGI-AGI conflict. (At least, according to this comment thread, I haven’t personally thought about it much.) Anyway, I guess you would say “in a more s-risk-focused world, we would be working more on the top-right blue box and less on the red box”. But really, in a more s-risk-focused world, we would be working more on all three colored boxes. :-P I’m not an expert on the ITN of particular projects within the blue boxes, and therefore don’t have a strong opinion about how to weigh them against particular projects within the red box. I am concerned / pessimistic about prospects for success in the red box. But maybe if I knew more about the state of the blue boxes, I would be equally concerned / pessimistic about those too!! ¯\_(ツ)_/¯ Comment by Steven Byrnes (steve2152) on Is AI safety still neglected? · 2022-03-31T12:56:52.524Z · EA · GW maybe no one wants to do ambitious value learning "Maybe no one" is actually an overstatement, sorry, here are some exceptions: 1,2,3. (I have corrected my previous comment.) I guess I think of current value learning work as being principally along the lines of “What does value learning even mean? How do we operationalize that?” And if we’re still confused about that question, it makes it a bit hard to productively think about failure modes. It seems pretty clear to me that “unprincipled, making-it-up-as-you-go-along, alignment schemes” would be bad for s-risks, for such reasons as you mentioned. So trying to gain clarity about the lay of the land seems good. Comment by Steven Byrnes (steve2152) on Is AI safety still neglected? · 2022-03-31T11:43:00.315Z · EA · GW Oops, I was thinking more specifically about technical AGI safety. Or do you think "conflict scenarios" impact that too? Comment by Steven Byrnes (steve2152) on Is AI safety still neglected? · 2022-03-30T17:15:12.744Z · EA · GW What is neglected within AI safety is suffering-focused AI safety for preventing S-risks. Most AI safety research and existential risk research in general seems to be focused on reducing extinction risks and on colonizing space, rather than on reducing the risk of worse than death scenarios. I disagree, I think if AGI safety researchers cared exclusively about s-risk, their research output would look substantially the same as it does today, e.g. see here and discussion thread. For example, suppose there is an AI aligned to reflect human values. Yet "human values" could include religious hells. Ambitious value learning and CEV are not a particularly large share of what AGI safety researchers are working on on a day-to-day basis, AFAICT. And insofar as researchers are thinking about those things, a lot of that work is trying to figure out whether those things are good ideas the first place, e.g. whether they would lead to religious hell. Comment by Steven Byrnes (steve2152) on The role of academia in AI Safety. · 2022-03-28T19:22:56.617Z · EA · GW It's not obvious to me that "the academic community has a comparative advantage at solving sufficiently defined problems". For example, mechanistic interpretability has been a well-defined problem for the past two years at least, but it seems that a disproportionate amount of progress on it has been made outside of academia, by Chris Olah & collaborators at OpenAI & Anthropic. There are various concrete problems here but it seems that more progress is being made by independent researchers (e.g. Vanessa Kosoy, John Wentworth) and researchers at nonprofits (MIRI) than by anyone in academia. In other domains, I tend to think of big challenging technical projects as being done more often by the private or public sector—for example, academic groups are not building rocket ships, or ultra-precise telescope mirrors, etc., instead companies and governments are. Yet another example: In the domain of AI capabilities research, DeepMind and OpenAI and FAIR and Microsoft Research etc. give academic labs a run for their money in solving concrete problems. Also, quasi-independent-researcher Jeremy Howard beat a bunch of ML benchmarks while arguably kicking off the pre-trained-language-model revolution here. My perspective is: academia has a bunch of (1) talent and (2) resources. I think it's worth trying to coax that talent and resources towards solving important problems like AI alignment, instead of the various less-important and less-time-sensitive things that they do. However, I think it's MUCH less clear that any particular Person X would be more productive as a grad student than as a nonprofit employee, or more productive as a professor than as a nonprofit technical co-founder. In fact, I strongly expect the reverse. And in that case, we should really be framing it as "There are tons of talented people in academia, and we should be trying to convince them that AGI x-risk is a problem they should work on. And likewise, there are tons of resources in academia, and we should be trying to direct those resources towards AGI x-risk research." Note the difference: in this framing, we're not pre-supposing that pushing people and projects from outside academia to inside academia is a good thing. It might or might not be, depending on the details. Comment by Steven Byrnes (steve2152) on The role of academia in AI Safety. · 2022-03-28T18:28:00.767Z · EA · GW OK, thanks for clarifying. So my proposal would be: if a person wants to do / found / fund an AGI-x-risk-mitigating research project, they should consider their background, their situation, the specific nature of the research project, etc., and decide on a case-by-case basis whether the best home for that research project is academia (e.g. CHAI) versus industry (e.g. DeepMind, Anthropic) versus nonprofits (e.g. MIRI) versus independent research. And a priori, it could be any of those. Do you agree with that? Comment by Steven Byrnes (steve2152) on The role of academia in AI Safety. · 2022-03-28T16:38:07.240Z · EA · GW There are a few possible claims mixed up here: • Possible claim 1: "We want people in academia to be doing lots of good AGI-x-risk-mitigating research." Yes, I don't think this is controversial. • Possible claim 2: "We should stop giving independent researchers and nonprofits money to do AGI-x-risk-mitigating research, because academia is better." You didn't exactly say this, but sorta imply it. I disagree. Academia has strengths and weaknesses, and certain types of projects and people might or might not be suited to academia, and I think we shouldn't make a priori blanket declarations about academia being appropriate for everything versus nothing. My wish-list of AGI safety research projects (blog post is forthcoming UPDATE: here it is) has a bunch of items that are clearly well-suited to academia and a bunch of others that are equally clearly a terrible fit to academia. Likewise, some people who might work on AGI safety are in a great position to do so within academia (e.g. because they're already faculty) and some are in a terrible position to do so within academia (e.g. because they lack relevant credentials). Let's just have everyone do what makes sense for them! • Possible claim 3: "We should do field-building to make good AGI-x-risk-mitigating research more common, and better, within academia." The goal seems uncontroversially good. Whether any specific plan will accomplish that goal is a different question. For example, a proposal to fund a particular project led by such-and-such professor (say, Jacob Steinhardt or Stuart Russell) is very different from a proposal to endow a university professorship in AGI safety. In the latter case, I would suspect that universities will happily take the money and spend it on whatever their professors would have been doing anyway, and they'll just shoehorn the words "AGI safety" into the relevant press releases and CVs. Whereas in the former case, it's just another project, and we can evaluate it on its merits, including comparing it to possible projects done outside academia. • Possible claim 4: "We should turn AGI safety into a paradigmatic field with well-defined widely-accepted research problems and approaches which contribute meaningfully towards x-risk reduction, and also would be legible to journals, NSF grant applications, etc." Yes that would be great (and is already true to a certain extent), but you can't just wish that into existence! Nobody wants the field to be preparadigmatic! It's preparadigmatic not by choice, but rather to the extent that we are still searching for the right paradigms. (COI note.) Comment by Steven Byrnes (steve2152) on Milan Griffes on EA blindspots · 2022-03-19T14:35:12.532Z · EA · GW I agree that the current nuclear weapon situation makes AI catastrophe more likely on the margin, and said as much here (The paragraph "You might reply: The thing that went wrong in this scenario is not the out-of-control AGI, it’s the fact that humanity is too vulnerable! And my response is: Why can’t it be both? ...") That said, I do think the nuclear situation is a rather small effect (on AI risk specifically), in that there are many different paths for an intelligent motivated agent to cause chaos and destruction. Even if triggering nuclear war is the lowest-hanging fruit for a hypothetical future AGI aspiring to destroy humanity (it might or might not be, I dunno), I think other fruits are hanging only slightly higher, like causing extended blackouts, arranging for the release of bio-engineered plagues, triggering non-nuclear great power war (if someday nuclear weapons are eliminated), mass spearphishing / hacking, mass targeted disinformation, etc., even leaving aside more exotic things like nanobots. Solving all these problems would be that much harder (still worth trying!), and anyway we need to solve AI alignment one way or the other, IMO. :) Comment by Steven Byrnes (steve2152) on We're announcing a100,000 blog prize · 2022-03-08T02:30:53.917Z · EA · GW

I regularly write posts on lesswrong (and cross-post when applicable to alignmentforum). Am I a blogger? I certainly describe myself that way. But I get a strong impression from the Effective Ideas website that this doesn’t count. (You can correct me if I’m wrong.)

I guess the question is: do we think of lesswrong as a “blogging platform” akin to substack? Or do we think of it as a “community forum” akin to hacker news? (Or both!)

The same question, of course, applies to people who “blog” exclusively on EA Forum!

You might say: Maybe my lesswrong posts don’t constitute a proper “blog” because people can’t see just my posts, separated from everyone else’s lesswrong posts? Ah, but they can! Not only that, they can also view just my posts on my solo RSS feed, or my solo twitter, or an index of my posts on my personal website!

For my part, I find lesswrong to be a nice “blogging platform”, and have not so far felt tempted to set up a separate substack / wordpress / whatever. If I did, I would probably wind up cross-posting to lesswrong anyway, and the end result would just be a split-up comment section and more hassle posting and editing, with no appreciable upside, it seems to me. However, maybe I’d do it anyway, if eligibility for this giant prize is on the line. Is it?

Comment by Steven Byrnes (steve2152) on The Future Fund’s Project Ideas Competition · 2022-03-01T19:39:38.135Z · EA · GW

A better open-source human-legible world-model, to be incorporated into future ML interpretability systems

Artificial intelligence

[UPDATE 3 MONTHS LATER: Better description and justification is now available in Section 15.2.2.1 here.]

It is probable that future powerful AGI systems will involve a learning algorithm that builds a common-sense world-model in the form of a giant unlabeled black-box data structure—after all, something like this is true in both modern machine learning and (I claim) human brains. Improving our ability, as humans, to look inside and understand the contents of such a black box is overwhelmingly (maybe even universally) viewed  by AGI safety experts as an important step towards safe and beneficial AGI.

A future interpretability system will presumably look like an interface, with human-legible things on one side of the interface, and things-inside-the-black-box on the other side of the interface.  For the former (i.e., human-legible) side of the interface, it would be helpful to have access to an open-source world-model / knowledge-graph data structure with the highest possible quality, comprehensiveness, and especially human-legibility, including clear and unambiguous labels. We are excited to fund teams to build, improve, and open-source such human-legible world-model data structures, so that they may be freely used as one component of current and future interpretability systems.

~

Note 1: For further discussion, see my post Let's buy out Cyc, for use in AGI interpretability systems? I still think that a hypothetical open-sourcing of Cyc would be a promising project along these lines. But I’m open-minded to the possibility that other approaches are even better (see the comment section of that post for some possible examples). As it happens, I’m not personally familiar with what open-source human-legible world-models are out there right now. I’d just be surprised if they're already so good that it wouldn't be helpful to make them even better (more human-legible, more comprehensive, fewer errors, uncertainty quantification, etc.). After all, there are people building knowledge webs right now, but nobody is doing it for the purpose of future AGI interpretability systems. So it would be quite a coincidence if they were already doing everything exactly right for that application.

Note 2: Speaking of which, there could also be a separate project—or a different aspect of this same project—which entails trying to build an automated tool that matches up (a subset of) the entries of an existing open-source human-legible world-model / web-of-knowledge data structure with (a subset of) the latent variables in a language model like GPT-3. (It may be a fuzzy, many-to-many match, but that would still be helpful.) I’m even less of an expert there; I have no idea if that would work, or if anyone is currently trying to do that. But it does strike me as the kind of thing we should be trying to do.

Note 3: To be clear, I don't think of myself as an interpretability expert. Don’t take my word for anything here.  :-)  [However, later in my post series I'll have more detailed discussion of exactly where this thing would fit into an AGI control system, as I see it. Check back in a few weeks. Here’s the link.]

Comment by Steven Byrnes (steve2152) on [Discussion] Best intuition pumps for AI safety · 2021-11-06T16:22:02.104Z · EA · GW

One of my theories here is that it's helpful to pivot quickly towards "here's an example concrete research problem that seem hard but not impossible, and people are working on it, and not knowing the solution seems obviously problematic". This is good for several reasons, including "pattern-matching to serious research, safety engineering, etc., rather than pattern-matching to sci-fi comics", providing a gentler on-ramp (as opposed to wrenching things like "your children probably won't die of natural causes" or whatever), providing food for thought, etc. Of course this only works if you can engage in the technical arguments. Brian Christian's book is the extreme of this approach.

Comment by Steven Byrnes (steve2152) on Why aren't you freaking out about OpenAI? At what point would you start? · 2021-10-14T00:47:33.827Z · EA · GW

Vicarious and Numenta are both explicitly trying to build AGI, and neither does any safety/alignment  research whatsoever. I don't think this fact is particularly relevant to OpenAI, but I do think it's an important fact in its own right, and I'm always looking for excuses to bring it up.  :-P

Anyone who wants to talk about Vicarious or Numenta in the context of AGI safety/alignment, please DM or email me.  :-)

Comment by Steven Byrnes (steve2152) on Why does (any particular) AI safety work reduce s-risks more than it increases them? · 2021-10-07T20:34:10.631Z · EA · GW

I don't really distinguish between effects by order*

I agree that direct and indirect effects of an action are fundamentally equally important (in this kind of outcome-focused context) and I hadn't intended to imply otherwise.

Comment by Steven Byrnes (steve2152) on Why does (any particular) AI safety work reduce s-risks more than it increases them? · 2021-10-07T14:41:08.742Z · EA · GW

Hmm, it seems to me (and you can correct me) that we should be able to agree that there are SOME technical AGI safety research publications that are positive under some plausible beliefs/values and harmless under all plausible beliefs/values, and then we don't have to talk about cluelessness and tradeoffs, we can just publish them.

And we both agree that there are OTHER technical AGI safety research publications that are positive under some plausible beliefs/values and negative under others. And then we should talk about your portfolios etc. Or more simply, on a case-by-case basis, we can go looking for narrowly-tailored approaches to modifying the publication in order to remove the downside risks while maintaining the upside.

I feel like we're arguing past each other: I keep saying the first category exists, and you keep saying the second category exists. We should just agree that both categories exist! :-)

Perhaps the more substantive disagreement is what fraction of the work is in which category. I see most but not all ongoing technical work as being in the first category, and I think you see almost all ongoing technical work as being in the second category. (I think you agreed that "publishing an analysis about what happens if a cosmic ray flips a bit" goes in the first category.)

(Luke says "AI-related" but my impression is that he mostly works on AGI governance not technical, and the link is definitely about governance not technical. I would not be at all surprised if proposed governance-related projects were much more heavily weighted towards the second category, and am only saying that technical safety research is mostly first-category.)

For example, if you didn't really care about s-risks, then publishing a useful considerations for those who are concerned about s-risks might take attention away from your own priorities, or it might increase cooperation, and the default position to me should be deep uncertainty/cluelessness here, not that it's good in expectation or bad in expectation or 0 in expectation.

This points to another (possible?) disagreement. I think maybe you have the attitude where (to caricature somewhat) if there's any downside risk whatsoever, no matter how minor or far-fetched, you immediately jump to "I'm clueless!". Whereas I'm much more willing to say: OK, I mean, if you do anything at all there's a "downside risk" in a sense, just because life is uncertain, who knows what will happen, but that's not a good reason to let just sit on the sidelines and let nature take its course and hope for the best. If I have a project whose first-order effect is a clear and specific and strong upside opportunity, I don't want to throw that project out unless there's a comparably clear and specific and strong downside risk. (And of course we are obligated to try hard to brainstorm what such a risk might be.)  Like if a firefighter is trying to put out a fire, and they aim their hose at the burning interior wall, they don't stop and think, "Well I don't know what will happen if the wall gets wet, anything could happen, so I'll just not pour water on the fire, y'know, don't want to mess things up."

The "cluelessness" intuition gets its force from having a strong and compelling upside story weighed against a strong and compelling downside story, I think.

If the first-order effect of a project is "directly mitigating an important known s-risk", and the second-order effects of the same project are "I dunno, it's a complicated world, anything could happen", then I say we should absolutely do that project.

Comment by Steven Byrnes (steve2152) on Why does (any particular) AI safety work reduce s-risks more than it increases them? · 2021-10-07T02:55:21.994Z · EA · GW

In practice, we can't really know with certainty that we're making AI safer, and without strong evidence/feedback, our judgements of tradeoffs may be prone to fairly arbitrary subjective judgements, motivated reasoning and selection effects.

This strikes me as too pessimistic. Suppose I bring a complicated new board game to a party. Two equally-skilled opposing teams each get a copy of the rulebook to study for an hour before the game starts. Team A spends the whole hour poring over the rulebook and doing scenario planning exercises. Team B immediately throws the rulebook in the trash and spends the hour watching TV.

Neither team has "strong evidence/feedback"—they haven't started playing yet. Team A could think they have good strategy ideas but in fact they are engaging in arbitrary subjective judgments and motivated reasoning. In fact, their strategy ideas, which seemed good on paper, could in fact turn out to be counterproductive!

Still, I would put my money on Team A beating Team B. Because Team A is trying. Their planning abilities don't have to be all that good to be strictly better (in expectation) than "not doing any planning whatsoever, we'll just wing it". That's a low bar to overcome!

So by the same token, it seems to me that vast swathes of AGI safety research easily surpasses the (low) bar of doing better in expectation than the alternative of "Let's just not think about it in advance, we'll wing it".

For example, compare (1) a researcher spends some time thinking about what happens if a cosmic ray flips a bit (or a programmer makes a sign error, like in the famous GPT-2 incident), versus (2) nobody spends any time thinking about that. (1) is clearly better, right? We can always be concerned that the person won't do a great job, or that it will be counterproductive because they'll happen across very dangerous information and then publish it, etc. But still, the expected value here is  clearly positive, right?

You also bring up the idea that (IIUC) there may be objectively good safety ideas but they might not actually get implemented because there won't be a "strong and justified consensus" to do them. But again, the alternative is "nobody comes up with those objectively good safety ideas in the first place". That's even worse, right? (FWIW I consider "come up with crisp and rigorous and legible arguments for true facts about AGI safety" to be a major goal of AGI safety research.)

Anyway, I'm objecting to undirected general feelings of "gahhhh we'll never know if we're helping at all", etc. I think there's just a lot of stuff in the AGI safety research field which is unambiguously good in expectation, where we don't have to feel that way. What I don't object to—and indeed what I strongly endorse—is taking a more directed approach and say "For AGI safety research project #732, what are the downside risks of this research, and how do they compare to the upsides?"

So that brings us to "ambitious value alignment". I agree that an ambitiously-aligned AGI comes with a couple potential sources of s-risk that other types of AGI wouldn't have, specifically via (1) sign flip errors, and (2) threats from other AGIs. (Although I think (1) is less obviously a problem than it sounds, at least in the architectures I think about.) On the other hand, (A) I'm not sure anyone is really working on ambitious alignment these days … at least Rohin Shah & Paul Christiano have stated that narrow (task-limited) alignment is a better thing to shoot for (and last anyone heard MIRI was shooting for task-limited AGIs too) (UPDATE: actually this was an overstatement, see e.g. 1,2,3); (B) my sense is that current value-learning work (e.g. at CHAI) is more about gaining conceptual understanding then creating practical algorithms / approaches that will scale to AGI. That said, I'm far from an expert on the current value learning literature; frankly I'm often confused by what such researchers are imagining for their longer-term game-plan.

BTW I put a note on my top comment that I have a COI. If you didn't notice. :)

Comment by Steven Byrnes (steve2152) on Why does (any particular) AI safety work reduce s-risks more than it increases them? · 2021-10-06T17:57:23.503Z · EA · GW

Hmm, just a guess, but …

• Maybe you're conceiving of the field as "AI alignment", pursuing the goal "figure out how to bring an AI's goals as close as possible to a human's (or humanity's) goals, in their full richness" (call it "ambitious value alignment")
• Whereas I'm conceiving the field as "AGI safety", with the goal "reduce the risk of catastrophic accidents involving AGIs".

"AGI safety research" (as I think of it) includes not just how you would do ambitious value alignment, but also whether you should do ambitious value alignment. In fact, AGI safety research may eventually result in a strong recommendation against doing ambitious value alignment, because we find that it's dangerously prone to backfiring, and/or that some alternative approach is clearly superior (e.g. CAIS, or microscope AI, or act-based corrigibility or myopia or who knows what). We just don't know yet. We have to do the research.

"AGI safety research" (as I think of it) also includes lots of other activities like analysis and mitigation of possible failure modes (e.g. asking what would happen if a cosmic ray flips a bit in the computer), and developing pre-deployment testing protocols, etc. etc.

Does that help? Sorry if I'm missing the mark here.

Comment by Steven Byrnes (steve2152) on Why does (any particular) AI safety work reduce s-risks more than it increases them? · 2021-10-04T17:42:01.521Z · EA · GW

Thanks!

(Incidentally, I don't claim to have an absolutely watertight argument here that AI alignment research couldn't possibly be bad for s-risks, just that I think the net expected impact on s-risks is to reduce them.)

If s-risks were increased by AI safety work near (C), why wouldn't they also be increased near (A), for the same reasons?

I think suffering minds are a pretty specific thing, in the space of "all possible configurations of matter". So optimizing for something random (paperclips, or "I want my field-of-view to be all white", etc.) would almost definitely lead to zero suffering (and zero pleasure). (Unless the AGI itself has suffering or pleasure.) However, there's a sense in which suffering minds are "close" to the kinds of things that humans might want an AGI to want to do. Like, you can imagine how if a cosmic ray flips a bit, "minimize suffering" could turn into "maximize suffering". Or at any rate, humans will try (and I expect succeed even without philanthropic effort) to make AGIs with a prominent human-like notion of "suffering", so that it's on the table as a possible AGI goal.

In other words, imagine you're throwing a dart at a dartboard.

• The bullseye has very positive point value.
• That's representing the fact that basically no human wants astronomical suffering, and basically everyone wants peace and prosperity etc.
• On other parts of the dartboard, there are some areas with very negative point value.
• That's representing the fact that if programmers make an AGI that desires something vaguely resembling what they want it to desire, that could be an s-risk.
• If you miss the dartboard entirely, you get zero points.
• That's representing the fact that a paperclip-maximizing AI would presumably not care to have any consciousness in the universe (except possibly its own, if applicable).

So I read your original post as saying "If the default is for us to miss the dartboard entirely, it could be s-risk-counterproductive to improve our aim enough that we can hit the dartboard", and my response to that was "I don't think that's relevant, I think it will be really easy to not miss the dartboard entirely, and this will happen "by default". And in that case, better aim would be good, because it brings us closer to the bullseye."

Comment by Steven Byrnes (steve2152) on Why does (any particular) AI safety work reduce s-risks more than it increases them? · 2021-10-04T00:31:00.535Z · EA · GW

Sorry I'm not quite sure what you mean. If we put things on a number line with (A)=1, (B)=2, (C)=3, are you disagreeing with my claim "there is very little probability weight in the interval ", or with my claim "in the interval , moving down towards 1 probably reduces s-risk", or with both, or something else?

Comment by Steven Byrnes (steve2152) on Why does (any particular) AI safety work reduce s-risks more than it increases them? · 2021-10-03T20:18:29.157Z · EA · GW

[note that I have a COI here]

Hmm, I guess I've been thinking that the choice is between (A) "the AI is trying to do what a human wants it to try to do" vs (B) "the AI is trying to do something kinda weirdly and vaguely related to what a human wants it to try to do". I don't think (C) "the AI is trying to do something totally random" is really on the table as a likely option, even if the AGI safety/alignment community didn't exist at all.

That's because everybody wants the AI to do the thing they want it to do, not just long-term AGI risk people. And I think there are really obvious things that anyone would immediately think to try, and these really obvious techniques would be good enough to get us from (C) to (B) but not good enough to get us to (A).

[Warning: This claim is somewhat specific to a particular type of AGI architecture that I work on and consider most likely—see e.g. here. Other people have different types of AGIs in mind and would disagree. In particular, in the "deceptive mesa-optimizer" failure mode (which relates to a different AGI architecture than mine) we would plausibly expect failures to have random goals like "I want my field-of-view to be all white", even after reasonable effort to avoid that. So maybe people working in other areas would have different answers, I dunno.]

I agree that it's at least superficially plausible that (C) might be better than (B) from an s-risk perspective. But if (C) is off the table and the choice is between (A) and (B), I think (A) is preferable for both s-risks and x-risks.

Comment by Steven Byrnes (steve2152) on BrownHairedEevee's Shortform · 2021-09-27T11:53:51.509Z · EA · GW

The main argument of Stuart Russell's book focuses on reward modeling as a way to align AI systems with human preferences.

Hmm, I remember him talking more about IRL and CIRL and less about reward modeling. But it's been a little while since I read it, could be wrong.

If it's really difficult to write a reward function for a given task Y, then it seems unlikely that AI developers would deploy a system that does it in an unaligned way according to a misspecified reward function. Instead, reward modeling makes it feasible to design an AI system to do the task at all.

Maybe there's an analogy where someone would say "If it's really difficult to prevent accidental release of pathogens from your lab, then it seems unlikely that bio researchers would do research on pathogens whose accidental release would be catastrophic". Unfortunately there's a horrifying many-decades-long track record of accidental release of pathogens from even BSL-4 labs, and it's not like this kind of research has stopped. Instead it's like, the bad thing doesn't happen every time, and/or things seem to be working for a while before the bad thing happens, and that's good enough for the bio researchers to keep trying.

So as I talk about here, I think there are going to be a lot of proposals to modify an AI to be safe that do not in fact work, but do seem ahead-of-time like they might work, and which do in fact work for a while as training progresses. I mean, when x-risk-naysayers like Yann LeCun or Jeff Hawkins are asked how to avoid out-of-control AGIs, they can spout off a list of like 5-10 ideas that would not in fact work, but sound like they would. These are smart people and a lot of other smart people believe them too. Also, even something as dumb as "maximize the amount of money in my bank account" would plausibly work for a while and do superhumanly-helpful things for the programmers, before it starts doing superhumanly-bad things for the programmers.

Even with reward modeling, though, AI systems are still going to have similar drives due to instrumental convergence: self-preservation, goal preservation, resource acquisition, etc., even if they have goals that were well specified by their developers. Although maybe corrigibility and not doing bad things can be built into the systems' goals using reward modeling.

Yup, if you don't get corrigibility then you failed.

Comment by Steven Byrnes (steve2152) on [Creative Writing Contest] The Reset Button · 2021-09-20T01:18:21.853Z · EA · GW

I really liked this!!!

Since you asked for feedback, here's a little suggestion, take it or leave it: I found a couple things at the end slightly out-of-place, in particular "If you choose to tackle the problem of nuclear security, what angle can you attack the problem from that will give you the most fulfillment?" and "Do any problems present even bigger risks than nuclear war?"

Immediately after such an experience, I think the narrator would not be thinking about option of not bothering to work on nuclear security because other causes are more important, nor thinking about their own fulfillment. If other causes came to mind, I imagine it would be along the lines of "if I somehow manage to stop the nuclear war, what other potential catastrophes are waiting in the wings, ready to strike anytime in the months and years after that—and this time with no reset button?"

Or if you want it to fit better as written now, then shortly after the narrator snaps back to age 18 the text could say something along the lines of "You know about chaos theory and the butterfly effect; this will be a new re-roll of history, and there might not be a nuclear war this time around. Maybe last time was a fluke?" Then that might remove some of the single-minded urgency that I would otherwise expect the narrator to feel, and thus it would become a bit more plausible that the narrator might work on pandemics or whatever.

(Maybe that "new re-roll of history" idea is what you had in mind? Whereas I was imagining the Groundhog Day / Edge of Tomorrow / Terminator trope where the narrator knows 100% for sure that there will be a nuclear war on this specific hour of this specific day, if the narrator doesn't heroically stop it.)

(I'm not a writer, don't trust my judgment.)

Comment by Steven Byrnes (steve2152) on A mesa-optimization perspective on AI valence and moral patienthood · 2021-09-16T18:08:18.128Z · EA · GW

Hmm, yeah, I guess you're right about that.

Comment by Steven Byrnes (steve2152) on A mesa-optimization perspective on AI valence and moral patienthood · 2021-09-15T13:49:37.395Z · EA · GW

Oh, you said "evolution-type optimization", so I figured you were thinking of the case where the inner/outer distinction is clear cut. If you don't think the inner/outer distinction will be clear cut, then I'd question whether you actually disagree with the post :) See the section defining what I'm arguing against, in particular the "inner as AGI" discussion.

Comment by Steven Byrnes (steve2152) on A mesa-optimization perspective on AI valence and moral patienthood · 2021-09-14T15:40:49.019Z · EA · GW

Nah, I'm pretty sure the difference there is "Steve thinks that Jacob is way overestimating the difficulty of humans building AGI-capable learning algorithms by writing source code", rather than "Steve thinks that Jacob is way underestimating the difficulty of computationally recapitulating the process of human brain evolution".

For example, for the situation that you're talking about (I called it "Case 2" in my post) I wrote "It seems highly implausible that the programmers would just sit around for months and years and decades on end, waiting patiently for the outer algorithm to edit the inner algorithm, one excruciatingly-slow step at a time. I think the programmers would inspect the results of each episode, generate hypotheses for how to improve the algorithm, run small tests, etc." If the programmers did just sit around for years not looking at the intermediate training results, yes I expect the project would still succeed sooner or later. I just very strongly expect that they wouldn't sit around doing nothing.

Comment by Steven Byrnes (steve2152) on A mesa-optimization perspective on AI valence and moral patienthood · 2021-09-14T00:07:49.936Z · EA · GW

AlphaGo has a human-created optimizer, namely MCTS. Normally people don't use the term "mesa-optimizer" for human-created optimizers.

Then maybe you'll say "OK there's a human-created search-based consequentialist planner, but the inner loop of that planner is a trained ResNet, and how do you know that there isn't also a search-based consequentialist planner inside each single run through the ResNet?"

Admittedly, I can't prove that there isn't. I suspect that there isn't, because there seems to be no incentive for that (there's already a search-based consequentialist planner!), and also because I don't think ResNets are up to such a complicated task.

Comment by Steven Byrnes (steve2152) on AI timelines and theoretical understanding of deep learning · 2021-09-13T14:14:19.195Z · EA · GW

I find most justifications and arguments made in favor of a timeline of less than 50 years to be rather unconvincing.

If we don't have convincing evidence in favor of a timeline <50 years, and we also don't have convincing evidence in favor of a timeline ≥50 years, then we just have to say that this is a question on which we don't have convincing evidence of anything in particular. But we still have to take whatever evidence we have and make the best decisions we can. ¯\_(ツ)_/¯

(You don't say this explicitly but your wording kinda implies that ≥50 years is the default, and we need convincing evidence to change our mind away from that default. If so, I would ask why we should take ≥50 years to be the default. Or sorry if I'm putting words in your mouth.)

I am simply not able to understand why we are significantly closer to AGI today than we were in 1950s

Lots of ingredients go into AGI, including (1) algorithms, (2) lots of inexpensive chips that can do lots of calculations per second, (3) technology for fast communication between these chips, (4) infrastructure for managing large jobs on compute clusters, (5) frameworks and expertise in parallelizing algorithms, (6) general willingness to spend millions of dollars and roll custom ASICs to run a learning algorithm, (7) coding and debugging tools and optimizing compilers, etc. Even if you believe that you've made no progress whatsoever on algorithms since the 1950s, we've made massive progress in the other categories. I think that alone puts us "significantly closer to AGI today than we were in the 1950s": once we get the algorithms, at least everything else will be ready to go, and that wasn't true in the 1950s, right?

But I would also strongly disagree with the idea that we've made no progress whatsoever on algorithms since the 1950s. Even if you think that GPT-3 and AlphaGo have absolutely nothing whatsoever to do with AGI algorithms (which strikes me as an implausibly strong statement, although I would endorse much weaker versions of that statement), that's far from the only strand of research in AI, let alone neuroscience. For example, there's a (IMO plausible) argument that PGMs and causal diagrams will be more important to AGI than deep neural networks are. But that would still imply that we've learned AGI-relevant things about algorithms since the 1950s. Or as another example, there's a (IMO misleading) argument that the brain is horrifically complicated and we still have centuries of work ahead of us in understanding how it works. But even people who strongly endorse that claim wouldn't also say that we've made "no progress whatsoever" in understanding brain algorithms since the 1950s.

Sorry if I'm misunderstanding.

isn't there an infinite degree of freedom associated with a continuous function?

I'm a bit confused by this; are you saying that the only possible AGI algorithm is "the exact algorithm that the human brain runs"? The brain is wired up by a finite number of genes, right?

Comment by Steven Byrnes (steve2152) on A mesa-optimization perspective on AI valence and moral patienthood · 2021-09-13T01:35:33.828Z · EA · GW

most contemporary progress on AI happens by running base-optimizers which could support mesa-optimization

GPT-3 is of that form, but AlphaGo/MuZero isn't (I would argue).

I'm not sure how to settle whether your statement about "most contemporary progress" is right or wrong. I guess we could count how many papers use model-free RL vs model-based RL, or something? Well anyway, given that I haven't done anything like that, I wouldn't feel comfortable making any confident statement here. Of course you may know more than me! :-)

If we forget about "contemporary progress" and focus on "path to AGI", I have a post arguing against what (I think) you're implying at Against evolution as an analogy for how humans will create AGI, for what it's worth.

Ideally we'd want a method for identifying valence which is more mechanistic that mine. In the sense that it lets you identify valence in a system just by looking inside the system without looking at how it was made.

Yeah I dunno, I have some general thoughts about what valence looks like in the vertebrate brain (e.g. this is related, and this) but I'm still fuzzy in places and am not ready to offer any nice buttoned-up theory. "Valence in arbitrary algorithms" is obviously even harder by far.  :-)