AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher 2020-07-13T16:17:45.913Z · score: 87 (30 votes)
Ben Garfinkel: How sure are we about this AI stuff? 2019-02-09T19:17:31.671Z · score: 84 (43 votes)


Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-08-06T22:43:02.407Z · score: 1 (1 votes) · EA · GW

Hi Ofer,

Thanks for the comment!

I actually do think that the instrumental convergence thesis, specifically, can be mapped over fine, since it's a fairly abstract principle. For example, this recent paper formalizes the thesis within a standard reinforcement learning framework. I just think that the thesis at most weakly suggests existential doom, unless we add in some other substantive theses. I have some short comments on the paper, explaining my thoughts, here.

Beyond the instrumental convergence thesis, though, I do think that some bits of the classic arguments are awkward to fit onto concrete and plausible ML-based development scenarios: for example, the focus on recursive self-improvement, and the use of thought experiments in which natural language commands, when interpretted literally and single-mindedly, lead to unforeseen bad behaviors. I think that Reframing Superintelligence does a good job of pointing out some of the tensions between classic ways of thinking and talking about AI risk and current/plausible ML engineering practices.

For the sake of concreteness, consider the algorithm that Facebook uses to create the feed that each user sees (which is an example that Stuart Russell has used). Perhaps there's very little public information about that algorithm, but it's reasonable to guess they're using some deep RL algorithm and a reward function that roughly corresponds to user engagement. Conditioned on that, do you agree that in the limit (i.e. when using whatever algorithm and architecture they're currently using, at a sufficiently large scale), the arguments about instrumental convergence seem to apply?

This may not be what you have in mind, but: I would be surprised if the FB newsfeed selection algorithm became existentially damaging (e.g. omnicidal), even in the limit of tremendous amounts of training data and compute. I don't know the algorithm actually works, but as a simplication: let's imagine that it produces an ordered list of posts to show a user, from the set of recent posts by their friends, and that it's trained using something like the length of the user's FB browsing session as the reward. I think that, if you kept training it, nothing too weird would happen. It might produce some unintended social harms (like addiction, polarization, etc.), but the system wouldn't, in any meaningful sense, have long-run objectives (due to the shortness of sessions). It also probably wouldn't have the ability or inclination to manipulate the external world in the pursuit of complex schemes. Figuring out how to manipulate the external world in precise ways would require a huge amount of very weird exploration, deep in a section of the space of possible policies where most of the policies are terrible at maximizing reward; in the unlikely event that the necessary exploration happened, and the policy started moving in this direction, I think it would be conspicuous before the newsfeed selection algorithm does something like kill everyone to prevent ongoing FB sessions from ending (if this is indeed possible given the system's limited space of possible actions.)

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-08-03T13:31:02.862Z · score: 2 (2 votes) · EA · GW

The key difference is that I don't think orthogonality thesis, instrumental convergence or progress being eventually fast are wrong - you just need extra assumptions in addition to them to get to the expectation that AI will cause a catastrophe.

Quick belated follow-up: I just wanted to clarify that I also don't think that the orthogonality thesis or instrumental convergence thesis are incorrect, as they're traditionally formulated. I just think they're not nearly sufficient to establish a high level of risk, even though, historically, many presentations of AI risk seemed to treat them as nearly sufficient. Insofar as there's a mistake here, the mistake concerns way conclusions have been drawn from these theses; I don't think the mistake is in the theses themselves. (I may not stress this enough in the interview/slides.)

On the other hand, progress/growth eventually becoming much faster might be wrong (this is an open question in economics). The 'classic arguments' also don't just predict that growth/progress will become much faster. In the FOOM debate, for example, both Yudkowsky and Hanson start from the position that growth will become much faster; their disagreement is about how sudden, extreme, and localized the increase will be. If growth is actually unlikely to increase in a sudden, extreme, and localized fashion, then this would be a case of the classic arguments containing a "mistaken" (not just insufficient) premise.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-25T20:05:27.061Z · score: 3 (3 votes) · EA · GW

Because instead of disaster striking only if we can't figure out the right goals to give to the AI, it can also be the case that we know what goals we want to give it, but due to constraints of the development process, we can't give it those goals and can only build AI with unaligned goals. So it seems to me that the latter scenario can also be rightly described as "exogenous deadline of the creep of AI capability progress". (In both cases, we can try to refrain from developing/deploying AGI, but it may be a difficult coordination problem for humanity to stay in a state where we know how to build AGI but chooses not to, and in any case this consideration cuts equally across both scenarios.)

I think that the comment you make above is right. In the podcast, we only discuss this issue in a super cursory way:

(From the transcript) A second related concern, which is a little bit different, is that you could think this is an argument against us naively going ahead and putting this thing out into the world that’s as extremely misaligned as a dust minimizer or a paperclip maximizer, but we could still get to the point where we haven’t worked out alignment techniques.... No sane person would keep running the dust minimizer simulation once it’s clear this is not the thing we want to be making. But maybe not everyone is the same. Maybe someone wants to make a system that pursues some extremely narrow objective like this extremely effectively, even though it would be clear to anyone with normal values that you’re not in the process of making a thing that you want to actually use. Maybe somebody who wants to cause destruction could conceivably plough ahead. So that might be one way of rescuing a deadline picture. The deadline is not when will people have intelligent systems that they naively throw out into the world. It’s when do we reach the point where someone wants to create something that, in some sense, is intuitively pursuing a very narrow objective, has the ability to do that.

Fortunately, I'm not too worried about this possibility. Partly, as background, I expect us to have moved beyond using hand-coded reward functions -- or, more generally, what Stuart Russell calls the "standard model" -- by the time we have the ability to create broadly superintelligent and highly agential/unbounded systems. There are really strong incentives to do this, since there are loads of useful applications that seemingly can't be developed using hand-coded reward functions. This is some of the sense in which, in my view, capabilities and alignment research is mushed up. If progress is sufficiently gradual, I find it hard to imagine that the ability to create things like world-destroying paperclippers comes before (e.g.) the ability to make at least pretty good use of reward modeling techniques.

(To be clear, I recognize that loads of alignment researchers also think that there will be strong economic incentives for alignment research. I believe there's a paragraph in Russell's book arguing this. I think DM's "scalable agent alignment" paper also suggests that reward modeling is necessary to develop systems that can assist us in most "real world domains." Although I don't know how much optimism other people tend to take from this observation. I don't actually know, for example, whether or not Russell is less optimisic than me.)

If we do end up in a world where people know they can create broadly superintelligent and highly agential/unbounded AI systems, but we're still haven't worked out alternatives to Russell's "standard model," then no sane person really has any incentive to create and deploy these kinds of systems. Training up a broadly superintelligent and highly agential system using something like a hand-coded reward function is likely to be an obviously bad idea; if it's not obviously bad, a priori, then it will likely become obviously bad during the training process. There wouldn't be much of a coordination problem, since, at least in normal circumstances, no one has an incentive to knowingly destroy themselves.

If I then try to tell a story where humanity goes extinct, due to a failure to move beyond the standard model in time, two main scenarios come to mind.

Doomsday Machine: States develop paperclipper-like systems, while thinking of them as doomsday machines, to serve as a novel alternative or complement to nuclear deterrents. They end up being used, either accidentally or intentionally.

Apocalyptic Residual: The ability to develop paperclipper-like systems diffuses broadly. Some of the groups that gain this ability have apocalyptic objectives. They groups intentionally develop and deploy the systems, with the active intention of destroying humanity.

The first scenario doesn't seem very likely to me. Although this is obviously very speculative, paperclippers seem much worse than nuclear or even biological deterrents. First, your own probability of survival, if you use a paperclipper, may be much lower than your probability of survival if you used nukes or biological weapons. Second, and somewhat ironically, it may actually be hard to convince people that your paperclipper system can actually do a ton of damage; it seems hard to know that the result would actually be as bad as feared, without real-world experience using it before. States would also, likely, be slow to switch to this new deterrence strategy, providing even more time for alignment techniques to be worked out. As a further bit of friction/disincentive, these systems might also just be extremely expensive (depending on compute or environment design requirements). Finally, for doomsday to occur, it's actually necessary for a paperclipper system to be used -- and for its effect to be as bad as feared. The history of nuclear weapons suggests that the annual probability of use is probably pretty low.

The second scenario also doesn't seem very likely to me, since: (a) I think there would probably be an initial period where large quantities of resources (e.g. compute and skilled engineers) are required to make world-destroying paperclippers. (b) Only a very small portion of people want to destroy the world. (c) There would be unusually strong incentives for states to prevent apocalyptic groups or individuals from gaining access to the necessary resources.

Although see Asya's "AGI in Vulnerable World" post for a discussion of some conditions under which malicious use concerns might loom larger.

(Apologies for the super long response!)

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-25T20:01:24.394Z · score: 1 (1 votes) · EA · GW

I continue to have a lot of uncertainty about how likely it is that AI development will look like "there’s this separate project of trying to figure out what goals to give these AI systems" vs a development process where capability and goals are necessarily connected. (I didn't find your arguments in favor of the latter very persuasive.) For example it seems GPT-3 can be seen as more like the former than the latter. (See this thread for background on this.)

I don't think I caught the point about GPT-3, although this might just be a matter of using concepts differently.

In my mind: To whatever extent GPT-3 can be said to have a "goal," its goal is to produce text that it would be unsurprising to find on the internet. The training process both imbued it with this goal and made the system good at achieving it.

There are other things we might want spin-offs of GPT-3 to do: For example, compose better-than-human novels. Doing this would involve shifting both what GPT-3 is "capable" of doing and shifting what its "goal" is. (There's not really a clean practical or conceptual distinction between the two.) It would also probably require making progress on some sort of "alignment" technique, since we can't (e.g.) write down a hand-coded reward function that quantifies novel quality.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-23T16:16:25.288Z · score: 9 (6 votes) · EA · GW

Michael Huemer's "Ethical Intuitionism" and David Enoch's "Taking Morality Seriously" are both good; Enoch's book is, I think, better, but Huemer's book is a more quick and engaging read. Part Six of Parfit's "On What Matters" is also good.

I don't exactly think that non-naturalism is "plausible," since I think there are very strong epistemological objections to it. (Since our brain states are determined entirely by natural properties of the world, why would our intuitions about non-natural properties track reality?) It's more that I think the alternative positions are self-undermining or have implications that are unacceptable in other ways.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-21T14:28:09.089Z · score: 3 (2 votes) · EA · GW

I thought a bit about humans, but I feel that this is much more complicated and needs more nuanced definitions of goals. (is avoiding suffering a terminal goal? It seems that way, but who is doing the thinking in which it is useful to think of one thing or another as a goal? Perhaps the goal is to reduce specific neuronal activity for which avoiding suffering is merely instrumental?)

I'm actually not very optimistic about a more complex or formal definition of goals. In my mind, the concept of a "goal" is often useful, but it's sort of an intrinisically fuzzy or fundamentally pragmatic concept. I also think that, in practice, the distinction between an "intrinsic" and "instrumental" goal is pretty fuzzy in the same way (although I think your definition is a good one).

Ultimately, agents exhibit behaviors. It's often useful to try to summarize these behaviors in terms of what sorts of things the agent is fundamentally "trying" to do and in terms of the "capabilities" that the agent brings to bear. But I think this is just sort of a loose way of speaking. I don't really think, for example, that there are principled/definitive answers to the questions "What are all of my cat's goals?", "Which of my cat's goals are intrinsic?", or "What's my cat's utility function?" Even if we want to move beyond behavioral definitions of goals, to ones that focus on cognitive processes, I think these sorts of questions will probably still remain pretty fuzzy.

(I think that this way of thinking -- in which evolutionary or engineering selection processes ultimately act on "behaviors," which can only somewhat informally or imprecisely be described in terms of "capabilities" and "goals" -- also probably has an influence on my relative optimism about AI alignment. )

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-20T23:45:42.370Z · score: 2 (2 votes) · EA · GW

Hi Sammy,

Thanks for the links -- both very interesting! (I actually hadn't read your post before.)

I've tended to think of the intuitive core as something like: "If we create AI systems that are, broadly, more powerful than we are, and their goals diverge from ours, this would be bad -- because we couldn't stop them from doing things we don't want. And it might be hard to ensure, as we're developing increasingly sophisticated AI systems, that there aren't actually subtle but extremely important divergences in some of these systems' goals."

At least in my mind, both the classic arguments and the arguments in "What Failure Looks Like" share this common core. Mostly, the challenge is to explain why it would be hard to ensure that there wouldn't be subtle-but-extremely-important divergences; there are different possible ways of doing this. For example: Although an expectation of discontinous (or at least very fast) progress is a key part of the classic arguments, I don't consider it part of the intuitive core; the "What Failure Looks Like" picture doesn't necessarily rely on it.

I'm not sure if there's actually a good way to take the core intuition and turn it into a more rigorous/detailed/compelling argument that really works. But I do feel that there's something to the intuition; I'll probably still feel like there's something to the intuition, even if I end feeling like the newer arguments have major issues too.

[[Edit: An alternative intuitive core, which I sort of gesture at in the interview, would simply be: "AI safety and alignment issues exist today. In the future, we'll have crazy powerful AI systems with crazy important responsibilities. At least the potential badness of safety and alignment failures should scale up with these systems' power and responsibility. Maybe it'll actually be very hard to ensure that we avoid the worst-case failures."]]

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-20T22:45:05.771Z · score: 3 (2 votes) · EA · GW

In brief, I do actually feel pretty positively.

Even if governments aren't doing a lot of important AI research "in house," and private actors continue to be the primary funders of AI R&D, we should expect governments to become much more active if really serious threats to security start to emerge. National governments are unlikely to be passive, for example, if safety/alignment failures become increasingly damaging -- or, especially, if existentially bad safety/alignment failures ever become clearly plausible. If any important institutions, design decisions, etc., regarding AI get "locked in," then I also expect governments to be heavily involved in shaping these institutions, making these decisions, etc. And states are, of course, the most important actors for many concerns having to do with political instability caused by AI. Finally, there are also certain potential solutions to risks -- like creating binding safety regulations, forging international agreements, or plowing absolutely enormous amounts of money into research projects -- that can't be implemented by private actors alone.

Basically, in most scenarios where AI governance work turns out be really useful from a long-termist perspective -- because there are existential safety/alignment risks, because AI causes major instability, or because there are opportunities to "lock in" key features of the world -- I expect governments to really matter.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-20T22:10:12.028Z · score: 3 (2 votes) · EA · GW

I don't have a single top pick; I think this will generally depend on a person's particular interests, skills, and "career capital."

I do just want to say, though, that I don't think it's at all necessary to have a strong technical background to do useful AI governance work. For example, if I remember correctly, most of the research topics discussed in the "AI Politics" and "AI Ideal Governance" sections of Allan Dafoe's research agenda don't require a significant technical background. A substantial portion of people doing AI policy/governance/ethics research today also have a primarily social science or humanities background.

Just as one example that's salient to me, because I was a co-author on it, I don't think anything in this long report on distributing the benefits of AI required substantial technical knowledge or skills.

(That being said, I do think it's really important for pretty much anyone in the AI governance space to understand at least the core concepts of machine learning. For example, it's important to know things like the difference between "supervised" and "unsupervised" learning, the idea of stochastic gradient descent, the idea of an "adversarial example," and so on. Fortunately, I think this is pretty do-able even without a STEM background; it's mostly the concepts, rather than the math, that are important. Certain kinds of research or policy work certainly do require more in-depth knowledge, but a lot of useful work doesn't.)

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-20T21:39:29.911Z · score: 5 (3 votes) · EA · GW

I think that my description of the thesis (and, actually, my own thinking on it) is a bit fuzzy. Nevertheless, here's roughly how I'm thinking about it:

First, let's say that an agent has the "goal" of doing X if it's sometimes useful to think of the system as "trying to do X." For example, it's sometimes useful to think of a person as "trying" to avoid pain, be well-liked, support their family, etc. It's sometimes useful to think of a chess program as "trying" to win games of chess.

Agents are developed through a series of changes. In the case of a "hand-coded" AI system, the changes would involve developers adding, editing, or removing lines of code. In the case of an RL agent, the changes would typically involve a learning algorithm updating the agent's policy. In the case of human evolution, the changes would involve genetic mutations.

If the "process orthogonality thesis" were true, then this would mean that we can draw a pretty clean line between between "changes that affect an agent's capabilities" and "changes that affect an agent's goals." Instead, I want to say that it's really common for changes to affect both capabilities and goals. In practice, we can't draw a clean line between "capability genes" and "goal genes" or between "RL policy updates that change goals" and "RL policy updates that change capabilities." Both goals and capabilities tend to take shape together.

That being said, it is true that some changes do, intuitively, mostly just affect either capabilities or goals. I wouldn't be surprised, for example, if it's possible to introduce a minus sign somewhere into Deep Blue's code and transform it into a system that looks like it's trying to lose at chess; although the system will probably be less good at losing than it was a winning, it may still be pretty capable. So the processes of changing a system's capabilities and changing its goals can still come apart to some degree.

It's also possible to do fundamental research and engineering work that is useful for developing a wide variety of systems. For example, hardware progress has, in general, made it easier to develop highly competent RL agents in all sorts of domains. But, when it comes time to train a new RL agent, its goals and capabilities will still take shape together.

(Hope that clarifies things at least a bit!)

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T23:00:07.127Z · score: 6 (4 votes) · EA · GW

Yes, but they're typically invite-only.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T22:49:31.498Z · score: 4 (3 votes) · EA · GW

Interesting. So you generally expect (well, with 50-75% probability) AI to become a significantly bigger deal, in terms of productivity growth, than it is now? I have not looked into this in detail but my understanding is that the contribution of AI to productivity growth right now is very small (and less than electricity).

If yes, what do you think causes this acceleration? It could simply be that AI is early-stage right now, akin to electricity in 1900 or earlier, and the large productivity gains arise when key innovations diffuse through society on a large scale. (However, many forms of AI are already widespread.) Or it could be that progress in AI itself accelerates, or perhaps linear progress in something like "general intelligence" translates to super-linear impact on productivity.

I mostly have in mind the idea that AI is "early-stage," as you say. The thought is that "general purpose technologies" (GPTs) like electricity, the steam engine, the computer, and (probably) AI tend to have very delayed effects.

For example, there was really major progress in computing in the middle of the 20th century, and lots of really major invents throughout the 70s and 80s, but computers didn't have a noticeable impact on productivity growth until the 90s. The first serious electric motors were developed in the mid-19th century, but electricity didn't have a big impact on productivity until the early 20th. There was also a big lag associated with steam power; it didn't really matter until the middle of the 19th century, even though the first steam engines were developed centuries earlier.

So if AI takes several decades to have a large economic impact, this would be consistent with analagous cases from history. It can take a long time for the technology to improve, for engineers to get trained up, for complementary inventions to be developed, for useful infrastructure to be built, for organizational structures to get redesigned around the technology, etc. I don't think it'd be very surprising if 80 years was enough for a lot of really major changes to happen, especially since the "time to impact" for GPTs seems to be shrinking over time.

Then I'm also factoring in the additional possibility that there will be some unusually dramatic acceleration, which is distinguishes AI from most earlier GPTs.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T22:23:41.122Z · score: 3 (2 votes) · EA · GW

I would strongly consider donating to the long-term investment fund. (But I haven't thought enough about this to be sure.)

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T22:19:24.420Z · score: 8 (4 votes) · EA · GW

Toby's estimate for "unaligned artificial intelligence" is the only one that I meaningfully disagree with.

I would probably give lower numbers for the other anthropogenic risks as well, since it seems really hard to kill virtually everyone, and since the historical record suggests that permanent collapse is unlikely. (Complex civilizations were independently developed multiple times; major collapses, like the Bronze Age Collapse or fall of the Roman Empire, were reversed after a couple thousand years; it didn't take that long to go from the Neolithic Revolution to the Industrial Revolution; etc.) But I haven't thought enough about civilizational recovery or, for example, future biological weapons to feel firm in my higher level of optimism.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T22:09:46.276Z · score: 4 (3 votes) · EA · GW

Those numbers sound pretty reasonable to me, but, since they're roughly my own credences, it's probably unsurprising that I'm describing them as "pretty reasonable" :)

On the other hand, depending on what counts as being "convinced" of the classic arguments, I think it's plausible they actually support a substantially higher probability. I certainly know that some people assign a significantly higher than 10% chance to an AI-based existential catastrophe this century. And I believe that Toby's estimate, for example, involved weighing up different possible views.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T21:53:18.348Z · score: 2 (2 votes) · EA · GW

I don't currently give them very much weight.

It seems unlikely to me that hardware progress -- or, at least, practically achievable hardware progress -- will turn out to be sufficient for automating away all the tasks people can perform. If both hardware progress and research effort instead play similarly fundamental roles, then focusing on only a single factor (hardware) can only give us pretty limited predictive power.

Also, to a lesser extent: Even it is true that compute growth is the fundamental driver of AI progress, I'm somewhat skeptical that we could predict the necessary/sufficient amount of compute very well.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T21:37:29.365Z · score: 3 (2 votes) · EA · GW

I don't think it's had a significant impact on my views about the absolute likelihood or tractability of other existential risks. I'd be interested if you think it should have, though!

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T21:33:37.957Z · score: 3 (2 votes) · EA · GW

Partly, I had in mind a version of the astronomical waste argument: if you think that we should basically ignore the possibility of preventing extinction or pre-mature stagnation (e.g. for Pascal's mugging reasons), and you're optimistic about where the growth process is bringing us, then maybe we should just try to develop an awesome technologically advanced civilization as quickly as possible so that more people can ultimately live in it. IRRC Tyler Cowen argues for something at least sort of in this ballpark, in Stubborn Attachments. I think you'd need pretty specific assumptions to make this sort of argument work, though.

Jumping the growth process forward can also reduce some existential risks. The risk of humanity getting wiped out by a natural disasters, like asteroids, probably gets lower the more technologically sophisticated we become; so, for example, kickstarting the Industrial Revolution earlier would have meant a shorter "time of peril" for natural risks. Leo Aschenbrenner's paper "Economic Growth and Existential Risk" considers a more complicated version of this argument in the context of anthropogenic risks, which takes into account the fact that growth can also contribute to these risks.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T21:01:28.646Z · score: 6 (4 votes) · EA · GW

In brief, I feel positively about these broader attempts!

It seems like some of these broad efforts could be useful, instrumentally, for reducing a number of different risks (by building up the pool of available talent, building connections, etc.) The more unsure about what risks matter most, as well, the more valuable broad capacity-building efforts are.

It's also possible that some shifts in values, institutions, or ideas could actually be long-lasting. (This is something that Will MacAskill, for example, is currently interested in.) If this is right, then I think it's at least conceivable that trying to positively influence future values/institutions/ideas is more important than reducing the risk of global catastrophes: the goodness of different possible futures might vary greatly.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T17:12:38.481Z · score: 14 (7 votes) · EA · GW

I actually haven't seen The Boss Baby. A few years back, this ad was on seemingly all of the buses in Oxford for a really long time. Something about them made a lasting impression on me. Maybe it was the smug look on the boss baby's face.

Reviewing it purely on priors, though, I'll give it a 3.5 :)

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T16:29:27.673Z · score: 16 (11 votes) · EA · GW

I'm not sure how unpopular these actually are, but a few at least semi-uncommon views would be:

  • I'm pretty sympathetic to non-naturalism, in the context of both normativity and consciousness

  • Controlling for tractability, I think it's probably more important to improve the future (conditional on humanity not going extinct) than to avoid human extinction. (The gap between a mediocre future or bad future and the best possible future is probably vast.)

  • I don't actually know what my credence is here, since I haven't thought much about the issue, but I'm probably more concerned about growth slowing down and technological progress stagnating than the typical person in the community

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T15:02:58.465Z · score: 3 (2 votes) · EA · GW

Thanks so much for letting me know! I'm really glad to hear :)

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T15:00:21.820Z · score: 10 (4 votes) · EA · GW

What is your overall probability that we will, in this century, see progress in artificial intelligence that is at least as transformative as the industrial revolution?

I think this is a little tricky. The main way in which the Industrial Revolution was unusually transformative is that, over the course of the IR, there were apparently unusually large pivots in several important trendlines. Most notably, GDP-per-capita began to increase at a consistently much higher rate. In more concrete terms, though, the late nineteenth and early twentieth centuries probably included even greater technological transformations.

From David Weil's growth textbook (pg. 265-266):

Given these two observations—that growth during the Industrial Revolution was not particularly fast and that growth did not slow down when the Industrial Revolution ended—what was really so revolutionary about the period? There are two answers. First, the technologies introduced during the Industrial Revolution were indeed revolutionary, but their immediate impact on economic growth was small because they were initially confined to a few industries. More significantly, the Industrial Revolution was a beginning. Rapid technological change, the replacement of old production processes with new ones, the continuous introduction of new goods—all of these processes that we take for granted today got their start during the Industrial Revolution. Although the actual growth rates achieved during this period do not look revolutionary in retrospect, the pattern of continual growth that began then was indeed revolutionary in contrast to what had come before.

I think it's a bit unclear, then, how to think about AI progress that's at least as transformative as the IR. If economic growth rates radically increase in the future, then we might apply the label "transformative AI" to the the period where the change in growth rates becomes clear. But it's also possible that growth rates won't ultimately go up that much. Maybe the trend in the labor force participation rate is the one to look at, since there's a good chance it will eventually decline to nearly zero; but it's also possible the decline will be really protracted, without a particularly clean pivot.

None of this is an answer to your question, of course. (I will probably circle back and try to give you a probability later.) But I am sort of wary of "transformative AI" as a forecasting target; if I was somehow given access to a video recording of the future of AI, I think it's possible I would have a lot of trouble labeling the decade where "AI progress as transformative as the Industrial Revolution" has been achieved.

What is your probability for the more modest claim that AI will be at least as transformative as, say, electricity or railroads?

Also a little bit tricky, partly because electricity underlies AI. As one operationalization, then, suppose we were to ask an economist in 2100: "Do you think that the counterfactual contribution of AI to American productivity growth between 2010 and 2100 was at least as large as the counterfactual contribution of electricity to American productivity growth between 1900 and 1940?" I think that the economist would probably agree -- let's say, 50% < p < 75% -- but I don't have a very principled reason for thinking this and might change my mind if I thought a bit more.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T14:30:39.023Z · score: 4 (3 votes) · EA · GW

From a long-termist perspective, I think that -- the more gradual AI progress is -- the more important concerns about "bad attractor states" and "instability" become relative to concerns about AI safety/alignment failures. (See slides).

I think it is probably true, though, that AI safety/alignment risk is more tractable than these other risks. To some extent, the solution to safety risk is for enough researchers to put their heads down and work really hard on technical problems; there's probably some amount of research effort that would be enough, even if this quantity is very large. In contrast, the only way to avoid certain risks associated with "bad attractor states" might be to establish stable international institutions that are far stronger than any that have come before; there might be structural barriers, here, that no amount of research effort or insight would be enough to overcome.

I think it's at least plausible that the most useful thing for AI safety and governance researchers to do ultimately focus on brain-in-a-box-ish AI risk scenarios, even they're not very likely relative to other scenarios. (This would still entail some amount of work that's useful for multiple scenarios; there would also be instrumental reasons, related to skill-building and reputation-building, to work on present-day challenges.) But I have some not-fully-worked-out discomfort with this possibility.

One thing that I do feel comfortable saying is that more effort should go into assessing the tractability of different influence pathways, the likelihood of different kinds of risks beyond the classic version of AI risk, etc.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T14:10:08.973Z · score: 11 (5 votes) · EA · GW

I would be really interested in you writing on that!

It's a bit hard to say what the specific impact would be, but beliefs about the magnitude of AI risk of course play at least an implicit role in lots of career/research-focus/donation decisions within the EA community; these beliefs also affect the extent to which broad EA orgs focus on AI risk relative to other cause areas. And I think that people's beliefs about the Sudden Emergence hypothesis at least should have a large impact in their level of doominess about AI risk; I regard it as one of the biggest cruxes. So I'd at least be hopeful that, if everyone's credences in Sudden Emergence changed by a factor of three, this had some sort of impact on the portion of EA attention devoted to AI risk. I think that credences in the Sudden Emergence hypothesis should also have an impact on the kinds of risks/scenarios that people within the AI governance and safety communities focus on.

I don't, though, have a much more concrete picture of the influence pathway.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T13:55:58.265Z · score: 5 (3 votes) · EA · GW

I think the work is mainly useful for EA organizations making cause prioritization decisions (how much attention should they devote to AI risk relative to other cause areas?) and young/early-stage people deciding between different career paths. The idea is mostly to help clarify and communicate the state of arguments, so that more fully informed and well-calibrated decisions can be made.

A couple other possible positive impacts:

  • Developing and shifting to improved AI risk arguments -- and publicly acknowledging uncertainties/confusions -- may, at least in the long run, cause other people to take the EA community and existential-risk-oriented AI safety communities more seriously. As one particular point, I think that a lot of vocal critics (e.g. Pinker) are mostly responding to the classic arguments. If the classic arguments actually have significant issues, then it's good to acknowledge this; if other arguments (e.g. these) are more compelling, then it's good to work them out more clearly and communicate them more widely. As another point, I think that sharing this kind of work might reduce perceptions that the EA is more group-think-y/unreflective than it actually is. I know that people have sometimes pointed to my EAG talk from a couple years back, for example, in response to concerns that the EA community is too uncritical in its acceptance of AI risk arguments.

  • I think that it's probably useful for the AI safety community to have a richer and more broadly shared understanding of different possible "AI risk threat models"; presumably, this would feed into research agendas and individual prioritization decisions to some extent. I think that work that analyzes newer AI risk arguments, especially, would be useful here. For example, it seems important to develop a better understanding of the role that "mesa-optimization" plays in driving existential risk.

(There's also the possibility of negative impact, of course: focusing too much on the weaknesses of various arguments might cause people to downweight or de-prioritize risks more than they actually should.)

I haven't thought very much about the timelines of which this kind of work is useful, but I think it's plausible that the delayed impact on prioritization and perception is more important than the immediate impact.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-19T12:55:08.069Z · score: 19 (9 votes) · EA · GW

I feel that something went wrong, epistemically, but I'm not entirely sure what it was.

My memory is that, a few years ago, there was a strong feeling within the longtermist portion of the EA community that reducing AI risk was far-and-away the most urgent problem. I remember there being a feeling that the risk was very high, that short timelines were more likely than not, and that the emergence of AGI would likely be a sudden event. I remember it being an open question, for example, whether it made sense to encourage people to get ML PhDs, since, by the time they graduated, it might be too late. There was also, in my memory, a sense that all existing criticisms of the classic AI risk arguments were weak. It seemed plausible that the longtermist EA community would pretty much just become an AI-focused community. Strangely, I'm a bit fuzzy on what my own views were, but I think they were at most only a bit out-of-step.

This might be an exaggerated memory. The community is also, obviously, large enough for my experience to be significantly non-representative. (I'd be interested in whether the above description resonates with anyone else.) But, in any case, I am pretty confident that there's been a real shift in average views over the past three years: credences in discontinuous progress and very short timelines have decreased; people's concerns about AI have become more diverse; a broad portfolio approach to long-termism has become more popular; and, overall, there's less of a doom-y vibe.

One explanation for the shift, if it's real, is that the community has been rationally and rigorously responding to available evidence, and the available evidence has simply changed. I don't think this could be the whole explanation, though. As I wrote in response to another question, many of the arguments for continuous AI progress, which seem to have had a significant impact over the past couple years, could have been published more than a decade ago -- and, in some cases, were. An awareness of the differences between the ML paradigm and the "good-old-fashioned-AI" (GOFAI) paradigm has been another source of optimism, but ML had already largely overtaken GOFAI by the time Superintelligence was published. I also don't think that much novel evidence for long timelines has emerged over the past few years, beyond the fact that we still don’t have AGI.

It's possible that the community's updated views, including my own updated views, are wrong: but even in this case, there needs to have been an epistemic mishap somewhere down the line. (The mishap would just be more recent.) I'm unfortunately pretty unsure of what actually happened. I do think that more energy should have gone into critiquing the classic AI risk arguments, porting them into the ML paradigm, etc., in the few years immediately after Superintelligence was published, and I do think that there's been too much epistemic deference within the community. As Asya pointed out in a comment on this post, I think that misperception has also been an important issue: people have often underestimated how much uncertainty and optimism prominent community members actually have about AI risk. Another explanation -- although this isn’t a very fundamental explanation -- is that, over the past few years, many people with less doom-y views have entered the community and had an influence. But I’m still confused, overall.

I think that studying and explaining the evolution of views within the community would be an interesting and valuable project in its own right.

[[As a side note, partly in response to below comment: It’s possible that the community has still made pretty much the right prioritization decisions over the past few years, even if there have been significant epistemic mistakes. Especially since AI safety/governance were so incredibly neglected in 2017, I’m less confident that the historical allocation of EA attention/talent/money to AI risk has actually substantially overshot the optimal level. We should still be nervous, though, if it turns out that the right decisions were made despite significantly miscalibrated views within the community.]]

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-18T23:34:54.140Z · score: 5 (5 votes) · EA · GW

How entrenched do you think are old ideas about AI risk in the AI safety community? Do you think that it's possible to have a new paradigm quickly given relevant arguments?

I actually don't think they're very entrenched!

I think that, today, most established AI researchers have fairly different visions of the risks from AI -- and of the problems that they need to solve -- than the primary vision discussed in Superintelligence and in classic Yudkowsky essays. When I've spoken to AI safety researchers about issues with the "classic" arguments, I've encountered relatively low levels of disagreement. Arguments that heavily emphasize mesa-optimization or arguments that are more in line with this post seem to be more influential now. (The safety researchers I know aren't a random sample, though, so I'd be interested in whether this sounds off to anyone in the community.)

I think that "classic" ways of thinking about AI risk are now more prominent outside the core AI safety community than they are within it. I think that they have an important impact on community beliefs about prioritization, on individual career decisions, etc., but I don't think they're heavily guiding most of the research that the safety community does today.

(Unfortunately, I probably don't make this clear in the podcast.)

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-18T23:15:50.752Z · score: 15 (7 votes) · EA · GW

I agree with Aidan's suggestion that Human Compatible is probably the best introduction to risks from AI (for both non-technical readers and readers with CS backgrounds). It's generally accessible and engagingly written, it's up-to-date, and it covers a number of different risks. Relative to many other accounts, I think it also has the virtue of focusing less on any particular development scenario and expressing greater optimism about the feasibility of alignment. If someone's too pressed for time to read Human Comptabile, the AI risk chapter in The Precipice would then be my next best bet. Another very readable option, mainly for non-CS people, would be the AI risk chapters in The AI Does Not Hate You: I think they may actually be the cleanest distillation of the "classic" AI risk argument.

For people with CS backgrounds, hoping for a more technical understanding of the problems safety/alignment researchers are trying to solve, I think that Concrete Problems in AI Safety, Scalable Agent Alignment Via Reward Modeling, and Rohin Shah's blog post sequence on "value learning" are especially good picks. Although none of these resources frames safety/alignment research as something that's intended to reduce existential risks.

I think that AI Governance: A Research Agenda would be the natural starting point for social scientists, especially if they have a substantial interest in risks beyond alignment.

Of course, for anyone interested in digging into arguments around AI risk, I think that Superintelligence is still a really important read. (Even beyond its central AI risk argument, it also has a ton of interesting ideas on the future of intelligent life, ethics, and the strategic landscape that other resources don't.) But it's not where I think people should start.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-18T22:37:24.499Z · score: 3 (3 votes) · EA · GW

You say you disagree with the idea that the day when we create AGI acts as a sort of 'deadline', and if we don't figure out alignment before then we're screwed.

A lot of your argument is about how increasing AI capability and alignment are intertwined processes, so that as we increase an AI's capabilities we're also increasing its alignment. You discuss how it's not like we're going to create a super powerful AI and then give it a module with its goals at the end of the process.

I agree with that, but I don't see it as substantially affecting the Bostrom/Yudkowsky arguments.

Isn't the idea that we would have something that seemed aligned as we were training it (based on this continuous feedback we were giving it), but then only when it became extremely powerful we'd realize it wasn't actually aligned?

I think there are a couple different bits to my thinking here, which I sort of smush together in the interview.

The first bit is that, when developing an individual AI system, its goals and capabilities/intelligence tend to take shape together. This is helpful, since it increases the odds that we'll notice issues with the system's emerging goals before they result in truly destructive behavior. Even if someone didn't expect a purely dust-minimizing house-cleaning robot to be a bad idea, for example, they'll quickly realize their mistake as they train the system. The mistake will be clear well before the point when the simulated robot learns how to take over the world; it will probably be clear even before the point when the robot learns how to operate door knobs.

The second bit is that there are many contexts in which pretty much any possible hand-coded reward function will either quickly reveal itself as inappropriate or be obviously inappropriate before the training process even begins. This means that sane people won’t proceed in developing and deploying things like house-cleaning robots or city planners until they’ve worked out alignment techniques to some degree; they’ll need to wait until we’ve moved beyond “hand-coding” preferences, toward processes that more heavily involve ML systems learning what behaviors users or developers prefer.

It’s still conceivable that, even given these considerations, people will still accidentally develop AI systems that commit omnicide (or cause similarly grave harms). But the likelihood at least goes down. First of all, it needs to be the case that (a): training processes that use apparently promising alignment techniques will still converge on omnicidal systems. Second, it needs to be the case that (b): people won’t notice that these training processes have serious issues until they’ve actually made omnicidal AI systems.

I’m skeptical of both (a) and (b). My intuition, regarding (a), is that some method that involves learning human preferences would need to be really terrible to result in systems that are doing things on the order of mass murder. Although some arguments related to mesa-optimization may push against this intuition.

Then my intuition, regarding (b), is that the techniques would likely display serious issues before anyone creates a system capable of omnicide. For example, if these techniques tend to induce systems to engage in deceptive behaviors, I would expect there to be some signs that this is an issue early on; I would expect some failed or non-catastrophic acts of deception to be observed first. However, again, my intuition is closely tied to my expectation that progress will be pretty continuous. A key thing to keep in mind about highly continuous scenarios is that there’s not just one single consequential ML training run, where the ML system might look benign at the start but turn around and take over the world at the end. We’re instead talking about countless training runs, used to develop a wide variety of different systems of intermediate generality and competency, deployed across a wide variety of domains, over a period of multiple years. We would have many more opportunities to notice issues with available techniques than we would in a “brain in a box” scenario. In a more discontinuous scenario, the risk would presumably be higher.

This seems to be a disagreement about "how hard is AI alignment?".

This might just be a matter of semantics, but I don’t think “how hard is AI alignment?” is the main question I have in mind here. I’m mostly thinking about the question of whether we’ll unwittingly create existentially damaging systems, if we don’t work out alignment techniques first. For example, if we don’t know how to make benign house cleaners, city planners, or engineers by year X, will we unwittingly create omnicidal systems instead? Certainly, the harder it is to work out alignment techniques, the higher the risks become. But it’s possible for accident risk to be low even if alignment techniques are very hard to work out.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-18T18:51:59.511Z · score: 3 (2 votes) · EA · GW

It seems that even in a relatively slow takeoff, you wouldn't need that big of a discontinuity to result in a singleton AI scenario. If the first AGI that's significantly more generally intelligent than a human is created in a world where lots of powerful narrow AIs exist, wouldn't having a super smart thing at the center of control of a bunch of narrow AI tools plausibly be way more powerful than having human brains at the center of that control?

It seems plausible that in a "smooth" scenario the time between when the first group created AGI and the second group creating an equally powerful one could be months apart. Do you think a months-long discontinuity is not enough for an AGI to pull sufficiently ahead?

I would say that, in a scenario with relatively "smooth" progress, there's not really a clean distinction between "narrow" AI systems and "general" AI systems; the line between "we have AGI" and "we don't have AGI" is either a bit blurry or a bit arbitarily drawn. Even if the management/control of large collections of AI systems is eventually automated, I would also expect this process of automation to unfold over time rather than happening in single go.

In general, the smoother things are, the harder it is to tell a story where one group gets out way ahead of others. Although I'm unsure just how "unsmooth" things need to be for this outcome to be plausible.

Even if multiple groups create AGIs within a short time, isn't having a bunch of unaligned AGIs all trying to get power at the same time also an existential risk? It doesn't seem clear that they'd automatically keep each other in check. One might simply be better at growing or better at sabotaging other AIs. Or if they reach a stalemate they might start cooperating with each other to achieve unaligned goals as a compromise.

I think that if there were multiple AGI or AGI-ish systems in the world, and most of them were badly misaligned (e.g. willing to cause human extinction for instrumental reasons), this would present an existential risk. I wouldn't count on them balancing each other out, in the same way that endangered gorilla populations shouldn't count on warring communities to balance each other out.

I think the main benefits of smoothness have to do with risk awareness (e.g. by observing less catastrophic mishaps) and, especially, with opportunities for trial-and-error learning. At least when the concern is misalignment risk, I don't think of the decentralization of power as a really major benefit in its own right: the systems in this decentralized world still mostly need to be safe.

My model is: if you have a central control unit (a human brain, or group of human brains) who is deciding how to use a bunch of narrow AIs, then if you replace that central control unit with one that it more intelligent / fast acting, the whole system will be more effective.

The only way I can think of where that wouldn't be true would be if the general AI required so many computational resources that the narrow AIs that were acting as tools of the AGI were crippled by lack of resources. Is that what you're imagining?

I think it's plausible that especially general systems would be especially useful for managing the development, deployment, and interaction of other AI systems. I'm not totally sure this is the case, though. For example, at least in principle, I can imagine an AI system that is good at managing the training of other AI systems -- e.g. deciding how much compute to devote to different ongoing training processes -- but otherwise can't do much else.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-18T14:46:31.867Z · score: 6 (4 votes) · EA · GW

Hi Elliot,

Thanks for all the questions and comments! I'll answer this one in stages.

On your first question:

Do you agree that the goodness of this analogy is roughly proportional to how slow our AI takeoff is? For instance if the first AGI ever created becomes more powerful than the rest of the world, then it seems that anyone who influenced the properties of this AGI would have a huge impact on the future.

I agree with this.

To take the fairly extreme case of the Neolithic Revolution, I think that there are at least a few reasons why groups at the time would have had trouble steering the future. One key reason is what the world was highly "anarchic," in the international relations sense of the term: there were many different political communities, with divergent interests and a limited ability to either coerce one another or form credible commitments. One result of anarchy is that, if the adoption of some technology or cultural/institutional practice would give some group an edge, then it's almost bound to be adopted by some group at some point: other groups will need to either lose influence or adopt the technology/innovation to avoid subjugation. This explains why the emergence and gradual spread of agricultural civilization was close to inevitable, even though (there's some evidence) people often preferred the hunter-gatherer way of life. There was an element of technological or economic determinism that put the course of history outside of any individual group's control (at least to a significant degree).

Another issue, in the context of the Neolithic Revolution, is that norms, institutions, etc., tend to shift over time, even in there aren't very strong selection pressures. This was even more true before the advent of writing. So we do have a few examples of religious or philosophical traditions that have stuck around, at least in mutated forms, for a couple thousand years; but this is unlikely, in any individual case, and would have been even more unlikely 10,000 years ago. At least so far, we also don't have examples of more formal political institutions (e.g. constitutions) that have largely stuck around for more than few thousand years either.

There are a couple reasons why AI could be different. The first reason is that -- under certain scenarios, especially ones with highly discontinuous and centralized progress -- it's perhaps more likely that one political community will become much more powerful than all others and thereby make the world less "anarchic." Another is that, especially if the world is non-anarchic, values and institutions might naturally be more stable in a heavily AI-based world. It seems plausible that humans will eventually step almost completely out of the loop, even if they don't do this immediately after extremely high levels of automation are achieved. At this point, if one particular group has disproportionate influence over the design/use of existing AI systems, then that one group might indeed have a ton of influence over the long-run future.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-18T12:18:40.065Z · score: 7 (4 votes) · EA · GW

Hi Wei,

I didn't mean to imply that no one had noticed any issues until now. I talk about this a bit more in the podcast, where I mention people like Robin Hanson and Katja Grace as examples of people who wrote good critiques more than a decade ago, and I believe mention you as someone who's had a different take on AI risk.

Over the past 2-3 years, it seems like a lot of people in the community (myself included) have become more skeptical of the classic arguments. I think this has at least partly been the result of new criticisms or improved formulations of old criticisms surfacing. For example, Paul's 2018 post arguing against a "fast takeoff" seems to have been pretty influential in shifting views within the community. But I don't think there's any clear reason this post couldn't have been written in the mid-2000s.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-18T11:58:19.180Z · score: 14 (9 votes) · EA · GW

I currently give it something in the .1%-1% range.

For reference: My impression is that this is on the low end, relative to estimates that other people in the long-termist AI safety/governance community would give, but that it's not uniquely low. It's also, I think, more than high enough to justify a lot of work and concern.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-18T11:34:03.089Z · score: 12 (9 votes) · EA · GW

I'm sorry, but I consider that a very personal question.

Comment by bmg on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher · 2020-07-14T13:13:51.448Z · score: 10 (7 votes) · EA · GW

I do think there's still more thinking to be done here, but, since I recorded the episode, Alexis Carlier and Tom Davidson have actually done some good work in response to Hanson's critique. I was pretty persuaded of their conclusion:

There are similarities between the AI alignment and principal-agent problems, suggesting that PAL could teach us about AI risk. However, the situations economists have studied are very different to those discussed by proponents of AI risk, meaning that findings from PAL don’t transfer easily to this context. There are a few main issues. The principal-agent setup is only a part of AI risk scenarios, making agency rents too narrow a metric. PAL models rarely consider agents more intelligent than their principals and the models are very brittle. And the lack of insight from PAL unawareness models severely restricts their usefulness for understanding the accident risk scenario.

Nevertheless, extensions to PAL might still be useful. Agency rents are what might allow AI agents to accumulate wealth and influence, and agency models are the best way we have to learn about the size of these rents. These findings should inform a wide range of future scenarios, perhaps barring extreme ones like Bostrom/Yudkowsky.

Comment by bmg on Does generality pay? GPT-3 can provide preliminary evidence. · 2020-07-13T00:34:22.611Z · score: 11 (6 votes) · EA · GW

Ben here: Great post!

Something I didn't really touch on the interview is factors that might push in the direction of generality. I've never considered user-friendliness as a factor that might be important -- but I think you're right at least about the case of GPT-3. I also agree that empirical work investigating the value of generality will probably be increasingly useful.

Some other potential factors that might count in favor of generality:

*It seems like limited data availability can push in the direction of generality. For example: If we wanted to create a system capable of producing Shakespearean sonnets, and we had a trillion examples of Shakespearean sonnets, I imagine that the best and most efficient way to create this system would be to train it only on Shakespearean sonnets. But, since we don't have that many Shakespearean sonnets, it of course ends up being useful to first train the system on a more inclusive corpus of English-language text (as in the case of GPT-3) and then fine-tune it on the smaller Shakespeare dataset. In this way, creating general systems can end up being useful (or even necessary) for creating systems that can perform specific tasks. (Although this argument is consistent with more general systems being used in training, but more narrow systems ultimately being deployed.)

*If you're pretty unsure what tasks you'll want AI systems to perform in some context -- and it's slow or costly to create new narrow AI systems, to figure out what existing narrow AI system would be appropriate for the tasks that come up, or to switch to using new narrow AI systems -- then it may simply be more efficient to use very general AI systems that can handle a wide range of tasks.

*If you use multiple dinstinct systems to get some job done, there's a cost to coordinating them, which might avoided if you use a single more unified system. For example, as a human analogy, if three people people want to cook a meal together, then some energy is going to need to go into deciding who does what, keeping track of each person's progress, etc. The costs of coordinating multiple specialized units can sometimes outweigh the benefits of specialization.

I think the "CAIS response" to the latter two points would probably be that AI-driven R&D processes might eventually get really good at quickly spinning up new AI systems, as needed, and coordinating the use of multiple systems as needed. Personally unsure whether or not I find that compelling, in the long run.

Comment by bmg on What posts do you want someone to write? · 2020-04-04T23:55:55.191Z · score: 7 (4 votes) · EA · GW

I think that chapter in the Precipice is really good, but it's not exactly the sort of thing I have in mind.

Although Toby's less optimistic than I am, he's still only arguing for a 10% probability of existentially bad outcomes from misalignment.* The argument in the chapter is also, by necessity, relatively cursory. It's aiming to introduce the field of artificial intelligence and the concept of AGI to readers who might be unfamiliar with it, explain what misalignment risk is, make the idea vivid to readers, clarify misconceptions, describe the state of expert opinion, and add in various other nuances all within the span of about fifteen pages. I think that it succeeds very well in what it's aiming to do, but I would say that it's aiming for something fairly different.

*Technically, if I remember correctly, it's a 10% probability within the next century. So the implied overall probability is at least somewhat higher.

Comment by bmg on What posts do you want someone to write? · 2020-03-26T21:34:48.791Z · score: 28 (13 votes) · EA · GW

I'd be really interested in reading an updated post that makes the case for there being an especially high (e.g. >10%) probability that AI alignment problems will lead to existentially bad outcomes.

There still isn't a lot of writing explaining case for existential misalignment risk. And a significant fraction of what's been produced since Superintelligence is either: (a) roughly summarizing arguments in Superintelligence, (b) pretty cursory, or (c) written by people who are relative optimists and are in large part trying to explain their relative optimism.

Since I have the (possibly mistaken) impression that a decent number of people in the EA community are quite pessimistic regarding existential misalignment risk, on the basis of reasoning that goes significantly beyond what's in Superintelligence, I'd really like to understand this position a lot better and be in a position to evaluate the arguments for it.

(My ideal version of this post would probably assume some degree of familiarity with contemporary machine learning, and contemporary safety/robustness issues, but no previous familiarity with arguments that AI poses an existential risk.)

Comment by bmg on Request for Feedback: Draft of a COI policy for the Long Term Future Fund · 2020-02-08T19:22:13.531Z · score: 14 (5 votes) · EA · GW

More broadly, I just feel really uncomfortable with having to write all of our documents to make sense on a purely associative level. I as a donor would be really excited to see a COI policy as concrete as the one above, similarly to how all the concrete mistake pages on all the EA org websites make me really excited. I feel like making the policy less concrete trades of getting something right and as such being quite exciting to people like me, in favor of being more broadly palatable to some large group of people, and maybe making a bit fewer enemies. But that feels like it's usually going to be the wrong strategy for a fund like ours, where I am most excited about having a small group of really dedicated donors who are really excited about what we are doing, much more than being very broadly palatable to a large audience, without anyone being particularly excited about it.

It seems to me like there's probably an asymmetry here. I would be pretty surprised if the inclusion of specific references to drug use and metamours was the final factor that tipped anyone into a decision to donate to the fund. I wouldn't be too surprised, though, if the inclusion tipped at least some small handful of potential donors into bouncing. At least, if I were encountering the fund for the first time, I can imagine these inclusions being one minor input into any feeling of wariness I might have.

(The obvious qualifier here, though, is that you presumably know the current and target demographics of the fund better than I do. I expect different groups of people will tend to react very differently.)

I feel like the thing that is happening here makes me pretty uncomfortable, and I really don't want to further incentivize this kind of assessment of stuff.

Apologies if I'm misreading, but it feels to me like the suggestion here might be that intentionally using a more "high-level" COI is akin to trying to 'mislead' potential donors by withholding information. If that's the suggestion, then I think I at least mostly disagree. I think that having a COI that describes conflicts in less concrete terms is mostly about demonstrating an expected form of professionalism.

As an obviously extreme analogy, suppose that someone applying for a job decides to include information about their sexual history on their CV. There's some sense in which this person is being more "honest" than someone who doesn't include that information. But any employer who receives this CV will presumably have a negative reaction. This reaction also won't be irrational, since it suggests the applicant is either unaware of norms around this sort of thing or (admittedly a bit circularly) making a bad decision to willfully transgress them. In either case, it's reasonable for the employer to be a lot more wary of the applicant than they otherwise be.

I think the dynamic is roughly the same as the dynamic that leads people to (rationally) prefer to hire lawyers who wear suits over those who don't, to trust think tanks that format and copy-edit their papers properly over those who don't, and so on.

This case is admittedly more complicated than the case of lawyers and suits, since you are in fact depriving potential donors of some amount of information. (At worst, suits just hide information about lawyers' preferred style of dress.) So there's an actual trade-off to be balanced. But I'm inclined to agree with Howie that the extra clarity you get from moving beyond 'high-level' categories probably isn't all that decision-relevant.

I'm not totally sure, though. In part, it's sort of an empirical question whether a merely high-level COI would give any donors an (in their view) importantly inaccurate or incomplete impression of how COIs are managed. If enough potential donors do seem to feel this way, then it's presumably worth being more detailed.

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-08T12:38:46.900Z · score: 5 (3 votes) · EA · GW

No, I think that would be far worse.

But if two people were (for example) betting on a prediction platform that's been set up by public health officials to inform prioritization decisions, then this would make the bet better. The reason is that, in this context, it would obviously matter if their expressed credences are well-callibrated and honestly meant. To the extent that the act of making the bet helps temporarily put some observers "on their toes" when publicly expressing credences, the most likely people to be put "on their toes" (other users of the platform) are also people whose expressed credences have an impact. So there would be an especially solid pro-social case for making the bet.

I suppose this bullet point is mostly just trying to get at the idea that a bet is better if it can clearly be helpful. (I should have said "positively influence" instead of just "influence.") If a bet creates actionable incentives to kill people, on the other hand, that's not a good thing.

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-03T18:57:36.654Z · score: 5 (3 votes) · EA · GW

Maybe you are describing a distinction that is more complicated than I am currently comprehending, but I at least would expect Chi and Greg to object to bets of the type "what is the expected number of people dying in self-driving car accidents over the next decade?", "Will there be an accident involving an AGI project that would classify as a 'near-miss', killing at least 10000 people or causing at least 10 billion dollars in economic damages within the next 50 years?" and "what is the likelihood of this new bednet distribution method outperforming existing methods by more than 30%, saving 30000 additional people over the next year?".

Just as an additional note, to speak directly to the examples you gave: I would personally feel very little discomfort if two people (esp. people actively making or influencing decisions about donations and funding) wanted to publicly bet on the question: "What is the likelihood of this new bednet distribution method outperforming existing methods by more than 30%, saving 30000 additional people over the next year?" I obviously don't know, but I would guess that Chi and Greg would both feel more comfortable about that question as well. I think that some random "passerby" might still feel some amount of discomfort, but probably substantially less.

I realize that there probably aren't very principled reasons to view one bet here as intrinsically more objectionable than others. I listed some factors that seem to contribute to my judgments in my other comment, but they're obviously a bit of a hodgepodge. My fully reflective moral view is also that there probably isn't anything intrinsically wrong with any category of bets. For better or worse, though, I think that certain bets will predictably be discomforting and wrong-feeling to many people (including me). Then I think this discomfort is worth weighing against the plausible social benefits of the individual bet being made. At least on rare occasions, the trade-off probably won't be worth it.

I ultimately don't think my view here is that different than common views on lots of other more mundane social norms. For example: I don't think there's anything intrinsically morally wrong about speaking ill of the dead. I recognize that a blanket prohibition on speaking ill of the dead would be a totally ridiculous and socially/epistemically harmful form of censorship. But it's still true that, in some hard-to-summarize class of cases, criticizing someone who's died is going to strike a lot of people as especially uncomfortable and wrong. Even without any specific speech "ban" in place, I think that it's worth giving weight to these feelings when you decide what to say.

What this general line of thought implies about particular bets is obviously pretty unclear. Maybe the value of publicly betting is consistently high enough to, in pretty much all cases, render feelings of discomfort irrelevant. Or maybe, if the community tries to have any norms around public betting, then the expected cost of wise bets avoided due to "false positives" would just be much higher than the expected the cost of unwise bets made due to "false negatives." I don't believe this, but I obviously don't know. My best guess is that it probably makes sense to strike a (messy/unprincipled/disputed) balance that's not too dissimilar from balances we strike in other social and professional contexts.

(As an off-hand note, for whatever it's worth, I've also updated in the direction of thinking that the particular bet that triggered this thread was worthwhile. I also, of course, feel a bit weird having somehow now written so much about the fine nuances of betting norms in a thread about a deadly virus.)

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-02T17:46:52.974Z · score: 9 (6 votes) · EA · GW

Thanks! I do want to stress that I really respect your motives in this case and your evident thoughtfulness and empathy in response to the discussion; I also think this particular bet might be overall beneficial. I also agree with your suggestion that explicitly stating intent and being especially careful with tone/framing can probably do a lot of work.

It's maybe a bit unfortunate that I'm making this comment in a thread that began with your bet, then, since my comment isn't really about your bet. I realize it's probably pretty unpleasant to have an extended ethics debate somehow spring up around one of your posts.

I mainly just wanted to say that it's OK for people to raise feelings of personal/moral discomfort and that these feelings of discomfort can at least sometimes be important enough to justify refraining from a public bet. It seemed to me like some of the reaction to Chi's comment went too far in the opposite direction. Maybe wrongly/unfairly, it seemed to me that there was some suggestion that this sort of discomfort should basically just be ignored or that people should feel discouraged from expressing their discomfort on the EA Forum.

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-02T16:53:37.301Z · score: 12 (5 votes) · EA · GW

To clarify a bit, I'm not in general against people betting on morally serious issues. I think it's possible that this particular bet is also well-justified, since there's a chance some people reading the post and thread might actually be trying to make decisions about how to devote time/resources to the issue. Making the bet might also cause other people to feel more "on their toes" in the future, when making potentially ungrounded public predictions, if they now feel like there's a greater chance someone might challenge them. So there are potential upsides, which could outweigh the downsides raised.

At the same time, though, I do find certain kinds of bets discomforting and expect a pretty large portion of people (esp. people without much EA exposure) to feel discomforted too. I think that the cases where I'm most likely to feel uncomfortable would be ones where:

  • The bet is about an ongoing, pretty concrete tragedy with non-hypothetical victims. One person "profits" if the victims become more numerous and suffer more.

  • The people making the bet aren't, even pretty indirectly, in a position to influence the management of the tragedy or the dedication of resources to it. It doesn't actually matter all that much, in other words, if one of them is over- or under-confident about some aspect of the tragedy.

  • The bet is made in an otherwise "casual"/"social" setting.

  • (Importantly) It feels like the people are pretty much just betting to have fun, embarrass the other person, or make money.

I realize these aren't very principled criteria. It'd be a bit weird if the true theory of morality made a principled distinction between bets about "hypothetical" and "non-hypothetical" victims. Nevertheless, I do still have a pretty strong sense of moral queeziness about bets of this sort. To use an implausibly extreme case again, I'd feel like something was really going wrong if people were fruitlessly betting about stuff like "Will troubled person X kill themselves this year?"

I also think that the vast majority of public bets that people have made online are totally fine. So maybe my comments here don't actually matter very much. I mainly just want to make the point that: (a) Feelings of common-sense moral discomfort shouldn't be totally ignored or dismissed and (b) it's at least sometimes the right call to refrain from public betting in light of these feelings.

At a more general level, I really do think it's important for the community in terms of health, reputation, inclusiveness, etc., if common-sense feelings of moral and personal comfort are taken seriously. I'm definitely happy that the community has a norm of it typically being OK to publicly challenge others to bets. But I also want to make sure we have a strong norm against discouraging people from raising their own feelings of discomfort.

(I apologize if it turns out I'm disagreeing with an implicit straw-man here.)

Comment by bmg on Concerning the Recent 2019-Novel Coronavirus Outbreak · 2020-02-02T04:23:51.667Z · score: 24 (14 votes) · EA · GW

I can guess that the primary motivation is not "making money" or "the feeling of winning and being right" - which would be quite inappropriate in this context

I don't think these motivations would be inappropriate in this context. Those are fine motivations that we healthily leverage in large parts of the world to cause people to do good things, so of course we should leverage them here to allow us to do good things.

The whole economy relies on people being motivated to make money, and it has been a key ingredient to our ability to sustain the most prosperous period humanity has ever experienced (cf. more broadly the stock market). Of course I want people to have accurate beliefs by giving them the opportunity to make money. That is how you get them to have accurate beliefs!

At least from a common-sense morality perspective, this doesn't sit right with me. I do feel that it would be wrong for two people to get together to bet about some horrible tragedy -- "How many people will die in this genocide?" "Will troubled person X kill themselves this year?" etc. -- purely because they thought it'd be fun to win a bet and make some money off a friend. I definitely wouldn't feel comfortable if a lot of people around me were doing this.

When the motives involve working to form more accurate and rigorous beliefs about ethically pressing issues, as they clearly were in this case, I think that's a different story. I'm sympathetic to the thought that it would be bad to discourage this sort of public bet. I think it might also be possible to argue that, if the benefits of betting are great enough, then it's worth condoning or even encouraging more ghoulishly motivated bets too. I guess I don't really buy that, though. I don't think that a norm specifically against public bets that are ghoulish from a common-sense morality perspective would place very important limitations on the community's ability to form accurate beliefs or do good.

I do also think there are significant downsides, on the other hand, to having a culture that disregards common-sense feelings of discomfort like the ones Chi's comment expressed.

[[EDIT: As a clarification, I'm not classifying the particular bet in this thread as "ghoulish." I share the general sort of discomfort that Chi's comment describes, while also recognizing that the bet was well-motivated and potentially helpful. I'm more generally pushing back against the thought that evident motives don't matter much or that concerns about discomfort/disrespectfulness should never lead people to refrain from public bets.]]

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2020-01-18T19:40:56.247Z · score: 5 (5 votes) · EA · GW

I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, you're just saying it's "often" natural, and I suppose it's natural in some cases and not others. But I think we may disagree on how often it's natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)

In particular, I'm curious what makes you optimistic about finding a "correct" criterion of rightness. In the case of the politician, it seems clear that learning they don't have some of the properties you thought shouldn't call into question whether they exist at all.

But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that there's no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and I'm not sure I understand why.

Hey again!

I appreciated your comment on the LW post. I started writing up a response to this comment and your LW one, back when the thread was still active, and then stopped because it had become obscenely long. Then I ended up badly needing to procrastinate doing something else today. So here’s an over-long document I probably shouldn’t have written, which you are under no social obligation to read.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-30T15:24:37.370Z · score: 4 (8 votes) · EA · GW

Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.

Bomb Case: Omega puts a million dollars in a transparent box if he predicts you'll open it. He puts a bomb in the transparent box if he predicts you won't open it. He's only wrong about one in a trillion times.

Now suppose you enter the room and see that there's a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and don't open the box, then nothing bad will happen to you. You'll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.

Intuitively, this decision strikes me as deeply irrational. You're intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that you're flagrantly violating the "Don't Make Things Worse" principle.

Now, let's step back a time step. Suppose you know that you're sort of person who would refuse to kill yourself by detonating the bomb. You might decide that -- since Omega is such an accurate predictor -- it's worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision you're making now is nonetheless making things better in expectation.

This decision strikes me as pretty intuitively rational. You're violating the second principle -- the "Don't Commit to a Policy..." Principle -- but this violation just doesn't seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.

(This obviously just a description of my own intuitions, as they stand, though.)

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-30T01:23:42.653Z · score: 1 (3 votes) · EA · GW

Yep, thanks for the catch! Edited to fix.

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-30T01:14:37.483Z · score: 4 (6 votes) · EA · GW
  • So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_'s, and not about P_'s

  • But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn't get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)

Here are two logically inconsistent principles that could be true:

Don't Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.

Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.

I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since they're mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.

One could argue that R_CDT sympathists don't actually have much stronger intuitions regarding the first principle than the second -- i.e. that their intuitions aren't actually very "targeted" on the first one -- but I don't think that would be right. At least, it's not right in my case.

A more viable strategy might be to argue for something like a meta-principle:

The 'Don't Make Things Worse' Meta-Principle: If you find "Don't Make Things Worse" strongly intuitive, then you should also find "Don't Commit to a Policy That In the Future Will Sometimes Make Things Worse" just about as intuitive.

If the meta-principle were true, then I guess this would sort of imply that people's intuitions in favor of "Don't Make Things Worse" should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.

But I don't see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of "Don't Make Things Worse" :)

Comment by bmg on I'm Buck Shlegeris, I do research and outreach at MIRI, AMA · 2019-11-28T02:20:08.950Z · score: 2 (3 votes) · EA · GW

I may write up more object-level thoughts here, because this is interesting, but I just wanted to quickly emphasize the upshot that initially motivated me to write up this explanation.

(I don't really want to argue here that non-naturalist or non-analytic naturalist normative realism of the sort I've just described is actually a correct view; I mainly wanted to give a rough sense of what the view consists of and what leads people to it. It may well be the case that the view is wrong, because all true normative-seeming claims are in principle reducible to claims about things like preferences. I think the comments you've just made cover some reasons to suspect this.)

The key point is just that when these philosophers say that "Action X is rational," they are explicitly reporting that they do not mean "Action X suits my terminal preferences" or "Action X would be taken by an agent following a policy that maximizes lifetime utility" or any other such reduction.

I think that when people are very insistent that they don't mean something by their statements, it makes sense to believe them. This implies that the question they are discussing -- "What are the necessary and sufficient conditions that make a decision rational?" -- is distinct from questions like "What decision would an agent that tends to win take?" or "What decision procedure suits my terminal preferences?"

It may be the case that the question they are asking is confused or insensible -- because any sensible question would be reducible -- but it's in any case different. So I think it's a mistake to interpret at least these philosophers' discussions of "decisions theories" or "criteria of rightness" as though they were discussions of things like terminal preferences or winning strategies. And it doesn't seem to me like the answer to the question they're asking (if it has an answer) would likely imply anything much about things like terminal preferences or winning strategies.

[[NOTE: Plenty of decision theorists are not non-naturalist or non-analytic naturalist realists, though. It's less clear to me how related or unrelated the thing they're talking about is to issues of interest to MIRI. I think that the conception of rationality I'm discussing here mainly just presents an especially clear case.]]