FLI AI Alignment podcast: Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI 2020-07-01T20:59:56.243Z


Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-24T23:45:35.777Z · EA · GW

To be clear, I agree with a lot of the points that you're making—the point of sketching out that model was just to show the sort of thing I'm doing; I wasn't actually trying to argue for a specific conclusion. The actual correct strategy for figuring out the right policy here, in my opinion, is to carefully weigh all the different considerations like the ones you're mentioning, which—at the risk of crossing object and meta levels—I suspect to be difficult to do in a low-bandwidth online setting like this.

Maybe it'll still be helpful to just give my take using this conversation as an example. In this situation, I expect that:

  • My models here are complicated enough that I don't expect to be able to convey them here to a point where you'd understand them without a lot of effort.
  • I expect I could properly convey them in a more high-bandwidth conversation (e.g. offline, not text) with you, which I'd be willing to have with you if you wanted.
  • To the extent that we try to do so online, I think there are systematic biases in the format which will lead to beliefs (of at least the readers) being systematically pushed in incorrect directions—as an example, I expect arguments/positions that use simple, universalizing arguments (e.g. Bayesian reasoning says we should do this, therefore we should do it) to lose out to arguments that involve summing up a bunch of pros and cons and then concluding that the result is above or below some threshold (which in my opinion is what most actual true arguments look like).
Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-24T21:17:24.446Z · EA · GW

I think you're imagining that I'm doing something much more exotic here than I am. I'm basically just advocating for cooperating on what I see as a prisoner's-dilemma-style game (I'm sure you can also cast it as a stag hunt or make some really complex game-theoretic model to capture all the nuances—I'm not trying to do that there; my point here is just to explain the sort of thing that I'm doing).


A and B can each choose:

  • public) publicly argue against the other
  • private) privately discuss the right thing to do

And they each have utility functions such that

  • A = public; B = private:
    • u_A = 3
    • u_B = 0
    • Why: A is able to argue publicly that A is better than B and therefore gets a bunch of resources, but this costs resources and overall some of their shared values are destroyed due to public argument not directing resources very effectively.
  • A = private; B = public:
    • u_A = 0
    • u_B = 3
    • Why: ditto except the reverse.
  • A = public; B = public:
    • u_A = 1
    • u_B = 1
    • Why: Both A and B argue publicly that they're better than each other, which consumes a bunch of resources and leads to a suboptimal allocation.
  • A = private; B = private:
    • u_A = 2
    • u_B = 2
    • Why: Neither A nor B argue publicly that they're better than each other, not consuming as many resources and allowing for a better overall resource allocation.

Then, I'm saying that in this sort of situation you should play (private) rather than (public)—and that therefore we shouldn't punish people for playing (private), since punishing people for playing (private) has the effect of forcing us to Nash and ensuring that people always play (public), destroying overall welfare.

Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-24T19:38:20.680Z · EA · GW

For example would you really not have thought worse of MIRI (Singularity Institute at the time) if it had labeled Holden Karnofsky's public criticism "hostile" and refused to respond to it, citing that its time could be better spent elsewhere?

To be clear, I think that ACE calling the OP “hostile” is a pretty reasonable thing to judge them for. My objection is only to judging them for the part where they don't want to respond any further. So as for the example, I definitely would have thought worse of MIRI if they had labeled Holden's criticisms as “hostile”—but not just for not responding. Perhaps a better example here would be MIRI still not having responded to Paul's arguments for slow takeoff—imo, I think Paul's arguments should update you, but MIRI not having responded shouldn't.

Would you update in a positive direction if an organization does effectively respond to public criticism?

I think you should update on all the object-level information that you have, but not update on the meta-level information coming from an inference like “because they chose not to say something here, that implies they don't have anything good to say.”

Do you update on the existence of the criticism itself, before knowing whether or how the organization has chosen to respond?


Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-22T19:08:49.071Z · EA · GW

That's a great point; I agree with that.

Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-22T07:07:50.944Z · EA · GW

I disagree, obviously, though I suspect that little will be gained by hashing it out in more here. To be clear, I have certainly thought about this sort of issue in great detail as well.

Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-22T06:48:27.791Z · EA · GW

It clearly is actual, boring, normal, bayesian evidence that they don't have a good response. It's not overwhelming evidence, but someone declining to respond sure is screening off the worlds where they had a great low-inferential distance reply that was cheap to shoot off that addressed all the concerns. Of course I am going to update on that.

I think that you need to be quite careful with this sort of naive-CDT-style reasoning. Pre-commitments/norms against updating on certain types of evidence can be quite valuable—it is just not the case that you should always update on all evidence available to you.[1]

  1. To be clear, I don't think you need UDT or anything to handle this sort of situation, you just need CDT + the ability to make pre-commitments. ↩︎

Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-22T06:33:46.554Z · EA · GW

To be clear, I think it's perfectly reasonable for you to want ACE to respond if you expect that information to be valuable. The question is what you do when they don't respond. The response in that situation that I'm advocating for is something like “they chose not to respond, so I'll stick with my previous best guess” rather than “they chose not to respond, therefore that says bad things about them, so I'll update negatively.” I think that the latter response is not only corrosive in terms of pushing all discussion into the public sphere even when that makes it much worse, but it also hurts people's ability to feel comfortably holding onto non-public information.

Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-22T03:58:59.346Z · EA · GW

Yeah, I downvoted because it called the communication hostile without any justification for that claim. The comment it is replying to doesn't seem at all hostile to me, and asserting it is, feels like it's violating some pretty important norms about not escalating conflict and engaging with people charitably.

Yeah—I mostly agree with this.

I think it's pretty important for people to make themselves available for communication.

Are you sure that they're not available for communication? I know approximately nothing about ACE, but I'd surprised if they wouldn't be willing to talk to you after e.g. sending them an email.

Importantly, the above also doesn't highlight any non-public communication channels that people who are worried about the negative effects of ACE can use instead. The above is not saying "we are worried about this conversation being difficult to have in public, please reach out to us via these other channels if you think we are causing harm". Instead it just declares a broad swath of potential communication "hostile" and doesn't provide any path forward for concerns to be addressed. That strikes me as quite misguided given the really substantial stakes of shared reputational, financial, and talent-related resources that ACE is sharing with the rest of the EA community.

I'm a bit skeptical of this sort of “well, if they'd also said X then it would be okay” argument. I think we should generally try to be charitable in interpreting unspecified context rather than assume the worst. I also think there's a strong tendency for goalpost-moving with this sort of objection—are you sure that, if they had said more things along those lines, you wouldn't still have objected?

I mean, it's fine if ACE doesn't want to coordinate with the rest of the EA community, but I do think that currently, unless something very substantial changes, ACE and the rest of EA are drawing from shared resource pools and need to coordinate somehow if we want to avoid tragedies of the commons.

To be clear, I don't have a problem with this post existing—I think it's perfectly reasonable for Hypatia to present their concerns regarding ACE in a public forum so that the EA community can discuss and coordinate around what to do regarding those concerns. What I have a problem with is the notion that we should punish ACE for not responding to those accusations—I don't think they should have an obligation to respond, and I don't think we should assume the worst about them from their refusal to do so (nor should we always assume the best, I think the correct response is to be charitable but uncertain).

Comment by evhub on Concerns with ACE's Recent Behavior · 2021-04-22T01:38:34.042Z · EA · GW

Why was this response downvoted so heavily? (This is not a rhetorical question—I'm genuinely curious what the specific reasons were.)

As Jakub has mentioned above, we have reviewed the points in his comment and fully support Anima International’s wish to share their perspective in this thread. However, Anima’s description of the events above does not align with our understanding of the events that took place, primarily within points 1,5, and 6.

This is relevant, useful information.

The most time-consuming part of our commitment to Representation, Equity, and Inclusion has been responding to hostile communications in the EA community about the topic, such as this one.

Perhaps the objection is to ACE's description of the OP as “hostile”? I certainly didn't think the OP was hostile, so if that's the concern, I would agree, but...

We prefer to use our time and generously donated funds towards our core programs. Therefore, we will not be engaging any further in this thread.

I think this is an extremely reasonable position, and I don't think any person or group should be downvoted or otherwise shamed for not wanting to engage in any sort of online discussion. Online discussions are very often terrible and I think it's a problem if we have a norm that requires people or organizations to publicly engage with any online discussion that mentions them.

Comment by evhub on Should pretty much all content that's EA-relevant and/or created by EAs be (link)posted to the Forum? · 2021-01-15T22:26:47.302Z · EA · GW

I'd personally love to get more Alignment Forum content cross-posted to the EA Forum. Maybe some sort of automatic link-posting? Though that could pollute the EA Forum with a lot of link posts that probably should be organized separately somehow. I'd certainly be willing to start cross-posting my research to the EA Forum if that would be helpful.

Comment by evhub on FLI AI Alignment podcast: Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI · 2020-07-03T06:57:08.640Z · EA · GW

Glad you enjoyed it!

So, I think what you're describing in terms of a model with a pseudo-aligned objective pretending to have the correct objective is a good description of specifically deceptive alignment, though the inner alignment problem is a more general term that encompasses any way in which a model might be running an optimization process for a different objective than the one it was trained on.

In terms of empirical examples, there definitely aren't good empirical examples of deceptive alignment right now for the reason you mentioned, though whether or not there are good empirical examples of inner alignment problems in general is more questionable. There are certainly lots of empirical examples of robustness/distributional shift problems, but because we don't really know whether our models are internally implementing optimization processes or not, it's hard to really say whether we're actually seeing inner alignment failures. This post provides a description of the sort of experiment which I think would need to be done to really definitely demonstrate an inner alignment failure (Rohin Shah at CHAI also has a similar proposal here).

Comment by evhub on [deleted post] 2020-03-05T23:23:10.189Z

This thread on LessWrong has a bunch of information about precautions that might be worth taking.