The case for becoming a black-box investigator of language models

post by Buck · 2022-05-06T14:37:13.853Z · EA · GW · 7 comments

[Cross-posted from the Alignment Forum [AF · GW].]

Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.

I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.

I'd probably fund people to spend at least a few months on this; email me if you want to talk about it.

Some main activities I’d do if I was a black-box LM investigator are:

The skills you’d gain seem like they have a few different applications to alignment:

My guess is that this work would go slightly better if you had access to someone who was willing to write you some simple code tools for interacting with models, rather than just using the OpenAI playground. If you start doing work like this and want tools, get in touch with me and maybe someone from Redwood Research will build you some of them.


Comments sorted by top scores.

comment by Peter Wildeford (Peter_Hurford) · 2022-05-06T22:56:28.368Z · EA(p) · GW(p)

I’ve started doing a bunch of this and posting results to my Twitter.

comment by gabriel_wagner · 2022-05-06T20:14:28.500Z · EA(p) · GW(p)

Nice post!

Do you think a person working on this should also have some basic knowledge of ML? Or might it be better to NOT have that, to have a more "pure" outsider view on the behaviour of the models?

Replies from: Buck
comment by Buck · 2022-05-06T21:44:40.794Z · EA(p) · GW(p)

I think that knowing a bit about ML is probably somewhat helpful for this but not very important.

comment by Girish_Sastry · 2022-05-06T18:32:48.251Z · EA(p) · GW(p)

I'd also be interested in funding activities like this. This could  inform how much we can learn about models without distributing weights.

comment by tugbazsen · 2022-05-20T18:50:14.179Z · EA(p) · GW(p)

I'm not very well-versed in the CS aspect of ML or AI, but I really enjoyed reading about Redwood's work and reading this post reminded me of something I found striking then. I am not a trained EFL teacher but I have a decent grasp of some of the theory and some experience with classroom observation at different levels and teaching/tutoring EFL, and the examples in your "Redwood Research’s current project [AF · GW]" write up are very similar in a lot of ways to mistakes intermediate or almost proficient EFL speakers would make (not being able to track  what pronouns are referring to across longer bodies of texts, not being able to grasp the implications of certain verbs when applied to certain objects etc). This makes me think that getting both language acquisition experts and EFL researchers' perspectives on your data may also be interesting and useful to this kind of research.

comment by Joe Pusey · 2022-05-07T21:15:32.832Z · EA(p) · GW(p)

Super interesting topic- I sent you a message Buck :)

comment by brb243 · 2022-05-06T15:56:54.140Z · EA(p) · GW(p)

This is so cool. I spent a lot of time analyzing AI, but mainly from the perspective of visual, tone of voice, and symbolical rather than objective content. Maybe my analysis of abstract advertisement can be of an inspiration to you. I see the greatest issue with AI persuasion [AF · GW], in particular one that the audience cannot rationalize since it is working with human intuition (prima facie, positive objectives are narrated - by advertisers and humans).

I only saw a few GPT-3 texts, but it seems to me that the model is optimizing for attention by making discussants 'submit' - normalizes aggression: expresses the unwillingness of the discussants to interact yet interacting (I commented a bit here [LW(p) · GW(p)]). This is a suboptimal system - norms of free cooperation on important problem solving may seem better.

It may not be an issue of GPT-3, it may be the issue of prominent/majority internet content that is co-developed by humans and AI (humans intuitively learn what to produce by seeing metrics which relate to engagement). If GPT-3 reflects other content, it may behave differently (has anyone already tried training it on the EA Forum?)

How can one use a software to train language models in self-reflection (analyze and qualify the rationale of audience's action or the viewers' feelings - e. g. if they are acting based on fear vs. on somewhat reasoned conclusion regarding usefulness of the product to their true objectives) and fostering humans' objectives which are also virtuous and inclusive (e. g. be healthy, improve one's and others' problems, be well-enjoyed in one's environment, be informed about reasons for one's actions, empathize with animals, learn unbiasing information, etc)? Is this already developed, just the questions/prompts are missing/limited?

I think that a software that can sell actually good products by motivating reasoning is somewhat unbeatable (consumers will decrease their demand for products less aligned with their objectives), so any company or economy that has a competitive advantage in this capacity (alongside with the ability to diversify and rapidly adjust production) benefits.

There is a risk of explaining AI in a way that omits the rationale for persons' emotions or actions, since then AI can seem prima facie safe but actually lead to the loss of human agency without humans/regulators able to detect it or having the infrastructure to react to it.