Steelmanning the Case Against Unquantifiable Interventions

post by Davidmanheim · 2019-11-13T08:34:07.820Z · score: 45 (21 votes) · EA · GW · 19 comments

Contents

    Outline
  The best quantifiable interventions
    Why are the best interventions so great?
    Why aren't we finding more of these effective interventions?
    Explaining Observed Success
  Objections to speculative interventions
    We Can't Guess What Works
    Isn't there a Track Record of Failures?
  Predicting Effectiveness
    Cost Effectiveness isn't Effectiveness
    Past Performance Doesn't Guarantee Future Results
    But what about the outcomes?
    Gloomy Clouds, with a Ray of Hope
None
19 comments

There is a strong argument that many of the most important, high-value interventions cannot be robustly quantified. For example, corruption reduction efforts and other policy change to increase long-term economic growth, therefore saving lives and reducing suffering, are high-importance, plausibly tractable, and relatively neglected areas. Using the ITN framework [EA · GW], or almost any other expected-value prioritization approach, this means they should be prioritized highly. I want to briefly outline and steelman the counterargument, that these have very low expected value compared to the best short-term and quantifiable interventions, and pick apart some uncertainties and issues. (This argument is not particularly novel, but I haven't seen it made clearly before.)

This is not about long-term interventions, or existential risk-reduction, where a number of different concerns apply, and thinking about it is even harder, so I'm not doing so. For the sake of this discussion, this translate into something like assuming a relatively high discount rate so that the long-term doesn't matter.

Epistemic Status / Notes: When first encountering EA, one of my early reactions was to say that these "soft" / "squishy" policy interventions were underappreciated. (In my defense / clouding my judgement, I was in a public policy PhD program at the time.) On reflection, I have mostly changed my mind, or at least become far less certain, which is what led to this write-up.

I am NOT highly confident in all of these arguments, but I have put in time thinking about them. Lastly, hard-to quantify and deeply uncertain interventions about existential risk reduction and the long term future are different in key ways, thinking about them often makes my head hurt, I don't have clear conclusions, and I have personal biases, so I'm not going to discuss it here.

Outline

First, to understand the distribution of expected outcomes, I'll discuss why the best interventions that we know of are orders of magnitude more effective than the average intervention, and why it's hard to find more.

Second, I'll review Peter Hurford's arguments about why we should be wary of "speculative" causes [EA · GW], focusing on the uncertainty and non-quantifiable outcomes rather than his discussion of long-termism and existential risk.

Lastly, I'll suggest that comparing expected value of investments is fundamentally intractable, and suggest why this would lead to focusing on short term quantified investments versus focusing on deeply uncertain or non-quantifiable issues.

The best quantifiable interventions

By now, it is a fairly accepted empirical observation that when considering interventions, the distribution of impact of extant charities is fat-tailed. The items way out in the tail are things like distributing bed-nets, deworming, and giving directly. The question is why these are good, and why are there few of them.

Why are the best interventions so great?

Why are these so effective? It takes an unusual combination of factors to make an intervention highly effective. To start, the intervention must address an important issue, and the intervention must either be novel, or the area must be neglected.

To explain why this is true, first note that marginal investment has decreasing returns in most domains. That is, the first million dollars spent on preventing starvation is likely to be far more effective than the thousandth million. Second, for most important problems, billions have been spent on a problem over decades. Even if only a fraction is spent relatively effectively, it will be very unusual for there to be low hanging fruit, unless a novel approach to address the problem is found. So if a cause is not neglected (relative to the scale of the problem,) and the intervention is not novel, it is unlikely that the intervention will be highly effective.

This is as true for current effective charities as is it is in general - once almost all people impacted by Malaria have bed nets, and almost everyone susceptible to schistosomiasis is dewormed, they will stop being priorities for effective charity. (This is, of course, a good thing.)

Even for novel interventions in neglected areas, however, we find very few highly effective interventions. Why?

Why aren't we finding more of these effective interventions?

Even given the insight that we are looking for novel approaches and neglected problems, most charities are comparatively ineffective. While it is easy to forget, it took significant amounts of research across a very broad set of interventions to identify the few that work really well. Givewell considered 300 charities in 2009, and 400 up to that point, in order to identify what ended up as 10 charities to recommend. Of these, only 1 is still recommended as effective. (And none are currently not recommended because of insufficient room for funding, or the problem being solved.)

The low "hit" rate for very effective interventions is even more noteworthy given the filters that were in place. Of the tens of thousands of programs that are tried, a small fraction are found worthwhile enough to be continued and made into ongoing charities. Of those, most weren't nominated for review by Givewell, likely in part because there is little reason to even suspect they are highly effective. Even among those few, many had evidence that pointed to a high likelihood that they were lower-value than the best charities.

This is to be expected. Rossi's Iron Law Of Evaluation states that the expected value of any net impact assessment of any large scale social program is zero. Even given fat tails, most interventions are flops. If effective interventions are rare, and are exploited and driven to be ineffective once the problem is solved, there should be very few such interventions.

The next question is why we find any at all.

Explaining Observed Success

If good interventions are expected to be rare, how do we explain the fact that we do, in fact, find that some interventions are orders of magnitude more effective than others?

I claim that the question has a simple answer: Persi Diaconis and Frederick Mosteller's law of truly large numbers. With a large enough number of samples, any unlikely thing is likely to be observed. And since we keep trying interventions, and we actively search for those that are effective, and (slowly) abandon those that don't work, it should be unsurprising that we find some that are relatively far out in the distribution.

A bit more technically, if we're trying to sample with a bias towards the tail, the tail of the overall distribution doesn't need to be fat for the observed tail to be fat. What this (tentative) model implies, however, is that even when looking, we're still unlikely to find interventions that are orders of magnitude more effective than those we currently see.

Objections to speculative interventions

Peter Hurford notes that speculative interventions are a priori unlikely to be effective, for several reasons related to the above argument which are worth reviewing. (There are other arguments which seem less relevant, and so they will not be discussed.) I will review them somewhat critically, but the arguments all are relevant and support his argument.

We Can't Guess What Works

In addition to the above issue with effective interventions being rare, he points out that people are bad at guessing which interventions are effective, and which are ineffective. People's inability to guess which programs from among those that are actively tried are helpful is evidence about their ability to predict the success of future programs.

However, I will claim that it is far weaker evidence than it seems. That's because there is a pre-selection of programs that are evaluated, and programs that most people would expect to be low value are never pursued. You might object that Rossi's Iron law applies to interventions that are tried - the expected value of even the interventions that seem plausible is zero.

Still, Rossi's Iron Law is likely misleading because the sample of attempted interventions is biased as well. The mean is zero, he notes, only after we exclude the most obvious and effective programs. As Rossi's Zinc law states, “Only those programs that are likely to fail are evaluated.” I will note that this non-evaluation was probably true in 1978 when the paper was written, but the movement towards cost-effectiveness analysis means that even effective interventions are likely to be tested to see how effective they are. Given that, it is unclear whether the stylized fact / empirical observation of the Iron Law is still correct.

But going a step further, a complete sample would include not only likely effective programs, but obviously effective things, like whether schools teaching a subject, say, biology, is effective at increasing children's knowledge of that material, or whether distributing food in regions with starving people reduces deaths from starvation. In both cases, people would, presumably, correctly guess the direction of the effect.

I'll call this the Diamond Law of Evaluation: Programs that are effective in obvious enough ways are not evaluated on that basis. Any evaluation of Give-Directly is about spillover effects, or effects over time, not about whether giving someone money makes the amount of money they have increase. That means that some proportion of proposed interventions should be expected to work well.

Isn't there a Track Record of Failures?

If we can predict what works, why is there is a track record of failure? As noted above, 90% of the charities initially chosen by Givewell are no longer recommended. Hurford argues that this means they failed, and this failure rate should be expected to apply to future programs.

But on review the track record doesn't imply these interventions failed, exactly. They were not found to be ineffective or harmful. Instead, most such charities were downgraded in effectiveness, but it seems that none were found to have negative or even near-zero impacts. The track record also shows that focusing on a priori important and neglected areas is a good way to find effective interventions.

So we can conclude that there are a bunch of factors we care about, and they have different influences.

Predicting Effectiveness

Given all of the factors we care about, we need to look at a few different factors to find an estimate of the expected value. Before looking at that, we should reconsider what we're estimating.

Cost Effectiveness isn't Effectiveness

The previous discussion has mostly focused on impacts of interventions, not cost. Thankfully, it turns out that cost is much easier to predict than impact - it's not exact, but we're shocked if our estimate of cost is off by an order of magnitude, and we're only mildly surprised if our estimate of impact is off by a similar factor. This is critical, because it means impact estimates are far more important.

If we're looking at a potential intervention affecting 10,000 people that costs $10 million, the difference between it saving 0.1% of the people and it saving 10% is tremendous; the first case is $100,000 per life saved, a worthwhile but not amazing intervention, versus $1,000 per life saved, putting it among the most effective interventions. The difference between it costing $9 million and $11 million, on the other hand, is tiny.

This is particularly true when we're uncertain if an intervention works at all. Impact evaluations might show an improvement of 0.1% ,or an improvement of 10% - 2 order of magnitude difference. It might also show that the impact is -0.1%, which is a much bigger deal, since it means we're paying money to make things worse. Cost, however, is rarely this uncertain, and it's not usually negative (it can happen, but that's not our current focus.)

Past Performance Doesn't Guarantee Future Results

Given the discussion above, we should naively expect that the effectiveness of future interventions is distributed similarly to the effectiveness of past interventions. This isn't quite correct, because the future interventions we're looking at are ideas that we think are the most effective, rather than the full set of interventions that get tried.

We do have a comparison class for this, since Givewell has been giving out Incubation Grants that (in part) fund exactly this sort of intervention that is expected to be effective. Unfortunately, there are only a few data-points, so our inference will be weak. Not only that, but the differences in intervention types mean that we can't even compare these, I would even argue that our prior estimates are probably more informative than the data.

But what about the outcomes?

Not only are samples of comparable interventions too small to make conclusions, but our estimates of the actual impacts will have a comparable problem. There are only a few dozen countries where you might want to run a country level anti-corruption effort. If we imagine that it increases GDP growth by 1% per year, GDP growth per year is really variable, and even if we did it everywhere, the sample size isn't enough to let you control for the key variables and figure out if it is working. Our estimate of the change isn't ever going to tell us the impact of our work. (I talked about this here, and came to an unsatisfactory conclusion.)

That means that we can't conclude afterwords whether the intervention worked. Instead, we need theories of change, and surveys of corruption, and second order estimates of the impact based on that. In short, we won't find out if our work helped. Instead, our feedback mechanism is based on our usually impossible to empirically test estimate, and we compare the estimate from this to our prior estimate of what we thought would happen. As Andrew Gelman said in a related vein, "the data have no direct connection to anything, so if these data are too noisy, the whole thing is essentially useless (except for informing us that, in retrospect, this approach to studying the problem was not helpful)."

If this seems like a hopeless muddle, it gets worse. If we can't see what the effect is, we cannot improve our interventions on that basis. That means that it's hard to selectively sample from the best types of interventions, since we don't know what the best types are.

Gloomy Clouds, with a Ray of Hope

Basically, I would conclude that any attempt to pick new interventions, or fund internvetion types with impacts that are not quantifiable, is fundamentally problematic. Unless we can appeal to the Diamond law of Evaluation, that the impact of our work is logically dictated by the interventions, funding this type of work can only be justified on the basis of our ability to predict the outcomes. Unless we have some super-ability to forecast, this is doomed.

In fact, however, we do have some amazing forecasting methods. Unfortunately, they are restricted in scope to events with quantifiable outcomes of the type that could be used as feedback. If we find that markets can predict replicability of studies, we might be able to say more. Even then, however, it seems unlikely that we'd get precise enough feedback to reliably tell the difference between an intervention that will be very, very impactful and those that are only somewhat impactful.

That doesn't mean that we shouldn't continue to look for effective ways to impact systems that are hard to predict, but it does mean that it's hard to justify any claims that most such programs will be nearly as impactful as just giving poor people money.


19 comments

Comments sorted by top scores.

comment by MichaelPlant · 2019-11-13T12:55:40.322Z · score: 9 (5 votes) · EA(p) · GW(p)

Thanks for writing this up - I found it helpful. I'm just trying to summarise this in my head and have some questions.

To get the claim that the best interventions are much better than the rest, don't you need to claim that interventions follow a (very) fat-tailed distribution, rather than the claim there are lots of interventions? If they were normally distributed, then (say) bednets would be a massive outlier in terms of effectiveness, right? Do you (or does someone else) have an argument that interventions should be heavy-tailed?

About predicting effectiveness, it seems your conclusion should be one of epistemic modesty relating to hard-to-quantify interventions, not that we should never think they are better. The thought seems to be people are bad at predicting interventions in general, but we can at least check for the easy-to-quantify predictions to overcome our bias; whereas we cannot do this for the hard ones. It seems the implication is that we should discount the naive cost-effectiveness of systemic interventions to account for this bias. But 'sophisticated' estimates of cost-effectiveness for hard-to-quantify interventions might still turn out to be better than those for estimates of simple interventions. Hence it's a note of caution about estimations, not a claim that, in fact, hard to quantify interventions are (always or generally) less cost-effective.


comment by Davidmanheim · 2019-11-13T16:47:57.289Z · score: 4 (2 votes) · EA(p) · GW(p)

You say that the distribution needs to be "very" fat tailed - implying that we have a decent chance of finding interventions order of mangitude more eefective than bed-nets. I disagree. The very most effective possible interventions, where the cost-benefit ratio is insanely large, are things that we don't need to run as interventions. For instance, telling people to eat when they have food so they don't starve would be really impactful if it weren't unnecessary because of how obviously beneficial it is.

So I don't think bednets are a massive outlier - they just have a relatively low saturation compared to most comparably effective interventions. The implication of my model is that most really effective interventions are saturated, often very quickly. Even expensive systemic efforts like vaccinations for smallpox got funded fairly rapidly after such universal eradication was possible, and the less used vaccines are either less effective, for less critical diseases, or are more expensive and/or harder to distribute. (And governments and foundations are running those campaigns, successfully, without needing EA pushing or funding.) And that's why we see few very effective outliers - and since the underlying distribution isn't fat tailed, even more effective interventions are even rarer, and those that did exist are gone very quickly.


On prediction, I agree that the conclusion is one of epistemic modesty rather than confident claims of non-effectiveness. But the practical implication of that modesty is that for any specific intervention, if we fund it thinking it may be really impactful, we're incredibly unlikely to be correct.

Also, I'm far more skeptical than you about 'sophisticated' estimates. Having taken graduate courses in econometrics, I'll say that the methods are sometimes really useful, but the assumptions never apply, and unless the system model is really fantastic, the prediction error once accounting for model specification uncertainty is large enough that most such econometric analyses of these sorts of really complex, poorly understood systems like corruption or poverty simply don't say anything.

comment by Milan_Griffes · 2019-11-13T13:13:56.356Z · score: 2 (1 votes) · EA(p) · GW(p)
About predicting effectiveness, it seems your conclusion should be one of epistemic modesty relating to hard-to-quantify interventions, not that we should never think they are better.

This is where I'm at too – e.g. the impact of Bletchley Park would have been hard to quantify prospectively, and in retrospect was massively positive.

Curious if OP is actually saying the other thing (that hard-to-quantify implies lower cost-effectiveness).

comment by Davidmanheim · 2019-11-13T16:50:53.554Z · score: 4 (2 votes) · EA(p) · GW(p)

See my comment above. Bletchley park was exactly the sort of intervention that doesn't need any pushing. It was funded immediately because of how obvious the benefit was. That''s not retrospective.

If you were to suggest something similar now that were politically feasible and similarly important to a country, I'd be shocked if it wasn't already happening. Invest in AI and advanced technologies? Check. Invest in Global Health Security? Also check. So the things left to do are less obviously good ideas.

comment by Milan_Griffes · 2019-11-14T12:52:34.209Z · score: 3 (2 votes) · EA(p) · GW(p)
Bletchley park was exactly the sort of intervention that doesn't need any pushing. It was funded immediately because of how obvious the benefit was.

Pretty sure that's not right, at least for Turing's work on Enigma:

"Turing decided to tackle the particularly difficult problem of German naval Enigma 'because no one else was doing anything about it and I could have it to myself'."


If you were to suggest something similar now that were politically feasible and similarly important to a country, I'd be shocked if it wasn't already happening. Invest in AI and advanced technologies?...

What about AI alignment work circa 2010?

Quick examples from the present day: preparing for risks from nanotechnology; working on geoengineering safety

comment by Davidmanheim · 2019-11-15T09:28:35.307Z · score: 1 (1 votes) · EA(p) · GW(p)

Bletchley park as an intervention wasn't mostly focused on enigma, at least in the first part of the war. It was tremendously effective anyways, as should have been expected. The fact that new and harder codes were being broken was obviously useful as well, and from what I understood, was being encouraged by the leadership alongside the day-to-day codebreaking work.

And re: AI alignment, it WAS being funded. Regarding nanotech risks and geoengineering safety now, it's been a focus of discussion at CSER and FHI, at least - and there is agreement about the relatively low priority of each compared to other work. (But if someone qualified and aligned with EA goals wanted to work on it more, there's certainly funding available.)

comment by Milan_Griffes · 2019-11-15T19:36:11.591Z · score: 10 (4 votes) · EA(p) · GW(p)

I feel confused about whether there's actually a disagreement here. Seems possible that we're just talking past each other.

  • I agree that Bletchley Park wasn't mostly focused on cracking Enigma.
  • I don't know enough about Bletchley's history to have an independent view about whether it was underfunded or not. I'll follow your view that it was well supported.
  • It does seem like Turing's work on Enigma wasn't highly prioritized when he started working on it ("...because no one else was doing anything about it and I could have it to myself"), and this work turned out to be very impactful. I feel confident claiming that Bletchley wasn't prioritizing Enigma highly enough before Turing decided to work on it. (Curious whether you disagree about this.)

On the present-day stuff:

  • My claim is that circa 2010 AI alignment work was being (dramatically) underfunded by institutions, not that it wasn't being funded at all.
  • It wouldn't surprise me if 20 years from now the consensus view was "Oh man, we totally should have been putting more effort towards figuring out what safe geoengineering looks like back in 2019."
  • I believe Drexler had a hard time getting support to work on nanotech stuff (believe he's currently working mostly on AI alignment), but I don't know the full story there. (I'm holding Drexler as someone who is qualified and aligned with EA goals.)
comment by JP Addison (jpaddison) · 2019-11-15T19:55:02.419Z · score: 4 (3 votes) · EA(p) · GW(p)

I thought this was a really good comment – well written and well structured.

I feel confident claiming that Bletchley wasn't prioritizing Enigma highly enough before Turing decided to work on it.

This is obvious from hindsight, but to make that claim you need to show that they could predict that the expected value was high in advance, which does seem to be the whole game.

comment by Davidmanheim · 2019-11-17T08:00:46.727Z · score: 1 (1 votes) · EA(p) · GW(p)

I think we can drop the Bletchley park discussion. On the present-day stuff, I think they key point is that future-focused interventions have a very different set of questions than present-day non-quantifiable interventions, and you're plausibly correct that they are underfunded - but I was trying to focus on the present-day non-quantifiable interventions.

comment by Milan_Griffes · 2019-11-17T13:56:16.859Z · score: -1 (2 votes) · EA(p) · GW(p)
I think we can drop the Bletchley park discussion.

Okay, I take it that you agree with my view.


... future-focused interventions have a very different set of questions than present-day non-quantifiable interventions

How are you separating out "future-focused interventions" from "present-day non-quantifiable interventions"?

Plausibly geoengineering safety will be very relevant in 15-30 years. Assuming that's true, would you categorize geoengineering safety research as future-focused or present-day non-quantifiable?


comment by Davidmanheim · 2019-11-18T11:13:40.823Z · score: 1 (1 votes) · EA(p) · GW(p)

I think my example of corruption reduction captures most of the types of interventions that people have suggested are useful but hard-to quantify, but other examples would be happiness focused work, or pushing for systemic change of various sorts.

Tech risks involving GCRs that are a decade or more away are much more future-focused in the sense that different arguments apply, as I said in the original post.

comment by Eva · 2019-11-14T15:05:07.747Z · score: 8 (5 votes) · EA(p) · GW(p)

As a small note, we might get more precise estimates of the effects of a program by predicting magnitudes rather than whether something will replicate (which is what we're doing with the Social Science Prediction Platform). That said, I think a lot of work needs to be done before we can have trust in predictions, and there will always be a gap between how comfortable we are extrapolating to other things we could study vs. "unquantifiable" interventions.

(There's an analogy to external validity here, where you can do more if you can assume the study you predict is drawn from the same set as those you have studied, or the same set if weighted in some way. You could in principle make an ordering of how feasible something is to be studied, and regress your ability to predict on that, but that would be incredibly noisy and not practical as things stand, and past some threshold you don't observe studies anymore and have little to say without making strong assumptions about generalizing past that threshold.)

comment by Davidmanheim · 2019-11-15T09:23:07.949Z · score: 1 (1 votes) · EA(p) · GW(p)

Agreed on all points!

I'd note that the problem with predicting magnitudes is simply that it's harder to do than predicting a binary "will it replicate," though both are obviously valuable.


comment by Halffull · 2019-11-14T02:34:33.359Z · score: 6 (3 votes) · EA(p) · GW(p)

I'd be curious about your own view on unquantifiable interventions, rather than just the Steelman of this particular view.

comment by Davidmanheim · 2019-11-14T05:55:32.826Z · score: 1 (1 votes) · EA(p) · GW(p)

As I said in the epistemic status, I'm far less certain than I once was, and on the whole I'm now skeptical. As I said in the post and earlier comments, I still think there are places where unquantifiable interventions are very valuable, I just think that unless it's obvious that they will be (see: Diamond Law of Evaluation,) I'd claim that quantifiably effective interventions are in expectation better.

comment by Michael_Wiebe · 2019-11-16T18:24:47.171Z · score: 4 (2 votes) · EA(p) · GW(p)
But on review the track record doesn't imply these interventions failed, exactly. They were not found to be ineffective or harmful.

Another factor to consider: a cause area could be highly cost-effective, but GiveWell rejected it because the organizations working in that area were not sufficiently transparent or competent.

comment by Davidmanheim · 2019-11-17T08:05:38.329Z · score: 1 (1 votes) · EA(p) · GW(p)

Yes - but if it is expected to be very high value, I'd think that they'd be pushing for a new EA charity with it as a focus, as they have done in the past. Most were dropped because the work they did wasn't as valuable as the top charities.

comment by Michael_Wiebe · 2019-11-16T18:41:55.807Z · score: 1 (1 votes) · EA(p) · GW(p)
That means that we can't conclude afterwords whether the intervention worked. Instead, we need theories of change, and surveys of corruption, and second order estimates of the impact based on that. In short, we won't find out if our work helped.

This seems too strong. We can't conclude with certainty whether the intervention worked, and we won't find out with certainty if our work helped. But we will have some information.

comment by Davidmanheim · 2019-11-17T08:06:46.238Z · score: 2 (2 votes) · EA(p) · GW(p)

Agreed - but as the link I included argues, the information we have is swamped by our priors, and isn't particularly useful for making objective conclusions