Do Prof Eva Vivalt's results show 'evidence-based' development isn't all it's cut out to be?

post by Robert_Wiblin · 2018-05-21T16:28:27.239Z · EA · GW · Legacy · 16 comments

I recently interviewed a member of the EA community - Prof Eva Vivalt - at length about her research into the value of using trials in medical and social science to inform development work. The benefits of 'evidence-based giving' has been a core message of both GiveWell and Giving What We Can since they started.

Vivalt's findings somewhat challenge this, and are not as well known as I think they should be. The bottom line is that results from existing studies only weakly predict the results of similar future studies. They appear to have poor 'external validity' - they don't reliably indicate the measured result an intervention will seem to have in future. This means that developing an evidence base to figure out how well projects will work is more expensive than it otherwise would be.

Perversely, in some cases this can make further studies more informative, because we currently know less than we would if past results generalized well.

Note that Eva discussed an earlier version of this paper at EAG 2015.

Another result that conflicts with messages 80,000 Hours has used before is that experts on average are fairly good at guessing the results of trials (though you need to average many guesses). Aggregating these guesses may be a cheaper alternative to running studies, though the guesses may become worse without trial results to inform them.

Eva's view is that there isn't much alternative to collecting evidence like this - if it's less useful we should just accept that, but continue to run and use studies of this kind.

I'm more inclined to say this should shift our approach. Here's one division of the sources of information that inform our beliefs:

  1. Foundational priors
  2. Trials in published papers
  3. Everything else (e.g. our model of how things work based on everyday experience).

Inasmuch as 2 looks less informative, we should rely more on the alternatives (1 and 3).

Of course Eva's results may also imply that 3 won't generalize between different situations either. In that case, we also have more reason to work within our local environment. It should nudge us towards thinking that useful knowledge is more local and tacit, and less universal and codified. We would then have greater reason to become intimately familiar with a particular organisation or problem and try to have most of our impact through those areas we personally understand well. 

It also suggests that problems which can be tackled with published social science may not have as high tractability - relative to alternative problems we could work on - as it first seems.

You can hear me struggle to figure out how much these results actually challenge conventional wisdom in the EA community later in the episode, and I'm still unsure.

For an alternative perspective from another economist in the community, Rachel Glennerster, you can read this article: Measurement & Evaluation: The Generalizability Puzzle. Glennerster believes that generalisability is much less of an issue than does Vivalt, and is not convinced by how she has tried to measure it.

There are more useful links and a full transcript on the blog post associated with the podcast episode.


Comments sorted by top scores.

comment by Eva · 2018-05-22T23:59:31.934Z · EA(p) · GW(p)

Great comment. I don't think anyone, myself included, would say the means are not the same and therefore everything is terrible. In the podcast, you can see my reluctance to that when Rob is trying to get me to give one number that will easily summarize how much results in one context will extrapolate to another, and I just don't want to play ball (which is not at all to criticize!). The number I tend to focus on these days (tau squared) is not one that is easily interpretable in that way - instead, it's a measure of the unexplained variation in results - but how much is unexplained clearly depends on what model you are using (and because it is a variance, it really depends on units, making it hard to interpret across interventions except for those dealing with the same kind of outcome). On this view, if you can come up with a great model to explain away more of the heterogeneity, great! I am all for models that have better predictive power.

On the other hand:

1) I do worry that often people are not building more complicated models, but rather thinking about a specific study (if lucky, a group of studies), most likely being biased towards those which found particularly large effects as people seem to update more on positive results.

2) I am not convinced that focusing on mechanisms will completely solve the problem. I agree that interventions that are more theory-based should (in theory) have more similar results -- or at least results that are better able to be predicted, which is more to the point. On the other hand, implementation details matter. I agree with Glennerster and Bates that there is an undue focus on setting -- everyone wants an impact evaluation done in their particular location. But I think there is too much focus on setting because (perhaps surprisingly) when I look in the AidGrade data, there is little to no effect of geography on the impact found, by which I mean that a result from (say) Kenya does not even generalize to Kenya very well (and I believe James Rising and co-authors have found similar results using a case study of conditional cash transfers). This isn't always going to be true; for example, the effect of health interventions depend on the baseline prevalence of disease, and baseline prevalences can be geographically clustered. But what I worry -- without convincing evidence yet so take this with a grain of salt -- is that small implementation details might frequently wash out the effects of knowing the mechanisms. Hopefully, we will have more evidence on this in the future (whichever way that evidence goes), and I very much hope that the more positive view turns out to be true.

I do agree with you that it's possible that researchers (and policymakers?) are able to account for some of the other factors when making predictions. I also said that there was some evidence that people were updating more on the positive results; I need to dig into the data a bit more to do subgroup analyses, but one way to reconcile these results (which would be consistent with what I have seen using different data) is that some people may be better at it than others. There are definitely times when people are wildly off, as well. I don't think I have a good enough sense yet of when predictions are good and when they are not, and that would be valuable.

Edit: I meant to add, there are a lot of frameworks that people use to try to get a handle on when they can export results or how to generalize. In addition to the work cited in Glennerster and Bates, see Williams for another example. And talking with people in government, there are a lot of other one-off frameworks or approaches people use internally. I am a fan of this kind of work and think it highly necessary, even though I am quite confident it won't get the appreciation it deserves within academia.

comment by Eva · 2018-05-22T10:20:46.467Z · EA(p) · GW(p)

This video might also add to the discussion - the closing panel at CSAE this year was largely on methodology, moderated by Hilary Greaves (head of the new Global Priorities Institute at Oxford), with Michael Kremer, Justin Sandefur, Joseph Ssentongo, and myself. Some of the comments from the other panellists still stick with me today.

comment by Michael_S · 2018-05-21T16:43:24.630Z · EA(p) · GW(p)

I agree that limitations on RCTs are a reason to devalue them relative to other methodologies. They still add value over our priors, but I think the best use cases for RCTs are when they're cheap and can be done at scale (Eg. in the context of online surveys) or when you are randomizing an expensive intervention that would be provided anyway such that the relative cost of the RCT is cheap.

When costs of RCTs are large, I think there's reason to favor other methodologies, such as regression discontinuity designs, which have faired quite well compared to RCTs (

Replies from: Eva, RomeoStevens
comment by Eva · 2018-05-22T00:02:01.261Z · EA(p) · GW(p)

I agree that it would be important to weigh the costs and benefits - I don't think it's exclusively an issue with RCTs, though.

One thing that could help in doing this calculus is a better understanding of when our non-study-informed beliefs are likely to be accurate.

I know at least some researchers are working in this area - Stefano DellaVigna and Devin Pope are looking to follow up their excellent papers on predictions with another one looking at how well people predict results based on differences in context, and Aidan Coville and I also have some work in this area using impact evaluations in development and predictions gathered from policymakers, practitioners, and researchers.

comment by RomeoStevens · 2018-05-21T18:42:06.853Z · EA(p) · GW(p)

Would the development of a VoI checklist be helpful here? Heuristics and decision criteria similar to the flowchart that the Campbell collab. has for experimental design heuristics.

comment by cole_haus · 2018-05-30T18:51:47.456Z · EA(p) · GW(p)

I think Evidence-Based Policy: A Practical Guide To Doing It Better is also a good source here. The blurb:

Over the last twenty or so years, it has become standard to require policy makers to base their recommendations on evidence. That is now uncontroversial to the point of triviality--of course, policy should be based on the facts. But are the methods that policy makers rely on to gather and analyze evidence the right ones? In Evidence-Based Policy, Nancy Cartwright, an eminent scholar, and Jeremy Hardie, who has had a long and successful career in both business and the economy, explain that the dominant methods which are in use now--broadly speaking, methods that imitate standard practices in medicine like randomized control trials--do not work. They fail, Cartwright and Hardie contend, because they do not enhance our ability to predict if policies will be effective.

The prevailing methods fall short not just because social science, which operates within the domain of real-world politics and deals with people, differs so much from the natural science milieu of the lab. Rather, there are principled reasons why the advice for crafting and implementing policy now on offer will lead to bad results. Current guides in use tend to rank scientific methods according to the degree of trustworthiness of the evidence they produce. That is valuable in certain respects, but such approaches offer little advice about how to think about putting such evidence to use. Evidence-Based Policy focuses on showing policymakers how to effectively use evidence, explaining what types of information are most necessary for making reliable policy, and offers lessons on how to organize that information.

comment by RomeoStevens · 2018-05-21T18:44:20.697Z · EA(p) · GW(p)

Really happy to see this get some attention. I think this is where the biggest potential value add of EA lies. Very very few groups are prepared to do work on methodological issues. Those that do seem to generally get bogged down in object level implementation details quickly (See: the output of METRICS for example.) Method work is hard, connecting people and resources to advance it is neglected.

Replies from: Eva
comment by Eva · 2018-05-22T00:28:04.445Z · EA(p) · GW(p)

And when groups do work on these issues there is a tendency towards infighting.

Some things that could help:

  • Workshops that bring people together. It's harder to misinterpret someone's work when they are describing it in front of you, and it's easier to make fast progress towards a common goal (and to increase the salience of the goal).
  • Explicitly recognizing that the community is small and needs nurturing. It's natural for people to at first be scared that someone else is in their coveted area (career concerns), but overall I think it might be a good thing even on a personal level. It's such a neglected topic that if people work together and help bring attention to it real progress could be made. In contrast, sometimes you see a subfield where people are so busy tearing down each other's work that nothing can get published or funded - a much worse equilibrium.

Bringing people together is hugely important to working constructively.

Replies from: RomeoStevens
comment by RomeoStevens · 2018-05-22T21:35:41.457Z · EA(p) · GW(p)

when groups do work on these issues there is a tendency towards infighting.

Do you think this is a side effect of the-one-true-ontology issues?

Do you happen to know which conferences research results in this area tend to get presented at or which journals they tend to get published in? Could be useful to bootstrap from those networks. I've been tracing some citation chains from highly cited stat papers, but it's very low signal to noise for meta-research vs esoteric statistical methods.

Replies from: Eva
comment by Eva · 2018-05-23T00:10:16.060Z · EA(p) · GW(p)

I'll try to think about this some more. It's a good question.

comment by Anders_Huitfeldt · 2018-05-21T17:56:44.516Z · EA(p) · GW(p)

I also conduct research on the generalizability issue, but from a different perspective. In my view, any attempt to measure effect heterogeneity (and by extension, research generalizability) is scale dependent. It is very difficult to tease apart genuine effect heterogeneity from the appearance of heterogeneity due to using an inappropriate scale to measure the effects.

In order to to get around this, I have constructed a new scale for measuring effects, which I believe is more natural than the alternative measures. My work on this is available on arXiv at . The paper has been accepted for publication at the journal Epidemiologic Methods, and I plan to post a full explanation of the idea here and on Less Wrong when it is published (presumably, this will be a couple of weeks from now).

I would very much appreciate feedback on this work, and as always, I operate according to Crocker's Rules.

Replies from: RomeoStevens
comment by RomeoStevens · 2018-05-21T18:39:54.306Z · EA(p) · GW(p)

I think counterfactual outcome state transition parameters is a bad name in that it doesn't help people identify where and why they should use it, nor does it communicate all that well what it really is. I'd want to thesaurus each of the key terms in order to search for something punchier. You might object that essentially 'marketing' an esoteric statistics concept seems perverse, but papers with memorable titles do in fact outperform according to the data AFAIK. Sucks but what can you do?

I bother to go into this because this research area seems important enough to warrant attention and I worry it won't get it.

Replies from: Anders_Huitfeldt
comment by Anders_Huitfeldt · 2018-05-21T18:48:51.644Z · EA(p) · GW(p)

Thank you! I will think about whether I can come up with a catchier name for future publications (and about whether the benefits outweight the costs of rebranding).

If anyone has suggestions for a better name (for an effect measure that intuitively measures the probability that the exposure switches a person's outcome state), please let me know!

comment by John G. Halstead (Halstead) · 2018-06-06T13:26:25.742Z · EA(p) · GW(p)

V interesting. I'd be curious to hear the arguments as to why we should persist with expensive RCTs if heterogeneity is as described.