# The Case for Funding New Long-Term Randomized Controlled Trials of Deworming

post by MHR · 2022-08-04T16:12:33.573Z · EA · GW · 8 comments

## Contents

  Summary
Introduction
Model
Robustness
Discussion
None


# Summary

Despite significant uncertainty in the cost-effectiveness of mass deworming, GiveWell has directed over a hundred million dollars in donations to deworming initiatives since 2011. Almost all the data underlying GiveWell’s cost-effectiveness estimate comes from a single 1998 randomized trial of deworming in 75 Kenyan schools. Errors in GiveWell’s estimate of cost-effectiveness (in either direction) could be driving an impactful misallocation of funding in the global health and development space, reducing the total welfare created by Effective Altruism (EA)-linked donations. A randomized controlled trial replicating the 1998 Kenya deworming trial could provide a substantial improvement in the accuracy of cost-effectiveness estimates, with a simplified model indicating the expected value of such a trial is in the millions of dollars per year. Therefore, EA-aligned donors may have made an error by not performing replication studies on the long-run economic impact of deworming and should prioritize running them in the future. More generally, this finding suggests that EA organizations may be undervaluing the information that could be gained from running experiments to replicate existing published results.

# Introduction

Chronic parasitic infections are common in many regions of the world, including sub-Saharan Africa and parts of East Asia. Two common types of parasitic disease are schistosomiasis, which is transmitted by contaminated water, and the soil-transmitted helminth infections (STHs) trichuriasis, ascariasis, and hookworm. Mass deworming is the process of treating these diseases in areas of high prevalence by administering antiparasitic medications to large groups of people without first testing each individual for infection. The antiparasitic medications involved, praziquantel for schistosomiasis and albendazole for STHs, are cheap, have relatively few side effects, and are considered safe to administer on a large scale. There is strong evidence that deworming campaigns reduce the prevalence of parasitic disease, as well as weaker evidence that deworming campaigns improve broader life outcomes.

# Model

This section introduces a highly simplified model that compares the total well-being created by GiveWell-directed donations with and without a new RCT investigating the long-term effects of deworming. The model operates in terms of the “units of value” used in GiveWell’s cost effectiveness analysis. To give a sense of scale, GiveWell values preventing the death of a child under 5 from malaria at 116 units of value and increasing the natural log of consumption of one person by one for one year at 1.44 units of value.[6] Working in these units, the model makes the following introductory assumptions:

• All dollars spent on non-deworming GiveWell top charities generate six times the units of value per dollar as donations to GiveDirectly. Denote this constant value G = 0.0034*6 = 0.0204. This is based on GiveWell’s 2022 statement “we think we’ll be able to recommend up to approximately $750 million in grants that are at least 6x as cost-effective as cash transfers." Since the overall room for funding in non-deworming causes is much greater than that of deworming, it is assumed that shifting funding towards or away from deworming does not change the effectiveness of donations to other causes. • Denote the value generated by the first dollar of spending on deworming as D. The prior probability distribution (the probability distribution without a new RCT) for D is assumed to be normal: D1 ~ N(). It has a mean equal to the average of GiveWell’s value per dollar estimates across all regions in which the SCI initiative and Deworm the World operate (0.049 units of value). It has the same ratio of standard deviation to mean as in the Hamory et al. (2021) data on the consumption increase from deworming (giving a standard deviation of 0.032 units of value). • The value per marginal dollar spent on deworming decreases linearly. The slope s at which the value per dollar decreases is estimated at -1.825*10-9 units/dollar, which was computed by assuming that GiveWell’s directed funding was perfectly rational in 2020, such that under the linearly-decreasing value assumption, the last dollar spent by GiveWell on deworming in 2020 (GiveWell directed$15,699,622 to deworming that year) generated the same value as dollars spent on the average non-deworming intervention (6x the value per dollar generated by donations to GiveDirectly).
• This prior distribution is assumed to be well-calibrated, so a true number for the value of the first dollar spent on deworming Dtrue is generated by taking a random sample from the prior.

# Discussion

GiveWell has funded deworming programs for over a decade despite significant uncertainty in the cost-effectiveness of deworming. In that time, GiveWell’s researchers have done their best to extrapolate cost-effectiveness estimates from a single 1998 study in Kenya to the wide variety of regions in which GiveWell top charities are now running deworming programs. However, they have not obtained data from additional trials on deworming's long-run impact. Given the challenging tradeoffs GiveWell’s staff are forced to make in allocating funding across a wide range of highly effective charitable organizations, more accurate information on cost-effectiveness would likely deliver a high expected value by enabling more optimal funding allocation.

It is possible that the reason GiveWell or Open Philanthropy has not funded this kind of replication study is that they are aware of trials on the long-run economic impacts of deworming that are already occurring. However, searches of the American Economic Association's RCT Registry and ClinicalTrials.gov do not list any currently-active clinical trials of deworming with preregistered hypotheses related to long-term income or consumption. There are, however, several active clinical trials studying the efficacy of differing deworming methods in eliminating parasitic infections and in creating short-run improvements in health and school attendance. It is possible that EA organizations could work with the investigators in these trials to add plans for long-run follow-ups studying economic impacts.

Based on the simplified model discussed in the previous section, the expected return to a replication study on the long-run economic effects of deworming with similar statistical power to the 1998 study analyzed in Miguel and Kremer (2004) and its follow-ups is equivalent to the value of millions of dollars in additional funding per year. Given this high estimate, GiveWell or other EA-affiliated nonprofits working in the global health and development space should strongly consider running such a follow-up study to maximize their impact. Moreover, they may have lost an opportunity for maximizing their positive impact by not running such a study earlier. Had an EA-aligned organization decided to run a replication study during the first iteration of the “worm wars,” 5-year follow-up data would already be available for analysis. A 5-year follow-up would be enough to compare to Miguel and Kremer (2004), which would give GiveWell and other donors much better information to go on than is currently available.

The fact that no EA organization has tried to replicate Miguel and Kremer’s work is potentially suggestive of a more general blind spot in EA thinking. EA organizations may be overall putting too low a value on information relative to action, and should consider whether other kinds of replication studies would also deliver significant returns. For example, similar types of replication trials could help resolve the contradictions between different studies' efficacy estimates for vitamin A supplementation. Targeted investment in this kind of replication research might deliver benefits even beyond the global health and development space, such as in evaluating the efficacy of interventions to improve animal welfare. Truly doing the most possible good is likely to require active and continued investment in replicating published results, rather than accepting the state of evidence as is.

1. ^

Information on how GiveWell measures donations that were influenced by its recommendations is in the appendix to Givewell’s Metrics Report.

2. ^

See line 106 of the cost-effectiveness analysis spreadsheet for the source of this 7% factor.

3. ^

See line 7 of the cost-effectiveness analysis spreadsheet for where the average of these estimates is incorporated into the overall model.

4. ^

See table S3 of the appendix to the paper.

5. ^

See line 11 of the cost-effectiveness model.

6. ^

See line 116 for the value of saving a life and line 127 for the value of increasing consumption.

7. ^

The model code is available as an R Jupyter notebook

8. ^

See e.g. this paper for a derivation of these formulas.

comment by Alexander_Berger · 2022-08-07T23:12:29.377Z · EA(p) · GW(p)

Hi MHR,

I really appreciate substantive posts like this, thanks!

This response is just speaking for myself, doing rough math on the weekend that I haven't run by anyone else. Someone (e.g., from @GiveWell) should correct me if I'm wrong, but I think you're vastly understating the difficulty and cost of running an informative replication given the situation on deworming. (My math below seems intuitively too pessimistic, so I welcome corrections!)

If you look at slide 58 here you get the minimum detectable effect (MDE) size with 80% power can be approximated as 2.8*the standard error (which is itself effectively inversely proportional to the square of the sample size).

I didn't check the original sources, but this GiveWell doc on their deworming replicability adjustment implies that the standard error for log(income/consumption) in the most recent replications is ~.066 (on a "main effect" of .109). The original RCT involved 75 schools, and according to figure A1 here the followup KLPS 4 involved surveying 4,135 participants in the original trial. GiveWell's most recent cost-effectiveness analysis for Deworm the World makes 2 key adjustments to the main effect from the RCT:

• A replicability adjustment of .13 (row 11)
• A geography-specific adjustment for worm burden which averages about .12 (row 40) (this is because worm burdens are now much lower than they were at the time of MK)

Together, these adjustments imply that GiveWell projects the per-capita benefit to the people dewormed to be just .13*.12=1.56% of the .109 impact on log income in the late followups to the original Miguel and Kremer RCT. So if we wanted to detect the effect GiveWell expects to see in mass deworming, we'd have an MDE  of ~.0017 on log income, which with 80% power and the formula above (MDE=2.8*standard error) implies we'd need the standard error to be .0017/2.8=~.00061 log points. So a well-powered study to get the effect GiveWell expects would need a standard error roughly 108 times smaller than the standard error (.066) GiveWell calculates on the actual followup RCTs.

But because standard errors are inversely proportional to the square root of sample size, if you used the same study design, getting a 108x smaller standard error would require a 108*108=11,664 times larger sample. I think that might imply a sample size of ~all the elementary schools in India (11,664*75=874K), which would presumably include many schools that do not in fact actually have significant worm burdens.

If the original MK study and one followup cost $1M (which I think is the right order of magnitude but may be too high or too low), this implies that a followup powered to find the effect GiveWell expects would cost many billions of dollars. And of course it would take well over a decade to get the long term followup results here. (That said, it wouldn't surprise me if I'm getting the math wrong here - someone please flag if so!) I'm sure there are better study designs than the one I'm implicitly modeling here that could generate more power, or places where worm burdens are still high enough to make this somewhat more economical, but I'm skeptical they can overcome the fundamental difficulty of detecting small effects in cluster RCTs. I think a totally reasonable reaction to this is to be more skeptical of small cheap interventions, because they're so hard to study and it's so easy to end up driven by your priors. Replies from: MHR, Falk Lieder comment by MHR · 2022-08-11T19:09:52.016Z · EA(p) · GW(p) Thanks so much for taking the time to read the post and for really engaging with it. I very much appreciate your comment and I think there are some really good points in it. But based on my understanding of what you wrote, I’m not sure I currently agree with your conclusion. In particular, I think that looking in terms of minimum detectable effect can be a helpful shorthand, but it might be misleading more than it’s helping in this case. We don’t really care about getting statistical significance at p <0.05 in a replication, especially given that the primary effects seen in Hamory et al. (2021) weren’t significant at that level. Rather, we care about the magnitude of the update we’d make in response to new trial data. To give a sense of why that’s so different, I want to start off with an oversimplified example. Consider two well-calibrated normal priors, one with a mean effect of 10 and standard deviation of 0.5, and one with a mean effect of 0.2 and the same standard deviation. By the simplified MDE criterion, a trial with a standard error of 3.5 would be required to detect the effect at p <0.05 80% of the time in the first case and a trial with a standard error of 0.07 would be required to detect the effect at p <0.05 80% of the time in the second case. But we would update our estimate of the mean by the same amount in the second case as in the first case if new trial data came in with a certain standard error and difference between its mean estimate and our prior mean. (The situation for deworming is more complex because the prior distribution is probably truncated at around zero. But I think the basic concept still holds, in that the sample size required to keep the same value of new information wouldn’t grow as fast as the sample size required to keep the same statistical power.) Therefore, I don’t think the required sample size is likely to be nearly as big as you estimated in order to get a valuable update to GiveWell’s current cost-effectiveness estimate. However, your point is clearly correct in that the sample size will need to increase to handle the worm burden effect. That was something I hadn’t thought about in the original post, so I really appreciate you bringing it up in your comment. According to GiveWell, the highest-worm-burden regions in which Deworm the World operates (Kenya and Ogun State, Nigeria) have a worm burden adjustment of 20.5%. A replication trial would likely need to be substantially larger to account for that lower burden, but I don’t think that increase would be prohibitively large. Regarding the replicability adjustment, I’m not sure it implies that a larger sample size would be needed to make a substantial update based on new trial data (separate from the larger sample needed to handle the worm burden effect). The replicability adjustment was arrived at by starting with a prior based on short-term effect data and performing a bayesian update based on the Miguel and Kremer followup results. If the follow-up study has the same statistical power as M&K, then the two can be pooled to make the update and they should be given equal weight. Thinking about it qualitatively, if a replication trial showed a similar or greater effect size than Hamory et al. (2021) after accounting for the difference in worm burden, I would think that would imply a strong update away from GiveWell’s current replicability adjustment of 0.13. In fact, it might even suggest that deworming worked via an alternate mechanism than the ones considered in the analysis underlying GiveWell’s adjustment. On the flip side, I don’t think that GiveWell would be recommending deworming if the Miguel and Kremer follow-ups had found a point estimate of zero for the relevant effect sizes (the entire cost-effectiveness model starts with the Hamory et al. numbers and adjusts them). So if a replication study came in with a negative point estimate for the effect size, GiveWell should probably update noticeably towards zero. Zooming out, I think that information on deworming’s effectiveness in the presence of current worm burdens and health conditions would be very valuable. GiveWell has done an admirable job of trying to extrapolate from the Miguel and Kremer trial and its follow-ups to a bunch of extremely different environments, but they’re changing the point estimate by a factor of ~66 in doing so. To me, that implies that there’s really tremendous uncertainty here, and that even imperfect evidence in the current environment would be very useful. Since deworming is so cheap, I’m particularly worried about the case where it’s noticeably more effective than GiveWell is currently estimating, in which case EA donors would be leaving a big opportunity to do good on the table. Thank you again for taking the time to read the post! Replies from: Alexander_Berger comment by Alexander_Berger · 2022-08-14T22:11:07.565Z · EA(p) · GW(p) Thanks MHR. I agree that one shouldn't need to insist on statistical significance, but if GiveWell thinks that the actual expected effect is ~12% of the MK result, then I think if you're updating on a similarly-to-MK-powered trial, you're almost to the point of updating on a coinflip because of how underpowered you are to detect the expected effect. I agree it would be useful to do this in a more formal bayesian framework which accurately characterizes the GW priors. It wouldn't surprise me if one of the conclusions was that I'm misinterpreting GiveWell's current views, or that it's hard to articulate a formal prior that gets you from the MK results to GiveWell's current views. comment by Falk Lieder · 2022-08-13T08:48:39.844Z · EA(p) · GW(p) I think your estimate of how costly it would be to run a replication study is too pessimistic. In addition to the issues that MHR identified, it strikes me as unrealistic that the cost of rerunning the data collection would be more than 10,000 times as high as the cost of the original research project. I think this is highly unlikely because data collection usually accounts for at most 10% of the cost of research. Moreover, the cost of data collection does not scale linearly with the number of participants, but linearly in the number of researchers that are paid to coordinate data collection. The most difficult parts of organizing data collection, such as developing the strategy and establishing contact with high-ranking relevant officials, only have to be done once. Moreover, there are economies of scale such that once you can collect data from 1 school, it is very little effort to replicate the process with 100 or 1000 schools, and that work can then be done by local volunteers with minimal training for minimal pay or free of charge. It certainly won't require 10000 times as many professors, postdocs, and graduate students as the original study, and it is almost exclusively the salaries of those people that makes research expensive. To the contrary, collecting more data on an already designed study with an existing data analysis pipeline requires minimal work from the scientists themselves, and that makes it much less expensive. Therefore, I think that the cost of data collection was probably only 10% of the cost of the research project and only scale logarithmically with the sample size. Based on that line of reasoning, I believe that the replication study could be conducted for one or a few million dollars. comment by Joseph Lemien (jlemien) · 2022-08-04T23:31:41.968Z · EA(p) · GW(p) Although I have followed the writings of Chris Blattman and Esther Duflo and similar "randomistas" on and off for several years, I don't know much about deworming. Nonetheless, my general bias leans in favor of having multiple replications of important findings. We are in the 2020s, but it is still common for core research that is widely cited and accepted to have only been looked into once. I'd be very happy to see a fund or an organization that focuses on creating multiple replications of "cruxy" research. Replies from: MHR comment by MHR · 2022-08-04T23:49:00.411Z · EA(p) · GW(p) Yeah that's a cool idea to have an org that specifically focuses on replication work. I think that if you fleshed out the modeling done here, you could pretty confidently show funders that it would be a cost-effective use of money to do this more widely. comment by Otis Reid · 2022-08-08T02:11:52.369Z · EA(p) · GW(p) Enjoyed the post! Open Philanthropy hired Tom Adamczewski (not sure what his user name is on here) to create https://valueofinfo.com/ for us, which is designed to do this sort of calculation — and accommodates a limited number of non-normal distributions (currently just Pareto and normal though if you work with the core package, you can do more). Just an FYI if you want to play more with this! comment by MHR · 2022-08-06T14:28:31.318Z · EA(p) · GW(p) GiveWell's 2021 metrics report is out! Funding distributed to deworming increased greatly last year, from$15,699,622 to $44,124,942. Rerunning the model with the higher 2021 funding levels, the mean estimate of the value created by a replication study increases to approximately 370,000 GiveWell value units per year. This is equivalent to the value created by$18.5 million/year in donations to organizations with a cost-effectiveness of 6x GiveDirectly's.

In general, as EA-related organizations distribute more money per year, the value of information is naturally going to rise. So this kind of replication work will only get more important.