# The Case for Funding New Long-Term Randomized Controlled Trials of Deworming

post by MHR · 2022-08-04T16:12:33.573Z · EA · GW · 8 comments## Contents

Summary Introduction The Case for Additional Studies Model Robustness Discussion None 8 comments

# Summary

Despite significant uncertainty in the cost-effectiveness of mass deworming, GiveWell has directed over a hundred million dollars in donations to deworming initiatives since 2011. Almost all the data underlying GiveWell’s cost-effectiveness estimate comes from a single 1998 randomized trial of deworming in 75 Kenyan schools. Errors in GiveWell’s estimate of cost-effectiveness (in either direction) could be driving an impactful misallocation of funding in the global health and development space, reducing the total welfare created by Effective Altruism (EA)-linked donations. A randomized controlled trial replicating the 1998 Kenya deworming trial could provide a substantial improvement in the accuracy of cost-effectiveness estimates, with a simplified model indicating the expected value of such a trial is in the millions of dollars per year. Therefore, EA-aligned donors may have made an error by not performing replication studies on the long-run economic impact of deworming and should prioritize running them in the future. More generally, this finding suggests that EA organizations may be undervaluing the information that could be gained from running experiments to replicate existing published results.

# Introduction

Chronic parasitic infections are common in __many regions of the world__, including sub-Saharan Africa and parts of East Asia. Two common types of parasitic disease are __schistosomiasis__, which is transmitted by contaminated water, and the __soil-transmitted helminth infections__ (STHs) trichuriasis, ascariasis, and hookworm. __Mass deworming__ is the process of treating these diseases in areas of high prevalence by administering antiparasitic medications to large groups of people without first testing each individual for infection. The __antiparasitic medications involved__, praziquantel for schistosomiasis and albendazole for STHs, are cheap, have relatively few side effects, and are considered safe to administer on a large scale. There is __strong evidence__ that deworming campaigns reduce the prevalence of parasitic disease, as well as weaker evidence that deworming campaigns improve broader life outcomes.

GiveWell has included charities working on deworming in its top charities list for __over a decade__, with the __SCI Foundation__ (formerly the Schistosomiasis Control Initiative) and __Evidence Action’s Deworm the World Initiative__ being the top recipients of GiveWell-directed deworming donations. As of 2020, GiveWell has directed __$163 million__ to charities working on deworming, with this funding coming from individual donors giving to deworming organizations based on GiveWell’s recommendation, GiveWell funding deworming organizations directly via its Maximum Impact Fund, and Open Philanthropy donating to deworming organizations based on GiveWell’s research.^{[1]}

GiveWell’s recommendation of deworming-focused charities is based almost entirely on the limited evidence linking deworming to long-term economic benefits, particularly increases in income and consumption. Regarding impacts on health, the __GiveWell brief on deworming__ states “evidence for the impact of deworming on short-term general health is thin. We would guess that deworming has small impacts on weight, but the evidence for its impact on other health outcomes is weak.” So-called “supplemental factors” other than the effect on income change GiveWell’s overall cost-effectiveness estimate for Deworm the World __by 7%__.^{[2]}

GiveWell’s estimate of the long-term economic benefit produced by deworming comes from “__Twenty-Year Economic Impacts of Deworming__” (2021), by Joan Hamory, Edward Miguel, Michael Walker, Michael Kremer, and Sarah Baird. This paper is a 20-year follow-up to “__Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities__” (2004) by Edward Miguel and Michael Kremer, which analyzed the results of the randomized introduction of deworming into 75 Kenyan schools over 3 years starting in 1998. The 20-year follow-up found that “Individuals who received two to three additional years of childhood deworming experienced a 14% gain in consumption expenditures and 13% increase in hourly earnings.” __GiveWell uses__ the average of the point estimates from the 20-year follow-up on the increases in individual earnings and consumption to derive its estimate of the long-term economic benefit produced by deworming.^{[3]} However, there is __enormous uncertainty__ in these point estimates: the mean consumption increase is $199/year with a standard error of $130/year, while the mean income increase is $85/year with a standard error of $171/year.^{[4]}

In light of this uncertainty, researchers and commentators have extensively debated whether deworming produces real benefits. The so-called “worm wars,” __as these debates have come to be known__, have involved multiple re-analyses of the data from the 1998 Kenya study and have led some researchers to conclude that deworming does not have significant economic effects. For example, Paul Garner, David Taylor-Robinson, Harshpal Singh Sachdev of the Liverpool School of Tropical Medicine __wrote__ “it seems implausible that deworming itself would have an independent effect on school attendance or economic development.”

The goal of this essay is not to relitigate these discussions, but to argue that in light of this uncertainty, efforts to run new randomized controlled trials of deworming would likely have a high expected value. The next section provides an argument for this position in general terms, and the section after introduces a simplified statistical model of the potential benefits of such studies.

# The Case for Additional Studies

The size of the uncertainty in deworming’s cost-effectiveness cannot easily be understated. Assuming a normal distribution, the mean and standard deviation in Hamory et al. (2021) indicate that the 95% confidence interval for the impact of deworming on income is between -$257/year and +$427/year, or alternately between -12% and +20%. This uncertainty is further compounded by the fact that the many of the regions in which Deworm the World and The SCI Foundation operate have __important differences__ relative to the region studied in Hamory et al. (2021), such as in the rates of parasitic disease infection or in terms of other population health factors such as nutrition. In computing cost-effectiveness estimates, GiveWell attempts to __make adjustments for some of these factors__, but notes that its cost-effectiveness estimates are highly sensitive to assumptions on how these adjustments are made. These adjustments would therefore further increase the uncertainty in deworming’s cost-effectiveness.

A naive interpretation of the confidence intervals in Hamory et al. (2021) would indicate that the true impacts of deworming on economic outcomes are as likely to be higher than the nominal point estimates as they are to be lower. However, there are reasons to believe that the point estimates given in that paper are likely to be high. Two that are especially worth highlighting are publication bias and the lack of a clear causal mechanism for the economic effects. __Publication bias__ in the public health field manifests as a tendency for negative results to not be published. This means that the lack of studies showing zero or negative effects of deworming on economic outcomes does not mean that no such studies were conducted; they may have been conducted and not published. Moreover, it is possible that the Miguel and Kremer (2004) study would not have been published if it had not found a substantial positive effect. Given the existence of publication bias, it is expected that the average effect size seen in published studies is an overestimate of the true positive effect of an intervention. The __lack of a clear causal mechanism__ relates to the issue that overall data on the short-term effects of deworming do not indicate large enough changes in weight, cognition, and years of schooling to fully explain the economic effects seen in Hamory et al. (2021). This short-term data comes both from the Miguel and Kremer (2004) study (which showed some decreases in effect sizes when reanalyzed) and from other studies of deworming that only performed short-term follow-ups on participants. Deworming may still produce positive outcomes via harder-to-measure or less well understood effects than the ones discussed here. But the lack of a clear and currently-understood causal mechanism somewhat increases the likelihood that the economic effects of deworming seen in Hamory et al. (2021) are a statistical outlier rather than reflecting the true effectiveness.

To handle these issues, GiveWell applies an extremely large “__replicability adjustment__” that downweights the cost-effectiveness estimate produced from the data in Hamory et al. (2021) in order to arrive at its final cost-effectiveness estimate. Currently, GiveWell __multiplies the raw cost-effectiveness estimate by 0.13__ to produce the adjusted estimate, decreasing the effectiveness by approximately a factor of 7.^{[5]} This adjustment has two implications. First, it further highlights the large uncertainties involved in current estimates of deworming’s effectiveness. Second, the fact that GiveWell has already applied such a large negative correction greatly increases the likelihood that further studies would increase GiveWell’s cost-effectiveness estimate, rather than decreasing it.

Regardless of the direction, a substantial error in GiveWell’s current estimate of the cost- effectiveness of deworming would significantly reduce global well-being. If GiveWell’s current estimate of deworming’s cost-effectiveness is too high, then well-being could likely be substantially improved by shifting donations away from deworming to other cause areas. Given that GiveWell reported in __early 2022__ [EA · GW] that “we don’t expect to have enough funding to support all the cost-effective opportunities we find,” money diverted to ineffective causes directly detracts from highly effective organizations. However, the opposite case would also be a very serious issue. If deworming is substantially more cost-effective than GiveWell currently estimates, then funding other initiatives at the expense of deworming is forgoing a major opportunity to do good.

Further randomized controlled trials of deworming would not perfectly identify the impact of deworming on life outcomes. Just like Miguel and Kremer (2004) and its follow-ups, future trials will likely have substantial uncertainties on any estimates. But if these trials were run well, they would provide important evidence on the effectiveness of deworming that could be aggregated with existing research to produce more accurate cost-effectiveness estimates. Since in expectation, a replication trial will drive GiveWell’s estimate of the cost-effectiveness of deworming closer to the truth, one can expect a replication trial to improve funding allocations in a way that improves overall well-being.

Of course, a replication study would carry costs of its own. However, since deworming is already occurring, the only costs of running a controlled trial would be the costs of administering the randomization, collecting the additional data for both the treatment and control groups, and analyzing the results. Including the costs of the deworming drugs, existing studies analyzing the short-term effects of deworming have had treatment and data collection costs of approximately __$1 per participant per year__. Therefore, a study that enrolled 10,000 participants (to have similar statistical power to the 1998 trial in Kenya) and followed them for 10 years would likely cost less than $0.5 million in total, even after accounting for the costs of data analysis. When compared to the $163 million GiveWell directed to deworming in the last decade, as well as GiveWell’s aim of __directing up to a billion dollars per year by 2025__, it seems clear that a further randomized controlled trial of deworming would be a cost-effective use of funding.

# Model

This section introduces a highly simplified model that compares the total well-being created by GiveWell-directed donations with and without a new RCT investigating the long-term effects of deworming. The model operates in terms of the “units of value” used in GiveWell’s cost effectiveness analysis. To give a sense of scale, __GiveWell values preventing the death of a child under 5 from malaria at 116 units of value and increasing the natural log of consumption of one person by one for one year at 1.44 units of value__.^{[6]} Working in these units, the model makes the following introductory assumptions:

- All dollars spent on non-deworming GiveWell top charities generate six times the units of value per dollar as donations to GiveDirectly. Denote this constant value
*G =*__0.0034__*6 = 0.0204. This is based on GiveWell’s__2022 statement__“we think we’ll be able to recommend up to approximately $750 million in grants that are at least 6x as cost-effective as cash transfers." Since the overall room for funding in non-deworming causes is much greater than that of deworming, it is assumed that shifting funding towards or away from deworming does not change the effectiveness of donations to other causes. - Denote the value generated by the first dollar of spending on deworming as
*D*. The prior probability distribution (the probability distribution without a new RCT) for*D*is assumed to be normal:*D*~_{1}*N*(, ). It has a mean__value per dollar estimates__across all regions in which the SCI initiative and Deworm the World operate (0.049 units of value). It has the same ratio of standard deviation to mean as in the Hamory et al. (2021) data on the consumption increase from deworming (giving a standard deviation of 0.032 units of value). - The value per marginal dollar spent on deworming decreases linearly. The slope
*s*at which the value per dollar decreases is estimated at -1.825*10^{-9}units/dollar, which was computed by assuming that GiveWell’s directed funding was perfectly rational in 2020, such that under the linearly-decreasing value assumption, the last dollar spent by GiveWell on deworming in 2020 (GiveWell directed $15,699,622 to deworming that year) generated the same value as dollars spent on the average non-deworming intervention (6x the value per dollar generated by donations to GiveDirectly). - This prior distribution is assumed to be well-calibrated, so a true number for the value of the first dollar spent on deworming
*D*is generated by taking a random sample from the prior._{true} - The total amount of funding available per year is
*F*= $224,500,413, which is the_{T}__amount of money GiveWell directed in 2020__(the last year for which full information is available). Denote the portion of funding allocated to deworming*F*and the portion of funding allocated to general non-deworming causes_{D}*F*._{G}

With these assumptions introduced, two scenarios are considered:^{[7]}

Scenario 1: No new trial of deworming.

- Without a new trial of deworming, the true cost-effectiveness number for deworming is unknown.
- The fraction of funding per year distributed to each cause stays the same as it currently is. Describing this split in terms of the model’s assumptions, the amount of funding distributed to deworming is in this case computed based on the prior distribution. The amount is such that the marginal dollar given to deworming is no more effective than the marginal dollar given to other causes. All other funding is given to non-deworming causes.

= = $15,699,622 = =$208,800,791

Scenario 2: New trial of deworming.

- Another randomized controlled trial is run with the same statistical power as the trial referenced in Hamory et al. (2021). The mean first-dollar cost-effectiveness estimate produced by this RCT, , is drawn from a normal distribution X
_{r}~*N*(*D*, ) with a mean equal to the true value of the first dollar and the same standard deviation as the prior probability distribution ()._{true} - A posterior probability distribution for the first-dollar cost-effectiveness of deworming is computed by performing a Bayesian update from the prior using the new evidence. Since the prior and likelihood function are both normal, the posterior distribution is also normal with the values given below for and .
^{[8]}

- The amount of funding distributed to deworming is computed based on the posterior distribution. The amount donated to deworming is once again computed such that the marginal dollar given to deworming is no more effective than the marginal dollar given to other causes (additional dollars spent on deworming are still assumed to obey the same linear decrease in effectiveness). If the posterior estimate for the value of the first dollar given to deworming is less than the value of each dollar given to non-deworming causes, no money is given to deworming. All funding not spent on deworming is spent on non-deworming causes.

In both scenarios, the average value per dollar provided by deworming is computed by averaging the true value of the first dollar spent on deworming with the true value of the last dollar spent on deworming (since the value per dollar is modeled as decreasing linearly). Then the total value *V *created by GiveWell-directed donations is computed by multiplying the average value per dollar spent on deworming by the dollars spent on deworming and adding the average value per dollar spent on non-deworming interventions multiplied by the dollars spent on non-deworming interventions.

This model was run for one million randomized cases with these parameters, producing the distributions of value created per year shown in Figure 1. Note that the spike seen in the “With Replication Study” curve includes the cases in which the replication study leads GiveWell to stop funding deworming. The mean value created without the replication study was 4.81 million units/year, while the mean value created with the replication study was 4.94 million units/year. This difference in value created of approximately 130,000 units/year is equivalent to the value created by $6.6 million/year in donations to causes with 6x the effectiveness of GiveDirectly. This is a very large return for a study with a one-time cost of $0.5 million, and implies that such a study would be highly cost-effective.

# Robustness

The model introduced in the previous section makes a number of assumptions and simplifications, which this section discusses in more detail. To start off, one of the more significant factors not addressed in the previous section is that a new deworming trial will not deliver significant information on long-run effects until sufficient time has passed to measure them. The first time at which a follow-up study’s results would likely be helpful in determining deworming’s cost-effectiveness is when 5-year follow-up data could be measured to compare to Miguel and Kremer (2004). This means that the value generated by a replication study would not be realized for several years, requiring some adjustment for temporal discounting. However, given that the cost of a study is likely to be under half a million dollars and the benefits are in the millions of dollars per year, the expected value will be positive under any reasonable choice of discount rate.

A further aspect of the model that is especially simplified is the assumption that deworming’s cost-effectiveness decreases linearly as spending on deworming increases. GiveWell __typically models__ decreasing returns to scale in terms of its “room for more funding” metric, which is similar to modeling the cost-effectiveness of donations to an organization as constant up until some funding level, at which point it becomes zero. GiveWell’s modeling approach is reasonable when considering a single organization, since a single nonprofit is likely to face bottlenecks on its expansion due to non-monetary factors such as the time required to onboard new employees. However, the approach taken by GiveWell is unlikely to be the best way to model effectiveness of an overall cause area, since spending on a cause can also be increased by adding new organizations to GiveWell’s top charities list (for example, both __Partners in Health__ and __Doctors Without Borders__ distribute deworming medication in some cases, but neither is a GiveWell top charity). Funding given to organizations that are not current top charities is likely to be less effective per dollar, but it should not be valued at zero. The exact shape of the relationship between funding level and marginal cost-effectiveness is unknown, but modeling it as linear is a common approach for this type of simplified modeling.

A third aspect of the model that is highly simplified is the assumption the GiveWell-directed donations are perfectly optimized for effectiveness. This is primarily because many of the donations directed by GiveWell are made by individual donors rather than through the Maximum Impact Fund. This assumption particularly impacts the estimate of the slope for deworming’s decreasing cost effectiveness as funding increases. Given that GiveWell did __not fully fill deworming organizations’ room for additional funding in 2020__, it is likely that if GiveWell’s current cost effectiveness model is assumed to be correct, the last dollar spent on deworming in 2020 was more than 6x as effective as donations to GiveDirectly. Therefore, the slope used in the model is likely an overestimate of the rate of decrease in cost-effectiveness per dollar. By manually varying the slope parameter, it was found that increasing the slope decreases the estimate of the value produced by the replication study, and vice-versa, so an overestimate in the slope would lead to an under-estimate in the study’s value. Additionally, in modeling the impact of new information, it is possible that non-optimal funding allocations in the future will reduce the value of gaining new information. However, GiveWell does try to optimize its donations from the maximum impact fund, and informed donors are also likely to shift their donations in response to new information. Modeling this behavior as perfect optimization is likely sufficient for this kind of simplified model.

The results of the model are also impacted by the choices of numerical parameters used, but the general conclusion of a replication study delivering over $1m/year in equivalent donation value is robust to reasonable variation of the parameters. Increasing the uncertainty in the current estimate of deworming’s cost effectiveness increases the expected value of a replication study, while decreasing the uncertainty reduces it. However, it would require a reduction by a factor of approximately 2.5 before the expected value of a replication study falls below $1m/year. This kind of decrease seems unreasonable given that it would remove zero from the 95% confidence interval of the value per dollar of deworming, and as discussed in the introduction, there are a substantial number of researchers who think the true effectiveness is roughly zero. Changes in the mean estimate for the first-dollar value of deworming do not change the general conclusion unless the prior estimate is reduced to be very close to or below the 6x GiveDirectly value used for the value per dollar of non-deworming interventions. Similarly, changing the estimate used for the value of non-deworming interventions does not change the general conclusion unless it is increased above the mean estimate of the first-dollar value of deworming. Finally, a significantly steeper slope for the declining marginal value of donations spent on deworming would reduce the expected value of new information, but approximately a 6x increase in slope would be required before the expected value of a replication study falls below $1m/year. Moreover, as discussed above, the model assumptions likely lead to an under-estimate of the slope, not an overestimate.

# Discussion

GiveWell has funded deworming programs for over a decade despite significant uncertainty in the cost-effectiveness of deworming. In that time, GiveWell’s researchers have done their best to extrapolate cost-effectiveness estimates from a single 1998 study in Kenya to the wide variety of regions in which GiveWell top charities are now running deworming programs. However, they have not obtained data from additional trials on deworming's long-run impact. Given the challenging tradeoffs GiveWell’s staff are forced to make in allocating funding across a wide range of highly effective charitable organizations, more accurate information on cost-effectiveness would likely deliver a high expected value by enabling more optimal funding allocation.

It is possible that the reason GiveWell or Open Philanthropy has not funded this kind of replication study is that they are aware of trials on the long-run economic impacts of deworming that are already occurring. However, searches of the __American Economic Association's RCT Registry__ and __ClinicalTrials.gov__ do not list any currently-active clinical trials of deworming with preregistered hypotheses related to long-term income or consumption. There are, however, several active clinical trials studying the efficacy of differing deworming methods in eliminating parasitic infections and in creating short-run improvements in health and school attendance. It is possible that EA organizations could work with the investigators in these trials to add plans for long-run follow-ups studying economic impacts.

Based on the simplified model discussed in the previous section, the expected return to a replication study on the long-run economic effects of deworming with similar statistical power to the 1998 study analyzed in Miguel and Kremer (2004) and its follow-ups is equivalent to the value of millions of dollars in additional funding per year. Given this high estimate, GiveWell or other EA-affiliated nonprofits working in the global health and development space should strongly consider running such a follow-up study to maximize their impact. Moreover, they may have lost an opportunity for maximizing their positive impact by not running such a study earlier. Had an EA-aligned organization decided to run a replication study during the first iteration of the “worm wars,” 5-year follow-up data would already be available for analysis. A 5-year follow-up would be enough to compare to Miguel and Kremer (2004), which would give GiveWell and other donors much better information to go on than is currently available.

The fact that no EA organization has tried to replicate Miguel and Kremer’s work is potentially suggestive of a more general blind spot in EA thinking. EA organizations may be overall putting too low a value on information relative to action, and should consider whether other kinds of replication studies would also deliver significant returns. For example, similar types of replication trials could help resolve the contradictions between different studies' efficacy estimates for __vitamin A supplementation__. Targeted investment in this kind of replication research might deliver benefits even beyond the global health and development space, such as in evaluating the efficacy of interventions to improve animal welfare. Truly doing the most possible good is likely to require active and continued investment in replicating published results, rather than accepting the state of evidence as is.

^{^}Information on how GiveWell measures donations that were influenced by its recommendations is in the appendix to Givewell’s

__Metrics Report__.^{^}See line 106 of the cost-effectiveness analysis spreadsheet for the source of this 7% factor.

^{^}See line 7 of the cost-effectiveness analysis spreadsheet for where the average of these estimates is incorporated into the overall model.

^{^}See table S3 of the appendix to the paper.

^{^}See line 11 of the cost-effectiveness model.

^{^}See line 116 for the value of saving a life and line 127 for the value of increasing consumption.

^{^}The model code is available as an R Jupyter notebook.

^{^}See e.g.

__this paper__for a derivation of these formulas.

## 8 comments

Comments sorted by top scores.

## comment by Alexander_Berger · 2022-08-07T23:12:29.377Z · EA(p) · GW(p)

Hi MHR,

I really appreciate substantive posts like this, thanks!

This response is just speaking for myself, doing rough math on the weekend that I haven't run by anyone else. Someone (e.g., from @GiveWell) should correct me if I'm wrong, but I think you're vastly understating the difficulty and cost of running an informative replication given the situation on deworming. **(My math below seems intuitively too pessimistic, so I welcome corrections!)**

If you look at slide 58 here you get the minimum detectable effect (MDE) size with 80% power can be approximated as 2.8*the standard error (which is itself effectively inversely proportional to the square of the sample size).

I didn't check the original sources, but this GiveWell doc on their deworming replicability adjustment implies that the standard error for log(income/consumption) in the most recent replications is ~.066 (on a "main effect" of .109). The original RCT involved 75 schools, and according to figure A1 here the followup KLPS 4 involved surveying 4,135 participants in the original trial. GiveWell's most recent cost-effectiveness analysis for Deworm the World makes 2 key adjustments to the main effect from the RCT:

- A replicability adjustment of .13 (row 11)
- A geography-specific adjustment for worm burden which averages about .12 (row 40) (this is because worm burdens are now much lower than they were at the time of MK)

Together, these adjustments imply that GiveWell projects the per-capita benefit to the people dewormed to be just .13*.12=1.56% of the .109 impact on log income in the late followups to the original Miguel and Kremer RCT. So if we wanted to detect the effect GiveWell expects to see in mass deworming, we'd have an MDE of ~.0017 on log income, which with 80% power and the formula above (MDE=2.8*standard error) implies we'd need the standard error to be .0017/2.8=~.00061 log points. **So a well-powered study to get the effect GiveWell expects would need a standard error roughly 108 times smaller than the standard error (.066) GiveWell calculates on the actual followup RCTs**.

But because standard errors are inversely proportional to the square root of sample size, if you used the same study design, **getting a 108x smaller standard error would require a 108*108=11,664 times larger sample**. I think that might imply a sample size of ~all the elementary schools in India (11,664*75=874K), which would presumably include many schools that do not in fact actually have significant worm burdens.

If the original MK study and one followup cost $1M (which I think is the right order of magnitude but may be too high or too low), **this implies that a followup powered to find the effect GiveWell expects would cost many billions of dollars**. And of course it would take well over a decade to get the long term followup results here. (**That said, it wouldn't surprise me if I'm getting the math wrong here - someone please flag if so!**)

I'm sure there are better study designs than the one I'm implicitly modeling here that could generate more power, or places where worm burdens are still high enough to make this somewhat more economical, but I'm skeptical they can overcome the fundamental difficulty of detecting small effects in cluster RCTs.

I think a totally reasonable reaction to this is to be more skeptical of small cheap interventions, because they're so hard to study and it's so easy to end up driven by your priors.

Replies from: MHR, Falk Lieder## ↑ comment by MHR · 2022-08-11T19:09:52.016Z · EA(p) · GW(p)

Thanks so much for taking the time to read the post and for really engaging with it. I very much appreciate your comment and I think there are some really good points in it. But based on my understanding of what you wrote, I’m not sure I currently agree with your conclusion. In particular, I think that looking in terms of minimum detectable effect can be a helpful shorthand, but it might be misleading more than it’s helping in this case. We don’t really care about getting statistical significance at p <0.05 in a replication, especially given that the primary effects seen in Hamory et al. (2021) weren’t significant at that level. Rather, we care about the magnitude of the update we’d make in response to new trial data.

To give a sense of why that’s so different, I want to start off with an oversimplified example. Consider two well-calibrated normal priors, one with a mean effect of 10 and standard deviation of 0.5, and one with a mean effect of 0.2 and the same standard deviation. By the simplified MDE criterion, a trial with a standard error of 3.5 would be required to detect the effect at p <0.05 80% of the time in the first case and a trial with a standard error of 0.07 would be required to detect the effect at p <0.05 80% of the time in the second case. But we would update our estimate of the mean by the same amount in the second case as in the first case if new trial data came in with a certain standard error and difference between its mean estimate and our prior mean. (The situation for deworming is more complex because the prior distribution is probably truncated at around zero. But I think the basic concept still holds, in that the sample size required to keep the same value of new information wouldn’t grow as fast as the sample size required to keep the same statistical power.)

Therefore, I don’t think the required sample size is likely to be nearly as big as you estimated in order to get a valuable update to GiveWell’s current cost-effectiveness estimate. However, your point is clearly correct in that the sample size will need to increase to handle the worm burden effect. That was something I hadn’t thought about in the original post, so I really appreciate you bringing it up in your comment. According to GiveWell, the highest-worm-burden regions in which Deworm the World operates (Kenya and Ogun State, Nigeria) have a worm burden adjustment of 20.5%. A replication trial would likely need to be substantially larger to account for that lower burden, but I don’t think that increase would be prohibitively large.

Regarding the replicability adjustment, I’m not sure it implies that a larger sample size would be needed to make a substantial update based on new trial data (separate from the larger sample needed to handle the worm burden effect). The replicability adjustment was arrived at by starting with a prior based on short-term effect data and performing a bayesian update based on the Miguel and Kremer followup results. If the follow-up study has the same statistical power as M&K, then the two can be pooled to make the update and they should be given equal weight.

Thinking about it qualitatively, if a replication trial showed a similar or greater effect size than Hamory et al. (2021) after accounting for the difference in worm burden, I would think that would imply a strong update away from GiveWell’s current replicability adjustment of 0.13. In fact, it might even suggest that deworming worked via an alternate mechanism than the ones considered in the analysis underlying GiveWell’s adjustment. On the flip side, I don’t think that GiveWell would be recommending deworming if the Miguel and Kremer follow-ups had found a point estimate of zero for the relevant effect sizes (the entire cost-effectiveness model starts with the Hamory et al. numbers and adjusts them). So if a replication study came in with a negative point estimate for the effect size, GiveWell should probably update noticeably towards zero.

Zooming out, I think that information on deworming’s effectiveness in the presence of current worm burdens and health conditions would be very valuable. GiveWell has done an admirable job of trying to extrapolate from the Miguel and Kremer trial and its follow-ups to a bunch of extremely different environments, but they’re changing the point estimate by a factor of ~66 in doing so. To me, that implies that there’s really tremendous uncertainty here, and that even imperfect evidence in the current environment would be very useful. Since deworming is so cheap, I’m particularly worried about the case where it’s noticeably more effective than GiveWell is currently estimating, in which case EA donors would be leaving a big opportunity to do good on the table.

Thank you again for taking the time to read the post!

Replies from: Alexander_Berger## ↑ comment by Alexander_Berger · 2022-08-14T22:11:07.565Z · EA(p) · GW(p)

Thanks MHR. I agree that one shouldn't need to insist on statistical significance, but if GiveWell thinks that the actual expected effect is ~12% of the MK result, then I think if you're updating on a similarly-to-MK-powered trial, you're almost to the point of updating on a coinflip because of how underpowered you are to detect the expected effect.

I agree it would be useful to do this in a more formal bayesian framework which accurately characterizes the GW priors. It wouldn't surprise me if one of the conclusions was that I'm misinterpreting GiveWell's current views, or that it's hard to articulate a formal prior that gets you from the MK results to GiveWell's current views.

## ↑ comment by Falk Lieder · 2022-08-13T08:48:39.844Z · EA(p) · GW(p)

I think your estimate of how costly it would be to run a replication study is too pessimistic. In addition to the issues that MHR identified, it strikes me as unrealistic that the cost of rerunning the data collection would be more than 10,000 times as high as the cost of the original research project. I think this is highly unlikely because data collection usually accounts for at most 10% of the cost of research. Moreover, the cost of data collection does not scale linearly with the number of participants, but linearly in the number of researchers that are paid to coordinate data collection. The most difficult parts of organizing data collection, such as developing the strategy and establishing contact with high-ranking relevant officials, only have to be done once. Moreover, there are economies of scale such that once you can collect data from 1 school, it is very little effort to replicate the process with 100 or 1000 schools, and that work can then be done by local volunteers with minimal training for minimal pay or free of charge. It certainly won't require 10000 times as many professors, postdocs, and graduate students as the original study, and it is almost exclusively the salaries of those people that makes research expensive. To the contrary, collecting more data on an already designed study with an existing data analysis pipeline requires minimal work from the scientists themselves, and that makes it much less expensive. Therefore, I think that the cost of data collection was probably only 10% of the cost of the research project and only scale logarithmically with the sample size. Based on that line of reasoning, I believe that the replication study could be conducted for one or a few million dollars.

## comment by Joseph Lemien (jlemien) · 2022-08-04T23:31:41.968Z · EA(p) · GW(p)

Although I have followed the writings of Chris Blattman and Esther Duflo and similar "randomistas" on and off for several years, I don't know much about deworming. Nonetheless, my general bias leans in favor of having multiple replications of important findings. We are in the 2020s, but it is still common for core research that is widely cited and accepted to have only been looked into once. I'd be very happy to see a fund or an organization that focuses on creating multiple replications of "cruxy" research.

Replies from: MHR## ↑ comment by MHR · 2022-08-04T23:49:00.411Z · EA(p) · GW(p)

Yeah that's a cool idea to have an org that specifically focuses on replication work. I think that if you fleshed out the modeling done here, you could pretty confidently show funders that it would be a cost-effective use of money to do this more widely.

## comment by Otis Reid · 2022-08-08T02:11:52.369Z · EA(p) · GW(p)

Enjoyed the post! Open Philanthropy hired Tom Adamczewski (not sure what his user name is on here) to create https://valueofinfo.com/ for us, which is designed to do this sort of calculation — and accommodates a limited number of non-normal distributions (currently just Pareto and normal though if you work with the core package, you can do more). Just an FYI if you want to play more with this!

## comment by MHR · 2022-08-06T14:28:31.318Z · EA(p) · GW(p)

GiveWell's 2021 metrics report is out! Funding distributed to deworming increased greatly last year, from $15,699,622 to $44,124,942. Rerunning the model with the higher 2021 funding levels, the mean estimate of the value created by a replication study increases to approximately 370,000 GiveWell value units per year. This is equivalent to the value created by $18.5 million/year in donations to organizations with a cost-effectiveness of 6x GiveDirectly's.

In general, as EA-related organizations distribute more money per year, the value of information is naturally going to rise. So this kind of replication work will only get more important.