How do EA Orgs Account for Uncertainty in their Analysis?post by Peter Wildeford (Peter_Hurford) · 2017-04-05T16:48:45.220Z · EA · GW · Legacy · 4 comments
Key points from EA orgs GiveWell Shortcomings with Cost-Effectiveness Estimates Cluster Thinking and Extreme Model Uncertainty Communicating Uncertainty Animal Charity Evaluators 80,000 Hours Centre for the Study of Existential Risk Population ethics Fairness Discounting Endnotes None 3 comments
This essay was jointly written by Peter Hurford, Kathryn Mecrow, and Simon Beard.
Effective altruism is about figuring out how to do the most good; when working with limited resources, whether financial or otherwise, and faced with opportunity costs, how do we measure the good of the impact of a particular program? Using examples from current projects: how good is it to give someone a bednet, hand out a pro-vegetarian leaflet, or do an hour of AI safety research? This article considers the various methods employed by EA and EA associated organizations for dealing with uncertainty in their cost effectiveness analyses. To address this, we conducted a literature review using resources on the websites of EA orgs, and reached out to the different organizations for guidance.
When striving to answer these questions, we can model various parameters. For example, GiveWell mentions that the benefit of a deworming program depends on a number of factors that each carry their own uncertainties. Does deworming have short-term health benefits? Is it clear that deworming has long-term positive health or social impacts? How do we evaluate evidence from randomized control trials that differ from modern day deworming programs in a number of important ways? How do we proceed if some analysis suggests that deworming programs are more likely than not to have very little or no impact, but there is some possibility that deworming has a very large impact? In the process of combining these models we can employ ranges or intervals to express our uncertainty and aggregate the various inputs to create a total estimate of impact.
How do we know that we have all the relevant parameters, that we have aggregated correctly, accurately assessed our uncertainty in each parameter, and made no errors? Given the emphasis of the EA movement on robustness of evidence, how do we ensure that we actually do the most good and retain the credibility of our assessments in the face of poorly researched areas, uncertain future impact scenarios, and multitudes of known and unknown program-specific factors?
For example, VillageReach was GiveWell’s top charity for three years (2009-2011) with an estimated "$200-$1,000 per life saved". However, a later re-analysis in 2012 found that there was a significant amount of missing data that was not considered in the original review of VillageReach. This data could potentially have a large effect on the endline conclusion of Village Reach’s cost-effectiveness. Additionally, GiveWell later discovered potential alternative explanations for some of Village Reach’s impact, further reducing confidence in the cost-effectiveness figure. Together, this missing data and alternative explanations potentially represents a large source of “model uncertainty” on the initial estimate.
This dilemma is not confined to the EA movement. As a now famous example, Sam Wang at Princeton gave a 99% chance of Clinton winning the election. While Wang expressed his uncertainty, there were systematic errors in the model that made it more confident than was warranted. Nate Silver argues in The Signal and The Noise that a similar kind of problem with correlated errors contributed to a lot of model uncertainty in financial forecasting, which was ultimately partially responsible for the 2008 Recession.
Key points from EA orgs
Clearly describing sources of uncertainty is useful for readers. It creates positive reinforcement mechanisms for charities that fully disclose the successes, but particularly failures, in program implementation, further increasing the likelihood of further positive implementation of interventions. It also makes it easier for the reader to “sanity check” the estimate and understand to what degree and under what circumstances the estimate might accurately represent future results.
Cost-effectiveness estimates frequently need to include value judgements. If the cost effectiveness of an intervention depends upon a particular value judgement, e.g., how much one values saving the life of a child under age 5 versus improving the income of a 40-year old adult, this should be acknowledged as a further source of uncertainty and attempts should be made to calculate the implications of making different possible value judgements instead (i.e., we should do sensitivity analyses). When viewing a cost-effectiveness estimate, one should make sure that it reflects one’s own values or update the estimate so that it does.
Data may be inherently biased, such as confirmation bias in the assessment of cause areas and with information provided by charities. Unacknowledged biases are a key driver of overconfidence.
Cost-effectiveness of the implementation of a program should take into consideration variation in the contexts of implementation. Different organizations run the same program with varying levels of effectiveness, in a diversity of implementation contexts.
A cost-effectiveness estimate should take into consideration all data. For example, it may be helpful to use a Bayesian framework where the cost-effectiveness estimate is used to perform a Bayesian update on a prior distribution of expected cost-effectiveness. One should also strive to gather data from a variety of sources, when applicable, such as randomized controlled trials.
Performing a cost-effectiveness analysis alone (as is typically done) may not be enough to adequately assess a program’s effectiveness. Your confidence in a cost-effectiveness estimate should depend on the quality of the work that has gone into it -- some cost-effectiveness estimates are more robust than others. As GiveWell and many EA orgs state, cost-effectiveness estimates are frequently oversimplified, overly sensitive, and unverified. Other criteria may be essential in building a more informed understanding of an intervention’s effectiveness.
Transparency is key. Not revealing in detail how the estimate was made can obscure errors for a long time. Organizations should strive to mention and elaborate on the sources of uncertainty in their models, as well as make the full details of their calculations public. A cost-effectiveness analysis should take into consideration uncertainty and be transparent not only about how its model attempts to deal with it but also that a degree of uncertainty is a reality of programs operating in under-researched or developing areas. People running programs should therefore be prepared to change their estimates based on new evidence without prejudice.
It may be less unintentionally confusing to emphasize comparing estimates against a threshold rather than emphasize absolute estimates. For example, it is a lot easier to accurately assess whether a certain value exceeds a threshold than it is to assess the value precisely, which favors threshold approaches that try to identify “top charities” but not rank those charities any further. This may go against the virtue of transparency as absolute scores are more informative of the estimate, even if less robust. However, this transparency could unintentionally confuse people into thinking that an estimate that scores higher is automatically better.
Consider information quality. When information quality is good, you should focus on quantifying your different options; when it isn’t, you should focus on raising information quality.
It can be useful to aggregative uncertainty levels of independent estimates. There are tools that make this easier, like Guesstimate.
One shouldn’t treat all areas of uncertainty as equal, or be overly concerned about the mean or median possible outcome. When risks and benefits are not normally distributed there may be good reason to care disproportionately about the best or worst possible outcomes. It may also be important to consider issues such as fairness and the temporal distribution of costs and benefits across different outcomes as well as their overall cost effectiveness.
It’s hard to find an organization, in effective altruism or otherwise, that has invested as much time thinking and articulating thoughts about uncertainty as GiveWell. While organizations like UNICEF suggest they can save a life with $1 per dose and that Nothing But Nets argues they can save a life with $10 per bednet, GiveWell’s page on cost-effectiveness argues that this is misleading and GiveWell is not inclined to take these estimates literally.
Shortcomings with Cost-Effectiveness Estimates
As early as 2008, GiveWell pointed out that while they find the idea of a cost-effectiveness estimate to be an attractive way to compare causes, there are shortcomings to these estimates, such as incorporating value judgements that may differ dramatically between different people . In 2010, GiveWell elaborated on more shortcomings:
Frequently, cost-effectiveness estimates like DALYs or QALYs are point-value estimates that do not properly express the range of uncertainty involved.
These estimates often are based on ideal versions of the program and don’t take into account possible errors or the idea that efficacy may decline over time.
Estimates of particular interventions (e.g., deworming) do not take into account the large amount of variation that comes from different organizations implementing the same program in different contexts with different abilities.
Lastly, these estimates frequently ignore indirect effects.
The complexity of these estimates provide a large opportunity for model error. For example, in 2011, GiveWell found five separate errors in a DCP2 DALY figure for deworming that, combined, made the estimate off by a factor of 100x.
GiveWell summarizes these problems into three core issues: cost-effectiveness estimates are frequently oversimplified (ignoring important indirect and long-term effects), overly sensitive (with small changes in assumptions producing big changes in value), and unverified (with errors persisting unnoticed for years).
Cluster Thinking and Extreme Model Uncertainty
These discussions led Holden Karnofsky to articulate a mathematical framework for assessing cost-effectiveness estimates that take into account a level of rigor in "Why We Can’t Take Expected Value Estimates Literally", "Maximizing Cost-Effectiveness via Critical Inquiry", and"Cluster Thinking vs. Sequence Thinking", with further clarifications in "Modeling Extreme Model Uncertainty". In summary, the key insight is that rather than rely on a single, explicit cost-effectiveness estimate, one ought to try to evaluate interventions from many different angles and adjust for one’s prior expectation of the intervention and an assumption that an estimate is likely to magnify dramatic effects.
Mathematically, the framework suggests that when you have a robust estimate of a particular intervention's cost-effectiveness, the key figure is how good the charity is, according to your estimations. Robustness can be achieved by (among other things) having multiple independent estimates. But when robustness is poor to moderate, variation in robustness can be as important as or more important than the point estimate. More broadly – when information quality is good, you should focus on quantifying your different options; when it isn’t, you should focus on raising information quality.
These points arise from the fact that when conducting a cost-effectiveness estimate, one must consider one’s prior distribution (i.e., what is predicted for the value of one’s actions by other life experience and evidence) and the variance of the estimate error around the cost-effectiveness estimate (i.e., how much room for error the estimate has) to produce a posterior estimate for cost-effectiveness.
The approach of looking at an organization or intervention from many different angles is further sketched out by Holden in what he calls "cluster thinking", which is distinct from what he calls "sequence thinking". With "cluster thinking", one seeks to evaluate a claim from multiple perspectives (or "clusters") and take an aggregated approach. In contrast, "sequence thinking" involves combining all factors into a single, sequential line of argument, usually in the form of an explicit expected value calculation.
This forms the basis for a general philosophy toward supporting the charities that have a combination of reasonably high estimated cost-effectiveness and maximally robust evidence. GiveWell now look for "meaningful differences" in modeled cost-effectiveness to determine whether an intervention appears to be meaningfully more cost-effective than GiveDirectly or direct financial grants. They also aim to determine whether there are meaningful differences between other organizations, for example, deworming verses bednets.
When assessing particular interventions, GiveWell tends to avoid using "error bars" and prefers to display their best guess values and state the reasons for uncertainty qualitatively instead of quantitatively, while urging people to not take the best guess value literally.
GiveWell does not aim to adjust downwards for uncertainty. Instead they aim to make best guesses about the key parameters that affect the final estimate. For example, GiveWell's evaluation of deworming includes a discount of 100x in the weighting of estimates from a key supporting study, due to concerns about replicability /external validity for this study.
More generally, GiveWell has noticed a trend that cost-effectiveness estimates frequently become worse (and rarely become better) upon further investigation, and occasionally they adjust estimates downward to try to “get ahead” of this trend when first looking at new interventions. However, in the March 2016 open thread, GiveWell expressed that staff have differing opinions on these uncertainty-based downward discounts and that they could appear overly subjective and controversial.
As an example of some of the uncertainty considered, GiveWell wrote an uncertainty section in their 2016 review of SCI’s room for more funding, highlighting context specific issues such as political unrest, expiring drug supplies, additional donated drugs becoming available, delays and budget changes due to coordination with other actors, results of disease mapping, and grants from other donors. GiveWell additionally acknowledged that they do not place much weight on the preliminary room for more funding estimates for SCI's work in 2017-2018. In the consideration of the final estimate of cost per treatment delivered, GiveWell emphasized that their estimate relies on a number of uncertain assumptions.
Animal Charity Evaluators
Just as GiveWell recommends the best non-profits for improving global health, ACE analyzes the effectiveness of different ways to improve the wellbeing of nonhuman animals. This is a task rife with uncertainty as the empirical record for many of the analyzed interventions is sparse or nonexistent. Additionally, ACE has to consider very difficult and unresolved questions like animal consciousness and wild animal suffering. "Thoughts on the Reducetarian Study" contains a review of existing empirical pro-vegetarian intervention research and their limitations.
ACE rates each charity on seven specific criteria -- an assessment of room for more funding, an explicit cost-effectiveness estimate, a more general assessment of individual intervention work having high implicit cost-effectiveness estimates, and four qualitative assessments of organizational health (e.g., track record, strong leadership, good culture). This could be seen as a form of cluster thinking where ACE looks at an organization in numerous ways to come to a holistic assessment, in which the impact of a cost-effectiveness calculation affects at most three of the seven criteria.
ACE has historically had challenges communicating cost-effectiveness information and now favors expressing their uncertainty about their estimates in the form of a range. For example, an estimated aggregation over all of Mercy for Animal’s activities produced an estimate of -4 and 20 years of suffering averted per dollar spent. To construct this estimate, ACE uses a third-party program called Guesstimate that can create and aggregate various confidence intervals, such as an estimate for all of MFA’s activities.
In addition to using Guesstimate, ACE also provides spreadsheet-driven calculators such as their social media calculator, that aims to have three point estimates for every parameter -- a pessimistic (conservative) view, a realistic (best guess) view, and an optimistic view. Each category is aggregated individually, creating three final estimates that are also pessimistic, realistic, and optimistic. The Guesstimate blog argues that this form of approach leads to inflated ranges, with the pessimistic final view being extra pessimistic and the optimistic final view being extra optimistic, though the central, realistic, best guess view should remain unaffected.
Currently, ACE’s estimates are based almost entirely on short-term effects. ACE has thought a lot about how to include long-term effects, but currently does not include these in their models due to an incommensurable level of bias relative to short-term estimates, a very high degree of uncertainty, and estimates being dominated by subjective judgment calls.
ACE also considered a robustness adjustment to their estimate (i.e., reducing estimates downward that are less robust), but decided not to do this due to concerns about too much subjectivity and thinking that estimating uncertainty of individual parameters and communicating the overall range should be sufficient to account for most of the impacts of robustness.
In the outline of their Research Principles, 80,000 Hours state they strive to employ Bayesian reasoning in their analysis of career decisions, through clarifying a prior guess on an issue, such as the potential benefits of a particular career path, by updating in or out of favour based on the strength of the evidence. They state that Bayesian reasoning is regarded as the best practice for decision-making under high uncertainty. They use their research principles as aspirational goals to inform their programs.
In the face of uncertainty, 80,000 Hours also uses cluster thinking -- instead of relying upon one or two strong considerations, they consider the question from a broad variety of angles and talk to people with different views, weighing each perspective according to the robustness and importance of the potential consequences. They additionally seek to avoid bias by aiming to make their research transparent and aiming to state their initial position so readers can spot any potential sources of bias and receive feedback from experts on sources of uncertainty.
In order to develop “workable assumptions”, 80,000 Hours generally adopts an assumption of linearity. For instance, they assume that the value of a resource is likely to be linear when considering changes that are a small fraction of the current supply of that resource. When consuming a resource, the overall effect is therefore very likely to be diminishing through most of the range; and is likely to be increasing only as one comes to control the majority of that resource, and even then only in some cases. For example, a donation of $200 is likely to be twice as good as $100.
Rob Wiblin emphasized that 80,000 Hours use their research principles as aspirational goals to inform their programs. He additionally drew our attention to the following ideas:
taking a 'risk management' approach to existential risk, thinking of it as insurance against the possibility that we really are dealing with an exceptional case here, i.e., the possibility that the inside view is right,
giving some weight to common sense,
doing things that aren't going to be a disaster, making sure that nothing they do will be catastrophically bad even if they misunderstood the situation.
Centre for the Study of Existential Risk
The Centre for the Study of Existential Risk is an interdisciplinary research centre within the University of Cambridge, dedicated to the study and mitigation of human extinction-level threats. As an academic institution, CSER does not have a corporate position on the evaluation of outcomes and risks, and contains a wide range of views. The following is therefore based on a specific project they are undertaking to develop a policy focused framework for the evaluation of Extreme Technological Risks (ETRs), i.e., technological developments that pose new existential risks. CSER states that standard cost-benefit analysis has serious deficiencies in the evaluation of ETRs and that the science of ETR management needs to take this into account when drawing conclusions about mitigating ETRs compared to other global priorities. Since much of the difficulty in evaluating ETRs stems from the significant degree of uncertainty about their risks and benefits, much of CSER’s work in this area involves developing alternatives to cost-benefit analysis that are better suited to evaluation under uncertainty.
One problem with using cost-benefit analysis to evaluate ETRs is that existential threats usually emerge only in worst case scenarios. Such scenarios are often very unlikely to occur. However, given that the costs associated with human extinction level threats are many orders of magnitude greater than those associated with the next worse scenario, such as a global catastrophe, they may significantly alter the overall balance of costs and benefits associated with developing a technology. One practical implication of this is that existential risk mitigation involves making predictions and preparing for outcomes that will be much worse than what we would expect to see in most outcomes. In practice, cost-benefit analyses often exclude or ignore such tail risks and most organizations are sensitive to being seen as making predictions that are persistently proven to be incorrect, so CSER is keen to identify when such pessimism is most justified and support those who it views as responding correctly to these kinds of extreme tail risk.
However, CSER’s work also goes beyond such practical concerns. They are also concerned that within the significant degree of uncertainty that surround ETRs and other existential risks there may be other morally salient concerns that go beyond what is captured by standard cost benefit analysis. CSER use a fundamentally normative approach to identify and evaluate these concerns to create a framework that identifies where we have the greatest ethical imperative for precaution in the face of uncertainty, and where a more balanced approach to weighing costs and benefits remains appropriate. Three key issues in this analysis are population ethics, fairness and temporal discounting:
Philosophical debates about the value of future lives have thrown up many intriguing axiological theories, implying that one cannot directly derive the value of a life from their level of well-being, let alone the interpretation of these wellbeing levels in monetary terms as required by cost-benefit analysis. Some people, such as Derek Parfit, have proposed a lexical theory about the value of lives, in which certain goods, such as science, friendship and culture, are morally more significant than any amount of individual welfare on its own. If taken seriously, this view greatly reduces the importance some some kinds of moral uncertainty. For instance, it implies that it does not matter if we do not know what the welfare costs and benefits of a technology will be if it threatens the existence of these ‘perfectionist goods’. There are several ways of incorporating such concerns into an evaluative framework, for instance by adopting a form of critical-level utilitarianism (giving priority to lives that are above some ‘critical level’ of welfare) or by implementing a more pluralist approach to moral value. As a starting point, CSER is analyzing whether possible future scenarios provide the potential resources necessary to foster perfectionist values at all, since this may be morally equivalent to the question of whether they pose an existential threat.
Sometimes our evaluation of an action is sensitive to more than just its costs and benefits, but also the ways in which these come about and their distribution. This view is common amongst a variety of moral theories, although it can be articulated in many ways. CSER is currently investigating accounts of fairness that allow us to integrate such concerns with a suitably consequentialist and aggregative approach to evaluating risks, for instance the "aggregate relevant claims" view. By introducing new evaluative frameworks such views have the potential to remove large amounts of evaluative uncertainty. For instance, on some of these views it is always better to save a single life than cure any number of headaches, rendering any uncertainty over the number of headaches that one might potentially cure morally insignificant. Such views are already being used to evaluate public health choices, but are yet to be studied in the evaluation of technological risks.
The relationship between uncertainty and the social discount rate (i.e. whether the fact that a cost or benefit will occur in the future makes it less important than if it occurred in the present) may seem less obvious. However, many theories about why we should discount future harms and benefits actually imply that we should use different discount rates for different kinds of costs and benefits. Whilst it seems legitimate to impose quite a high temporal discount rate on future benefits where these take the form of additional wellbeing for those who are already well off, the discount rate should be lower for assessing costs or the wellbeing of those who are worse off and should be even lower, or potentially negative, for costs associated with global catastrophes. This result is in fact well known in theory, but generally gets lost in practice when it is easiest to apply a single social discount rate to all costs and benefits, regardless of when and where they fall. One upshot is that as we move further into the future it may matter less and less just how big certain costs and benefits are, and much more whether or not there could be an extreme event such as human extinction or a global catastrophe.
CSER hopes that by developing these lines of research it will be possible to produce a middle way between Cost Benefit Analysis, which is often far too insensitive to risk and uncertainty, and a more blanket precautionary approach, which tends to overreact to it, yielding irrational results. This will form the basis of their integrated approach to managing extreme technological risks.
CSER is also interested in uncertainty in other areas that, although unlikely to produce existential threats in themselves, play an important role in framing humanity's future. One example is the evaluation of scientific research, where they are concerned that a reliance on overly precise risk-benefit assessments of research when there is significant uncertainty about the actual research outcomes produces no real improvement in the quality of research output, but does encourage the perpetuation of selection bias and other irrationalities in the kinds of scientific research that is undertaken and promoted.
: Peter Hurford is an independent researcher who works as a data scientist and is on the board of Charity Science Health, .impact, and Animal Charity Evaluators. Kathryn Mecrow is a member of the Operations Team of the Future of Humanity Institute. Simon Beard is a Research Associate at the Centre for the Study of Existential Risk at the University of Cambridge. We also thank (in no particular order) Michelle Hutchinson at the Oxford Institute for Effective Altruism, Rob Wiblin at 80,000 Hours; Allison Smith at Animal Charity Evaluators; Amanda Askell; and Elie Hassenfeld, Rebecca Raible, and Holden Karnofsky at GiveWell for answering initial questions that allowed us to research this essay and for reviewing a final draft of this essay prior to publication.
: As a concrete example, Michael Dickens elaborates on how to use this framework to produce cost-effectiveness estimates for various different causes that may be more directly comparable, even across multiple levels of rigor. GiveWell also produces a worked example showing mathematically how one might combine three different models demonstrating uncertainty about investing in a start-up.
: Note that one of the three authors, Peter Hurford, serves as the Treasurer of the ACE board and was the original designer of ACE’s evaluation criteria as an ACE volunteer. This could introduce potential bias when discussing this section.
: Note that one of the three authors, Simon Beard, is a research associate at the Centre for the Study of Existential Risk and works on their project Evaluating Extreme Technological Risks. This could introduce potential bias when discussing this section.
Comments sorted by top scores.