How do EA Orgs Account for Uncertainty in their Analysis?

post by Peter Wildeford (Peter_Hurford) · 2017-04-05T16:48:45.220Z · EA · GW · Legacy · 4 comments

Contents

    Key points from EA orgs
  GiveWell
      Shortcomings with Cost-Effectiveness Estimates
      Cluster Thinking and Extreme Model Uncertainty
      Communicating Uncertainty
  Animal Charity Evaluators
  80,000 Hours
  Centre for the Study of Existential Risk
      Population ethics
      Fairness
      Discounting
  Endnotes
None
3 comments

This essay was jointly written by Peter Hurford, Kathryn Mecrow, and Simon Beard[1].

 

Effective altruism is about figuring out how to do the most good; when working with limited resources, whether financial or otherwise, and faced with opportunity costs, how do we measure the good of the impact of a particular program? Using examples from current projects: how good is it to give someone a bednet, hand out a pro-vegetarian leaflet, or do an hour of AI safety research? This article considers the various methods employed by EA and EA associated organizations for dealing with uncertainty in their cost effectiveness analyses. To address this, we conducted a literature review using resources on the websites of EA orgs, and reached out to the different organizations for guidance.

When striving to answer these questions, we can model various parameters. For example, GiveWell mentions that the benefit of a deworming program depends on a number of factors that each carry their own uncertainties. Does deworming have short-term health benefits? Is it clear that deworming has long-term positive health or social impacts? How do we evaluate evidence from randomized control trials that differ from modern day deworming programs in a number of important ways? How do we proceed if some analysis suggests that deworming programs are more likely than not to have very little or no impact, but there is some possibility that deworming has a very large impact? In the process of combining these models we can employ ranges or intervals to express our uncertainty and aggregate the various inputs to create a total estimate of impact.

How do we know that we have all the relevant parameters, that we have aggregated correctly, accurately assessed our uncertainty in each parameter, and made no errors? Given the emphasis of the EA movement on robustness of evidence, how do we ensure that we actually do the most good and retain the credibility of our assessments in the face of poorly researched areas, uncertain future impact scenarios, and multitudes of known and unknown program-specific factors?

For example, VillageReach was GiveWell’s top charity for three years (2009-2011) with an estimated "$200-$1,000 per life saved". However, a later re-analysis in 2012 found that there was a significant amount of missing data that was not considered in the original review of VillageReach. This data could potentially have a large effect on the endline conclusion of Village Reach’s cost-effectiveness. Additionally, GiveWell later discovered potential alternative explanations for some of Village Reach’s impact, further reducing confidence in the cost-effectiveness figure. Together, this missing data and alternative explanations potentially represents a large source of “model uncertainty” on the initial estimate.

This dilemma is not confined to the EA movement. As a now famous example, Sam Wang at Princeton gave a 99% chance of Clinton winning the election. While Wang expressed his uncertainty, there were systematic errors in the model that made it more confident than was warranted. Nate Silver argues in The Signal and The Noise that a similar kind of problem with correlated errors contributed to a lot of model uncertainty in financial forecasting, which was ultimately partially responsible for the 2008 Recession.

 

Key points from EA orgs

 

GiveWell

It’s hard to find an organization, in effective altruism or otherwise, that has invested as much time thinking and articulating thoughts about uncertainty as GiveWell. While organizations like UNICEF suggest they can save a life with $1 per dose and that Nothing But Nets argues they can save a life with $10 per bednet, GiveWell’s page on cost-effectiveness argues that this is misleading and GiveWell is not inclined to take these estimates literally.

 

Shortcomings with Cost-Effectiveness Estimates

As early as 2008, GiveWell pointed out that while they find the idea of a cost-effectiveness estimate to be an attractive way to compare causes, there are shortcomings to these estimates, such as incorporating value judgements that may differ dramatically between different people . In 2010, GiveWell elaborated on more shortcomings:

 

GiveWell summarizes these problems into three core issues: cost-effectiveness estimates are frequently oversimplified (ignoring important indirect and long-term effects), overly sensitive (with small changes in assumptions producing big changes in value), and unverified (with errors persisting unnoticed for years).

 

Cluster Thinking and Extreme Model Uncertainty

These discussions led Holden Karnofsky to articulate a mathematical framework for assessing cost-effectiveness estimates that take into account a level of rigor in "Why We Can’t Take Expected Value Estimates Literally", "Maximizing Cost-Effectiveness via Critical Inquiry", and"Cluster Thinking vs. Sequence Thinking", with further clarifications in "Modeling Extreme Model Uncertainty". In summary, the key insight is that rather than rely on a single, explicit cost-effectiveness estimate, one ought to try to evaluate interventions from many different angles and adjust for one’s prior expectation of the intervention and an assumption that an estimate is likely to magnify dramatic effects.

Mathematically, the framework suggests that when you have a robust estimate of a particular intervention's cost-effectiveness, the key figure is how good the charity is, according to your estimations. Robustness can be achieved by (among other things) having multiple independent estimates. But when robustness is poor to moderate, variation in robustness can be as important as or more important than the point estimate. More broadly – when information quality is good, you should focus on quantifying your different options; when it isn’t, you should focus on raising information quality.

These points arise from the fact that when conducting a cost-effectiveness estimate, one must consider one’s prior distribution (i.e., what is predicted for the value of one’s actions by other life experience and evidence) and the variance of the estimate error around the cost-effectiveness estimate (i.e., how much room for error the estimate has) to produce a posterior estimate for cost-effectiveness[1].

The approach of looking at an organization or intervention from many different angles is further sketched out by Holden in what he calls "cluster thinking", which is distinct from what he calls "sequence thinking". With "cluster thinking", one seeks to evaluate a claim from multiple perspectives (or "clusters") and take an aggregated approach. In contrast, "sequence thinking" involves combining all factors into a single, sequential line of argument, usually in the form of an explicit expected value calculation.

 

Communicating Uncertainty

This forms the basis for a general philosophy toward supporting the charities that have a combination of reasonably high estimated cost-effectiveness and maximally robust evidence. GiveWell now look for "meaningful differences" in modeled cost-effectiveness to determine whether an intervention appears to be meaningfully more cost-effective than GiveDirectly or direct financial grants. They also aim to determine whether there are meaningful differences between other organizations, for example, deworming verses bednets.

When assessing particular interventions, GiveWell tends to avoid using "error bars" and prefers to display their best guess values and state the reasons for uncertainty qualitatively instead of quantitatively, while urging people to not take the best guess value literally.

GiveWell does not aim to adjust downwards for uncertainty. Instead they aim to make best guesses about the key parameters that affect the final estimate. For example, GiveWell's evaluation of deworming includes a discount of 100x in the weighting of estimates from a key supporting study, due to concerns about replicability /external validity for this study.

More generally, GiveWell has noticed a trend that cost-effectiveness estimates frequently become worse (and rarely become better) upon further investigation, and occasionally they adjust estimates downward to try to “get ahead” of this trend when first looking at new interventions. However, in the March 2016 open thread, GiveWell expressed that staff have differing opinions on these uncertainty-based downward discounts and that they could appear overly subjective and controversial.

As an example of some of the uncertainty considered, GiveWell wrote an uncertainty section in their 2016 review of SCI’s room for more funding, highlighting context specific issues such as political unrest, expiring drug supplies, additional donated drugs becoming available, delays and budget changes due to coordination with other actors, results of disease mapping, and grants from other donors. GiveWell additionally acknowledged that they do not place much weight on the preliminary room for more funding estimates for SCI's work in 2017-2018. In the consideration of the final estimate of cost per treatment delivered, GiveWell emphasized that their estimate relies on a number of uncertain assumptions.

 

Animal Charity Evaluators

Just as GiveWell recommends the best non-profits for improving global health, ACE analyzes the effectiveness of different ways to improve the wellbeing of nonhuman animals. This is a task rife with uncertainty as the empirical record for many of the analyzed interventions is sparse or nonexistent. Additionally, ACE has to consider very difficult and unresolved questions like animal consciousness and wild animal suffering. "Thoughts on the Reducetarian Study" contains a review of existing empirical pro-vegetarian intervention research and their limitations.

ACE rates each charity on seven specific criteria[3] -- an assessment of room for more funding, an explicit cost-effectiveness estimate, a more general assessment of individual intervention work having high implicit cost-effectiveness estimates, and four qualitative assessments of organizational health (e.g., track record, strong leadership, good culture). This could be seen as a form of cluster thinking where ACE looks at an organization in numerous ways to come to a holistic assessment, in which the impact of a cost-effectiveness calculation affects at most three of the seven criteria.

ACE has historically had challenges communicating cost-effectiveness information and now favors expressing their uncertainty about their estimates in the form of a range. For example, an estimated aggregation over all of Mercy for Animal’s activities produced an estimate of -4 and 20 years of suffering averted per dollar spent. To construct this estimate, ACE uses a third-party program called Guesstimate that can create and aggregate various confidence intervals, such as an estimate for all of MFA’s activities.

In addition to using Guesstimate, ACE also provides spreadsheet-driven calculators such as their social media calculator, that aims to have three point estimates for every parameter -- a pessimistic (conservative) view, a realistic (best guess) view, and an optimistic view. Each category is aggregated individually, creating three final estimates that are also pessimistic, realistic, and optimistic. The Guesstimate blog argues that this form of approach leads to inflated ranges, with the pessimistic final view being extra pessimistic and the optimistic final view being extra optimistic, though the central, realistic, best guess view should remain unaffected.

Currently, ACE’s estimates are based almost entirely on short-term effects. ACE has thought a lot about how to include long-term effects, but currently does not include these in their models due to an incommensurable level of bias relative to short-term estimates, a very high degree of uncertainty, and estimates being dominated by subjective judgment calls.

ACE also considered a robustness adjustment to their estimate (i.e., reducing estimates downward that are less robust), but decided not to do this due to concerns about too much subjectivity and thinking that estimating uncertainty of individual parameters and communicating the overall range should be sufficient to account for most of the impacts of robustness.

 

80,000 Hours

In the outline of their Research Principles, 80,000 Hours state they strive to employ Bayesian reasoning in their analysis of career decisions, through clarifying a prior guess on an issue, such as the potential benefits of a particular career path, by updating in or out of favour based on the strength of the evidence. They state that Bayesian reasoning is regarded as the best practice for decision-making under high uncertainty. They use their research principles as aspirational goals to inform their programs.

In the face of uncertainty, 80,000 Hours also uses cluster thinking -- instead of relying upon one or two strong considerations, they consider the question from a broad variety of angles and talk to people with different views, weighing each perspective according to the robustness and importance of the potential consequences. They additionally seek to avoid bias by aiming to make their research transparent and aiming to state their initial position so readers can spot any potential sources of bias and receive feedback from experts on sources of uncertainty.

In order to develop “workable assumptions”, 80,000 Hours generally adopts an assumption of linearity. For instance, they assume that the value of a resource is likely to be linear when considering changes that are a small fraction of the current supply of that resource. When consuming a resource, the overall effect is therefore very likely to be diminishing through most of the range; and is likely to be increasing only as one comes to control the majority of that resource, and even then only in some cases. For example, a donation of $200 is likely to be twice as good as $100.

Rob Wiblin emphasized that 80,000 Hours use their research principles as aspirational goals to inform their programs. He additionally drew our attention to the following ideas:

 

Centre for the Study of Existential Risk

The Centre for the Study of Existential Risk is an interdisciplinary research centre within the University of Cambridge, dedicated to the study and mitigation of human extinction-level threats[4]. As an academic institution, CSER does not have a corporate position on the evaluation of outcomes and risks, and contains a wide range of views. The following is therefore based on a specific project they are undertaking to develop a policy focused framework for the evaluation of Extreme Technological Risks (ETRs), i.e., technological developments that pose new existential risks. CSER states that standard cost-benefit analysis has serious deficiencies in the evaluation of ETRs and that the science of ETR management needs to take this into account when drawing conclusions about mitigating ETRs compared to other global priorities. Since much of the difficulty in evaluating ETRs stems from the significant degree of uncertainty about their risks and benefits, much of CSER’s work in this area involves developing alternatives to cost-benefit analysis that are better suited to evaluation under uncertainty. 

One problem with using cost-benefit analysis to evaluate ETRs is that existential threats usually emerge only in worst case scenarios. Such scenarios are often very unlikely to occur. However, given that the costs associated with human extinction level threats are many orders of magnitude greater than those associated with the next worse scenario, such as a global catastrophe, they may significantly alter the overall balance of costs and benefits associated with developing a technology. One practical implication of this is that existential risk mitigation involves making predictions and preparing for outcomes that will be much worse than what we would expect to see in most outcomes. In practice, cost-benefit analyses often exclude or ignore such tail risks and most organizations are sensitive to being seen as making predictions that are persistently proven to be incorrect, so CSER is keen to identify when such pessimism is most justified and support those who it views as responding correctly to these kinds of extreme tail risk.

However, CSER’s work also goes beyond such practical concerns. They are also concerned that within the significant degree of uncertainty that surround ETRs and other existential risks there may be other morally salient concerns that go beyond what is captured by standard cost benefit analysis. CSER use a fundamentally normative approach to identify and evaluate these concerns to create a framework that identifies where we have the greatest ethical imperative for precaution in the face of uncertainty, and where a more balanced approach to weighing costs and benefits remains appropriate. Three key issues in this analysis are population ethics, fairness and temporal discounting:

Population ethics

Philosophical debates about the value of future lives have thrown up many intriguing axiological theories, implying that one cannot directly derive the value of a life from their level of well-being, let alone the interpretation of these wellbeing levels in monetary terms as required by cost-benefit analysis. Some people, such as Derek Parfit, have proposed a lexical theory about the value of lives, in which certain goods, such as science, friendship and culture, are morally more significant than any amount of individual welfare on its own. If taken seriously, this view greatly reduces the importance some some kinds of moral uncertainty. For instance, it implies that it does not matter if we do not know what the welfare costs and benefits of a technology will be if it threatens the existence of these ‘perfectionist goods’. There are several ways of incorporating such concerns into an evaluative framework, for instance by adopting a form of critical-level utilitarianism (giving priority to lives that are above some ‘critical level’ of welfare) or by implementing a more pluralist approach to moral value. As a starting point, CSER is analyzing whether possible future scenarios provide the potential resources necessary to foster perfectionist values at all, since this may be morally equivalent to the question of whether they pose an existential threat.

Fairness

Sometimes our evaluation of an action is sensitive to more than just its costs and benefits, but also the ways in which these come about and their distribution. This view is common amongst a variety of moral theories, although it can be articulated in many ways. CSER is currently investigating accounts of fairness that allow us to integrate such concerns with a suitably consequentialist and aggregative approach to evaluating risks, for instance the "aggregate relevant claims" view. By introducing new evaluative frameworks such views have the potential to remove large amounts of evaluative uncertainty. For instance, on some of these views it is always better to save a single life than cure any number of headaches, rendering any uncertainty over the number of headaches that one might potentially cure morally insignificant. Such views are already being used to evaluate public health choices, but are yet to be studied in the evaluation of technological risks.

Discounting

The relationship between uncertainty and the social discount rate (i.e. whether the fact that a cost or benefit will occur in the future makes it less important than if it occurred in the present) may seem less obvious. However, many theories about why we should discount future harms and benefits actually imply that we should use different discount rates for different kinds of costs and benefits. Whilst it seems legitimate to impose quite a high temporal discount rate on future benefits where these take the form of additional wellbeing for those who are already well off, the discount rate should be lower for assessing costs or the wellbeing of those who are worse off and should be even lower, or potentially negative, for costs associated with global catastrophes. This result is in fact well known in theory, but generally gets lost in practice when it is easiest to apply a single social discount rate to all costs and benefits, regardless of when and where they fall. One upshot is that as we move further into the future it may matter less and less just how big certain costs and benefits are, and much more whether or not there could be an extreme event such as human extinction or a global catastrophe.


CSER hopes that by developing these lines of research it will be possible to produce a middle way between Cost Benefit Analysis, which is often far too insensitive to risk and uncertainty, and a more blanket precautionary approach, which tends to overreact to it, yielding irrational results. This will form the basis of their integrated approach to managing extreme technological risks.

CSER is also interested in uncertainty in other areas that, although unlikely to produce existential threats in themselves, play an important role in framing humanity's future. One example is the evaluation of scientific research, where they are concerned that a reliance on overly precise risk-benefit assessments of research when there is significant uncertainty about the actual research outcomes produces no real improvement in the quality of research output, but does encourage the perpetuation of selection bias and other irrationalities in the kinds of scientific research that is undertaken and promoted.

 

Endnotes

[1]: Peter Hurford is an independent researcher who works as a data scientist and is on the board of Charity Science Health, .impact, and Animal Charity Evaluators. Kathryn Mecrow is a member of the Operations Team of the Future of Humanity Institute. Simon Beard is a Research Associate at the Centre for the Study of Existential Risk at the University of Cambridge. We also thank (in no particular order) Michelle Hutchinson at the Oxford Institute for Effective Altruism, Rob Wiblin at 80,000 Hours; Allison Smith at Animal Charity Evaluators; Amanda Askell; and Elie Hassenfeld, Rebecca Raible, and Holden Karnofsky at GiveWell for answering initial questions that allowed us to research this essay and for reviewing a final draft of this essay prior to publication.

[2]: As a concrete example, Michael Dickens elaborates on how to use this framework to produce cost-effectiveness estimates for various different causes that may be more directly comparable, even across multiple levels of rigor. GiveWell also produces a worked example showing mathematically how one might combine three different models demonstrating uncertainty about investing in a start-up.

[3]: Note that one of the three authors, Peter Hurford, serves as the Treasurer of the ACE board and was the original designer of ACE’s evaluation criteria as an ACE volunteer. This could introduce potential bias when discussing this section.

[4]: Note that one of the three authors, Simon Beard, is a research associate at the Centre for the Study of Existential Risk and works on their project Evaluating Extreme Technological Risks. This could introduce potential bias when discussing this section.

4 comments

Comments sorted by top scores.

comment by drwahl · 2017-04-05T17:24:40.432Z · EA(p) · GW(p)

I've done a little work on this, using techniques from modern portfolio theory, and uncertainty estimates from GiveWell and ACE to generate optimal charity portfolios. See here for a background post, and here for my 2016 update.

Replies from: Phil_Thomas, Peter_Hurford
comment by Phil_Thomas · 2017-04-11T03:31:09.929Z · EA(p) · GW(p)

That’s interesting! I worked on something similar, but it only allows for normal distributions and requires pre-calculated returns and variances. Using the GiveWell estimates to create your own probability distributions is an interesting idea -- I spent some time looking through sources like the DCP2 and Copenhagen Consensus data but couldn’t find a source that did a good job of quantifying their uncertainty (although DCP2 does at least include the spread of their point estimates that I used for an analysis here ).

One thing I wondered about while working on this was whether it made sense to choose the tangency portfolio, or just keep moving up the risk curve to the portfolio with the highest expected value (In the end, I think this would mean just putting all your money in the single charity with the highest expected value). I guess the answer depends on how much risk an individual wants to take with their donations, so a nice feature of this approach is that it allows people to select a portfolio according to their risk preference. Overall, this seems like a good way to communicate the tradeoffs involved in philanthropy.

comment by Peter Wildeford (Peter_Hurford) · 2017-04-05T17:31:50.541Z · EA(p) · GW(p)

Thanks, this is actually highly relevant to another piece I'm working on!