[Link] The Optimizer's Curse & Wrong-Way Reductions

post by Chris Smith · 2019-04-04T13:28:53.703Z · score: 65 (32 votes) · EA · GW · 50 comments

This is a linkpost for https://confusopoly.com/2019/04/03/the-optimizers-curse-wrong-way-reductions/.

Summary

I spent about two and a half years as a research analyst at GiveWell. For most of my time there, I was the point person on GiveWell’s main cost-effectiveness analyses. I’ve come to believe there are serious, underappreciated issues with the methods the effective altruism (EA) community at large uses to prioritize causes and programs. While effective altruists approach prioritization in a number of different ways, most approaches involve (a) roughly estimating the possible impacts funding opportunities could have and (b) assessing the probability that possible impacts will be realized if an opportunity is funded.
I discuss the phenomenon of the optimizer’s curse: when assessments of activities’ impacts are uncertain, engaging in the activities that look most promising will tend to have a smaller impact than anticipated. I argue that the optimizer’s curse should be extremely concerning when prioritizing among funding opportunities that involve substantial, poorly understood uncertainty. I further argue that proposed Bayesian approaches to avoiding the optimizer’s curse are often unrealistic. I maintain that it is a mistake to try and understand all uncertainty in terms of precise probability estimates.

I go into a lot more detail in the full post.

50 comments

Comments sorted by top scores.

comment by AGB · 2019-04-14T19:07:11.362Z · score: 45 (15 votes) · EA · GW

I'm feeling confused.

I basically agree with this entire post. Over many years of conversations with Givewell staff or former staff, I can't readily recall speaking to anyone affiliated with Givewell who I can identify that they would substantively disagree with the suggestions in this post. But you obviously feel that some (reasonably large?) group of people disagrees with some (reasonably large?) part of your post. I understand a reluctance to give names, but focusing on Givewell specifically as much of their thoughts on these matters are public record here, can you identify what specifically in that post or the linked extra reading you disagree with? Or are you talking to EAs-not-at-Givewell? Or do you think Givewell's blog posts are reasonable but their internal decision-making process nonetheless commits the errors they warn against? Or some possibility I'm not considering?

I particularly note that your first suggestion to 'entertain multiple models' sounds extremely similar to 'cluster thinking' as described and advocated-for here, and the other suggestions also don't sound like things I would expect Givewell to disagree with. This leaves me at a bit of a loss as to what you would like to see change, and how you would like to see it change.

comment by Chris Smith · 2019-04-15T21:41:05.608Z · score: 11 (4 votes) · EA · GW

Thanks for raising this.

To be clear, I'm still a huge fan of GiveWell. GiveWell only shows up in so many examples in my post because I'm so familiar with the organization.

I mostly agree with the points Holden makes in his cluster thinking post (and his other related posts). Despite that, I still have serious reservations about some of the decision-making strategies used both at GW and in the EA community at large. It could be that Holden and I mostly agree, but other people take different positions. It could be that Holden and I agree about a lot of things at a high-level but then have significantly different perspectives about how those things we agree on at a high-level should actually manifest themselves in concrete decision making.

For what it's worth, I do feel like the page you linked to from GiveWell's website may downplay the role cost-effectiveness plays in its final recommendations (though GiveWell may have a good rebuttal).

In a response to Taymon's comment, I left a specific example of something I'd like to see change. In general, I'd like people to be more reluctant to brute-force push their way through uncertainty by putting numbers on things. I don't think people need to stop doing that entirely, but I think it should be done while keeping in mind something like: "I'm using lots of probabilities in a domain where I have no idea if I'm well-calibrated...I need to be extra skeptical of whatever conclusions I reach."

comment by AGB · 2019-04-20T22:11:06.076Z · score: 5 (3 votes) · EA · GW

Fair enough. I remain in almost-total agreement, so I guess I'll just have to try and keep an eye out for what you describe. But based on what I've seen within EA, which is evidently very different to what you've seen, I'm more worried about little-to-zero quantification than excessive quantification.

comment by Chris Smith · 2019-04-15T21:50:26.461Z · score: 4 (3 votes) · EA · GW

I'd also be excited to see more people in the EA movement doing the sort of work that I think would put society in a good position for handling future problems when they arrive. E.g., I think a lot of people who associate with EA might be awfully good and pushing for progress in metascience/open science or promoting a free & open internet.

comment by Taymon · 2019-04-14T03:11:36.425Z · score: 27 (14 votes) · EA · GW

Can you give an example of a time when you believe that the EA community got the wrong answer to an important question as a result of not following your advice here, and how we could have gotten the right answer by following it?

comment by Chris Smith · 2019-04-15T21:04:08.390Z · score: 18 (7 votes) · EA · GW

Sure. To be clear, I think most of what I'm concerned about applies to prioritization decisions made in highly-uncertain scenarios. So far, I think the EA community has had very few opportunities to look back and conclusively assess whether highly-uncertain things it prioritized turned out to be worthwhile. (Ben makes a similar point at https://www.lesswrong.com/posts/Kb9HeG2jHy2GehHDY/effective-altruism-is-self-recommending.)

That said, there are cases where I believe mistakes are being made. For example, I think mass deworming in areas where almost all worm infections are light cases of trichuriasis or ascariasis is almost certainly not among the most cost-effective global health interventions.

Neither trichuriasis nor ascariasis appear to have common/significant/easily-measured symptoms when infections are light (i.e., when there are not many worms in an infected person's body). To reach the conclusion that treating these infections has a high expected value, extrapolations are made from the results of a study that had some weird features and occurred in a very different environment (an environment with far heavier infections and additional types of worm infections). When GiveWell makes its extrapolations, lots of discounts, assumptions, probabilities, etc. are used. I don't think people can make this kind of extrapolation reliably (even if they're skeptical, smart, and thinking carefully). When unreliable estimates are combined with an optimization procedure, I worry about the optimizer's curse.

Someone who is generally skeptical of people's ability to productively use models in highly-uncertain situations might instead survey experts about the value of treating light trichuriasis & asariasis infections. Faced with the decision of funding either this kind of deworming or a different health program that looked highly-effective, I think the example person who ran surveys would choose the latter.

comment by Chris Smith · 2019-04-15T21:10:35.489Z · score: 8 (6 votes) · EA · GW

Just to be clear, much of the deworming work supported by people in the EA community happens in areas where worm infections are more intense or are caused by worm species other than Trichuris & Ascaris. However, I believe a non-trivial amount of deworming done by charities supported by the EA community occurs in areas w/ primarily light infections from those worms.

comment by Max_Daniel · 2019-04-04T22:48:22.012Z · score: 27 (10 votes) · EA · GW

I haven't had time yet to think about your specific claims, but I'm glad to see attention for this issue. Thank you to contributing what I suspect is an important discussion!

You might be interested in the following paper which essentially shows that under an additional assumption the Optimizer's Curse not only makes us overestimate the value of the apparent top option but in fact can make us predictably choose the wrong option.

Denrell, J. and Liu, C., 2012. Top performers are not the most impressive when extreme performance indicates unreliability. Proceedings of the National Academy of Sciences, 109(24), pp.9331-9336.

The crucial assumption roughly is that the reliability of our assessments varies sufficiently much between options. Intuitively, I'm concerned that this might apply when EAs consider interventions across different cause areas: e.g., our uncertainty about the value of AI safety research is much larger than our uncertainty about the short-term benefits of unconditional cash transfers.

(See also the part on the Optimizer's Curse and endnote [6] on Denrell and Liu (2012) in this post by me [EA · GW], though I suspect it won't teach you anything new.)

comment by kbog · 2019-04-05T17:52:55.083Z · score: 5 (3 votes) · EA · GW

Kind of an odd assumption that dependence on luck varies from player to player.

If we are talking about charity evaluations then reliability can be estimated directly so this is no longer a predictable error.

comment by Chris Smith · 2019-04-05T22:12:52.189Z · score: 8 (3 votes) · EA · GW

Can you expand on how you would directly estimate the reliability of charity evaluations? I feel like there are a lot of realistic situations where this would be extremely difficult to do well.

comment by kbog · 2019-04-06T00:54:25.831Z · score: 4 (3 votes) · EA · GW

I mean do the adjustment for the optimizer's curse. Or whatever else is in that paper.

I think talk of doing things "well" or "reliably" should be tabooed from this discussion, because no one has any coherent idea of what the threshold for 'well enough' or 'reliable enough' means or is in this context. "Better" or "more reliable" makes sense.

comment by Max_Daniel · 2019-04-06T12:35:48.099Z · score: 7 (5 votes) · EA · GW
Kind of an odd assumption that dependence on luck varies from player to player.

Intuitively, it strikes me as appropriate for some realistic situations. For example, you might try to estimate the performance of people based on quite different kinds or magnitudes of inputs; e.g. one applicant might have a long relevant track record, for another one you might just have a brief work test. Or you might compare the impact of interventions that are backed by very different kinds of evidence - say, a RCT vs. a speculative, qualitative argument.

Maybe there is something I'm missing here about why the assumption is odd, or perhaps even why the examples I gave don't have the property required in the paper? (The latter would certainly be plausible as I read the paper a while ago, and even back then not very closely.)

comment by Max_Daniel · 2019-04-06T12:41:13.394Z · score: 2 (2 votes) · EA · GW
If we are talking about charity evaluations then reliability can be estimated directly so this is no longer a predictable error.

Hmm. This made me wonder whether the paper's results depends on the decision-maker being uncertain about which options have been estimated reliably vs. unreliably. It seems possible that the effect could disappear if the reliability of my estimates varies but I know that the variance of my value estimate for option 1 is v_1, the one for option 2 v_2 etc. (even if the v_i vary a lot). (I don't have time to check the paper or get clear on this I'm afraid.)

Is this what you were trying to say here?

comment by Chris Smith · 2019-04-05T22:15:36.779Z · score: 3 (2 votes) · EA · GW

Thanks Max! That paper looks interesting—I'll have to give it a closer read at some point.

I agree with you that how the reliability of assessments varies between options is crucial.

comment by cole_haus · 2019-04-04T19:06:53.335Z · score: 10 (8 votes) · EA · GW

There's actually a thing called the Satisficer's Curse (pdf) which is even more general:

The Satisficer’s Curse is a systematic overvaluation that occurs when any uncertain prospect is chosen because its estimate exceeds a positive threshold. It is the most general version of the three curses, all of which can be seen as statistical artefacts.
comment by kbog · 2019-04-17T04:15:48.638Z · score: 4 (2 votes) · EA · GW

Also, if your criterion for choosing an intervention is how frequently it still looks good under different models and priors, as people seem to be suggesting in lieu of EV maximization, you will still get similar curses - they'll just apply to the number of models/priors, rather than the number in the EV estimate.

comment by kbog · 2019-04-04T19:26:59.430Z · score: 9 (6 votes) · EA · GW
I believe effective altruists should find this extremely concerning.

Well it does not change the ordering of options. You're kind of doing a wrong-way reduction here: you're taking the question of what project should I support and "reducing" it to literal quantitative estimation of effectiveness. Optimizer's curse only matters when comparing better-understood projects to worse-understood projects, but you are talking about "prioritizing among funding opportunities that involve substantial, poorly understood uncertainty".

In most scenarios where effective altruists encounter the optimizer’s curse, this solution is unworkable. The necessary data doesn’t exist.

We can specify a prior distribution.

I don’t think ignorance must cash out as a probability distribution. I don’t have to use probabilistic decision theory to decide how to act.

Well no, but it's better if you do. That Deutsch quote seems to say that it could allow people to take bad reasons and overstate them; that sounds like a problem with thinking in general. And there is no reason to assume that probabilistic decision makers will overestimate as opposed to underestimate. There have been many times when I had a vague, scarce prejudice/suspicion based on personal ignorance, and deeper analysis of reliable sources showed that I was correct and underconfident. If you think your vague suspicions aren't useful, then just don't trust them! Every system of thinking is going to bottom out in "be rational, don't be irrational" at some point, so this is not a problem with probabilism in particular.

The reason it's better is that it allows better rigor and accuracy. For instance, look how this post revolves around the optimizer's curse. Here's a question: how are you going to adjust for the optimizer's curse if you don't use probability (implicitly or explicitly)? And if people weren't using probabilistic decision theory, no one would have discovered the optimizer's curse in the first place!

Kyle is an atheist. When asked what odds he places on the possibility that an all-powerful god exists, he says “2%.”

Hey! I didn't consent to being included in your post!!!

Here's what it means, formally: given that I have an equal desire to be right about the existence of God and the nonexistence of God, and given some basic assumptions about my money and my desire for money, I would make a bet with at most 50:1 odds that all-powerful-God exists.

Placing hazy probabilities on the same footing as better-grounded probabilities (e.g., the odds a coin comes up heads) can lead to problems.

But in Bayesian decision theory, they aren't on the same footing. They have very different levels of robustness. They are not well-grounded and this matters for how readily we update away from them. Is the notion of robustness inadequate for solving some problem here? In the Norton paper that you cite later on this point, I ctrl-F for "robust" and find nothing.

When discussing these ideas with members of the effective altruism community, I felt that people wanted me to propose a formulaic solution—some way to explicitly adjust expected value estimates that would restore the integrity of the usual prioritization methods. I don’t have any suggestions of that sort.

All of your suggestions make perfect sense under standard, Bayesian probability and decision theory. As stated, they are kind of platitudinous. Moreover, it's not clear to me that abandoning these principles in favor of a some deeper concept of ignorance actually helps motivate any of your recommendations. Why, exactly, is it important that I embrace model skepticism for instance - just because I have decided to abandon probabilities? Does abandoning probabilities reduce the variance in the usefulness of different models? It can't, actually, because without probabilities the variance is going to be undefined.

In practice, I haven't done things with multiple quantitative models because (a) models are tough to build, and (b) a good model accommodates all kinds of uncertainty anyway. It's never been the case where I've found some new information/ideas, decided to update my model, and then realized "uh oh, I can't do this in this model." I can always just add new calculations for the new considerations, and it becomes a bit kludgy but still seems more accurate. So yeah this is good in theory but the practical value seems very limited. To be sure, I haven't really tried it yet.

If we want to test the accuracy of a model, we need to test a statistically significant number of the things predicted by the model. It's not sufficient for us to donate to AMF, see that AMF seems to work pretty well (or not), and then judge Givewell accordingly. We need to see whether Givewell's ordering of multiple charities holds.

Testing works well in some contexts. In others it's just unrealistic.

Improving social capacity tends to work better when society is trusted to actually do the right thing.

In my experience, effective altruists are unusually skeptical of conventional wisdom, tradition, intuition, and similar concepts.

But these are exactly the things that you are objecting to. Where do you think probability estimates of deeply uncertain things come from? If there's some disagreement here about the actual reliability of things like intuition and tradition, it hasn't been made explicit. Instead, you've just said that such things should not be expressed in the form of quantitative probabilities.

comment by Chris Smith · 2019-04-04T20:39:44.581Z · score: 2 (4 votes) · EA · GW

Thanks for the detailed comment!

I expect we’ll remain in disagreement, but I’ll clarify where I stand on a couple of points you raised:

“Optimizer's curse only matters when comparing better-understood projects to worse-understood projects, but you are talking about "prioritizing among funding opportunities that involve substantial, poorly understood uncertainty."

Certainly, the optimizer’s curse may be a big deal when well-understood projects are compared with poorly-understood projects. However, I don’t think it’s the case that all projects involving "substantial, poorly understood uncertainty" are on the same footing. Rather, each project is on its own footing, and we're somewhat ignorant about how firm that footing is.

“We can use prior distributions.”

Yes, absolutely. What I worry about is how reliable those priors will be. I maintain that, in many situations, it’s very hard to defend any particular prior.

“And there is no reason to assume that probabilistic decision makers will overestimate as opposed to underestimate.”

This gets at what I’m really worried about! Let’s assume decisionmakers coming up with probabilistic estimates to assess potential activities don’t have a tendency to overestimate or underestimate. However, once a decisionmaker has made many estimates, there is reason to believe the activities that look most promising likely involve overestimates (because of the optimizer’s curse).

“Here's a question: how are you going to adjust for the optimizer's curse if you don't use probability (implicitly or explicitly)?”

This is a great question!

Rather than saying, "This is a hard problem, and I have an awesome solution no one else has proposed," I'm trying to say something more like, "This is a problem we should acknowledge! Let's also acknowledge that it's a damn hard problem and may not have an easy solution!"

That said, I think there are approaches that have promise (but are not complete solutions):
-Favoring opportunities that look promising under multiple models.

-Being skeptical of opportunities that look promising under only a single model.

-Learning more (if that can cause probability estimates to become less uncertain & hazy).

-Doing more things to put society in a good position to handle problems when they arise (or become apparent) instead of trying to predict problems before they arise (or become apparent).

“Here's what it means, formally: given that I have an equal desire to be right about the existence of God and the nonexistence of God, and given some basic assumptions about my money and my desire for money, I would make a bet with at most 50:1 odds that all-powerful-God exists.”

This is how a lot of people think about statements of probability, and I think that’s usually reasonable. I’m concerned that people are sometimes accidentally equivocating between: “I would bet on this with at most 50:1 odds” and “this is as likely to occur as a perfectly fair 50-sided die being rolled and coming up ‘17’”

“But in Bayesian decision theory, they aren't on the same footing. They have very different levels of robustness. They are not well-grounded and this matters for how readily we update away from them. Is the notion of robustness inadequate for solving some problem here?”

The notion of robustness points in the right direction, but I think it’s difficult (perhaps impossible) to reliably and explicitly quantify robustness in the situations we’re concerned about.

comment by kbog · 2019-04-04T21:23:28.552Z · score: 3 (2 votes) · EA · GW
Certainly, the optimizer’s curse may be a big deal when well-understood projects are compared with poorly-understood projects. However, I don’t think it’s the case that all projects involving "substantial, poorly understood uncertainty" are on the same footing. Rather, each project is on its own footing, and we're somewhat ignorant about how firm that footing is.

"Footing" here is about the robustness of our credences, so I'm not sure that we can really be ignorant of them. Yes different projects in a poorly understood domain will have different levels of poorly understood uncertainty, but it's not clear that this is more important than the different levels of uncertainty in better-understood domains (e.g. comparisons across Givewell charities).

What I worry about is how reliable will those priors be.

What do you mean by reliable?

I maintain that, in many situations, it’s very hard to defend any particular prior.

Yes, but it's very hard to attack any particular prior as well.

Let’s assume decisionmakers coming up with probabilistic estimates to assess potential activities don’t have a tendency to overestimate or underestimate. However, once a decisionmaker has made many estimates, there is reason to believe the activities that look most promising likely involve overestimates (because of the optimizer’s curse).

Yes I know but again it's the ordering that matters. And we can correct for optimizer's curse, and we don't know if these corrections will overcorrect or undercorrect.

"This is a problem we should acknowledge! Let's also acknowledge that it's a damn hard problem and may not have an easy solution!"

"The problem" should be precisely defined. Identifying the correct intervention is hard because the optimizer's curse complicates comparisons between better- and worse-substantiated projects? Yes we acknowledge that. And you are not just saying that there's a problem, you are saying that there is a problem with a particular methodology, Bayesian probability. That is very unclear.

-Favoring opportunities that look promising under multiple models.
-Being skeptical of opportunities that look promising under only a single model.
-Learning more (if that can cause probability estimates to become less uncertain & hazy).
-Doing more things to put society in a good position to handle problems when they arise (or become apparent) instead of trying to predict problems before they arise (or become apparent).

This is just a generic bucket of "stuff that makes estimates more accurate, sometimes" without any more connection to the optimizer's curse than to any other facets of uncertainty.

Let's imagine I make a new group whose job is to randomly select projects and then estimate each project's expected utility as accurately and precisely as possible. In this case the optimizer's curse will not apply to me. But I'll still want to evaluate things with multiple models, learn more and use proxies such as social capacity.

What is some advice that my group should not follow, that Givewell or Open Philanthropy should follow? Aside from the existing advice for how to make adjustments for the Optimizer's Curse.

The notion of robustness points in the right direction, but I think it’s difficult (perhaps impossible) to reliably and explicitly quantify robustness in the situations we’re concerned about.

If you want, you can define some set of future updates (e.g. researching something for 1 week) and specify a probability distribution for your belief state after that process. I don't think that level of explicit detail is typically necessary though. You can just give a rough idea of your confidence level alongside likelihood estimates.

comment by MichaelStJules · 2019-04-14T17:57:45.144Z · score: 3 (4 votes) · EA · GW
Yes, but it's very hard to attack any particular prior as well.

I don't think this leaves you in a good position if your estimates and rankings are very sensitive to the choice of "reasonable" priors. Chris illustrated this in his post at the end of part 2 (with the atheist example), and in part 3.

You could try to choose some compromise between these priors, but there are multiple "reasonable" ways to compromise. You could introduce a prior on these priors, but you could run into the same problem with multiple "reasonable" choices for this new prior.

comment by kbog · 2019-04-14T20:15:51.358Z · score: 3 (2 votes) · EA · GW
I don't think this leaves you in a good position if your estimates and rankings are very sensitive to the choice of "reasonable" priors.

What do you mean by "a good position"?

You could try to choose some compromise between these priors, but there are multiple "reasonable" ways to compromise. You could introduce a prior on these priors, but you could run into the same problem with multiple "reasonable" choices for this new prior.

Ah, I guess we'll have to switch to a system of epistemology which doesn't bottom out in unproven assumptions. Hey hold on a minute, there is none.

I'm getting a little confused about what sorts of concrete conclusions we are supposed to take away from here.

comment by MichaelStJules · 2019-04-14T23:09:51.766Z · score: 3 (4 votes) · EA · GW
What do you mean by "a good position"?
(...)
I'm getting a little confused about what sorts of concrete conclusions we are supposed to take away from here.

I'm not saying we shouldn't use priors or that they'll never help. What I am saying is that they don't address the optimizer's curse just by including them, and I suspect they won't help at all on their own in some cases.

Maybe checking sensitivity to priors and further promoting interventions whose value depends less on them (among some set of "reasonable" priors) would help. You could see this as a special case of Chris's suggestion to "Entertain multiple models".

Perhaps you could even use an explicit model to combine the estimates or posteriors from multiple models into a single one in a way that either penalizes sensitivity to priors or gives less weight to more extreme estimates, but a simpler decision rule might be more transparent or otherwise preferable. From my understanding, GiveWell already uses medians of its analysts' estimates this way.

Ah, I guess we'll have to switch to a system of epistemology which doesn't bottom out in unproven assumptions. Hey hold on a minute, there is none.

I get your point, but the snark isn't helpful.

comment by kbog · 2019-04-15T03:34:46.655Z · score: 2 (1 votes) · EA · GW
What I am saying is that they don't address the optimizer's curse just by including them, and I suspect they won't help at all on their own in some cases.

You seem to be using "people all agree" as a stand-in for "the optimizer's curse has been addressed". I don't get this. Addressing the optimizer's curse has been mathematically demonstrated. Different people can disagree about the specific inputs, so people will disagree, but that doesn't mean they haven't addressed the optimizer's curse.

Maybe checking sensitivity to priors and further promoting interventions whose value depends less on them (among some set of "reasonable" priors) would help. You could see this as a special case of Chris's suggestion to "Entertain multiple models".
Perhaps you could even use an explicit model to combine the estimates or posteriors from multiple models into a single one in a way that either penalizes sensitivity to priors or gives less weight to more extreme estimates, but a simpler decision rule might be more transparent or otherwise preferable.

I think combining into a single model is generally appropriate. And the sub-models need not be fully, explicitly laid out.

Suppose I'm demonstrating that poverty charity > animal charity. I don't have to build one model assuming "1 human = 50 chickens", another model assuming "1 human = 100 chickens", and so on.

Instead I just set a general standard for how robust my claims are going to be, and I feel sufficiently confident saying "1 human = at least 60 chickens", so I use that rather than my mean expectation (e.g. 90).

comment by MichaelStJules · 2019-04-16T06:42:19.149Z · score: 3 (2 votes) · EA · GW
You seem to be using "people all agree" as a stand-in for "the optimizer's curse has been addressed". I don't get this. Addressing the optimizer's curse has been mathematically demonstrated. Different people can disagree about the specific inputs, so people will disagree, but that doesn't mean they haven't addressed the optimizer's curse.

Maybe we're thinking about the optimizer's curse in different ways.

The proposed solution of using priors just pushes the problem to selecting good priors. It's also only a solution in the sense that it reduces the likelihood of mistakes happening (discovered in hindsight, and under the assumption of good priors), but not provably to its minimum, since it does not eliminate the impacts of noise. (I don't think there's any complete solution to the optimizer's curse, since, as long as estimates are at least somewhat sensitive to noise, "lucky" estimates will tend to be favoured, and you can't tell in principle between "lucky" and "better" interventions.)

If you're presented with multiple priors, and they all seem similarly reasonable to you, but depending on which ones you choose, different actions will be favoured, how would you choose how to act? It's not just a matter of different people disagreeing on priors, it's also a matter of committing to particular priors in the first place.

If one action is preferred with almost all of the priors (perhaps rare in practice), isn't that a reason (perhaps insufficient) to prefer it? To me, using this could be an improvement over just using priors, because I suspect it will further reduce the impacts of noise, and if it is an improvement, then just using priors never fully solved the problem in practice in the first place.

I agree with the rest of your comment. I think something like that would be useful.

comment by kbog · 2019-04-17T03:23:54.447Z · score: 9 (3 votes) · EA · GW
The proposed solution of using priors just pushes the problem to selecting good priors.

The problem of the optimizer's curse is that the EV estimates of high-EV-options are predictably over-optimistic in proportion with how unreliable the estimates are. That problem doesn't exist anymore.

The fact that you don't have guaranteed accurate information doesn't mean the optimizer's curse still exists.

I don't think there's any complete solution to the optimizer's curse

Well there is, just spend too much time worrying about model uncertainty and other people's priors and too little time worrying about expected value estimation. Then you're solving the optimizer's curse too much, so that your charity selections will be less accurate and predictably biased in favor of low EV, high reliability options. So it's a bad idea, but you've solved the optimizer's curse.

If you're presented with multiple priors, and they all seem similarly reasonable to you, but depending on which ones you choose, different actions will be favoured, how would you choose how to act?

Maximize the expected outcome over the distribution of possibilities.

If one action is preferred with almost all of the priors (perhaps rare in practice), isn't that a reason (perhaps insufficient) to prefer it?

What do you mean by "the priors"? Other people's priors? Well if they're other people's priors and I don't have reason to update my beliefs based on their priors, then it's trivially true that this doesn't give me a reason to prefer the action. But you seem to think that other people's priors will be "reasonable", so obviously I should update based on their priors, in which case of course this is true - but only in a banal, trivial sense that has nothing to do with the optimizer's curse.

To me, using this could be an improvement over just using priors

Hm? You're just suggesting updating one's prior by looking at other people's priors. Assuming that other people's priors might be rational, this is banal - of course we should be reasonable, epistemically modest, etc. But this has nothing to do with the optimizer's curse in particular, it's equally true either way.

I ask the same question I asked of OP: give me some guidance that applies for estimating the impact of maximizing actions that doesn't apply for estimating the impact of randomly selected actions. So far it still seems like there is none - aside from the basic idea given by Muelhauser.

just using priors never fully solved the problem in practice in the first place

Is the problem the lack of guaranteed knowledge about charity impacts, or is the problem the optimizer's curse? You seem to (incorrectly) think that chipping away at the former necessarily means chipping away at the latter.

comment by Chris Smith · 2019-04-18T04:08:03.080Z · score: 4 (3 votes) · EA · GW

It's always worth entertaining multiple models if you can do that at no cost. However, doing that often comes at some cost (money, time, etc). In situations with lots of uncertainty (where the optimizer's curse is liable to cause significant problems), it's worth paying much higher costs to entertain multiple models (or do other things I suggested) than it is in cases where the optimizer's curse is unlikely to cause serious problems.

comment by kbog · 2019-04-21T03:04:05.673Z · score: 2 (1 votes) · EA · GW
In situations with lots of uncertainty (where the optimizer's curse is liable to cause significant problems), it's worth paying much higher costs to entertain multiple models (or do other things I suggested) than it is in cases where the optimizer's curse is unlikely to cause serious problems.

I don't agree. Why is the uncertainty that comes from model uncertainty - as opposed to any other kind of uncertainty - uniquely important for the optimizer's curse? The optimizer's curse does not discriminate between estimates that are too high for modeling reasons, versus estimates that are too high for any other reason.

The mere fact that there's more uncertainty is not relevant, because we are talking about how much time we should spend worrying about one kind of uncertainty versus another. "Do more to reduce uncertainty" is just a platitude, we always want to reduce uncertainty.

comment by MichaelStJules · 2019-04-20T22:32:02.652Z · score: 1 (1 votes) · EA · GW

I made a long top-level comment [EA · GW] that I hope will clarify some problems with the solution proposed in the original paper.

I ask the same question I asked of OP: give me some guidance that applies for estimating the impact of maximizing actions that doesn't apply for estimating the impact of randomly selected actions.

This is a good point. Somehow, I think you’d want to adjust your posterior downward based on the set or the number of options under consideration and how unlikely the data that makes the intervention look good. This is not really useful, since I don't know how much you should adjust these. Maybe there's a way to model this explicitly, but it seems like you'd be trying to model your selection process itself before you've defined it, and then you look for a selection process which satisfies some properties.

You might also want to spend more effort looking for arguments and evidence against each option the more options you're considering.

When considering a larger number of options, you could use some [EA · GW] randomness in your selection process [EA · GW] or spread funding further (although the latter will be vulnerable to the satisficer's curse if you're using cutoffs).

What do you mean by "the priors"?

If I haven’t decided on a prior, and multiple different priors (even an infinite set of them) seem equally reasonable to me.

comment by kbog · 2019-04-21T03:01:59.345Z · score: 2 (1 votes) · EA · GW
Somehow, I think you’d want to adjust your posterior downward based on the set or the number of options under consideration and how unlikely the data that makes the intervention look good.

That's the basic idea given by Muelhauser. Corrected posterior EV estimates.

You might also want to spend more effort looking for arguments and evidence against each option the more options you're considering.

As opposed to equal effort for and against? OK, I'm satisfied. However, if I've done the corrected posterior EV estimation, and then my specific search for arguments-against turns up short, then I should increase my EV estimates back towards the original naive estimate.

When considering a larger number of options, you could use some [EA · GW] randomness in your selection process [EA · GW]

As I recall, that post found that randomized funding doesn't make sense. Which 100% matches my presumptions, I do not see how it could improve funding outcomes.

or spread funding further

I don't see how that would improve funding outcomes.

If I haven’t decided on a prior, and multiple different priors (even an infinite set of them) seem equally reasonable to me.

In Bayesian rationality, you always have a prior. You seem to be considering or defining things differently.

Here we would probably say that your actual prior exists and is simply some kind of aggregate of these possible priors, therefore it's not the case that we should leap outside our own priors in some sort of violation of standard Bayesian rationality.

comment by Milan_Griffes · 2019-04-16T14:21:57.678Z · score: 2 (4 votes) · EA · GW
The proposed solution of using priors just pushes the problem to selecting good priors.

+1

In conversations I've had about this stuff, it seems like the crux is often the question of how easy it is to choose good priors, and whether a "good" prior is even an intelligible concept.

Compare Chris' piece ("selecting good priors is really hard!") with this piece by Luke Muehlhauser ("the optimizer's curse is trivial, just choose an appropriate prior!")

comment by kbog · 2019-04-17T03:31:38.186Z · score: 3 (2 votes) · EA · GW
it seems like the crux is often the question of how easy it is to choose good priors

Before anything like a crux can be identified, complainants need to identify what a "good prior" even means, or what strategies are better than others. Until then, they're not even wrong - it's not even possible to say what disagreement exists. To airily talk about "good priors" or "bad priors", being "easy" or "hard" to identify, is just empty phrasing and suggests confusion about rationality and probability.

comment by Chris Smith · 2019-04-18T04:02:09.182Z · score: 4 (3 votes) · EA · GW

Hey Kyle, I'd stopped responding since I felt like we were well beyond the point where we were likely to convince one another or say things that those reading the comments would find insightful.

I understand why you think "good prior" needs to be defined better.

As I try to communicate (but may not quite say explicitly) in my post, I think that in situations where uncertainty is poorly understood, it's hard to come up with priors that are good enough that choosing actions based explicit Bayesian calculations will lead to better outcomes than choosing actions based on a combination of careful skepticism, information gathering, hunches, and critical thinking.

comment by Chris Smith · 2019-04-18T04:15:54.376Z · score: 9 (4 votes) · EA · GW

As a real world example:

Venture capitalists frequently fund things that they're extremely uncertain about. It's my impression that Bayesian calculations rarely play into these situations. Instead, smart VCs think hard and critically and come to conclusions based on processes that they probably don't full understand themselves.

It could be that VCs have just failed to realize the amazingness of Bayesianism. However, given that they're smart & there's a ton of money on the table, I think the much more plausible explanation is that hardcore Bayesianism wouldn't lead to better results than whatever it is that successful VCs actually do.

comment by Chris Smith · 2019-04-18T04:22:56.814Z · score: 5 (4 votes) · EA · GW

Again, none of this is to say that Bayesianism is fundamentally broken or that high-level Bayesian-ish things like "I have a very skeptical prior so I should not take this estimate of impact at face value" are crazy.

comment by kbog · 2019-04-21T03:23:18.331Z · score: 2 (1 votes) · EA · GW

Venture capitalists frequently fund things that they're extremely uncertain about. It's my impression that Bayesian calculations rarely play into these situations. Instead, smart VCs think hard and critically and come to conclusions based on processes that they probably don't full understand themselves.

I interned for a VC, albeit a small and unknown one. Sure, they don't do Bayesian calculations, if you want to be really precise. But they make extensive use of quantitative estimates all the same. If anything, they are cruder than what EAs do. As far as I know, they don't bother correcting for the optimizer's curse! I never heard it mentioned. VCs don't primarily rely on the quantitative models, but other areas of finance do. If what they do is OK, then what EAs do is better. This is consistent with what finance professionals told me about the financial modeling that I did.

Plus, this is not about the optimizer's curse. Imagine that you told those VCs that they were no longer choosing which startups are best, instead they now have to select which ones are better-than-average and which ones are worse-than-average. The optimizer's curse will no longer interfere. Yet they're not going to start relying more on explicit Bayesian calculations. They're going to use the same way of thinking as always.

And explicit Bayesian calculation is rarely used by anyone anywhere. Humans encounter many problems which are not about optimizing, and they still don't use explicit Bayesian calculation. So clearly the optimizer's curse is not the issue. Instead, it's a matter of which kinds of cognition and calculation people are more or less comfortable with.

comment by kbog · 2019-04-21T03:23:00.807Z · score: 2 (1 votes) · EA · GW
it's hard to come up with priors that are good enough that choosing actions based explicit Bayesian calculations will lead to better outcomes than choosing actions based on a combination of careful skepticism, information gathering, hunches, and critical thinking.

Explicit Bayesian calculation is a way of choosing actions based on a combination of careful skepticism, information gathering, hunches, and critical thinking. (With math too.)

I'm guessing you mean we should use intuition for the final selection, instead of quantitative estimates. OK, but I don't see how the original post is supposed to back it up; I don't see what the optimizer's curse has to do with it.

comment by Chris Smith · 2019-04-05T22:37:15.929Z · score: 3 (5 votes) · EA · GW

I'm struggling to understand how your proposed new group avoids the optimizer's curse, and I'm worried we're already talking past each other. To be clear, I'm don't believe there's something wrong with Bayesian methods in the abstract. Those methods are correct in a technical sense. They clearly work in situations where everything that matters can be completely quantified.

The position I'm taking is that the scope of real-world problems that those methods are useful for is limited because our ability to precisely quantify things is severely limited in many real-world scenarios. In my post, I try to build the case for why attempting Bayesian approaches in scenarios where things are really hard to quantify might be misguided.

comment by kbog · 2019-04-06T00:41:42.674Z · score: 2 (1 votes) · EA · GW
I'm struggling to understand how your proposed new group avoids the optimizer's curse,

Because I'm not optimizing!

Of course it is still the case that the highest-scoring estimates will probably be overestimates in my new group. The difference is, I don't care about getting the right scores on the highest-scoring estimates. Now I care about getting the best scores on all my estimates.

Or to phrase it another way, suppose that the intervention will be randomly selected rather than picked from the top.

The position I'm taking is that the scope of real-world problems that those methods are useful for is limited because our ability to precisely quantify things is severely limited in many real-world scenarios. In my post, I try to build the case for why attempting Bayesian approaches in scenarios where things are really hard to quantify might be misguided.

Well yes, but I think the methods work better than anything else for all these scenarios.

comment by Milan_Griffes · 2019-04-15T20:58:39.637Z · score: 7 (3 votes) · EA · GW

FYI I asked about this on GiveWell's most recent open thread, Josh replied:

Hi Milan,
We’ve considered this issue and discussed it internally; we spent some time last year exploring ways in which we might potentially adjust our models for it, but did not come up with any promising solutions (and, as the post notes, an explicit quantitative adjustment factor is not Chris’s recommended solution at this time).
So, we are left in a difficult spot: the optimizer’s curse (and related issues) seems like a real threat, but we do not see high-return ways to address it other than continuing to broadly deepen and question our research. In the case that Chris highlights most — our recommendation of deworming — we have put substantial effort into working along the lines that he recommends and we continue to do so. Examples of the kind of additional scrutiny that we have given to this recommendation includes:

- Embracing model skepticism: We put weight on qualitative factors relevant to specific charities’ operations and specific uses of marginal funding (more). We generally try not to put too much weight on minor differences in cost-effectiveness analyses (more). We place substantial weight on cost-effectiveness analyses while doing what we can to recognize their limitations and bring in other forms of evidence.

– Re-examining our assumptions through vetting: we asked Senior Advisor David Roodman to independently assess the evidence for deworming and he produced extensive reports with his thoughts: see here and here.

– Having conversations and engaging with a variety of deworming researchers, particularly including skeptics. E.g., we’ve engaged with work from skeptical Cochrane researchers (e.g. here and here), epidemiologist Nathan Lo, Melissa Parker and Tim Allen (who looked at deworming through an anthropological perspective), etc.

– Funding additional research with the goal of potentially falsifying our conclusions: see e.g. grants here and here.
We will continue to take high-return steps to assess whether our recommendations are justified. For example, this year we are deepening our assessment of how we should expect deworming’s effectiveness to vary in contexts with different levels of worm infection. It is also on our list to consider quantitative adjustments for the optimizer’s curse further at some point in the future, but given the challenges we encountered in our work so far, we are unlikely to prioritize it soon.
Finally, we hope to continue to follow discussions on the optimizer’s curse and would be interested if theoretical progress or other practical suggestions are made. As Chris notes, this seems to be a cross-cutting theoretical issue that applies to cause prioritization researchers outside of GiveWell, as well.

comment by Milan_Griffes · 2019-04-04T16:57:19.264Z · score: 5 (3 votes) · EA · GW

Thanks for this!

I’ve come to believe there are serious, underappreciated issues with the methods the effective altruism (EA) community at large uses to prioritize causes and programs.

Is there a tl;dr of these issues?

comment by Chris Smith · 2019-04-04T17:31:15.560Z · score: 20 (12 votes) · EA · GW

Thanks Milan—I probably should have been a bit more detailed in my summary.

Here are the main issues I see:

-The optimizer's curse is an underappreciated threat to those who prioritize among causes and programs that involve substantial, poorly understood uncertainty.

-I think EAs are unusually prone to wrong-way reductions: a fallacy where people try to solve messy, hard problems with tidy, formulaic approaches that actually create more issues than they resolve.

--I argue that trying to turn all uncertainty into something like numeric probability estimates is a wrong-way reduction that can have serious consequences.

--I argue that trying to use Bayesian methods in situations where well-ground priors are unavailable is often a wrong-way reduction. (For what it's worth, I rarely see EAs actually deploy these Bayesian methods, but I often see people suggest that the proper approaches in hard situations involve "making a Bayesian adjustments." In many of these situations, I'd argue that something closer to run-of-the-mill critical thinking beats Bayesianism.)

-I think EAs sometimes have an unwarranted bias towards numerical, formulaic approaches over less-quantitative approaches.

comment by Milan_Griffes · 2019-04-04T16:56:36.542Z · score: 5 (4 votes) · EA · GW

Perhaps a related phenomenon is that "adding maximal value on the margin" can look a lot like "defecting from longterm alliances & relationships" when viewed from a different framing?

Most clearly when ongoing support to longterm allies is no longer as leveraged as marginal support towards a new effort.

comment by Chris Smith · 2019-04-04T17:45:15.871Z · score: 4 (3 votes) · EA · GW

It's definitely an interesting phenomenon & worth thinking about seriously.

Any procedures for optimizing for expected impact could go wrong if the value of long-term alliances and relationships isn't accounted for.

comment by MichaelStJules · 2019-04-21T00:00:43.866Z · score: 4 (2 votes) · EA · GW

This paper (Schuyler, J. R., & Nieman, T. (2007, January 1). Optimizer's Curse: Removing the Effect of this Bias in Portfolio Planning. Society of Petroleum Engineers. doi:10.2118/107852-MS; earlier version) has some simple recommendations for dealing with the Optimizer's Curse:

The impacts of the OC will be evident for any decisions involving ranking and selection among alternatives and projects. As described in Smith and Winkler, the effects increase when the true values of alternatives are more comparable and when the uncertainty in value estimations is higher. This makes intuitive sense: We expect a higher likelihood of making incorrect decisions when there is little true difference between alternatives and where there is significant uncertainty in our ability to asses value.
(...) Good decision-analysis practice suggests applying additional effort when we face closely competing alternatives with large uncertainty. In these cases, we typically conduct sensitivity analyses and value-of-information assessments to evaluate whether to acquire additional information. Incremental information must provide sufficient additional discrimination between alternatives to justify the cost of acquiring the additional information. New information will typically reduce the uncertainty in our values estimates, with the additional benefit of reducing the magnitude of OC.

The paper's focus is actually on a more concrete Bayesian approach, based on modelling the population from which potential projects are sampled.

comment by John_Maxwell_IV · 2019-04-06T00:12:07.519Z · score: 4 (2 votes) · EA · GW

Do you have any thoughts on Tetlock's work [EA · GW] which recommends the use of probabilistic reasoning and breaking questions down to make accurate forecasts?

comment by Chris Smith · 2019-04-06T00:47:01.573Z · score: 1 (1 votes) · EA · GW

I think it's super exciting—a really useful application of probability!

I don't know as much as I'd like to about Tetlock's work. My understanding is that the work has focused mostly on geopolitical events where forecasters have been awfully successful. Geopolitical events are a kind of thing I think people are in an OK position for predicting—i.e. we've seen a lot of geopolitical events in the past that are similar to the events we expect to see in the future. We have decent theories that can explain why certain events came to pass while others didn't.

I doubt that Tetlock-style forecasting would be as fruitful in unfamiliar domains that involve Knightian-ish uncertainty. Forecasting may not be particularly reliable for questions like:

-Will we have a detailed, broadly accepted theory of consciousness this century?

-Will quantum computers take off in the next 50 years?

-Will any humans leave the solar system by 2100?

(That said, following Tetlock's guidelines may still be worthwhile if you're trying to predict hard-to-predict things.)

comment by cole_haus · 2019-04-04T19:11:10.759Z · score: 4 (3 votes) · EA · GW

I don't know how promising others think this is, but I quite liked Concepts for Decision Making under Severe Uncertainty with Partial Ordinal and Partial Cardinal Preferences. It tries to outline possible decision procedures once you relax some of the subject expected utility theory assumptions you object to. For example, it talks about the possibility of having a credal set of beliefs (if one objects to the idea of assigning a single probability) and then doing maximin on this i.e. selecting the outcome that has the best expected utility according to its least favorable credences.

comment by saulius · 2019-04-14T20:07:24.699Z · score: 2 (2 votes) · EA · GW

I'm interested in what you think about using subjective confidence intervals to estimate effectiveness of charities and then comparing them. To account for the optimizer's curse, we can penalize charities that have wider confidence intervals. Not sure how it would be done in practice, but there probably is a mathematical method to calculate how much they should be penalized. Confidence intervals communicate both, value and uncertainty at the same time and therefore avoid some of the problems that you talk about.

comment by MichaelStJules · 2019-04-20T22:00:14.348Z · score: 1 (1 votes) · EA · GW

I’m going to try to clarify further why I think the Bayesian solution in the original paper on the Optimizer’s Curse is inadequate.

The Optimizer's Curse is defined by Proposition 1: informally, the expectation of the estimated value of your chosen intervention overestimates the expectation of its true value when you select the intervention with the maximum estimate.

The proposed solution is to instead maximize the posterior expected value of the variable being estimated (conditional on your estimates, the data, etc.), with a prior distribution for this variable, and this is purported to be justified by Proposition 2.

However, Proposition 2 holds no matter which priors and models you use; there are no restrictions at all in its statement (or proof). It doesn’t actually tell you that your posterior distributions will tend to better predict values you will later measure in the real world (e.g. by checking if they fall in your 95% credence intervals), because there need not be any connection between your models or priors and the real world. It only tells you that your maximum posterior EV equals your corresponding prior’s EV (taking both conditional on the data, or neither, although the posterior EV is already conditional on the data).

Something I would still call an “optimizer’s curse” can remain even with this solution when we are concerned with the values of future measurements rather than just the expected values of our posterior distributions based on our subjective priors. I’ll give 4 examples, the first just to illustrate, and the other 3 real-world examples:

1. Suppose you have different fair coins, but you aren’t 100% sure they’re all fair, so you have a prior distribution over the future frequency of heads (it could be symmetric in heads and tails, so the expected value would be for each), and you use the same prior for each coin. You want to choose the coin which has the maximum future frequency of landing heads, based on information about the results of finitely many new coin flips from each coin. If you select the one with the maximum expected posterior, and repeat this trial many times (flip each coin multiple times, select the one with the max posterior EV, and then repeat), you will tend to find the posterior EV of your chosen coin to be greater than , but since the coins are actually fair, your estimate will be too high more than half of the time on average. I would still call this an “optimizer’s curse”, even though it followed the recommendations of the original paper. Of course, in this scenario, it doesn’t matter which coin is chosen.

Now, suppose all the coins are as before except for one which is actually biased towards heads, and you have a prior for it which will give a lower posterior EV conditional on heads and no tails than the other coins would (e.g. you’ve flipped it many times before with particular results to achieve this; or maybe you already know its bias with certainty). You will record the results of coin flips for each coin. With enough coins, and depending on the actual probabilities involved, you could be less likely to select the biased coin (on average, over repeated trials) based on maximum posterior EV than by choosing a coin randomly; you'll do worse than chance.

(Math to demonstrate the possibility of the posteriors working this way for heads out of : you could have a uniform prior on the true future long-run average frequency of heads for the unbiased coins, i.e. for in the interval , then , and , which goes to as goes to infinity. You could have a prior which gives certainty to your biased coin having any true average frequency , so any of the unbiased coins which lands heads out of times will beat it for large enough.)

If you flip each coin times, there’s a number of coins, , so that the true probability (not your modelled probability) of at least one of the other coins getting k heads is strictly greater than , i.e. (for , you need , and for , you need , so grows pretty fast as a function of ). This means, with probability strictly greater than , you won’t select the biased coin, so with probability strictly less than , you will select the biased coin. So, you actually do worse than random choice, because of how many different coins you have and how likely one of them is to get very lucky. You would have even been better off on average ignoring all of the new coin flips and sticking to your priors, if you already suspected the biased coin was better (if you had a prior with mean ).

2. A common practice in machine learning is to select the model with the greatest accuracy on a validation set among multiple candidates. Suppose that the validation and test sets are a random split of a common dataset for each problem. You will find that under repeated trials (not necessarily identical; they could be over different datasets/problems, with different models) that by choosing the model with the greatest validation accuracy, this value will tend to be greater than its accuracy on the test set. If you build enough models each trial, you might find the models you select are actually overfitting to the validation set (memorizing it), sometimes to the point that the models with highest validation accuracy will tend to have worse test accuracy than models with validation accuracy in a lower interval. This depends on the particular dataset and machine learning models being used. Part of this problem is just that we aren’t accounting for the possibility of overfitting in our model of the accuracies, but fixing this on its own wouldn’t solve the extra bias introduced by having more models to choose from.

3. Due to the related satisficer’s curse, when doing multiple hypothesis tests, you should adjust your p-values upward or your p-value cutoffs (false positive rate, significance level threshold) downward in specific ways to better predict replicability. There are corrections for the cutoff that account for the number of tests being performed, a simple one is that if you want a false positive rate of , and you’re doing tests, you could instead use a cutoff of .

4. The satisficer’s curse also guarantees that empirical study publication based on p-value cutoffs will cause published studies to replicate less often than their p-values alone would suggest. I think this is basically the same problem as 3.

Now, if you treat your priors as posteriors that are conditional on a sample of random observations and arguments you’ve been exposed to or thought of yourself, you’d similarly find a bias towards interventions with “lucky” observations and arguments. For the intervention you do select compared to an intervention chosen at random, you’re more likely to have been convinced by poor arguments that support it and less likely to have seen good arguments against it, regardless of the intervention’s actual merits, and this bias increases the more interventions you consider. The solution supported by Proposition 2 doesn’t correct for the number of interventions under consideration.

comment by kbog · 2019-04-21T04:18:23.349Z · score: 4 (2 votes) · EA · GW
It doesn’t actually tell you that your posterior distributions will tend to better predict values you will later measure in the real world (e.g. by checking if they fall in your 95% credence intervals), because there need not be any connection between your models or priors and the real world.

This is an issue of the models and priors. If your models and priors are not right... then you should update over your priors and use better models. Of course they can still be wrong... but that's true of all beliefs, all reasoning, etc.

you will tend to find the posterior EV of your chosen coin to be greater than 1/2, but since the coins are actually fair, your estimate will be too high more than half of the time on average.

If you assume from the outside (unbeknownst to the agent) that they are all fair, then you're not showing a problem with the agent's reasoning, you're just using relevant information which they lack.

you could have a uniform prior on the true future long-run average frequency of heads for the unbiased coins

My prior would not be uniform, it would be 0.5! What else could "unbiased coins" mean? This solves the problem, because then a coin with few head flips and zero tail flips will always have posterior of p > 0.5.

If you build enough models each trial, you might find the models you select are actually overfitting to the validation set (memorizing it), sometimes to the point that the models with highest validation accuracy will tend to have worse test accuracy than models with validation accuracy in a lower interval.

In this case we have a prior expectation that simpler models are more likely to be effective.

Do we have a prior expectation that one kind of charity is better? Well if so, just factor that in, business as usual. I don't see the problem exactly.

3. Due to the related satisficer’s curse, when doing multiple hypothesis tests, you should adjust your p-values upward or your p-value cutoffs (false positive rate, significance level threshold) downward in specific ways to better predict replicability.
4. The satisficer’s curse also guarantees that empirical study publication based on p-value cutoffs will cause published studies to replicate less often than their p-values alone would suggest.

Bayesian EV estimation doesn't do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in that context, but they are separate.

Now, if you treat your priors as posteriors that are conditional on a sample of random observations and arguments you’ve been exposed to or thought of yourself, you’d similarly find a bias towards interventions with “lucky” observations and arguments. For the intervention you do select compared to an intervention chosen at random, you’re more likely to have been convinced by poor arguments that support it and less likely to have seen good arguments against it, regardless of the intervention’s actual merits, and this bias increases the more interventions you consider. The solution supported by Proposition 2 doesn’t correct for the number of interventions under consideration.

The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.

Of course that's not going to be very reliable. But that's the whole point of using such simplistic, informal thinking. All kinds of rigor get sacrificed when charities are dismissed for sloppy reasons. If you think your informally-excluded charities might actually turn out to be optimal then you shouldn't be informally excluding them in the first place.