External evaluation of GiveWell's research

post by Aaron Gertler (aarongertler) · 2020-05-22T04:09:01.964Z · score: 13 (5 votes) · EA · GW · 10 comments

This is a link post for https://blog.givewell.org/2013/03/01/external-evaluation-of-our-research/


  The challenges of external evaluation
  Improvements in informal evaluation
    Where we stand

Aaron's note: I stumbled over this today and got lost in the links; I hadn't realized how much effort GiveWell spent (at least in the early years) hiring people to evaluate their research. Here's a full list of their external reviews, which goes into a lot more depth than this post.

Also, this post was originally written in 2013; I just recently discovered it and decided to repost.

We’ve long been interested in the idea of subjecting our research to formal external evaluation. We publish the full details of our analysis so that anyone may critique it, but we also recognize that it can take a lot of work to digest and critique our analysis, and we want to be subjecting ourselves to constant critical scrutiny (not just to the theoretical possibility of it).

A couple of years ago, we developed a formal process for external evaluations, and had several such evaluations conducted and published. However, we haven’t had any such evaluations conducted recently. This post discusses why.

In brief,

Between these two factors, we aren’t currently planning to conduct more external evaluations in the near future. However, we remain interested in external evaluation and hope eventually to make frequent use of it again. And if someone volunteered to do (or facilitate) formal external evaluation, we’d welcome this and would be happy to prominently post or link to criticism.


The challenges of external evaluation

The challenges of external evaluation are significant:

On the “evaluating research” front, one plausible candidate for “qualified evaluator” would be an accomplished development economist. However, in practice many accomplished development economists (a) are extremely constrained in terms of the time they have available; (b) have affiliations of their own (the more interested in practical implications for aid, the more likely a scholar is to be directly involved with a particular organization or intervention) which may bias evaluation.

I felt that we found a good balance with a 2011 evaluation by Prof. Tobias Pfutze, a development economist. Prof. Pfutze took ten hours to choose a charity to give to – using GiveWell’s research as well as whatever other resources he found useful – and we “paid” him by donating funds to the charity he chose. However, developing this assignment, finding someone who was both qualified and willing to do it, and providing support as the evaluation was conducted involved significant capacity.

Given the time investment these sorts of activities require on our part, we’re hesitant to go forward with one until we feel confident that we are working with the right person in the right way and that the research they’re evaluating will be representative of our work for some time to come.


Improvements in informal evaluation

Over the last year, we feel that we’ve seen substantially more deep engagement with our research than ever before, even as our investments in formal external evaluation have fallen off.


Where we stand


We continue to believe that it is important to ensure that our work is subjected to in-depth scrutiny. However, at this time, the scrutiny we’re naturally receiving – combined with the high costs and limited capacity for formal external evaluation – make us inclined to postpone major effort on external evaluation for the time being.

That said,


Comments sorted by top scores.

comment by Tsunayoshi · 2020-05-22T12:10:05.778Z · score: 5 (4 votes) · EA(p) · GW(p)

Related to external evaluations: 80000hours used to have a little box at the bottom of an article, indicating a score given to it by internal and external evaluators. Does anybody know, why this is not being done anymore?

comment by Ozzie Gooen (oagr) · 2020-05-26T11:42:20.835Z · score: 3 (2 votes) · EA(p) · GW(p)

Oh man, happy to have come across this. I'm a bit surprised people remember that article. I was one of the main people that set up the system, that was a while back.

I don't know specifically why it was changed. I left 80k in 2014 or so and haven't discussed this with them since. I could imagine some reasons why they stopped it though. I recommend reaching out to them if you want a better sense.

This was done when the site was a custom Ruby/Rails setup. This functionality required a fair bit of custom coding functionality to set up. Writing quality was more variable then than it is now; there were several newish authors and it was much earlier in the research process. I also remember that originally the scores disagreed a lot between evaluators, but over time (the first few weeks of use) they converged a fair bit.

After I left they migrated to Wordpress, which I assume would have required a fair effort to set up a similar system in. The blog posts seem like they became less important than they used to be; in favor of the career guide, coaching, the podcast, and other things. Also the quality has become a fair bit more consistent, from what I can tell as an onlooker.

The ongoing costs of such a system are considerable. First, it just takes a fair bit of time from the reviewers. Second, unfortunately, the internet can be a hostile place for transparency. There are trolls and angry people who will actively search through details and then point them out without the proper context. I think this review system was kind of radical, and can imagine it not being very comfortable to maintain, unless it really justified a fair bit of effort.

I'm of course sad it's not longer in place, but can't really blame them.

comment by Nathan Young (nathan) · 2020-05-23T13:55:06.584Z · score: 3 (2 votes) · EA(p) · GW(p)

I'm gonna be a bit of a maverick and split my comment into separate ideas so you can upvote or downvote them separately. I think this is a better way to do comments, but looks a bit spammy.

comment by Nathan Young (nathan) · 2020-05-23T13:59:38.250Z · score: 2 (2 votes) · EA(p) · GW(p)

There should be more context on the important decision making tools

I could be wrong, but I think most decision are made using google sheets. I've read a few of these and I think there could be more context around which numbers are the most important.

comment by Nathan Young (nathan) · 2020-05-23T14:05:08.980Z · score: 1 (1 votes) · EA(p) · GW(p)

It should be possible to give feedback on specific point.

In the future I am confident that all articles will be able to have comments on any part of the text, like comments in a google doc. This means people can edit or comment on specific points. This is particularly important with fermi models and could be implemented - people can comment on each part of an evaluation to criticise some specific bit. One wrong leap of logic in an argument makes the whole argument void, so GiveWell's models need this level of scrutiny.

comment by Nathan Young (nathan) · 2020-05-23T14:03:45.722Z · score: 1 (1 votes) · EA(p) · GW(p)

All the most important models should have crowdsourced answers also.

I *think* GiveWell uses models to make decisions. It would be possible to crowdsource numbers for each step. I predict you would get better answers if you did this. The wisdom of crowds is a thing. It breaks down when the crowd doesn't understand the model, but if you are getting the to guess individual parts of a model, it works again.

Linked to the Stack Overflow point I made, I think there could easily be a site for crowdsourcing the answers to the GiveWells questions. I think there is a 10% chance that with 20k you could build a better site that could come up with better answers if EAs enjoyed making guesses for fun - wikipedia is the best encyclopaedia in the world. This is because it leverages the free time and energy of *loads* of nerds. GiveWell could do the same.

comment by Nathan Young (nathan) · 2020-05-23T13:58:53.570Z · score: 1 (1 votes) · EA(p) · GW(p)

I think is should be easier to give feedback on GiveWell. I would recommend not needing to login and allowing people to give suggestions on the text of pages.

comment by Nathan Young (nathan) · 2020-05-23T13:57:06.319Z · score: 1 (1 votes) · EA(p) · GW(p)

I think StackOverflow is is the gold standard for criticism. It's a question answering website. It allows answers to be ranked and questions and answers to be edited. Not only do the best answers get upvoted, but answers and questions get clearer and higher quality over time. I suggest this should be the aim for GiveWell's anlyses.

See examples of all such features on this question: https://rpg.stackexchange.com/questions/169345/can-a-druid-use-a-sending-stone-while-in-wild-shape


  • The question was answered by one individual but the edited by a much more experienced user. I think GiveWell could easily allow suggestions to their articles by the communty, which could be upvoted by other readers.
  • If Givewell doesn't want to test this, maybe try it on this forum first - allow people to suggest edits to posts.
comment by Nathan Young (nathan) · 2020-05-23T14:21:31.250Z · score: 0 (4 votes) · EA(p) · GW(p)

I had a discussion with someone who was very negative about GiveWell's work and provided a series of concerning anecotes: https://twitter.com/SimonDeDeo/status/1239569480063254530

I wrote the discussion up here: https://nathanpmyoung.com/givewell

If people have rebuttals, please add them to the doc.

(I'm not particularly worried about the reputational risk of this being public because A, the discussion is already on twitter and B, noone scrolls to the bottom of my website and follows the links and C, the allegations *are* pretty concerning)

comment by Nathan Young (nathan) · 2020-05-23T16:40:36.913Z · score: 2 (2 votes) · EA(p) · GW(p)

Whoever strongly disliked this, feel free to say why.