Relative Impact of the First 10 EA Forum Prize Winners

post by NunoSempere · 2021-03-16T17:11:29.172Z · EA · GW · 35 comments

Contents

  Summary
  Introduction
  Methodology
      Title of the post
  Estimates
    Donor Lottery Report
    from EAF's Hiring Round.
    we have over-rated Cool Earth
    Learned from a Prospective Alternative Meat Startup Team
    AI Alignment Literature Review and Charity Comparison
    profile: mental health
    Giving Tuesday Donation Matching Initiative 2018 Retrospective
    Survey 2018 Series: Cause Selection
    Boston 2018 Postmortem
    companies meet their animal welfare commitments?
  Table
      
  Comments and thoughts
    
    is still possible
    ideas
None
35 comments

Summary

Introduction

The EA forum—and local groups—have been seeing a decent amount of projects, but few are evaluated for impact. This makes it difficult to choose between projects beforehand, beyond using personal intuition (however good it might be), a connection to a broader research agenda, or other rough heuristics. Ideally, we would have something more objective, and more scalable. 

As part of QURI’s efforts to evaluate and estimate the impact of things in general, and projects QURI itself might carry out in particular, I tried to evaluate the impact of 10 projects I expected to be fairly valuable.

Methodology

I chose the first 10 posts which won the EA Forum Prize [? · GW], back in 2017 and 2018, to evaluate. For each of the 10 posts, each estimate has a structure like the one below. Note that not all estimates will have each element:

Title of the post

If a writeup refers to a project distinct from the writeup, I generally try to estimate the impact of both the project and the writeup.

Where possible, I estimated their impact in an ad-hoc scale, Quality Adjusted Research Papers (QARPs for short), whose levels correspond to the following:

ValueDescriptionExample
~0.1 mQARPsA thoughtful commentA thoughtful comment about the details of setting up a charity [EA(p) · GW(p)]
~1 mQARPsA good blog post, a particularly good commentWhat considerations influence whether I have more influence over short or long timelines? [? · GW]
~10 mQARPsAn excellent blog postHumans Who Are Not Concentrating Are Not General Intelligences [? · GW]
~100 mQA fairly valuable paperCategorizing Variants of Goodhart's Law.
~1 QARPsA particularly valuable paperThe Vulnerable World Hypothesis
~10-100 QARPsA research agendaThe Global Priorities Institute's Research Agenda.
~100-1000+ QARPsA foundational popular book on a valuable topicSuperintelligence, Thinking Fast and Slow
~1000+ QARPsA foundational research workShannon’s "A Mathematical Theory of Communication."

Ideally, this would both have relative meaning (i.e., I claim that an average thoughtful comment is worth less than an average good post), and absolute meaning (i.e., after thinking about it, a factor of 10x between an average thoughtful comment and an average good post seems roughly right). In practice, the second part is a work in progress. In an ideal world, this estimate would be cause-independent, but cause comparability is not a solved problem, and in practice the scale is more aimed towards long-term focused projects.

To elaborate on cause independence, upon reflection we may find out that a fairly valuable paper on AI Alignment might be 20 times as a fairly valuable paper on Food Security, and give both of their impacts in a common unit. But we are uncertain about their actual relative impacts, and they will not only depend on uncertainty, but also on moral preferences and values (e.g., weight given to animals, weight given to people who currently don't exist, etc.) To get around this, I just estimated how valuable a projects is within a field, leaving the work of categorizing and comparing fields as a separate endeavor: I don't adjust impact for different causes, as long as it's an established Effective Altruist cause.

Some projects don’t easily lend themselves to be rated in QARPs; in that case I’ve also used “dollars moved”. Impact is adjusted for Shapley values [EA · GW], which avoids double or triple-counting impact. In every example here, this will be equivalent to calculating counterfactual value, and dividing by the number of necessary stakeholders. This requires a judgment call for what is a “necessary stake-holder”. Intervals are meant to be 80% confidence intervals, but in general all estimates are highly speculative and shouldn't be taken too seriously.

Estimates

2017 Donor Lottery Report [EA · GW]

Total project impact:

Counterfactual impact of Adam Gleave winning the donor lottery (as opposed to other winners):

Impact of the writeup alone:

Takeaways from EAF's Hiring Round. [EA · GW]

Impact of the hiring round itself:

When reviewing this section, some commenters pointed out that, for them, calculating the opportunity cost didn’t make as much sense. I disagree with that. Further, I’m also not attempting to calculate the expected value ex ante; in this case this feels inelegant because the expected value will depend a whole bunch on the information, accuracy and calibration of the person doing the expected value calculation, and I don’t want to estimate how accurate or calibrated the piece’s author was at the time (though he is pretty good now).

Impact of the writeup (as opposed to impact of the hiring process):

Why we have over-rated Cool Earth [EA · GW]

Impact of the post and the research:

Lessons Learned from a Prospective Alternative Meat Startup Team [EA · GW]

Expected impact of the project:

Impact of the project:

Impact of the writeup:

2018 AI Alignment Literature Review and Charity Comparison [EA(p) · GW(p)]

Cause profile: mental health [EA · GW]

EA Giving Tuesday Donation Matching Initiative 2018 Retrospective [EA · GW]

EA Survey 2018 Series: Cause Selection [EA · GW]

Impact of the post alone:

EAGx Boston 2018 Postmortem [EA · GW]

Impact of the EAGx:

Impact of the writeup:

Will companies meet their animal welfare commitments? [EA · GW]

Table

ProjectBallparkEstimate
2017 Donor Lottery GrantBetween 6 fairly valuable papers and 3 really good ones500 mQARPs to 4 QARPs
Adam Gleave winning the 2017 donor lottery (as opposed to other participants)Roughly as valuable as a fairly valuable paper

-50 mQARPs to 500 mQARPs


 

2017 Donor Lottery Report (Writeup)A little less valuable than a fairly valuable paper50 mQARPs to 250 mQARPs
EAF's Hiring RoundLoss of between one fairly valuable paper to an excellent EA forum blog post.-70 to 5 mQARPs
Takeaways from EAF's Hiring Round (Writeup)Between two good EA forum posts to a fairly valuable paper0 to 30 mQARPs
Why we have over-rated Cool Earth-1/2 excellent EA forums post to +1.5 excellent EA forum posts-5 to 20 mQARPs
Alternative Meat Startup Team (Project)0 to 1 excellent EA Forum posts.1 to 50 mQARPs
Lessons Learned from a Prospective Alternative Meat Startup Team (Writeup)0 to 5 good EA forum posts0 to 20 mQARPs
2018 AI Alignment Literature Review and Charity ComparisonBetween two excellent EA forum posts and 6 fairly valuable papers40 to 800 mQARPs
Cause profile: mental healthVery uncertain0 to 100 mQARPs
EA Giving Tuesday Donation Matching Initiative 2018

$130K to $230K in Shapley-adjusted funding towards EA charities
EA Survey 2018 Series: Cause Selection0 to an excellent EA forum post.0 to 20 mQARPs
EAGx Boston 2018 (Event)

$100 to $350K in Shapley-adjusted funding towards EA charities
EAGx Boston 2018 Postmortem (Writeup)

$0 to $500 in Shapley-adjusted donations towards EA charities
Will companies meet their animal welfare commitments?0 to a fairly valuable paper0 to 100 mQARPs

Comments and thoughts

Calibration

An initial challenge in this domain relates to how to attain calibration. The way I would normally calibrate intuitions on a domain is by making a number of predictions at various levels of gut feeling, and then seeing empirically how frequently predictions made at different levels of gut feeling come out right. For example, I’ve previously found that my gut feeling of “I would be very surprised if this was false” generally corresponds to 95% (so 1 in 20 times, I am in fact wrong). But in this case, when considering or creating a new domain, I can’t actually check my predictions directly against reality, but instead have to check them against other people’s intuitions.

Comparison is still possible

Despite my wide levels of uncertainty, comparison is still possible. Even though I’m uncertain about the impact of both “Will companies meet their animal welfare commitments?” and “Lessons Learned from a Prospective Alternative Meat Startup Team”, I’d prefer to have the first over the second.

Similarly, while EAGx Boston 2018 and the EA Giving Tuesday Donation Matching Initiative might have taken similar amounts of time to organize, by comparably capable people, I prefer the second. This is in large part because EAGx events are scalable, whereas Giving Tuesdays are not.

I was also surprised by the high cost of producing papers when estimating the value of Larks’ review (though perhaps I shouldn’t have been). It could be the case that this was a problem with my estimates, or that papers truly are terribly inefficient. 

Future ideas

Ozzie Gooen has in the past suggested that one could build a consensus around these kinds of estimates, and scale them further. In addition, one could also use these kinds of estimates to choose one’s own projects, or to recommend projects to others, and see how that fares. Note how in principle, these kinds of estimates don’t have to be perfect or perfectly calibrated, they just have to be better than the implicit estimates which would otherwise have been made. 

In any case, there are also details to figure out or justify. For example, I’ve been using Shapley values [EA · GW], which I think are a more complicated, but often a more appropriate alternative to counterfactual values. Normally, this just means that I divide the total estimated impact by the estimated number of stakeholders, but sometimes, like in the case of a hiring round, I have the intuition that one might want to penalize the hiring organization for the lost opportunity cost of applicants, even though that’s not what Shapley values recommends. Further, it’s also sometimes not clear how many necessary stakeholders there are, or how important each stakeholder is, which makes the Shapley value unambiguous, or subject to a judgment call. 

I’ve also been using a cause-impartial value function. That is, I judge a post in the animal welfare space using the same units as for a post in the long-termist space. But maybe it’s a better idea to have a different scale for each cause area, and then have a conversion factor which depends on the reader’s specific values. If I continue working on this idea, I will probably go in that direction. 

Lastly, besides total impact, we also care about efficiency. For small and medium projects, I think that the most important kind of efficiency might be time efficiency. For example, when choosing between a project worth 100 mQARPs and one which is worth 10 mQARPs, one would also have to look at how long each takes, because maybe one can do 50 projects each worth 10 mQARPs in the time it takes to do a very elaborate 100 mQARPs project. 

Thanks to David Manheim, Ozzie Gooen and Peter Hurford for thoughts, comments and suggestions. 

35 comments

Comments sorted by top scores.

comment by NunoSempere · 2021-05-24T20:29:27.450Z · EA(p) · GW(p)

So here are the mistakes pointed out in the comments:

  • EAF's hiring round had a high value of information, which I didn't incorporate, per Misha's comment [EA · GW]
  • "Why we have over-rated Cool Earth" was more impactful than I thought, per Khorton's comment [EA(p) · GW(p)]
  • I likely underestimated the posible negative impact of the 2017 donor lotery report, which was quite positive on ALLFED, per MichaelA's comment [EA · GW].

I think this (a ~30% mistakes rate) is quite brutal, and still only a lower bound (because there might be other mistakes which commenters didn't point out.) I'm pointing this out here because I want to reference this error rate in a forthcoming post.

comment by Misha_Yagudin · 2021-03-17T15:11:36.596Z · EA(p) · GW(p)

There are a lot of things l like about this post. From small (e.g. the summary on top of it; and the table at the end) to large (e.g. it's a good thing to do given a desire to understand how to quantify/estimate impact better). 

Here are some things I am perplexed about or disagree with:
 

  • EAF hiring round estimate misses the enormous realized value of information. As far as I can see, EAF decided to move to London [EA · GW] (partly) because of that.
    • > We moved to London (Primrose Hill) to better attract and retain staff and collaborate with other researchers in London and Oxford.
    • > Budget 2020: $994,000 (7.4 expected full-time equivalent employees). Our per-staff expenses have increased compared with 2019 because we do not have access to free office space anymore, and the cost of living in London is significantly higher than in Berlin.

 

  • The donor lottery evaluation seems to miss that $100K would have been donated otherwise.
  • Further, I would suggest another decomposition.
    • Impact = impact of running donor lottery as a tool (as opposed to donating without ~aggregation) + the counterfactuals impact of particular grants (as opposed to ~expected grants) + misc. side-effects (like a grantmaker joining LTFF).
    • I can understand why you added the first two terms. But it seems to me that
      • we can get a principled estimate about the first one based on arguments for donor lotteries (e.g. epistemic advantage coming from spending more time per dollar donated; and freed time of donors);
        • One can get more empirical and have a quick survey here.
      • estimating the second term is trickier because you need to make a guess about the impact of an average epistemically advantaged donation (as opposed to an average donation of 100K I which I think is missing from your estimate)
        • Both of these are doable because we saw how other donor lottery winners gave their money and how wealthy/invested donors give their money.
        • A good proxy for an impact of average donation might come from (a) EA survey donation data, (b) a  quick survey of lottery participants. The latter seems superior because participating in an early donor lottery suggests a higher engagement with EA ideas &c.
    • After thinking a bit longer the choice of decomposition depends on what you want to understand better. It seems like your choice is better if you want to empirically understand whether the donor lottery is valuable.

 

  • Another weird thing is to see the 2017 Donor Lottery Grant having x5..10 higher impact than 2018 AI Alignment Literature Review and Charity Comparison.
    • I think it might come down to you not subtracting the counterfactual impact of donating 100K w/o lottery from donors' lottery impact estimate.
    • The basic source of impact of the donor lottery and charity review comes from an epistemic advantage (someone dedicating more time to think/evaluate donations; people being better informed about the charities they are likely to donate to). Given how well received the literature review is it seems to be (quite likely) helpful to individual donors and given that it (according to your guess) impacted $100K..1M it should be kinda as impactful or more impactful than an abstract donor lottery.
      • And it's hard to see this particular donor lottery as overwhelmingly more impactful than an average one.
Replies from: NunoSempere, NunoSempere
comment by NunoSempere · 2021-03-18T12:07:14.618Z · EA(p) · GW(p)

Another weird thing is to see the 2017 Donor Lottery Grant having x5..10 higher impact than 2018 AI Alignment Literature Review and Charity Comparison.

I see now, that is weird. Note that if I calculate the total impact of the 100k to $1M I think Larks moved, the impact of that would be 100mQ to 2Q (change the Shapley value fraction in the Guessstimate to 1), which is closer to the 500mQ to 4Q I estimated from the 2017 Donor Lottery. And the difference can be attributed to a) Investing in organizations which are starting up, b) the high cost of producing AI safety papers, coupled with cause neutrality, and c) further error.

comment by NunoSempere · 2021-03-17T15:37:39.285Z · EA(p) · GW(p)
  • Good point re: value of information
  • Re: "The donor lottery evaluation seems to miss that $100K would have been donated otherwise": I don't think it does. In the "total project impact" section, I clarify that "Note that in order to not double count impact, the impact has to be divided between the funding providers and the grantee (and possibly with the new hires as well)."
Replies from: Misha_Yagudin
comment by Misha_Yagudin · 2021-03-17T16:56:15.477Z · EA(p) · GW(p)

Thank you, Nuno! 

  • Am I understand correctly that the Shapley value multiplier (0.3 to 0.5) is responsible for preventing double counting?
    • If so why don't you apply it to Positive status effects?  The effect was also partially enabled by the funding providers (maybe less so).
    • Huh! I am surprised that your Shapley value calculation is not explicit but is reasonable.
      • Let's limit ourselves to two players (= funding providers who are only capable of shallow evaluations and grantmakers who are capable of in-depth evaluation but don't have their own funds). You get Your estimate of "0.3 to 0.5" implies that shallowly evaluated giving is as impactful as "0 to 0.4" of in-depth evaluated giving.
      • This x2.5..∞ multiplier is reasonable but doesn't feel quite right to put 10% on above ∞ :)
  • This makes me further confused about the gap between the donor lottery and the alignment review.
Replies from: NunoSempere
comment by NunoSempere · 2021-03-18T12:20:40.755Z · EA(p) · GW(p)

You are understanding correctly that the Shapley value multiplier is responsible for preventing double-counting, but you're making a mistake when you say that it "implies that shallowly evaluated giving is as impactful as "0 to 0.4" of in-depth evaluated giving"; the latter doesn't follow.

In the two player game, you have Value({}), Value({1}), Value({2}), Value({1,2}), and the Shapley value for player 1 (the funders) is ([Value({1})- Value({})] + [Value({1,2})- Value({2})] )/2, and the value of player 2 (the donor lottery winner) is  ([Value({2})- Value({})] + [Value({1,2})- Value({1})] )/2

In this case, I'm taking ([Value({2})- Value({})]  to be ~0 for simplicity, so the value of player 2 is  [Value({1,2})- Value({1})] )/2. Note that this is just the counterfactual value divided by a fraction.

If there were more players, it would be a little bit more complicated, but you'd end up with something similar to  [Value({1,2,3})- Value({1,3})] )/3. Note again this is just the counterfactual value divided by a fraction.

But now, I don't know how many players there are, so I just consider  [Value({The World})- Value({The world without player 2})] )/(some estimates of how many players there are). 

And the Shapley value multiplier would be 1/(some estimates of how many players there are). 

At no point am I assuming that "shallowly evaluated giving is as impactful as 0 to 0.4 of in-depth evaluated giving"; the thing that I'm doing is just allocating value so that the sum of the value of each player is equal to the total value.

Replies from: Misha_Yagudin
comment by Misha_Yagudin · 2021-03-18T14:39:06.841Z · EA(p) · GW(p)

Thank you for engaging!

  • First, "note that this [misha: Shapley value of evaluator] is just the counterfactual value divided by a fraction [misha: by two]." Right, this is exactly the same in my comment. I further divide by total impact to calculate the Shapley multiplier.
    • Do you think we disagree?
    • Why isn't my conclusion follows?
  • Second, you conclude "And the Shapley value multiplier would be 1/(some estimates of how many players there are)", while your estimate is"0.3 to 0.5". There have been like 30 participants over two lotteries that year, so you should have ended up with something an order of magnitude less like "3% to 10%".
    • Am I missing something?
  • Third, for the model with more than two players, it's unclear to me who the players are. If these are funders + evaluators. You indeed will end up with because
    • Shapley multipliers should add up to , and
    • Shapley value of the funders is easy to calculate (any coalition without them lacks any impact).
    • Please note that is from the comment above.
  • (Note that this model ignores that the beneficiary might win the lottery and no donations will be made.)

In the end,

  • I think that it is necessary to estimate X in "shallowly evaluated giving is as impactful as X times of in-depth evaluated giving". Because if impact of the evaluator is close to nil.
    • I might not understand how you model impact here, please, be more specific about the modeling setup and assumptions.
  • I don't think that you should split evaluators. Well, basically because you want to disentangle the impact of evaluation and funding provision and not to calculate Adam's personal impact.
    • Like, take it to the extreme: it would be pretty absurd to say that the overwhelmingly successful (e.g. seeding a new ACE Top Charity in yet unknown but highly tractable area of animal welfare and e.g. discovering AI alignment prodigy) donor lottery had an impact less than an average comment because there have been too many people (100K) contributing a dollar to participate in it.
Replies from: NunoSempere
comment by NunoSempere · 2021-03-22T11:48:31.342Z · EA(p) · GW(p)
  1. Yes, we agree
  2. No, we don't agree. I think that Adam did better than other potential donor lottery winners, and so his counterfactual value is higher, and thus his Shapley value is also higher. If all the other donors had been clones of Adam, I agree that you'd just divide by n. Thus, the "In every example here, this will be equivalent to calculating counterfactual value, and dividing by the number of necessary stakeholders" is in fact wrong, and I was implicitly doing both of the following in one step: a. Calculating Shapley values with "evaluators" as one agent and b. thinking of Adam's impact as a high proportion of the SV of the evaluator round,
  3. The rest of our disagreements hinge on 2., and I agree that judging the evaluator step alone would make more sense.
comment by Khorton · 2021-03-17T13:31:54.276Z · EA(p) · GW(p)

On Sanjay's Cool Earth post, I have seen it frequently referenced. Founders Pledge came out with some climate change recommendations shortly after and I think people have been largely donating to those now instead.

Replies from: NunoSempere
comment by NunoSempere · 2021-03-17T15:33:06.490Z · EA(p) · GW(p)

I'll flag the narrow and lowish estimates about the Cool Earth as something I was most likely wrong about, then, thanks.

comment by MichaelA · 2021-03-17T03:34:28.351Z · EA(p) · GW(p)

It’s hard to see how the writeup could have had a negative effect.

It seems plausible that people who gave to ALLFED, volunteered for ALLFED, worked for ALLFED, etc. due in part to Gleave's report would otherwise have done better things with their resources. 

The report may also have led to EAs/global catastrophic risk researchers/longtermists talking about ALLFED more often and more positively, which could perhaps negative effect on perceptions of those communities, e.g. because: 

  • Papers associated with them often present explicit quantitative models and estimates about very uncertain things (which some people are just averse to in general)
  • ALLFED and those models sometimes make claims that can seem intuitively fairly unlikely
  • Those models do seem to have some noticeable issues
    • Though I'd personally say that this is to be expected with any models, and a great thing about models is that they often make it easier to identify and correct specific issues, and I personally still basically agree with the qualitative conclusions drawn from the models)
  • A big part of ALLFED's focus is making a catastrophe less bad if it does happen, which could seem callous to some people

I think it's unlikely that the donor lottery report would have those downsides to a substantial extent. 

And I'm personally quite positive about ALLFED, David Denkenberger, and their work, and ALLFED is one of the three places I've donated the most to [EA(p) · GW(p)] (along with GCRI and the Long-Term Future Fund). 

I'm just disagreeing with the claim "It’s hard to see how the writeup could have had a negative effect." I basically think most longtermism-related things could plausibly have negative effects, since they operate on variables that we think might be important for the long-term future and we're really bad at predicting precisely how the effects play out. (But this doesn't mean we just try to "do nothing", of course [? · GW]! Something with some downside risk can still be very positive in expectation.)

I'm not sure how often my 80% confidence interval would include negative effects, nor whether it'd include them in the ALLFED case. So maybe this is just a nit-pick about your specific phrasing, and we'd agree on the substance of your model/estimate.

Replies from: NunoSempere
comment by NunoSempere · 2021-03-17T09:33:48.700Z · EA(p) · GW(p)

Yeah, I see what you're saying. Do you think that it is hard for the writeup to have a negative total effect?

Replies from: MichaelA
comment by MichaelA · 2021-03-17T23:53:00.155Z · EA(p) · GW(p)

When I made my comment, I think I kind-of had in mind "negative total effect", rather than "at least one negative effect, whether or not it's offset". But I don't think I'd explicitly thought about the distinction (which is a bit silly), and my comment doesn't make it totally clear what I meant, so it's a good question.

I think my 80% confidence interval probably wouldn't include an overall negative impact of the writeup. But I think my 95% confidence interval would. 

Reasons why my 80% confidence interval probably wouldn't include an overall negative impact of the writeup, despite what I said in my previous comment:

  • I think we should have some degree of confidence that, if there's more public discussion by people with fairly good epistemics and good epistemic and discussion norms, that'll tend to update people towards more accurate beliefs.
    • (Not every time, but more often than it does the opposite.)
    • As such, I think we should start off skeptical of claims like "An EA Forum post that influenced people's beliefs and behaviours substantially influenced those things in a bad way, even though in theory someone else could've pointed that out convincingly and thus prevented that influence."
  • And then there's also the fact that Gleave later got a role on the LTFF, suggesting he's probably good at reasoning about these things.
  • And there's also my object-level positive impressions of ALLFED.
Replies from: NunoSempere
comment by NunoSempere · 2021-03-18T12:22:02.860Z · EA(p) · GW(p)

I have nothing to disagree about here :)

comment by MichaelA · 2021-03-17T03:17:03.520Z · EA(p) · GW(p)

Overall thoughts

Thanks, I found this post interesting. 

I don't know what I think about the reasonableness of these specific evaluations, about how useful this sort of evaluation approach is, or about whether I'd like to see more of this sort of thing in future and exactly what form it should take. (To be clear, I literally just mean "I don't know", rather than meaning "I think this all sucks, but I'm being polite.") But I think it's plausible that this or something like it would be very valuable and should be scaled up substantially, so I think exploring the idea at least a bit is definitely worthwhile in expectation.

I'd be interested to hear roughly how long this whole process took you (or how long it took minus writing the actual post, or something)? This seems relevant to how worthwhile and scalable this sort of thing is. 

(Of course, the process may become much faster as the people doing it become more experienced, better tools or templates for it are built, etc. But it may also become slower if one aims for more rigour / less pulling things out of thin air. In any case, I think how long this early attempt took should give at least a rough idea.)

I also had a bunch of reactions that aren't especially important since they're focused on specific points about each evaluation, rather than on the basic methods and how this sort of analysis can be useful. I'll split them into seperate comments.

Replies from: Misha_Yagudin, NunoSempere
comment by Misha_Yagudin · 2021-03-18T12:19:31.242Z · EA(p) · GW(p)

Recently Nuño asked me to do similar (but shallower) forecasting for ~150 project ideas. It took me about 5 hours. I think I could have done the evaluation faster but I left ~paragraph-long comments on like to projects and sentence long comments on most others; I haven't done any advanced modeling or guesstimating.

comment by NunoSempere · 2021-03-17T09:41:38.135Z · EA(p) · GW(p)

I'd be interested to hear roughly how long this whole process took you (or how long it took minus writing the actual post, or something)? This seems relevant to how worthwhile and scalable this sort of thing is. 

Maybe an afternoon for the initial version, and then two weeks of occasional tweaks. Say 10h to 30h in total? I imagine that if one wanted to scale this, one could get it to 30 mins to an hour for each estimate. 

Replies from: MichaelA
comment by MichaelA · 2021-03-17T23:46:20.428Z · EA(p) · GW(p)

I think that that seems promisingly fast to me, given that this was an early attempt and could probably be sped up (holding quality/rigour constant) by experience, tools, templates, etc. So that updates me a bit further towards enthusiasm about this general idea. 

Replies from: oagr
comment by Ozzie Gooen (oagr) · 2021-03-18T06:39:20.752Z · EA(p) · GW(p)

I'd also note that the larger goals are to scale in non-human ways. If we have a bunch of examples, we could:

1) Open this up to a prediction-market style setup, with a mix of volunteers and possibly inexpensive hires.
2) As we get samples, some people could use data analysis to make simple algorithms to estimate the value of many more documents.
3) We could later use ML and similar to scale this further.

So even if each item were rather time-costly right now, this might be an important step for later. If we can't even do this, with a lot of work, that would be a significant blocker.

https://www.lesswrong.com/posts/kMmNdHpQPcnJgnAQF/prediction-augmented-evaluation-systems

comment by MichaelA · 2021-03-17T04:14:15.045Z · EA(p) · GW(p)

Some specific things I was confused about

  1. The estimated mQARPs per employee per month seems to differ substantially between sections. Is this based on something like dividing the posts/papers the org produced by the org's total budget or number of FTE employees? (Your comment on ALLFED vs AI safety papers seems to indicate this? Note that I didn't look closely at the Guesstimate model.)
  2. "I have reason to believe that the amount of money influenced was large, and Larks's writeup was further the only one available. My confidence interval is then $100,000 to $1M in moved funding". That seems surprisingly high, but I have no specific knowledge on this. Could you share your reason to believe that? (But no problem if the reasoning is based on private info or is just hard to communicate concisely and explicitly.)
  3. "You can see my first guesstimate model here, but I think that this was too low because it only took into account the value of papers, so here is a second model, which takes into account that the value which does not come from papers is 2 to 20x the magnitude of the value which does come from papers. I’m still uncertain, so my final estimate is even more uncertain than the Guesstimate model."
    • Could you give a sense of where you see the rest of that value is coming from? (I'm not saying I disagree with you.)
    • Was that accounted for in the other evaluations too? My impression from your written description (without looking at the models) was that e.g. for ALLFED you estimated their impact as entirely coming from posts and papers?
  4. Is there a reason you estimated the impact of the Giving Tuesday and EAGx things in terms of dollars moved, without converting that into mQARPS?
    • Part of why this confuses me is that:
      • I'd guess that dollars moved is actually not the primary value of EAGx (though you can still convert the value into dollars-moved-equivalents if you want)
      • Meanwhile, I'd guess that dollars moved is the primary value of the animal welfare commitments post and maybe some other posts (though you can still convert the value into mQARPs if you want)
  5. "Similarly, while EAGx Boston 2018 and the EA Giving Tuesday Donation Matching Initiative might have taken similar amounts of time to organize, by comparably capable people, I prefer the second. This is in large part because EAGx events are scalable, whereas Giving Tuesdays are not." Did you mean to say Giving Tuesday is scalable, whereas EAGx events are not?
Replies from: NunoSempere
comment by NunoSempere · 2021-03-17T10:02:50.121Z · EA(p) · GW(p)
  1. In the case of ALLFED, this is based on my picturing of one employee going about its month, and asking myself how surprising it would be if they couldn't produce 10 mQARPS of value per month, or how surprising it would be if they could produce 50 mQARPs per month.  In the case of the AI safety organizations, this is based on estimating the value of each of the papers that Larks things are valuable enough to mention, and then estimating what fraction of the total value of an organization those are.
  2. Private info
  3. a) Building up researchers into more capable researchers, knowledge acquired that isn't published, information value of trying out dead ends, acquiring prestige, etc. b) I actually didn't estimate ALLFED's impact, I estimated the impact of the marginal hires, per 1.
  4. Personal taste, it's possible that was the inferior choice. I found it more easy to picture the dollars moved than the improvement in productivity. In hindsight, maybe improving retention would be another main benefit which I didn't consider.
  5. I got that as a comment. The intuition here is that it would be really, really hard to find a project which moves as much money as Giving Tuesday and which you could do every day, every week, or every month. But if there are more than 52 local EA groups, an EAGx could be organized every week. If you think that EA is only doing projects at maximum efficiency (which it isn't), and knowing only that Giving Tuesdays are done once a year and EAGx are done more often, I'd expect one EAGx to be less valuable than one Giving Tuesday. 
    • Or, in other words, I'd expect there to be some tradeoff between quality and scalable.
Replies from: MichaelA
comment by MichaelA · 2021-03-17T23:39:44.829Z · EA(p) · GW(p)

Thanks for the clarifications :)

I actually didn't estimate ALLFED's impact, I estimated the impact of the marginal hires, per 1.

So did that estimate of the impact of marginal hires also account for how much those hires would contribute to "Building up researchers [themselves or others] into more capable researchers, knowledge acquired that isn't published, information value of trying out dead ends, acquiring prestige"?

Replies from: NunoSempere
comment by NunoSempere · 2021-03-18T12:23:53.586Z · EA(p) · GW(p)

Oof, no it didn't, good point.

comment by MichaelA · 2021-03-17T03:46:33.530Z · EA(p) · GW(p)

How would something like this approach be used for decision-making?

You write:

  • We don’t normally estimate the value of small to medium-sized projects.
  • But we could!
  • If we could do this reliably and scalably, this might lead us to choose better projects

And:

In addition, one could also use these kinds of estimates to choose one’s own projects, or to recommend projects to others, and see how that fares.

But this post estimates the impact of already completed projects/writeups. So precisely this sort of method couldn't directly be used to choose what projects to do. Instead, I see at least two broad ways something like this method could be used as an input when choosing what projects to do:

  1. When choosing what projects to do, look at estimates like these, and either explicitly reason about or form intuitions about what these estimates suggest about the impact of different projects one is considering
    • One way to do this would be to create classifications for different types of projects, and then look up what has been the estimated impact per dollar of past projects in the same or similar classifications to each of the projects one is now choosing between
    • I think there'd be many other specific ways to do this as well
  2. When choosing what projects to do, explicitly make estimates like these for those specific future projects
    • If that's the idea, then the sort of approach taken in this post could be seen as: 
      • just a proof of concept, or 
      • a way to calibrate one's intuitions/forecasts (with the hope being that there'll be transfer between calibration when estimating the impact of past projects and calibration when forecasting the impact of future projects), or
      • a way of getting reference classes / base rates / outside views
        • In that case, this second approach would sort-of incorporate the first approach suggested above as one part of it

Is one of these what you had in mind? Or both? Or something else?

Replies from: NunoSempere
comment by NunoSempere · 2021-03-17T10:22:50.243Z · EA(p) · GW(p)

Yeah, I think that the distinction between evaluation and forecasting is non-central. For example, these estimates can also be viewed as forecasts of what I would estimate if I spent 100x as much time on this, or as forecasts of what a really good system would output. 

 

 

More to the point, if a project isn't completed I could just estimate the distribution of expected quality, and the expected impact given each degree of quality (or, do a simplified version of that). 

That said, I was  thinking more about 2., though having a classification/lookup scheme would also be a way to produce explicit estimates.

Replies from: MichaelA
comment by MichaelA · 2021-03-17T23:37:01.575Z · EA(p) · GW(p)

For example, these estimates can also be viewed as forecasts of what I would estimate if I spent 100x as much time on this, or as forecasts of what a really good system would output. 

Agreed, but that's still different from forecasting the impact of a project that hasn't happened yet, and the difference intuitively seems like it might be meaningful for our purposes. I.e., it's not immediately obvious that methods and intuitions that work well for the sort of estimation/forecasting done in this post would also work well for forecasting the impact of a project that hasn't happened yet. 

One could likewise say that it's not obvious that methods and intuitions that work well for forecasting how I'll do in job applications would also work well for forecasting GDP growth in developing countries. So I guess my point was more fundamentally about the potential significance of the domain being different, rather than whether the thing can be seen as a type of forecasting or not. 

So it sounds like you're thinking that the sort of thing done in this post would be "a way to calibrate one's intuitions/forecasts (with the hope being that there'll be transfer between calibration when estimating the impact of past projects and calibration when forecasting the impact of future projects)"?

That does seem totally plausible to me; it just adds a step to the argument. 

(I guess I'm also more generally interested in the question of how well forecasting accuracy and calibration transfers across domains - though at the same time I haven't made the effort to look into it at all...)

Replies from: NunoSempere
comment by NunoSempere · 2021-03-18T12:30:36.958Z · EA(p) · GW(p)

Yes, I expect the intuitions and for estimation to generalize/help a great deal with the forecasting step, and though I agree that this might not be intuitively obvious. I understand that estimation and forecasting seem like different categories, but I don't expect that to be a significant hurdle in practice.

comment by MichaelA · 2021-03-17T04:01:10.604Z · EA(p) · GW(p)

Specific reactions to the evaluation of Takeaways from EAF's Hiring Round

The way to estimate impact here would be something like: "Counterfactual impact of the best hire(s) in the organizations it influenced, as opposed to the impact of the hires who would otherwise have been chosen" 

I think that's a substantial part of the impact, but that there may be other substantial parts too, such as:

  • Time saved by employees who have to design application processes (since it's usually easier to do things when one has a good writeup as guidance)
  • Causing orgs to hire sooner, since they're more confident they can do it well and without a huge time investment
  • Something along the lines of "health of the organisation"; if that post reduces the chance of making a hire which isn't a good fit, it reduces the chance of frictions and someone ending up being fired or quitting, which I imagine are negative for the culture of the organisation
  • Something along the lines of "health of the community"; I imagine a better hiring round will mean better applicant experiences, which could reduce rates of value drift or burnout or the like

But these are just quick thoughts, and I haven't run application rounds myself, and I think that those things overlap somewhat such that there's a risk of double-counting.

As an upper bound: Let's say it influenced 1 to 5 hiring rounds, and advice in the post allowed advice-takers to hire 0 to 3 people per organization who were 1 to 10% more effective, and who stayed with the organization for 0.5 to 3 years

FWIW, I think the upper bound of my 80% confidence interval would be above 10% more effective and 3 years staying at the org, and definitely above 1% more effect and 0.5 years staying there. 

I'm also not sure how to interpret your upper bound itself having a range? (Caveat that I haven't looked at your Guesstimate model.)

I also think that one other effect perhaps worth modelling is that better hiring rounds might mean hires stay at the org for longer (since better and more fitting hires are chosen). This could either be modelled as more output by those employees, or as less cost/output-reduction by employees involved in later hiring rounds, or maybe both.

There are also cases in which an org just doesn't hire anyone at all at a given time if they don't find a good enough fit, and presumably better hiring rounds somewhat reduces the odds of that.

Ballpark: 0 to three excellent EA forum posts.

FWIW, intuitively, that seems like a pretty low upper bound for the value of improving other orgs' hiring rounds. I guess this is just for the reasons noted above. (And obviously it'd be better if I actually myself provided specific alternative numbers and models - I'm just taking the quicker option, sorry!)

but sometimes, like in the case of a hiring round, I have the intuition that one might want to penalize the hiring organization for the lost opportunity cost of applicants, even though that’s not what Shapley values recommends

Many of the other projects were stated to have an impact by increasing the funding certain organisations received, thereby helping them hire more people, thereby resulting in more useful output. So by that logic, shouldn't those projects also be penalised for the lost opportunity cost of applicants involved in the hiring rounds run by the orgs which received extra funding due to the project? 

Or am I misunderstanding the reasoning or the modelling approach? (That's very possible; I didn't actually look at any of your Guesstimate models.)

Replies from: NunoSempere, NunoSempere, NunoSempere
comment by NunoSempere · 2021-03-18T12:33:34.390Z · EA(p) · GW(p)

I think that's a substantial part of the impact, but that there may be other substantial parts too, such as...

Yes, those seem like at least somewhat important pathways to impact that I've neglected, particularly the first two points. I imagine that could easily lead to a 2x to 3x error (but probably not to a 10x error)

comment by NunoSempere · 2021-03-17T10:08:44.052Z · EA(p) · GW(p)

To answer this specifically:

FWIW, I think the upper bound of my 80% confidence interval would be above 10% more effective and 3 years staying at the org, and definitely above 1% more effect and 0.5 years staying there. 

Yeah, I disagree with this. I'd expect most interventions to have a small effect, and in particular I expect it to just be hard to change people's actions by writing words. In particular, I'd be much higher if I was thinking about the difference between a completely terrible hiring round and an excellent one, but I don't know that people start off all that terrible or that this particular post brings people up all that much. 

Replies from: MichaelA
comment by MichaelA · 2021-03-17T23:29:48.043Z · EA(p) · GW(p)

That seems reasonable. I think my intuitions would still differ from yours, but I don't have that much reason to expect my intuitions are well-calibrated here, nor have I thought about this carefully and explicitly.

comment by NunoSempere · 2021-03-18T12:34:27.427Z · EA(p) · GW(p)

I'm also not sure how to interpret your upper bound itself having a range?

Upper bound being a range is a mistake, fixed now.

comment by jimrandomh · 2021-03-18T07:56:29.624Z · EA(p) · GW(p)

I was somewhat confused by the scale using Categorizing Variants of Goodhart's Law as an example of a 100mQ paper, given that the LW post version [LW · GW] of that paper won the 2018 AI Alignment Prize [LW · GW] ($5k), which makes a pretty strong case for it being "a particularly valuable paper" (1Q, the next category up). I also think this scale significantly overvalues research agendas and popular books relative to papers. I don't think these aspects of the rubric wound up impacting the specific estimates made here, though.

Replies from: jackmalde, NunoSempere
comment by jackmalde · 2021-03-18T13:08:20.219Z · EA(p) · GW(p)

I'm not sure on the exact valuation research agendas should get, but I would argue that well thought-through research agendas can be hugely beneficial in that they can reorient many researchers in high-impact directions, leading them to write papers on topics that are vastly more important than they might have otherwise chosen.  

I would argue an 'ingenious' paper written on an unimportant topic isn't anywhere near as good as a 'pretty good' paper written on a hugely important topic.

comment by NunoSempere · 2021-03-18T12:41:32.566Z · EA(p) · GW(p)

Yes, the scale is under construction, and you're not the first person to mention that the specific research agenda mentioned is overvalued.