Yeah, the idea is that the lower the expected value of the future the less bad it is if AI causes existential catastrophes that don't involve lots of suffering. So my wording was sloppy here; lower EV of the future perhaps decreases the importance of (existential catastrophe-preventing) AI risk but not my credence in it.
I'm not sure Eliezer having occasionally been overconfident, but got the general shape of things right is any evidence at all against >50% AGI in 30 years or >15% chance of catastrophe this century (though it could be evidence against Eliezer's very high risk view).
I wouldn't go as far as no evidence at all given that my understanding is Eliezer (+ MIRI) was heavily involved in influencing the OpenPhil's cluster's views so it's not entirely independent, but I agree it's much weaker evidence for less extreme views.
Fewer 'smart people disagree' about the numbers in your footnote than about the more extreme view.'
I was going to say that it seems like a big difference within our community, but both clusters of views are very far away from the median pretty reasonable person and the median AI researcher. Though I suppose the latter actually isn't far away on timelines (potentially depending on the framing?). It definitely seems to be in significant tension with how AI researchers and the general public / markets / etc. act, regardless of stated beliefs (e.g. I found it interesting how short the American public's timelines are, compared to their actions).
Anyway, overall I think you're right that it makes a difference but it seems like a substantive concern for both clusters of views.
The Carlsmith post you say you roughly endorse seems to have 65% on AGI in 50 years, with a 10% chance of existential catastophe overall. So I'm not sure if that means your conclusion is
The conclusion I intend to convey is something like "I'm no longer as hesitant about adopting views which are at least as concerning as >50% of AGI/TAI/APS-AI within 30 years, and >15% chance of existential catastrophe this century" which as I referred to above seem to make AI clearly the most important cause area.
I’m now at ~20% by 2036; my median is now ~2050 though still with a fat right tail.
My timelines shortening [due to reflecting on MATH breakthrough] should also increase my p(AI doom by 2100) a bit, though I’m still working out my views here. I’m guessing I’ll land somewhere between 20 and 60% [TBC, most of the variance is coming from working out my views and not the MATH breakthrough].
Consequently, it's possible to be skeptical of the motivations anyone in AI safety, expert or novice, on the grounds that "isn't it convenient the best way to save the world is to do cool AI stuff?"
Fair point overall, and I'll edit in a link to this comment in the post. It would be interesting to see data on what percentage of people working AI safety due to EA motivations would likely be working in AI regardless of impact. I'd predict that it's significant but not a large majority (say, 80% CI of 25-65%).
A few reactions to specific points/claims:
It's possible that they know the same amount about AI X-risk mitigation, and would perhaps have similar success rate working on some alignment research (which to a great deal involves GPT-3 prompt hacking with near-0 maths).
My understanding is that most alignment research involves either maths or skills similar to ML research/engineering; there is some ~GPT-3 prompt hacking (e.g. this post?) but it seems like <10% of the field?
Imagine that two groups wanted to organise an AI camp or event: a group of AI novice undergrads who have been engaged in EA vs a group of AI profs with no EA connections. Who is more likely to get funding?
I'm not sure about specifically organizing an event, but I'd guess that experienced AI profs with no EA connections but who seemed genuinely interested in reducing AI x-risk would be able to get substantial funding/support for their research.
EA-funded AI safety is actually a pretty sweet deal for an AI novice who gets to do something that's cool at very little cost.
The field has probably gotten easier to break into over time but I'd guess most people attempting to enter still experience substantial costs, such as rejections and mental health struggles.
tl;dr In the last 6 months I started a forecasting org, got fairly depressed and decided it was best to step down indefinitely, and am now figuring out what to do next. I note some lessons I’m taking away and my future plans.
But I don't think you learn all that much about how 'concrete and near mode' researchers who expect slower takeoff are being, from them not having given much thought to what to do in this (from their perspective) unlikely edge case.
I'm not sure how many researchers assign little enough credence to fast takeoff that they'd describe it as an unlikely edge case, which sounds like <=10%? e.g in Paul's blog post he writes "I’m around 30% of fast takeoff"
ETA: One proxy could be what percentage researchers assigned to "Superintelligence" in this survey
I agree with the general thrust of the post, but when analyzing technological risks I think one can get substantial evidence by just considering the projected "power level" of the technology, while you focus on evidence that this power level will lead to extinction. I agree the latter is much hard to get evidence about but I think the former is sufficient to be very worrisome without much evidence on the latter.
Specifically, re: AI you write:
we’re reliant abstract arguments that use ambiguous concepts (e.g. “objectives” and “intelligence”), rough analogies, observations of the behaviour of present-day AI systems (e.g. reinforcement learners that play videogames) that will probably be very different than future AI systems, a single datapoint (the evolution of human intelligence and values) that has a lot of important differences with the case we’re considering, and attempts to predict the incentives and beliefs of future actors in development scenarios that are still very opaque to us.
I roughly agree with all of this, but by itself the argument that we will within the next century plausibly create AI systems that are more powerful than humans (e.g. Ajeya's timelines report) seems like enough to get the risk pretty high. I'm not sure what our prior should be on existential risk conditioned on a technology this powerful being developed, but honestly starting from 50% might not be unreasonable.
5 forecasters from Samotsvety Forecasting discussed the forecasts in this post.
First, I estimate that the chance of direct Great Power conflict this century is around 45%.
Our aggregated forecast was 23.5%. Considerations discussed were the changed incentives in the nuclear era, possible causes (climate change, AI, etc.) and the likelihood of specific wars (e.g. US-China fighting over Taiwan).
Second, I think the chance of a huge war as bad or worse than WWII is on the order of 10%.
Our aggregated forecast was 25%, though we were unsure if this was supposed to only count wars between great powers, in which case it’s bounded above by the first forecast.
There was some discussion of the offense-defense balance as tech capabilities increase; perhaps offense will have more of an advantage over time.
Some forecasters would have preferred to predict based on something like human suffering per capita rather than battle deaths, due to an expected shift in how a 21st century great power war would be waged.
Third, I think the chance of an extinction-level war is about 1%. This is despite the fact that I put more credence in the hypothesis that war has become less likely in the post-WWII period than I do in the hypothesis that the risk of war has not changed.
Our aggregated forecast was 0.1% for extinction. Forecasters were skeptical of using Braumoeller’s model to estimate this as it seems likely to break down at the tails; killing everyone via a war seems really hard. There was some uncertainty of whether borderline cases such as war + another disaster to finish people off or war + future tech would count.
(Noticed just now that MaxRa commented giving a similar forecast with similar reasoning)
Hey, thanks for sharing these other options. I agree that one of these choices makes more sense than forecasting in many cases, and likely (90%) the majority. But I still think forecasting is a solid contender and plausibly (25%) the best in the plurality of cases. Some reasons:
Which activity is best likely depends a lot on which is easiest to actually start doing, because I think the primary barrier to doing most of these usefully is "just" actually getting started and completing something. Forecasting may (40%) be the most fun and least intimidating of these for many (33%+) prospective researchers because of the framing of competing on a leaderboard and the intrigue of trying to predict the future.
I think the EA community has relatively good epistemics, but there is still room for improvement, and more researchers getting a forecasting background is one way to help with this (due to both epistemic training and identifying prospective researchers with good epistemics).
Depending on the question, forecasting can look a lot like a bite-sized chunk of research, so I don't think it's mutually exclusive with some of the activities you listed and especially similar to summarizations/collections: for example, Ryan summarized relevant parts of papers then formed some semblance of an inside view in his winning entry.
Also, I was speaking from personal experience here; e.g. Misha and I both have forecasted for a few years and enjoyed it while building skills and a track record, and are now doing ~generalist research or had the opportunity to and seriously considered it, respectively.
I think this will become especially true as the UX of forecasting platforms improves; let's say 55% this is true in 3 years from now, as I expect the UX here to improve more than the "UX" of other options like summarizing papers.
Note that patient philanthropy includes investing in resources besides money that will allow us to do more good later; e.g. the linked article lists "global priorities research" and "Building a long-lasting and steadily growing movement" as promising opportunities from a patient longtermist view.
Looking at the Future Fund's Areas of Interest, at least 5 of the 10 strike me as promising under patient philanthropy: "Epistemic Institutions", "Values and Reflective Processes", "Empowering Exceptional People", "Effective Altruism", and "Research That Can Help Us Improve"
At first I thought the scenarios were separate so they would be combined with an OR to get an overall probability, which then made me confused when you looked at only scenario 1 for determining your probability for technological feasibility.
I was also confused about why you assigned 30% to polygenic scores reaching 80% predictive power in Scenario 2 while assigning 80% to reaching saturation at 40% predictive power in the Scenario 1, because when I read 80% to reach saturation at 40% predictive power I read this as "capping out at around 40%" which would only leave a maximum of 20% for scenarios with much greater than 40%?
Finally, I was a little confused about where the likelihood of iterated embryo selection fit into your scenarios; this seems highly relevant/important and is maybe implicitly accounted for in e.g. "Must be able to generate 100 embryos to select from"? But could be good to make more explicit.
Great point. Perhaps we should have ideally reported the mean of this type of distribution, rather than our best guess percentages. I'm curious if you think I'm underconfident here?
Edit: Yeah I think I was underconfident, would now be at ~10% and ~0.5% for being 1 and 2 orders of magnitude too low respectively, based primarily on considerations Misha describes in another comment placing soft bounds on how much one should update from the base rate. So my estimate should still increase but not by as much (probably by about 2x, taking into account possibility of being wrong on other side as well).
The estimate being too low by 1-2 orders of magnitude seems plausible to me independently (e.g. see the wide distribution in my Squiggle model ), but my confidence in the estimate is increased by it being the aggregated of several excellent forecasters, who were reasoning independently to some extent. Given that, my all-things-considered view is that 1 order of magnitude off feels plausible but not likely (~25%?), and 2 orders of magnitude seems very unlikely (~5%?).
I agree the risk should be substantially higher than for an average month and I think most Samotsvety forecasters agree. I think a large part of the disagreement may be on how risky the average month is.
From the post:
(a) may be due to having a lower level of baseline risk before adjusting up based on the current situation. For example, while Luisa Rodríguez’s analysis puts the chance of a US/Russia nuclear exchange at .38%/year. We think this seems too high for the post-Cold War era after new de-escalation methods have been implemented and lessons have been learnt from close calls. Additionally, we trust the superforecaster aggregate the most out of the estimates aggregated in the post.
Speaking personally, I'd put the baseline risk per year at ~.1%/yr then have adjusted up by a factor of 10 to ~1%/yr given the current situation, which gives me ~.08%/month which is pretty close to the aggregate of ~.07%.
We also looked some from alternative perspectives e.g. decomposing Putin's decision making process, which gave estimates in the same ballpark.
There are many important topics, such as the level of risk from advanced artificial intelligence and how to reduce it, among which there are reasonable people with very different views. We are interested in experimenting with various types of adversarial collaborations, which we define as people with opposing views working to clarify their disagreement and either resolve the disagreement or identify an experiment/observation that would resolve it. We are especially excited about combining adversarial collaborations with forecasting on any double cruxes identified from them. Some ideas for experimentation might be varying the number of participants, varying the level of moderation and strictness of enforced structure, and introducing AI-based aids.
I found this thought-provoking. I'd be curious to hear more about your recommendations for readers. I'm wondering:
Would you recommend ~all readers try decreasing their sleep to ~6 hours a night and observe the effects? Or should they slowly decrease until the effects are negative?
If not, how should they decide whether it makes sense for them?
What percentage of readers do you estimate would be overall more productive with ~6 hours of sleep than their unrestricted amount?
How much individual variability do you think there is here?
Some background is that I read Why We Sleep a while ago and loved it, I think because it told a story I liked. My personal experience was that I generally dealt with little sleep badly, and reading that others around me who seemingly functioned better with little sleep actually didn't and were fooling themselves was a nice story selfishly.
Then I read your takedown, which was not as comforting to read but seemed right to me. What I took from your takedown is that most of the claims in Why We Sleep were unsubstantiated, so I should go back to something like the prior of my observed experiences: I and some others can't deal with little sleep very well, while those I'm a bit jealous of are much better at sleeping little while being productive and happy.
Reading this piece now, I'm convinced to a decent extent that this restriction will work for some subset of people. But how large is this subset? How much individual variability do you think there is here? Is it worth it for people with prior negative experiences with lack of sleep like me and Peter and Oli in other comments to do more rigorous experimentation? What exactly should that experimentation be?
While I haven't done any controlled self-experiments or anything, I generally have less ability to focus and stronger urges to rest and nap even after one night of <= ~6 hours of sleep, and they get stronger after more.
They generally only accept applications from registered charities, but speculation grants (a) might be a good fit for smaller projects (40%).
My read is that speculation grants are a way for projects applying to SFF to get funding more quickly, rather than a way for projects that aren't eligible for SFF to get funding (I believe SFP serves this purpose).
There results are pretty interesting! I'm surprised at how much optimism there is about 25 unique people/groups compared to 100 total entries; my intuition for expecting an average of about 4 entries per person/group was that most would only submit 1-2, but it only takes a few to submit on many questions to drive the average up substantially.
My answer to your question depends on how you define "good for the long-term future". When I think about evaluating the chance an action is good including of long-run effects, specifying a few more dimensions matters to me. It feels like several combinations of these could be reasonable and would often lead to fairly different probabilities.
Expected value vs. realized value
Does "good for the long-term future" mean: good in expectation, or actually having good observed effects?
What is the ground truth evaluation?
Is the ground truth evaluation one that would be performed by:
An oracle that has all knowledge of all events?
The best evaluation that is in some sense realizable, e.g.:
A large (1,000 people?), well-organized team of competent people evaluating the action for a long time (1,000 years?)?
The best evaluation AI 100 years after AI surpasses human-level.
I think usually people mean (1), but in practice it often feels useful to me to think about some version of (2).
Foresight vs. hindsight
Does the ground truth evaluation occur before the action occurs, or after it occurs and all (or some) effects can be observed?
(Note: Using this as an excuse to ask this clarifying question that I've thought about some recently, that could apply to many posts. Haven't done a thorough lit review on this so apologies if this is already covered somewhere else)
More work put into creating impactful questions, e.g. via identifying forecastable cruxes in key EA debates and integrating with ongoing EA-aligned research.
Better incentives for deep, collaborative predictions on these impactful questions.
The question you ask about "studying forecasting to benefit EA" vs. "prediction markets as an EA cause area" is also important. I'm inclined to favor interventions closer to "studying forecasting to benefit EA" at present (though I might frame it more as "improve EA's wisdom/decision-making via various means including forecasting", h/t QURI/Ozzie for influence here) because I feel we're a relatively young and growing movement with a lot of resources (money + people) to use and not much clarity on how to do it best. Once we get better at this ourselves, e.g. forecasting platforms and prediction markets have clearly substantially improved important EA decisions, I'd feel it's more time for "prediction markets/forecasting platforms improving non-EA's decision-making" to be an EA cause area. I'm open to changing my mind on this e.g. if I see more evidence of forecasting having already improved important EA decisions. https://forum.effectivealtruism.org/posts/Ds2PCjKgztXtQrqAF/disentangling-improving-institutional-decision-making-2 and https://forum.effectivealtruism.org/posts/YpaQcARgLHFNBgyGa/prioritization-research-for-advancing-wisdom-and are good pointers on this overall question as well.
On causal evidence of RCTs vs. observational data: I'm intuitively skeptical of this but the sources you linked seem interesting and worthwhile to think about more before setting an org up for this. (Edited to add:) Hearing your view already substantially updates mine, but I'd be really curious to hear more perspectives from others with lots of experience working on this type of stuff, to see if they'd agree, then I'd update more. If you have impressions of how much consensus there is on this question that would be valuable too.
On nudging scientific incentives to focus on important questions rather than working on them ourselves: this seems pretty reasonable to me. I think building an app to do this still seems plausibly very valuable and I'm not sure how much I trust others to do it, but maybe we combine the ideas and build an app then nudge other scientists to use this app to do important studies.
This all makes sense to me overall. I'm still excited about this idea (slightly less so than before) but I think/agree there should be careful considerations on which interventions make the most sense to test.
I think it's really telling that Google and Amazon don't have internal testing teams to study productivity/management techniques in isolation. In practice, I just don't think you learn that much, for the cost of it.
What these companies do do, is to allow different managers to try things out, survey them, and promote the seemingly best practices throughout. This happens very quickly. I'm sure we could make tools to make this process go much faster. (Better elicitation, better data collection of what already happens, lots of small estimates of impact to see what to focus more on, etc).
A few things come to mind here:
The point on the amount of evidence Google/Amazon not doing it provides feels related to the discussion around our corporate prediction market analysis. Note that I was the author who probably took the evidence that most corporations discontinued their prediction markets as the most weak (see my conclusion), though I still think it's fairly substantial.
I also agree with the point in your reply that setting up prediction markets and learning from them has positive externalities, and a similar thing should apply here.
I agree that more data collection tools for what already happens and other innovations in that vein seem good as well!
A variant I'd also be excited about (could imagine even moreso, could go either way after more reflection) that could be contained within the same org or separate: the same thing but for companies (particularly, startups) edit to clarify: test policies/strategies across companies, not on people within companies
I think the obvious answer is that doing controlled trials in these areas is a whole lot of work/expense for the benefit.
Some things like health effects can take a long time to play out; maybe 10-50 years. And I wouldn't expect the difference to be particularly amazing. (I'd be surprised if the average person could increase their productivity by more than ~20% with any of those)
I think our main disagreement is around the likely effect sizes; e.g. I think blocking out focused work could easily have an effect size of >50% (but am pretty uncertain which is why I want the trial!). I agree about long-term effects being a concern, particularly depending on one's TAI timelines.
On "challenge trials"; I imagine the big question is how difficult it would be to convince people to accept a very different lifestyle for a long time. I'm not sure if it's called "challenge trial" in this case.
Yeah, I'm most excited about challenges that last more like a few months to a year, though this isn't ideal in all domains (e.g. veganism), so maybe this wasn't best as the top example. I have no strong views on terminology.
EDIT (Jul 2022): I'm no longer nearly as confident in this idea, though if someone was excited about it it still might be cool.
Reflecting a little on my shortform from a few years ago, I think I wasn't ambitious enough in trying to actually move this forward.
I want there to be an org that does "human challenge"-style RCTs across lots of important questions that are extremely hard to get at otherwise, e.g. (top 2 are repeated from previous shortform. edited to clarify: these are some quick examples off the top of my head, should be more consideration into which are the best for this org):
Health effects of veganism
Health effects of restricting sleep
Productivity of remote vs. in-person work
Productivity effects of blocking out focused/deep work
Edited to add: I no longer think "human challenge" is really the best way to refer to this idea (see comment that convinced me); I mean to say something like "large scale RCTs of important things on volunteers who sign up on an app to randomly try or not try an intervention." I'm open to suggestions on succinct ways to refer to this.
I'd be very excited about such an org existing. I think it could even grow to become an effective megaproject, pending further analysis on how much it could increase wisdom relative to power. But, I don't think it's a good personal fit for me to found given my current interests and skills.
However, I think I could plausibly provide some useful advice/help to anyone who is interested in founding a many-domain human-challenge org. If you are interested in founding such an org or know someone who might be and want my advice, let me know. (I will also be linking this shortform to some people who might be able to help set this up.)
Some further inspiration I'm drawing on to be excited about this org:
Freakonomics' RCT on measuring the effects of big life changes like quitting your job or breaking up with your partner. This makes me optimistic about the feasibility of getting lots of people to sign up.
I really appreciate that you break down explanatory factors in the way you do.
I'm happy that this was useful for you!
I have a hard time making a mental model of their relative importance compared to each other. Do you think that such an exercise is feasible, and if so, do any of you have a conception of the relative explanatory strength of any factor when considered against the others?
Good question. We also had some trouble with this, as it's difficult to observe the reasons many corporate prediction markets have failed to catch on. That being said, my best guess is that it varies substantially based on the corporation:
For an average company, the most important factor might be some combination of (2) and (4): many employees wouldn't be that interested in predicting and thus the cost of getting enough predictions might be high, and there is also just isn't that much appetite to change things up.
For an average EA org, the most important factors might be a combination of (1) and (2): the tech is too immature and writing + acting on good questions takes too much time such that it's hard to find the sweet spot where the benefit is worth the cost. In particular, many EA orgs are quite small so fixed costs of setting up and maintaining the market as well as writing impactful questions can be significant.
This Twitter poll by Ozzie and the discussion under it is also interesting data here; my read is that the mapping between Ozzie's options and our requirements are:
They're undervalued: None of our requirements are substantial enough issues.
They're mediocre: Some combination of our requirements (1), (2), and (3) make prediction markets not worth the cost.
Politically disruptive: Our requirement (4).
(3) won the poll by quite a bit, but note it was retweeted by Hanson which could skew the voting pool (h/t Ozzie for mentioning this somewhere else) .
Also, do you think that it is likely that the true explanation has nothing to do with any of these? In that case, how likely?
The most likely possibility I can think of is the one Ozzie included in his poll: prediction markets are undervalued for a reason other than political fears, and all/most of the companies made a mistake by discontinuing them. I'd say 15% for this, given that the evidence is fairly strong but there could be correlated reasons companies are missing out on the benefits. In particular, they could be underestimating some of the positive effects Ozzie mentioned in his comment above.
As for an unlisted explanation being the main one, it feels like we covered most of the ground here and the main explanation is at least related to something we mentioned, but unknown unknowns are always a thing; I'd say 10% here .
So that gives me a quick gut estimate of 25%; would be curious to get others' takes.
Appreciate the compliment. I am interested in making it a Forum post, but might want to do some more editing/cleanup or writing over next few weeks/months (it got more interest than I was expecting so seems more likely to be worth it now). Might also post as is, will think about it more soon.
Hi Lizka, thanks for your feedback and think it touched on some of the sections that I'm most unsure about / could most use some revision which is great!
[Bottlenecks] You suggest "Organizations and individuals (stakeholders) making important decisions are willing to use crowd forecasting to help inform decision making" as a crucial step in the "story" of crowd forecasting’s success (the "pathway to impact"?) --- this seems very true to me. But then you write "I doubt this is the main bottleneck right now but it may be in the future" (and don't really return to this).
I'll say up front it's possible I'm just wrong about the importance of the bottleneck here, and I think it also interacts with the other bottlenecks in a tricky way. E.g. if there were a clearer pipeline for creating important questions which get very high quality crowd forecasts which then affect decisions, more organizations would be interested.
That being said, my intuition that this is not the bottleneck comes from some personal experiences I've had with forecasts solicited by orgs that already are interested in using crowd forecasts to inform decision making. Speaking from the perspective of a forecaster, I personally wouldn't have trusted the forecasts produced as an input into important decisions.
Some examples: [Disclaimer: These are my personal impressions. Creating impactful questions and incentivizing forecaster effort is really hard and I respect OP//RP/Metaculus a lot for giving it a shot, and would love to be proven wrong about the impact of current initiatives like these]
The Open Philanthropy/Metaculus Forecasting AI Progress Tournament is the most well-funded initiative I know of [ETA: potentially besides those contracting Good Judgment superforecasters], but my best guess is that the forecasts resulting from it will not be impactful. An example is the "deep learning" longest time horizon round, where despite Metaculus' best efforts most questions have no-few comments and at least to me it felt like the bulk of the forecasting skill was forming a continuous distribution from trend extrapolation. See also this question where the community failed to fully update on record-breaking scores appropriately. Also note that each question attracted only 25-35 forecasters.
I feel less sure about this, but the RP's animal welfare questions authored by Neil Dullaghan seem to have the majority of comments on them by Neil himself. I feel intuitively skeptical that most of the 25-45 forecasters per question are doing more than skimming and making minor adjustments to the current community forecast, and this feels like an area where getting up to speed on domain knowledge is important to accurate forecasts.
So my argument is: given that AFAIK we haven't had consistent success using crowd forecasts to help institutions making important decisions, the main bottleneck seems to be helping the interested institutions rather than getting more institutions interested.
If, say, the CDC (or important people there, etc.) were interested in using Metaculus to inform their decision-making, do you think they would be unable to do so due to a lack of interest (among forecasters) and/or a lack of relevant forecasting questions? (But then, could they not tell suggest questions they felt were relevant to their decisions?) Or do you think that the quality of answers they would get (or the amount of faith they would be able to put into those answers) wouldn't be sufficient?
[Caveat: I don't feel too qualified too opine on this point since I'm not a stakeholder nor have I interviewed ones, but I'll give my best guess.]
I think for the CDC example:
Creating impactful questions seems relatively easier here than in e.g. the AI safety domain, though it still may be non-trivial to identify and operationalize cruxes for which predictions would actually lead to different decisions.
I'd on average expect the forecasts to be a bit better than CDC models / domain experts. Perhaps substantially better on tail risks. Don't think we have a lot of evidence here, we have some from Metaculus tournaments with a small sample size.
I think with better incentives to allocate more forecaster effort to this project, it's possible the forecasts could be much better.
Overall, I'd expect slightly decent forecasts on good but not great questions and I think that this isn't really enough to move the needle, so to speak. I also think there would need to be reasoning given behind the forecasts for stakeholder to understand and trust in crowd forecasts would need to be built up over time.
Part of the reason it seems tricky to have impactful forecasts is that often there are competing people/"camps" with different world models, and a person which the crowd forecast disagrees with may be reluctant to change their mind unless (a) the question is well targeted at cruxes of the disagreement and (b) they have built up trust of the forecasters and their reasoning process. To the extent this is true within the CDC, the harder it seems for forecasting questions to be impactful.
2. [Separate, minor confusion] You say: "Forecasts are impactful to the extent that they affect important decisions," and then you suggest examples a-d ("from an EA perspective") that range from career decisions or what seem like personal donation choices to widely applicable questions like "Should AI alignment researchers be preparing more for a world with shorter or longer timelines?" and "What actions should we recommend the US government take to minimize pandemic risk?" This makes me confused about the space (or range) of decisions and decision-makers that you are considering here.
Yeah I think this is basically right, I will edit the draft.
[Side note] I loved the section "Idea for question creation process: double crux creation," and in general the number of possible solutions that you list, and really hope that people try these out or study them more. (I also think you identify other really important bottlenecks).
I wrote a draft outline on bottlenecks to more impactful crowd forecasting that I decided to share in its current form rather than clean up into a post [edited to add: I ended up revising into a post here].
A third perspective roughly justifies the current position; we should discount the future at the rate current humans think is appropriate, but also separately place significant value on having a positive long term future.
I feel that EA shouldn't spend all or nearly all of its resources on the far future, but I'm uncomfortable with incorporating a moral discount rate for future humans as part of "regular longtermism" since it's very intuitive to me that future lives should matter the same amount as present ones.
I prefer objections from the epistemic challenge, which I'm uncertain enough about to feel that various factors e.g. personal fit, flow-through effects, gaining experience in several domains means that it doesn't make sense for EA to go "all-in". An important aspect of personal fit is comfort working on very low probability bets.
I'm curious how common this feeling is, vs. feeling okay with a moral discount rate as part of one's view. There's some relevant discussion under the comment linked in the post.
Overall I like this idea, appreciate the expansiveness of the considerations discussed in the post, and would excited to hear takes from people working at social media companies.
Thoughts on the post directly
Broadly, we envision i) automatically suggesting questions of likely interest to the user—e.g., questions related to the user’s current post or trending topics—and ii) rewarding users with higher than average forecasting accuracy with increased visibility
I think some version of some type of boosting visibility based on forecasting accuracy seems promising, but I feel uneasy about how this would be implemented. I'm concerned about (a) how this will be traded off with other qualities and (b) ensuring that current forecasting accuracy is actually a good proxy.
On (a), I think forecasting accuracy and the qualities it's a proxy for represent a small subset of the space that determines which content I'd like to see promoted; e.g. it seems likely to be loosely correlated with writing quality. It may be tricky to strike the right balance in terms of how the promotion system works.
Promoting and demoting content based on a small sample size of forecasts. In practice it often takes many resolved questions to discern which forecasters are more accurate, and I'm worried that it will be easy to increase/decrease visibility too early.
Even without a small sample size, there may be issues with many of the questions being correlated. I'm imagining a world in which lots of people predict on correlated questions about the 2016 presidential election, then Trump supporters get a huge boost in visibility after he wins because they do well on all of them.
That said, these issues can be mitigated with iteration on the forecasting feature if the people implementing it are careful and aware of these considerations.
Generally, it might be best if the recommendation algorithms don’t reward accurate forecasts in socially irrelevant domains such as sports—or reward them less so.
Insofar as the intent is to incentivize people to predict on more socially relevant domains, I agree. But I think forecasting accuracy on sports, etc. is likely strongly correlated with performance in other domains. Additionally, people may feel more comfortable forecasting on things like sports than other domains which may be more politically charged.
My experience with Facebook Forecast compared to Metaculus
I've been forecasting regularly on Metaculus for about 9 months and Forecast for about 1 month.
I don't feel as pressured to regularly go back and update my old predictions on Forecast as on Metaculus since Forecast is a play-money prediction market rather than a prediction platform. On Metaculus if I predict 60% and the community is at 50%, then don't update for 6 months and the community has over time moved to 95%, I'm at a huge disadvantage in terms of score relative to predictors who did update. But with a prediction market, if I buy shares at 50 cents and the price of the shares go up to 95 cents, it just helps me. The prediction market structure makes me feel less pressured to continually update on old questions, which has both its positives and negatives but seems good for a social media forecasting structure.
The aggregate on Forecast is often decent, but occasionally horrible more egregiously and more often than on Metaculus (e.g. this morning I bought some shares for Kelly Loeffler to win the Georgia senate runoff at as low as ~5 points implying 5% odds, while election betting odds currently have Loeffler at 62%). The most common reasons I've noticed are:
People misunderstand how the market works and bet on whichever outcome they think is most probable, regardless of the prices.
People don't make the error described in (1) (that I can tell), but are over-confident.
People don't read the resolution criteria carefully.
There aren't many predictors so the aggregate can be swung easily.
As hinted at in the post, there's an issue with being able to copy the best predictors. I've followed 2 of the top predictors on Forecast and usually agree with their analyses and buy into the same markets with the same positions.
Forecast currently gives points when other people forecast based on your "reasons" (aka comments), and these points are then aggregated on the leaderboard with points gained from actual predictions. I wish there were separate leaderboards for these.
The forecasting accuracy of Forecast’s users was also fairly good: “Forecast's midpoint brier score [...] across all closed Forecasts over the past few months is 0.204, compared to Good Judgement's published result of 0.227 for prediction markets.”
For what it's worth , as noted in Nuño's comment this comparison holds little weight when the questions aren't the same or on the same time scales; I'd take it as fairly weak evidence from my prior that real-money prediction markets are much more accurate.
My forecast is pretty heavily based on the GoodJudgment article How to Become a Superforecaster. According to it they identify Superforecasters each autumn and require forecasters to have made 100 forecasts (I assume 100 resolved), so now might actually be the worst time to start forecasting. It looks like if you started predicting now the 100th question wouldn't close until the end of 2020. Therefore it seems very unlikely you'd be able to become a Superforecaster in this autumn's batch.
[Note: alexrjl clarified over PM that I should treat this as "Given that I make a decision in July 2020 to try to become a Superforecaster" and not assume he would persist for the whole 2 years.]
This left most of my probability mass given you becoming a Superforecaster eventually on you making the 2021 batch, which requires you to both stick with it for over a year and perform well enough to become a Superforecaster. If I were to spend more time on this I would refine my estimates of how likely each of those are.
I assumed if you didn't make the 2021 batch you'd probably call it quits before the 2022 batch or not be outperforming the GJO crowd by enough to make it, and even if you didn't you made that batch you might not officially become a Superforecaster before 2023.
Overall I ended up with a 36% chance of you becoming a Superforecaster in the next 2 years. I'm curious to hear if your own estimate would be significantly different.
I first tried to tease out whether there was a correlation in which months had more activity between 2020 and 2019. It seemed there was a weak negative correlation, so I figured my base rate should be just based on the past few months of data.
In addition to the past few months of data, I considered that part of the catalyst for record-setting July activity might be Aaron's "Why you should put on the EA Forum" EAGx talk. Due to this possibility, I gave August a 65% chance of hitting over the base rate of 105 >=10 karma posts.
Relevant Metaculus question about whether the impact of the Effective Altruism movement will still be picked up by Google Trends in 2030 (specifically, whether it will have at least .2 times the total interest from 2017) has a community prediction of 70%
The efforts by https://1daysooner.org/ to use human challenge trials to speed up vaccine development make me think about the potential of advocacy for "human challenge" type experiments in other domains where consequentialists might conclude there hasn't been enough "ethically questionable" randomized experimentation on humans. 2 examples come to mind:
My impression of the nutrition field is that it's very hard to get causal evidence because people won't change their diet at random for an experiment.
Why We Sleep has been a very influential book, but the sleep science research it draws upon is usually observational and/or relies on short time-spans. Alexey Guzey's critique and self-experiment have both cast doubt on its conclusions to some extent.
Getting 1,000 people to sign up and randomly contracting 500 of them to do X for a year, where X is something like being vegan or sleeping for 6.5 hours per day, could be valuable.
I think we have good reason to believe veg*ns will underestimate the cost of not-eating-meat for others due to selection effects. People who it's easier for are more likely to both go veg*n and stick with it. Veg*ns generally underestimating the cost and non-veg*ns generally overestimating the cost can both be true.
The cost has been low for me, but the cost varies significantly based on factors such as culture, age, and food preferences. I think that in the vast majority of cases the benefits will still outweigh the costs and most would agree with a non-speciesist lens, but I fear down-playing the costs too much will discourage people who try to go veg*n and do find it costly. Luckily, this is becoming less of an issue as plant-based substitutes are becoming more widely available.
If I was donating 90% every year, I think my probability of giving up permanently would be even higher than 50% each year. If I had zero time and money left to enjoy myself, my future self would almost certainly get demotivated and give up on this whole thing. Maybe I’d come back and donate a bit less but, for simplicity, let’s just assume that if Agape gives up, she stays given up.
The assumption that if she gives up, she is most likely to give up on donating completely seems not obvious to me. I would think that it's more likely she scales back to a lower level, which would change the conclusion. It would be helpful to have data to determine which of these intuitions are correct.
Perhaps we should be encouraging a strategy where people increase their percentage donated by a few percentage points per year until they find the highest sustainable level for them. Combined with a community norm of acceptance for reductions in amounts donated, people could determine their highest sustainable donation level while lowering risk of stopping donations entirely.