A bottom-up approach for improving public decision making 2021-12-01T09:14:11.662Z
Improving the Public Management of Global Catastrophic Risks in Spain 2021-12-01T09:13:55.482Z
Takeaways from our interviews of Spanish Civil Protection servants 2021-11-24T09:12:52.711Z
Can we influence the values of our descendants? 2021-11-16T10:36:17.162Z
Persistence - A critical review [ABRIDGED] 2021-11-10T11:30:51.522Z
Jsevillamol's Shortform 2021-10-23T23:04:14.012Z
My current best guess on how to aggregate forecasts 2021-10-06T08:33:20.349Z
Announcing 2021-09-14T15:42:55.087Z
When pooling forecasts, use the geometric mean of odds 2021-09-03T09:58:19.282Z
My first PhD year 2021-08-31T11:31:49.939Z
[Link post] Parameter counts in Machine Learning 2021-07-01T15:44:18.410Z
Everyday longtermism in practice 2021-04-06T14:42:14.117Z
Quantum computing timelines 2020-09-15T14:15:29.399Z
Assessing the impact of quantum cryptanalysis 2020-07-22T11:26:21.286Z
My experience as a CLR grantee and visiting researcher at CSER 2020-04-29T19:03:42.434Z
Modelling Vantage Points 2020-01-01T16:50:11.108Z
Quantum Computing : A preliminary research analysis report 2019-11-05T14:25:41.628Z
My experience on a summer research programme 2019-09-22T09:54:39.044Z
Implications of Quantum Computing for Artificial Intelligence alignment research (ABRIDGED) 2019-09-05T14:56:29.449Z
A summary of Nicholas Beckstead’s writing on Bayesian Ethics 2019-09-04T09:44:24.260Z
How to generate research proposals 2019-08-01T16:38:53.790Z


Comment by Jsevillamol on Takeaways from our interviews of Spanish Civil Protection servants · 2021-11-29T14:23:51.780Z · EA · GW

(disclaimer: this is my opinion)

In short:  Spanish civil protection would not as of today consider making plans to address specific GCRs

There is this weird tension where they believe that resilience is very important, and that planning in advance is nearly useless for non-recurring risks.

The civil protection system is very geared towards response. Foresight, mitigation and prevention seldom happens.This means they are quite keen on improving their general response capacity but they have no patience for hypotheticals. So they would not consider specific GCRs.

Even if they wanted to address GCRs, their hands are relatively tied (at least at the national level) - the risks they do specific preparation for are encoded in the law and modifying the list of priority risks would require passing an amendment.

In their opinion some things like geomagnetic storms which could theoretically unleash a global catastrophe are to be addressed by the generalist response plans. And, at least one high ranked person thinks a specific plan for responding to solar storms and similar risks is could be created, but not without a coordinated technical and policy response at the European level.

On the other hand we have seen some autonomies that have enacted their own special civil protection plans independently, but for minor risks (eg coastal environmental protection).And for example Madrid's city hall wants to have better maps of which expertise is needed and where to find it for conceivable future emergencies.

Also bear in mind that while civil protection is a very important part of risk management in Spain, it is not the only part. The national security system and other organizations might have different attitudes towards GCRs.

Comment by Jsevillamol on Takeaways from our interviews of Spanish Civil Protection servants · 2021-11-29T10:18:52.477Z · EA · GW

I have received a private comment asking about the role of civil protection during COVID.

This is what I answered:

About COVID, we asked everyone about their role on it. 

Basically the picture I got is: public health is delegated to the ministry of health, and in particular pandemics are seen as the business of the Centre of Sanitary Alerts (CCAES) The CCAES does have an early warning system, but we could not talk to them and we don't know how the systems reacted to COVID. 

In the civil prot side, they basically did nothing until the govn declared an alarm state. It is not clear if the activation was urged by the CCAES, civil protection or if it came from up top. After the alarm state was declared, civil prot organized decentralised committees at the state, autonomy and municipal level to monitor the situation and discuss next steps. These committees incorporated public health experts, civil prot servants and other public and private figures with relevant expertise.

The military "forces" under command of civil protection (the UME) were mobilized, but they lacked expertise on dealing with pandemics and basically were ordered to help disinfect hospitals and other public places, even after it became obvious that the pandemic was airborne and that disinfecting helped little.

In general I think that the civil prot response was good and prompt once they were ordered to act. In particular, I think the decentralised response involving several expertise was as good as could have been given their lack of preparation. 

But they weren't able to anticipate and react to early signs and their own forces were unprepared for a pandemic. It would also have helped to have mapped out in advance who were relevant experts to consult at every level of the system

Comment by Jsevillamol on Announcing my retirement · 2021-11-25T15:24:45.636Z · EA · GW

Echoing everyone else, thank you for all your hard work.

I do not exaggerate when I say you are the best forum moderator I have ever seen. I am really impressed with your availability, creativity and kindness. You have driven the culture of this website to a whole new level, and inspired me and I bet many others to write better content.

Good luck at OpenPhil!

Comment by Jsevillamol on Don’t wait – there’s plenty more need and opportunity today · 2021-11-24T15:37:36.266Z · EA · GW

I think that this kind of criticism is really useful, and I am glad it was written. 

That being said, there is something about this post that really rubbed me the wrong way. This is a shame, because the topic is very pertinent and deserves an in-depth discussion - how do opportunities for funding today compare to opportunities for funding 5 or 10 years from now? 

Let me try to give my best shot at a more thoughtful critique.

In the midst of a global pandemic that pushed 150 million people into extreme poverty while billionaires’ wealth grew by $5.5 trillion, it’s odd timing. 

This is pretty much an unrelated remark. While I agree that this reflects inequality and a sad arrangement of the world, it has very little to do with whether GiveWell should or not be patient.

I am pointing this out because this is one of things that made me icky about the post, even though it is a great post otherwise.

The team at GiveWell know all these numbers of course, and they would likely agree that influencing more money is better. But their decision here tries to optimize 0.1% of U.S. charitable giving in isolation from the other 99.9%; when, in reality, growing that 99.9% and allocating it better will mean a lot more for our world than asking those donors to hold out for possible silver bullets down the road.

The argument here is that focusing on fundraising is probably a better use of GiveWell's resources than  being patient with their money. But this assumes that fundraising trades off against patience, which I believe is false. GiveWell can focus on fundraising AND be patient.

I assume that the crux here is that GiveDirectly believes that spending more money now would have a good publicity effect, that would promote philanthropy and raise the total amount of donations overall.
I would change my mind if this was the case, but I don't see this as obvious.

Since 2018, we have asked GiveWell to fully engage with this study and others, but they have opted not to, citing capacity constraints. Until they do, they may be underestimating the effects of cash transfers by 2.6x and overstating the benefits of waiting by the same amount.

This is the strongest argument in this document. If GiveWell has not taken into account these second hand effects and those would increase the effectiveness by x2 that might push them over the edge of effectiveness that would justify giving more now.

On the other hand, I am unsure of how much this affects the math on whether to expect better giving opportunities in the future. In fact, it could counterintuitively mean that good spending opportunities are easier to find than we expected, which could incentive waiting.


We’re skeptical about GiveWell’s hopes to find much more without a more fundamental re-evaluation of the scale of the sector, the opportunities available, and their role within it. We hope we’re wrong about that, but if we’re not it’s another reason to move faster now, rather than hold out for better opportunities down the road.

I agree with your position - if I believed that GiveWell has run out of low hanging fruit then this definitely is a good argument in favour of spending more now instead of waiting.

On the other hand, GiveWell is incredibly young. And they seem to be capacity constrained enough that they cannot afford to look into very relevant studies on cost-effectiveness like the one you mention in the post. I would be quite surprised if there weren't better funding opportunities than the ones already identified.

So it seems prudent to expect that better opportunities will be identified later.


However, people living in poverty have good reasons to answer this table’s impossible questions in a variety of different ways. They may also have good reasons to differ with GiveWell on what else matters, let alone which interventions will do the most for the things that matter most to them. Many of these potential differences of opinion are highly relevant to whether or not GiveWell should save funds and give further down the road.

I absolutely agree with this. And I commend GiveDirectly on its efforts to incorporate the voice of their recipients in their processes - I think its exemplary and something to be imitated.

I'd love to see GiveWell expand their work on this area, and to hire a more diverse team to help integrate many valuable perspectives. 


I know very little about the chances of finding better funding opportunities in the future and of GiveWell and GiveDirectly. So please definitely take all of this with a grain of salt.

My TL;DR of the post is that GiveWell might be understimating direct giving and needs to do more work on listening to charity beneficiaries to learn how to help them best. It was also argued that GiveWell is unlikely to find better funding opportunities in the future, but I didn't find those arguments particularly convincing.

Thank you to GiveDirectly for taking the time to engage with the community. Highlighting these problems is definitely something I would like to happen more often, so we stand a better chance at fixing them.

Comment by Jsevillamol on Can we influence the values of our descendants? · 2021-11-22T22:22:19.980Z · EA · GW

I do think so!

It's hard to contest that change across many dimensions has been accelerating.

And it would make sense that this accelerating change makes parental advice less applicable, and thus parents less influential overall. 

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-11-22T12:05:26.446Z · EA · GW

The A,B,C example you came up with is certainly a strike against average log odds and in favor of average probs.

I have though more about this. I now believe that this invariance property is not reasonable - aggregating outcomes is (surprisingly) not a natural operation in Bayesian reasoning. So I do not think this is a strike agains log-odd pooling.

Comment by Jsevillamol on We need alternatives to Intro EA Fellowships · 2021-11-21T19:34:26.031Z · EA · GW

Why did you end up being turned off by EA?


It's hard to pinpoint but I think it's somehting along the lines of a) the messaging didn't match my perceived self-image ("I am not an altruist"), b) they seemed weirdly fanatical ("donating 10% of my money seems crazy weird") and c) I was not impressed with the people I interacted with (concretely the people from eg the rationality community seemed comparatively more thoughful and to be working on cooler things).

I am unsure of whether I would have changed my mind had I interacted more with the community at that time - I think the quality of discussion has improved a lot since then.

> How did it end up being crucial for your long-term engagement?

After the MIRI Summer Fellows I started organizing a community in Spain (primarily about rationality, though some of the other people involved were self-identifying Effective Altruists and we also organized events about that. I also participated in several more Rationality and Effective Altruism events.

I kept talking to Effective Altruists regularly, and eventually became convinced that they were working on cool things and that it was a community I wanted to be a part of.

Comment by Jsevillamol on We need alternatives to Intro EA Fellowships · 2021-11-20T11:51:48.991Z · EA · GW

Retreats are awesome!

It was the MIRI Summer Fellows in 2015. For full disclosure it was not about EA, and I came off it being turned off by EA aesthetics. But it was where I first heard about the movement, and it was crucial for my involvement in the long term.

Comment by Jsevillamol on We need alternatives to Intro EA Fellowships · 2021-11-19T14:58:09.499Z · EA · GW

One hour (maybe two) fellowship sessions isn’t long enough to get into “late night life-changing conversations” mode, which is important for big changes.


This to me is the main downside. 

I got introduced to EA over a 3 week in-person summer program, and my experience is that 2~4 week in person intensive programs have a good track record in getting people excited and engaged. Off the top of my head 1 out of 3 participants in the camps ive been involved in became counterfactually engaged, 1 out of three was engaged but would be anyway and 1 out of three bounced off and didn't stay engaged.

Late night conversations and a great vibe was a big part of why I stayed engaged, and matches my intuitions of what works best to help people grow and connect.

I would be interested on having more data about 6 month after retention for the Intro EA Fellowships, both for the whole group and for the subgroups that are considered "more promising".

Comment by Jsevillamol on Persistence - A critical review [ABRIDGED] · 2021-11-18T11:52:12.355Z · EA · GW

Thank you!


These papers were ones that William MacAskill was considering citing in his forthcoming book. FF hired me to thoroughly check them.

There is definitely many other persistence papers I didn't cover!


  • Acemoglu et al, Colonial Origins
  • Acemoglu et al, Reversal of Fortune
  • Woodberry (2012). The Missionary Roots of Liberal Democracy
  • All the papers cited in Kelly's Understanding Persistence

And many others.

Comment by Jsevillamol on Persistence - A critical review [ABRIDGED] · 2021-11-16T18:36:40.381Z · EA · GW

EDIT: Faatima Osman questioned whether it was fair to exclude respondents from Benin, Ghana and Nigeria in Nunn and Wantchekon's paper, given that Nigeria is the most populated country in Africa by far.

And in hindsight I think she is totally right - respondents from these countries are ~25% of the sample! I now believe that its unfair to call these respondents outliers. Correspondingly, my trust in Nunn and Wantchekon's paper has gone up, since Kelly's critique was my main concern about it.

Comment by Jsevillamol on Can we influence the values of our descendants? · 2021-11-16T18:16:22.696Z · EA · GW

EDIT: Faatima Osman questioned whether it was fair to exclude respondents from Benin, Ghana and Nigeria in Nunn and Wantchekon's paper, given that Nigeria is the most populated country in Africa by far.

And in hindsight I think she is totally right - respondents from these countries are ~25% of the sample! I now believe that its unfair to call these respondents outliers. Correspondingly, my trust in Nunn and Wantchekon's paper has gone up, since Kelly's critique was my main concern about it.

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-11-13T10:37:47.261Z · EA · GW

Thank you for your thoughful reply. I think you raise interesting points, which move my confidence in my conclusions down.

Here are some comments

[...] averaging log odds will always give more extreme pooled probabilities than averaging probabilities does

As in your post, averaging the probs effectively erases the information from extreme individual probabilities, so I think you will agree that averaging log odds is not merely a more extreme version of averaging probs.

I nonetheless think this is a very important issue - the difficulty of separating the extremizing effect of log odds from its actual effect.

I don't expect optimally-extremized average log odds to outperform optimally-extremized average probabilities

This is an empirical question that we can settle empirically. Using Simon_M's script I computed the Brier and log scores for binary Metaculus questions of the extremized means and extremized log odds and extremizing factors between 1 and 3 in intervals of 0.05.

In this setting, the top performing metrics are the "optimally" extremized average log odds in term of log loss, surpassing the "optimally" extremized mean of probs.

Note that the Brier scores are identical, which is consistent with the average log odds outperforming the average probs only when extreme forecasts are involved.

Also notice that the optimal extremizing factor for the average of logodds is lower than for the average of probabilities - this relates to your observation that the average log odds are already relatively extremized compared to the mean of probs.

There are reasons to question the validity of this experiment - we are effectively overfitting the extremizing factor to whatever gives the best results. And of course this is just one experiment.  But I find it suggestive.


External Bayesianity seems like an actively undesirable property for probability pooling methods that treat experts symmetrically. When new evidence comes in, this should change how credible each expert is if different experts assigned different probabilities to that evidence.

I am not sure I follow your argument here.

I do agree that when new evidence comes in about the experts  we should change how we weight them. But when we are pooling the probabilities we aren't receiving any extra evidence about the experts (?). 


I talked about the argument that averaging probabilities ignores extreme predictions in my post, but the way you stated it, you added the extra twist that the expert giving more extreme predictions is known to be more knowledgeable than the expert giving less extreme predictions. If you know one expert is more knowledgeable, then of course you should not treat them symmetrically. 

I agree that the way I presented it I framed the extreme expert as more knowledgeable. I did this for illustrative purposes. But I believe the setting works just as well when we take both experts to be equally knowledgeable / calibrated.  Throwing away the information from the extreme prediction seems bad.


Probabilities must add to 1.

I like invariance arguments - I think they can be quite illuminating. In fact I am quite puzzled by the fact that neither the average of probabilities nor the average of log odds seem to satisfy the basic invariance property of respecting annualized probabilities.

The A,B,C example you came up with is certainly a strike against average log odds and in favor of average probs.

It reminds me of Toby Ord's example with the Jack, Queen and King. I think dependency structures between events make the average log odds fail.

My personal takeaway here is that when you are aggregating probabilities derived from mutually exclusive conditions, then the average probability is the right way to go. But otherwise stick with log-odds.


[...] I maintain that, if you want a quick and dirty heuristic, averaging probabilities is a better quick and dirty heuristic than anything as senseless as averaging log odds.

I notice this is very surprising to me, because averaging log odds is anything but senseless. 

This is a far lower confidence argument than the other points I raise here, but I think there is an aesthetic argument for averaging log odds - log odds make Bayes rule additive, and I expect means to work well when the underlying objects are additive (more about this from Owen CB here).

There is also the argument that average logodds are what you get when you try to optimize the minimum log loss in a certain situation - see Eric Neyman's comment here.  

Again, these arguments appeal mostly to aesthetic considerations. But I think it is unfair to call them senseless - they arise naturally in some circumstances.


if the worst odds you'd be willing to bet on are bounds on how seriously you take the hypothesis that someone else knows something that should make you update a particular amount, and you want to get an actual probability, then you should average over probabilities you perhaps should end up at, weighted by how likely it is that you should end up at them. This is an arithmetic mean of probabilities, not a geometric mean of odds.

Being honest I do not fully follow the reasoning here.

My gut feeling is this argument relies on an adversarial setting where you might get exploited. And this probably means that you should come up with a probability range for the additional evidence your opponent might have. 

So if you think their evidence is uniformly distributed over -1 and 1 bits, you should combine that with your evidence by adding that evidence to your logarithmic odds. This gives you a probability distribution over the possible values. Then use that spread to decide which bet odds are worth the risk of exploitation.  

I do not understand how this is about pooling different expert probabilities. But I might be misunderstanding your point.

Thank you again for writing the post and your comments. I think this is an important and fascinating issue, and I'm glad to see more discussion around it!

Comment by Jsevillamol on Jsevillamol's Shortform · 2021-10-23T23:04:14.222Z · EA · GW

On getting research collaborators

(adapted from a private conversation)

The 80/20 advice I would give is: be proactive in reaching out to other people and suggesting to them to work for an evening on a small project, like writing a post. Afterwards you both can decide if you are excited enough to work together on something bigger, like a paper.

For more in depth advice, here are some ways I've started collaborations in the past:

  • Deconfusion sessions
    I often invite other researchers for short sessions of 1-2 hours to focus on a topic, with the goal of coding together a barebones prototype or a sketch of a paper.

    For example, I engaged in conversation with Pablo Moreno about Quantum Computing and AI Aligment. We found we disagreed, so I invited him to spend one hour discussing the topic more in depth. During the conversation we wrote down the key points of disagrement, and we resolved to expand them into an article.
  • Advertise through intermediate outputs
    I found it useful for many reasons to split big research projects into post-size bits. One of those reasons is to let other people know what I am working on, and that I am interested in collaborating.

    For example, for the project on studying macroscopic trends in Machine Learning, we resolved to first write a short article about parameter counts. I then advertised the post asking for potential collaborators to reach out.
  • Interview people on their interests
    Asking people what motivates them and what they want to work on can segue into an opportunity to say "actually, I am also interested in X, do you want to work together on it?". I think this requires some finesse, but it is a skill that can be practiced.

    For example, I had an in depth conversation with Laura González about her interests and what kinds of things she wanted to work on. It came up that she was interested in game design, so I prodded her on whether she would be interested in helping me refine a board game prototype I had previously shown her. This started our collaboration.
  • Join communities of practice.
    I found it quite useful to participate in small communities of people working towards similar goals. 

    For example, my supervisor helped me join a Slack group for people working on AI Explainability. I reached out to the people for one-on-one conversations, and suggested working together to a few. Miruna Clinciu accepted - and now we are buiding a small research project.
Comment by Jsevillamol on EA Forum Prize: Winners for May-July 2021 · 2021-10-23T16:10:51.345Z · EA · GW

Thank you! I am quite honoured. And congratulations to the other winners!

Comment by Jsevillamol on New Data Visualisations of the EA Forum · 2021-10-20T09:30:53.150Z · EA · GW

I absolutely love the work you have done, thank you so much!

Comment by Jsevillamol on Why aren't you freaking out about OpenAI? At what point would you start? · 2021-10-10T18:43:56.417Z · EA · GW

He is listed in the website

> OpenAI is governed by the board of OpenAI Nonprofit, which consists of OpenAI LP employees Greg Brockman (Chairman & CTO), Ilya Sutskever (Chief Scientist), and Sam Altman (CEO), and non-employees Adam D’Angelo, Holden Karnofsky, Reid Hoffman, Shivon Zilis, Tasha McCauley, and Will Hurd.

It might not be up to date though

Comment by Jsevillamol on Why aren't you freaking out about OpenAI? At what point would you start? · 2021-10-10T18:42:14.740Z · EA · GW

Note that Eliezer Yudkowski argument in the opening link is that OpenAI's damage was done by fragmenting the AI Safety community on its launch.

This damage is done - and I am not sure it bears much relation to what OpenAI is trying to do going forward.

(I am not sure I agree with Eliezer on this one, but I lack details to tell if OpenAI's launch really was net negative)

Comment by Jsevillamol on My current best guess on how to aggregate forecasts · 2021-10-10T08:45:18.323Z · EA · GW

I found some revelant discussion in the EA Forum about extremizing in footnote 5 of this post.

The aggregation algorithm was elitist, meaning that it weighted more heavily forecasters with good track-records who had updated their forecasts more often. In these slides, Tetlock describes the elitism differently: He says it gives weight to higher-IQ, more open-minded forecasters. The extremizing step pushes the aggregated judgment closer to 1 or 0, to make it more confident. The degree to which they extremize depends on how diverse and sophisticated the pool of forecasters is. The academic papers on this topic can be found here and here. Whether extremizing is a good idea is controversial; according to one expert I interviewed, more recent data suggests that the successes of the extremizing algorithm during the forecasting tournament were a fluke. After all, a priori one would expect extremizing to lead to small improvements in accuracy most of the time, but big losses in accuracy some of the time.

The post in general is quite good, and I recommend it.

Comment by Jsevillamol on [Creative Writing Contest] [Fiction] The Fey Deal · 2021-10-08T07:05:49.567Z · EA · GW

I liked this one a lot!

It was very easy to read and pulled me in. I felt compelled by the protagonists inner turmoil, and how he makes his decision. The writing was clear but it flowed very well. This is something I will send some friends to introduce them to effective altruism.

The only part I didn't like was the ending. I like the intention of linking to GiveWell's page but it pulled me totally out of the fantasy. Also the friend felt a bit 2D. But these are minor quibbles. 

Thank you for writing this!

Comment by Jsevillamol on My current best guess on how to aggregate forecasts · 2021-10-07T14:56:05.709Z · EA · GW

Thanks for this post - I think this was a very useful conversation to have started (at least for my own work!), even if I'm less confident than you in some of these conclusions 

Thank you for your kind words! To dismiss any impression of confidence, this represents my best guesses. I am also quite confused.

I've heard other people give good-sounding arguments for other conclusions

I'd be really curious if you can dig these up!

You later imply that you think [the geo mean of probs outperforming the geo mean of odds] is at least partly because of a specific bias among Metaculus forecasts. But I'm not sure if you think it's fully because of that or whether that's the right explanation

I am confident that the geometric mean of probs outperformed the geo mean of odds because of this bias. If you change the coding of all binary questions so that True becomes False and viceversa then you are going to get worse performance that the geo mean of odds.

This is because the geometric mean of probabilities does not map consistently predictions and their complements. With a basic example, suppose that we have . Then .

 So the geometric mean of probabilities in this sense it's not a consistent probability - it doesn't map the complement of probabilities to the the complement of the geometric mean as we would expect (the geometric mean of odds, the mean of probabilities and the median all satisfy this basic property).

So I would recommend viewing the geometric mean of probabilities as a hack to adjust the geometric mean of odds down. This is also why I think better adjustments likely exist, since this isn't a particularly well motivated adjustment. It does however seem to slighly improve Metaculus predictions, so I included it in the flowchart.

To drill this point even more, here is what we would get if we aggregated the predictions in the last 860 resolved metaculus binary questions by mapping each prediction to their complement, taking the geo mean of probs and taking the complement again:

The complement of the geometric mean of complement probabilities is called comp_geo_mean

As you can see, this change (that would not affect the other aggregates) significantly weakens the geo mean of probs.

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-10-06T20:05:23.626Z · EA · GW

I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.

Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).

My reasoning here is something like:

  • There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight.
  • In this case I didn't try too hard to find an extremization factor that would work, just two attempts. I didn't need to mine for a factor that would work. But obviously we cannot generalize from just one example.
  • Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot.
  • Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions.

So overall I am not super convinced, and a big part of my argument is an appeal to authority. 

Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions.


I'd be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization factor that would optimize the last 100?

Comment by Jsevillamol on My current best guess on how to aggregate forecasts · 2021-10-06T16:23:22.057Z · EA · GW

Hmm good question.

For a quick foray into this we can see what would happen if we use our estimate the mean of the max likelihood beta distribution implied by the sample of forecasts .

The log-likelihood to maximize is then 


The wikipedia article on the Beta distribution discusses this maximization problem in depth, pointing out that albeit no closed form exists if  and  can be assumed to be not too small the max likelihood estimate can be approximated as   and , where  and .

The mean of a beta with these max likelihood parameters is .

By comparison, the geometric mean of odds estimate is:

Here are two examples of how the two methods compare aggregating five forecasts


I originally did this to convince myself that the two aggregates were different. And they seem to be! The method seems to be close to the arithmetic mean in this example. Let's see what happens when we extremize one of the predictions:


We have made p3 one hundred times smaller. The geometric mean is suitable affected. The maximum likelihood beta mean stays close to the arithmetic mean, unperturbed. 

This makes me a bit less excited about this method, but I would be excited about people poking around with this method and related ones!

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-10-06T08:48:00.206Z · EA · GW

I was curious about why the extremized geo mean of odds didn't seem to beat other methods. Eric Neyman suggested trying a smaller extremization factor, so I did that.

I tried an extremizing factor of 1.5, and reused your script to score the performance on recent binary questions. The result is that the extremized prediction comes on top. 


This has restored my faith on extremization. On hindsight, recommending a fixed extremization factor was silly, since the correct extremization factor is going to depend on the predictors being aggregated and the topics they are talking about. 

Going forward I would recommend people who want to apply extremization to study what extremization factors would have made sense in past questions from the same community.

I talk more about this in my new post.

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-10-05T10:41:28.271Z · EA · GW

I don't think I get your argument for why the approximation should not depend on the downstream task. Could you elaborate? 


Your best approximation of the summary distribution  is already "as good as it can get".  You think we should be cautious and treat this probability as if it could be higher for precautionary reasons? Then I argue that you should treat it as higher, regardless of how you arrived at the estimate.

In the end this circles back to basic Bayesian / Utility theory - in the idealized framework your credences about an event should be represented as a single probability. Departing from this idealization requires further justification.


a larger spread of forecasts does not seem to necessarily imply weaker evidence

You are right that "weaker evidence" is not exactly correct - this is more about the expected variance introduced by hypothetical additional predictions. I've realized I am confused about what is the best way to think about this in formal terms, so I wonder if my intuition was right after all.

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-10-05T09:05:28.549Z · EA · GW

I think this is a good account of the institutional failure example, thank you!

Comment by Jsevillamol on Honoring Petrov Day on the EA Forum: 2021 · 2021-09-26T18:56:40.862Z · EA · GW

I think it was an intentional false alarm, to better simulate Petrov's situation

Comment by Jsevillamol on EA Forum feature suggestion thread · 2021-09-25T14:57:10.413Z · EA · GW

Similarly, I would like for comments I have minimised to stay minimised between visits (unless there is a new reply in thread)

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-23T07:19:36.460Z · EA · GW

The ideal system would [not] aggregate first into a single number [...] Instead, the ideal system would use the whole distribution of estimates


I have been thinking a bit more about this.

And I have concluded that the ideal aggregation procedure should compress all the information into a single prediction - our best guess for the actual distribution of the event.

Concretely, I think that in an idealized framework we should be treating the expert predictions   as Bayesian evidence for the actual distribution of the event of interest .  That is, the idealized aggregation  should just match the conditional probability of the event given the predictions:  .

Of course, for this procedure to be practical you need to know the generative model for the individual predictions . This is for the most part not realistic - the generative model needs to take into account details of how each forecaster is generating the prediction and the redundance of information between the predictions.  So in practice we will need to approximate  the aggregate measure using some sort of heuristic.

But, crucially, the approximation does not depend on the downstream task we intend to use the aggregate prediction for

This is something hard for me to wrap my head around, since I too feel the intuitive grasp of wanting to retain information about eg the spread of the individual probabilities. I would feel more nervous making decisions when the forecasters widly disagree with each other, as opposed to when the forecasters  are of one voice.

What is this intuition then telling us? What do we need the information about the spread for then? 

My answer is that we need to understand the resilience of the aggregated prediction to new information. This already plays a role in the aggregated prediction, since it helps us weight the relative importance we should give to our prior beliefs  vs the evidence from the experts  - a wider spread or a smaller number of forecaster predictions will lead to weaker evidence, and therefore a higher relative weighting of our priors.

Similarly, the spread of distributions gives us information about how much would we gain from additional predictions.

I think this neatly resolves the tension between aggregating vs not, and clarifies when it is important to retain information about the distribution of forecasts: when value of information is relevant. Which, admittedly, is quite often! But when we cannot acquire new information, or we can rule out value of information as decision-relevant, then we should aggregate first into a single number, and make decisions based on our best guess, regardless of the task.  

Comment by Jsevillamol on Open Thread: September 2021 · 2021-09-21T13:01:52.008Z · EA · GW

I've been having some mixed feelings about some recent initiatives in the Forum.

These include things in the space of the creative fiction contest, posting humorous top level content and asking people to share memes.

I am having trouble articulating exactly what is causing my uneasiness. I think its something along the lines of "I use the EA Forum to stay up to date on research, projects and considerations about Effective Altruism. Fun content distracts from that experience,  and makes it harder for the work I publish in the Forum to be taken seriously".

On the other hand, I do see the value of having friendly content around. It makes the community more approachable. And the last thing I would want is to gatekeep people out for wanting to have fun together. I love hanging out with EAs too!

I trust the leadership of the Forum to have thought about these and other considerations. But I am voicing my opinion in case there are more who also share this uneasiness, to see if we can pinpoint it and figure out what to do about it.

Things that I think would help mitigate my uneasiness:

  • Create a peer-reviewed forum on top of the EA Forum, which curates research/thoughful content. An interface like the Alignment Forum / LessWrong would work well for this.
  • Create a separate place of discourse (a Facebook group?) for fun content, perhaps linked somehow from the EA Forum.
  • Have the fun content be hidden by default, like personal posts, so people need to opt into it.

What do other people think? Do other people feel this way? 

Comment by Jsevillamol on Open Thread: September 2021 · 2021-09-21T06:56:19.830Z · EA · GW

Some discussion about profile pictures for the Forum here

Comment by Jsevillamol on Announcing · 2021-09-18T16:18:15.349Z · EA · GW

Done, thank you!

Comment by Jsevillamol on EA Forum Creative Writing Contest: Submission thread for work first published elsewhere · 2021-09-15T12:20:31.129Z · EA · GW

Here is something  I wrote, and received some positive feeback on: Standing on a pile of corpses

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-13T10:00:22.332Z · EA · GW

Thank you! I learned too from the examples.

One question:

In particular, that the best approach for practical rationality involves calculating things out according to each  of the probabilities and then aggregating from there (or something like that), rather than aggregating first.

I am confused about this part. I think I said exactly the opposite? You need to aggregate first, then calculate whatever you are interested in. Otherwise you lose information (because eg taking the expected value of the individual predictions loses information that was contained in the individual predictions, about for example the standard deviation of the distribution, which depending on the aggregation method might affect the combined expected value).

What am I not seeing?

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-11T08:29:05.059Z · EA · GW

Thank you for your thoughts!

I agree with the general point of "different situations will require different approaches".

From that common ground, I am interested in seeing whether we can tease out when it is appropriate to use one method against the other.

*disclaimer: low confidence from here onwards

I do not find the  first example about value 0 vs value 500 entirely persuasive, though I see where you are coming from, and I think I can see when it might work.

The arithmetic mean of probabilities is entirely justified when aggregating predictions from  models that start from disjoint and exhaustive conditions (this was first pointed out to me by Ben Snodin, and Owen CB makes the same point in a comment above).

This suggests that if your experts are using radically different assumptions (and are not hedging their bets based on each others arguments) then the average probability seems more appealing. I think this is implicitly what is happening in Linch's and your first and third examples - we are in a sense assuming that only one  expert is correct in the assumptions that led them to their estimate, but you do not know which one.

My intuition is that once you have experts who are given all-considered estimates, the geometric mean takes over again. I realize that this is a poor argument; but I am making a concrete claim about when it is correct to use arithmetic vs geo mean of probabilities.

In slogan form: the average of probabilities works for aggregating hedgehogs, the geometric mean works for aggregating foxes.

On the second example about instition failure, the argument goes that the expected value of the aggregate probability ought to correspond to the mean of the expected values.

I do not think this is entirely correct - I think you lose information when taking the expected values before aggregating, and thus we should not in general expect this. This is an argument similar to (Lindley, 1983), where the author dismisses marginalization as a desirable property on similar grounds. For a concrete example, see this comment, where I worked out how the expected value of the aggregate of two log normals relates to the aggregate of the expected value.

What I think we should require is that the aggregate of the exponential distributions implied by the annual probabilities matches the exponential distribution implied by the  aggregated annual probabilities.

Interestingly, if you take the geometric mean aggregate of two exponential densities  with associated annual probabilities   then you end up with .

That is, the geometric mean aggregation of the implied exponentials led to an exponential whose annual rate probability is the arithmetic mean of the individual rates. 

EDIT: This is wrong, since the annualized probability does not match the rate parameter in an exponential. It still does not work after we correct it by substituting 

I consider this a strong argument against the geometric mean.

Note that the arithmetic mean fails to meet this property too - the mixture distribution  is not even an exponential!  The harmonic mean does not satisfy this property either.

 What is the class of aggregation methods implied by imposing this condition? I do not know.

I do not have much to say about the Jack, Queen, King example. I agree with the general point that yes, there are some implicit assumptions that make the geometric mean work well in practice.

Definitely the JQK example does not feel like "business as usual". There is an unusual dependence between the beliefs of the experts. For example, had we pooled expert C as well then the example does no longer work. 

I'd like to see whether we can derive some more intuitive examples that follow this pattern. There might be - but right now I am drawing a blank.

In sum, I think there is an important point here that needs to be acknoledged - the theoretical and empirical evidence I provided is not enough to pinpoint the conditions where the geometric mean is the better aggregate (as opposed to the arithmetic mean).

I think the intuition behind using mixture probabilities is correct when the experts are reasoning from mutually exclusive assumptions. I feel a lot less confident when aggregating experts giving all-considered views. In that case my current best guess is the geometric mean, but now I feel a lot less confident.

I think that first taking the expected value then aggregating loses you information. When taking a linear mixture this works by happy coincidence, but we should not expect this to generalize to situations where the correct pooling method is different.

I'd be interested in understanding better what is the class of pooling methods that "respects the exponential distribution" in the sense I defined above of having the exponential associated with a pooled annual rate matches the pooled exponentials implied by the individual annual rates.

 And I'd be keen on more work identifying real life examples where the geometric mean approach breaks, and more work suggesting theoretical conditions where it does (not). Right now we only have external bayesianity motivating it, that while compelling is clearly not enough.

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-08T16:12:00.940Z · EA · GW

Let's work this example through together! (but I will change the quantities to 10 and 20 for numerical stability reasons)

One thing we need to be careful with is not mixing the implied beliefs with the object level claims.

In this case, person A's claim that the value is  is more accurately a claim that the beliefs of person A can be summed up as some distribution over the positive numbers, eg a log normal with parameters  and  . So the density distribution of beliefs of A is  (and similar for person B, with  ). The scale parameters  intuitively represent the uncertainty of person A and person B.

Taking , these densities look like:

Note that the mean of these distributions is slightly displaced upwards from the median . Concretely, the mean is computed as , and equals 10.05 and 20.10 for person A and person B respectively.

To aggregate the distributions, we can use the generalization of the geometric mean of odds referred to in footnote [1] of the post.

According to that, the aggregated distribution has a density .

The plot of the aggregated density looks like:

I actually notice that I am very surprised about this - I expected the aggregate distribution to be bimodal, but here it seems to have a single peak.

For this particular example, a numerical approximation of the expected value seems to equal around 14.21 - which exactly equals the geometric mean of the means.

I am not taking away any solid conclusions from this exercise - I notice I am still very confused about how the aggregated distribution looks like, and I encountered serious numerical stability issues when changing the parameters, which make me suspect a bug.

Maybe a Monte Carlo approach for estimating the expected value would solve the stability issues - I'll see if I can get around to that at some point.

Meanwhile, here is my code for the results above.

EDIT: Diego Chicharro has pointed out to me that the expected value can be easily computed analytically in Mathematica.

The resulting expected value of the aggregated distribution is .

In the case where  we have then  that the expected value is , which is exactly the geometric mean of the expected values of the individual predictions.

Comment by Jsevillamol on How to get more academics enthusiastic about doing AI Safety research? · 2021-09-07T05:56:17.801Z · EA · GW

I think you mean bearish

Oops yes 🐻

You point out this highly skilled management/leadership/labor is not fungible

Yes, exactly.

I think what I am pointing towards is something like "if you are one such highly skilled editor, and your plan is to work on something like this part time delegating work to more junior people, then you are going to find yourself burnt out very soon. Managing a team of junior people / people who do not share your aesthetic sense to do highly skilled labor will be, at least for the first six months or so, much more work than if you do it on your own.".

I think an editor will be ten times more likely to succeed if:

  1. They have a high skilled co-founder who shares their vision
  2. They have a plan to work on something like this full time, at least for a while
  3. They have a plan for training aligned junior people on skills OR to teach taste to experts

On hindsight I think my comment was too negative, since I would still be excited about someone retrying a distill-like experiment and throwing money at it.

Comment by Jsevillamol on How to get more academics enthusiastic about doing AI Safety research? · 2021-09-05T07:09:31.838Z · EA · GW

I am more bullish about this. I think for distill to succeed it needs to have at least two full time editors committed to the mission.

Managing people is hard. Managing people, training them and making sure the vision of the project is preserved is insanely hard - a full time job for at least two people.

Plus the part Distill was bottlenecked on is very high skilled labour, which needed a special aesthetic sensitivity and commitment.

50 senior hours per draft sounds insane - but I do believe the Distill staff when they say it is needed.

This wraps back to why new journals are so difficult : you need talented researchers with additional entrepreneurial skills to push it forward. But researchers by and large would much rather just work on their research than manage a journal.

Comment by Jsevillamol on How to get more academics enthusiastic about doing AI Safety research? · 2021-09-04T19:26:42.693Z · EA · GW

Create a journal of AI safety, and get prestigious people like Russell publishing on them.

Basically many people in academia are stuck chasing publications. Aligning that incentive seems important.

The problem is that journals are hard work, and require a very specific profile to push it forward.

Here is a post mortem of a previous attempt:

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-04T07:11:30.905Z · EA · GW

META: Do you think you could edit this comment to include...

  1. The number of questions, and aggregated predictions per question?
  2. The information on extremized geometric mean you computed below (I think it is not receiving as much attention due to being buried in the replies)?
  3. Possibly a code snippet to reproduce the results?

Thanks in advance!

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-03T17:38:36.562Z · EA · GW

You are right and I should be more mindful of this. 

I have reformulated the main equations using only commonly known symbols, moved the equations that were not critical for the text to a footnote and added plain language explanations to the rest.

(I hope it is okay that I stole your explanation of the geometric mean!)

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-03T10:38:43.402Z · EA · GW

I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they've stopped being underconfident.  My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.



The bias isn't that more questions resolve positively than users expect.

Oh I see!

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-03T10:11:50.042Z · EA · GW

but also the average predictor improving their ability also fixed that underconfidence

What do mean by this? 

Metaculus has a known bias towards questions resolving positive

Oh I see!

It is very cool that this works.

One thing that confuses me - when you take the geometric mean of probabilities you end up with . So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn't that mean that it should be less accurate, given the bias towards questions resolving positively?

What am I missing?

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-03T09:35:08.329Z · EA · GW

(I note these scores are very different than in the first table; I assume these were meant to be the Brier scores instead?)

Comment by Jsevillamol on When pooling forecasts, use the geometric mean of odds · 2021-09-03T09:34:23.055Z · EA · GW

Thank you for the superb analysis!

This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit.

I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.

One question / nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability - the sum of the pooled ps and (1-p)s does not equal 1. You need to rescale them, at which point you end with the geometric mean of odds. 

Unexpected values explains this better than me here.

Comment by Jsevillamol on My first PhD year · 2021-09-02T11:30:28.136Z · EA · GW

Thank you! 

The freedom for side projects is the best - though I should warn other people here than having a supportive supervisor who is okay with this is crucial. 

I have definitely heard more than one horror story from colleagues who were constantly fighting their supervisors on the direction of their research, and felt they had little room for side projects.

Comment by Jsevillamol on What are the EA movement's most notable accomplishments? · 2021-08-23T06:47:00.059Z · EA · GW

EAF helped pass a ballot which doubled Zurich's development aid

Comment by Jsevillamol on More EAs should consider “non-EA” jobs · 2021-08-19T21:38:05.175Z · EA · GW

But when I look at this chart, my main takeaway is that there’s a ton of money being spent on welfare, and that working to make sure that money is spent as efficiently as possible could have a huge impact.

I think this is basically true.

A while back I thought that it was false - in particular, I thought that the public money was extremely tight, and that fighting to change a budget was an extremely political issue where one would face a lot of competition.

My experience collaborating with public organizations and hearing from public servants so far has been very different. I still think that budgets are hard to change. But within public organizations there is usually a fair amount of willingness to reconsider how their allocated budget is spent, and lots of space for savings and improvement.

This is still more anecdote than hard evidence. Yet I really think this is worth thinking more about. I think one very effective thing EAs can do is study closely public organizations (by eg interviewing their members or applying for jobs in the organization), and then think hard about how to help the organization better achieve their goals.

Comment by Jsevillamol on [PR FAQ] Sharing readership data with Forum authors · 2021-08-09T15:58:11.325Z · EA · GW

Reading time is one of my favourite features from Medium, and has helped me understand which of my posts are most useful. This in turn has informed my decisions on what to focus on writing. 

I expect to get similar benefits if a feature like this is implemented in the EA Forum. 

Comment by Jsevillamol on Forecasting Newsletter: July 2021 · 2021-08-01T20:30:00.128Z · EA · GW

In Section 4 we shift attention to the computational complexity of agreement, the subject of our deepest technical result. What we want to show is that, even if two agents are computationally bounded, after a conversation of reasonable length they can still probably approximately agree about the expectation of a [0, 1] random variable. A large part of the problem is to say what this even means. After all, if the agents both ignored their evidence and estimated (say) 1/2, then they would agree before exchanging even a single message. So agreement is only interesting if the agents have made some sort of “good-faith effort” to emulate Bayesian rationality.


TYPO: This belongs to the section on Aumann's agreement, but is listed in the problem of priors section