Yale has been one of the only groups in the past advocating for a selective fellowship [EA(p) · GW(p)]. However, after we noticed a couple instances of people who had barely been accepted to the fellowship becoming extremely engaged with the group, we decided to do an analysis of our scoring of applications and eventual engagement. We found no correlation.
We think this shows the possibility that some of the people we have rejected in the past could have become extremely engaged members, which seems like a lot of missed value. We are still doing more analysis using different metrics and methods. For now we are tentatively recommending that groups do not follow our previous advice about being selective if they have the capacity to take on more fellows. We recommend either guaranteeing future acceptance to those over a baseline or encouraging applicants to apply to EA virtual programs [? · GW] if limited by capacity. This is not to say that there is no good way of selecting fellows but rather that ours in particular was not effective.
Rationale for Being Selective & Relevant Updates
These have been our reasons for being selective in the past and our updated thoughts
Only the most excited applicants participate (less engaged fellows who have poor attendance or involvement can set unwanted norms)
By emphasizing the time commitment in the interviews and making it easy for applicants to postpone doing the fellowship hopefully we will self select for this.
Fellows are incentivized show up and be actively engaged (since they know they are occupying a spot another person did not receive)
The application and interview process alone should create the feeling of selectiveness even if we don’t end up being that selective.
We only need a few moderators that we are confident will be friendly, welcoming, and knowledgeable about EA
We were lucky enough to have several previous fellows who fit this description.
We made it a lot easier to become and be a facilitator by separating that role from the organizers role.
We create a stronger sense of community amongst Fellows
This is still a concern
Each Fellow can receive an appropriate amount of attention since organizers get to know each one individually
This is still a concern though in the past Fellowship organizers were also taking on many different roles and now we have one person now whose only role is to manage the fellowship.
We don’t strain our organizing capacity and can run the Fellowship more smoothly
This is still a concern but the previous point also applies here
Overall, we still think these are good and important reasons for keeping the fellowship smaller. However, we are currently thinking that the possibility of rejecting an applicant who would have become really involved outweighs these concerns.*
*Although, there is an argument to be made that these people would have found a way to be involved anyways.
How we Measured Engagement and Why we Chose it
How we measured it
We brainstormed ways of being engaged with the group and estimated a general ranking for them. We ended up with:
Became YEA President >
Joined YEA Exec >
Joined the Board OR
Became a regular attendee of events and discussion groups OR
Became a mentor (after graduating) >
Became a Fellowship Facilitator (who is not also on the board) OR
Did the In-Depth Fellowship >
Became a top recruiter for the Fellowship OR
Had multiple 1-1 outside of the fellowship OR
Asked to be connected to the EA community in their post- graduation location OR
Attended the Student Summit >
Came to at least three post-fellowship events/1-1s >
Came to at least one post-fellowship event/1-1 >
Had good Fellowship attendance >
Did not drop out of the Fellowship (unless for particularly good reason) >
Dropped out of the fellowship never to be seen again
We ranked each set of fellows separately so we were ranking at most 17 people at a time. If people had the same level of engagement we gave them the same rank.
One potential issue is that this selects for people who are more interested in management/operations as becoming YEA President and YEA exec are at the top. However, we do think those roles are the ones that show the most engagement with the group. This method is overall imperfect and there was some subjectivity involved which is not ideal. However, we think that the rankings ended up being pretty accurate in depicting engagement and more accurate than if we had tried to assign “points” to all of the above.
Why we chose engagement
While we do not think engagement with EA is a perfect measure of the impact we try to achieve through the fellowship we think it does a decent job capturing that impact and is the easiest for us to use. For instance:
Many of YEA’s most engaged members have gone on to pursue high impact careers and have cited their continued engagement with the group as a large influence of this.
While we do post-fellowship surveys, we are unsure of whether the results hold over a longer period of time and answers such as “the fellowship had a significant impact on my career trajectory or donation plans” are ambiguous and difficult to quantify.
Since we have someone who has been running the group since the first revision of our fellowship in 2018, she was able to identify which members became the most engaged relatively easily.
There is the possibility that the Fellowship had a significant impact on a participant and their future path but that fellow chose not to stay engaged with Yale’s group. Some reasons for this might be:
They spend a significant amount of time on other high impact projects and didn’t have time to get involved with the group
They didn’t like the social atmosphere of the group or did not generally mesh well with the current community members
They didn’t find any of the other programs offered by our group particularly valuable or enticing and didn’t want to help organize
There was a latency period in the effect of the fellowship and they only realize its impact after graduating and entering the workforce
For each person we interviewed, we gave a composite interview score, which was the sum of their scores across the score categories. In practice, however, the raw scores were never particularly important. Rather, the important thing was the relative ranking of scores: the top ~15 people would all be admitted to the fellowship regardless of their raw score. For people who ultimately went on to do the fellowship, we later gave an engagement score. Again, we felt most confident ranking people by engagement, rather than trying to precisely quantify that engagement.
We chose to use Spearman’s Rho to evaluate the rankings. Spearman’s Rho is a measure of correlation that only considers the relative ordering of scores, rather than their absolute values, which we felt was more appropriate here. We allowed ties in both rankings, so p-values below are not considered exact.
We completed the evaluation on data from Spring 2019, Fall 2019, Spring 2020, and Summer 2020 (Fall 2020 is too recent to properly score engagement). The data can be seen in Table 1.
Table 1: Spearman’s Rho Correlation of Interview Scores and Eventual Engagement
Observed rho (the closer to 1, the better interview scores predict engagement)
p-value (H0: rho = 0)
* Summer 2020 was an unusual fellowship [EA · GW], because we accepted students from outside Yale and got over 80 applicants. Regardless, our interview rankings were still not good predictors of engagement.
For all four fellowships, there was no significant correlation between fellows’ interview scores and engagement scores, and observed effects were quite small in any case. In other words, our scoring failed to predict what it was intended to predict.
Of course, we do not know what would have happened with applicants who we did not admit. However, we suspect that if we could not differentiate between the top k fellows who we admitted, there is no evidence to suggest that applicant #k, who was admitted, was any better than applicant #k+1, who was not. As a result, we believe there is no evidence to suggest that the applicants we admitted were any better than those we did not (in terms of their likelihood to engage further).
Our plan this semester
We will use our same application and run interviews but accept everyone over a baseline. Since we will have 6 facilitators we can have up to 30 fellows in this semester’s cohorts. If we have more than 30 people who are over the baseline then we will guarantee the excess applicants a spot in a future cohort as long as they apply again.
Rationale and Details
This semester we were lucky enough to get 6 previous fellows to volunteer to facilitate. In the past we would only have a few facilitators who we knew were very familiar with EA and experienced with leading discussions. This semester, however, we have brought on some facilitators whose only experience with EA was the Fellowship last semester. However, we thought they were particularly good in discussions and would make good facilitators so we invited them to facilitate. We also had them go through the facilitator training for EA Virtual Programs [EA · GW].
Keeping our desired format [EA · GW]for the fellowship, having 6 facilitators will allow us to have up to 30 fellows. We will still use our same application and run interviews [EA · GW] as we feel they help set the standards high for the fellowship and weed out people who can’t commit. We plan on accepting everyone over a baseline. This includes everyone who filled out the entire application, showed up to their interview, are willing to commit the necessary amount of time, and didn’t say things that seem to directly contrast with EA (suggesting they would definitely not be receptive). If there are more than 30 people over the baseline who apply then we will give the excess applicants* a guaranteed spot in a future cohort given that they apply again.
We have guaranteed future spots in the past and have had several people take us up on the offer. This has generally been well-received and adds another filter for commitment. We make sure to send these applicants a personalized email explaining that we would really like them to do the fellowship but we simply don’t have the capacity this semester. We will also give them the option to share particularly strong reasons for doing it this semester (such as it being there only one with a light course load). Since we usually have at least 1-2 accepted applicants decide not to do the Fellowship, we can give people with strong reasons those spots.
EA Virtual Programs [? · GW] is currently in a trial stage but if it goes well they hope to be having new batches of fellows every month. If this happens, encouraging applicants to apply there could be a great option.
*We will prioritize keeping second semester seniors, final year graduate students and students who have a particularly strong reason to do the fellowship this semester (such as being on a leave of absence). We will then prioritize first-years since they are still choosing their extracurriculars and majors. Sophomores and Juniors will be the first to cut if we have more than 30 good applicants. However, we plan to emphasize the amount of time required to successfully participate in the fellowship during interviews so that those who will not have the time can self select themselves out of the running.
Surveying past fellows
It is possible that our rankings had higher correlation with other measures of engagement. As noted above, it is possible that some fellows remained highly engaged with EA ideas but for various reasons did not engage much with our group. We have not done extensive surveying of past fellows, so it is unclear how many people fit this description. In the near future, we plan to survey fellows who completed the fellowship over a year ago to ask about their longer-term engagement with EA as a whole.
Testing a new scoring system
Our old scoring system [EA · GW] involved scoring applicants along 6 axes and had interviewers give each applicant a score of 1-5 on each of these axes. While we gave guidelines for scoring and calibrated our scores, this still has a level of subjectivity involved. We will be testing a new way of scoring participants that involves check boxes rather than subjective scores in different categories. We plan on not using these to decide who to admit this semester but rather to analyze in the future whether it was more predictive of engagement than our previous method. If it is, then we will likely switch back to being selective and will publish another post.
It's hard to tell without seeing the data, but do you think you might have faced a range restriction problem here? i.e. if you're admitting only people with the highest scores, and then seeing whether the scores of those people correlate with outcomes, you will likely have relatively little variation in the predictor variable.
Yes, this is definitely a concern, for some cohorts more than others. Here are the number of people we interviewed each cohort:
Fall 2018: 37
Spring 2019: 22
Fall 2019: 30
Spring 2020: 22
Summer 2020: 40
So for Fall 2018 and Summer 2020, I think the case can be made that the range restriction effects might be high (given we have admitted ~15 fellows). For the Spring fellowships, we admitted the majority of applications and thus there should be more differentiation in the predictor variable.
Thanks for the info! I guess that even if you aren't applying such strong selection pressure yourselves in some of these years, it could still be that all your applicants are sufficiently high in whatever the relevant factor is (there may be selection effects prior to your selection) that the measure doesn't make much difference. This would still might suggest that you shouldn't select based on this measure (at least while the applicant pool remains similar), but the same might not apply to other groups (who may have a less selective applicant pool).
There are definitely a lot of selection effects prior to us making our selection. I think what we are trying to say is that our selections based on interview scores were not very helpful. Perhaps, they would be helpful if our system worked very differently (for instance, if just interviewed anyone who put down their email). But it seems like with the selection effects we had (have to make an effort to fill out the application, do a small amount of background reading, schedule and show up to an interview) we arrived at a place where our interview scoring system didn't do a good job further narrowing down the applicants.
We definitely do not mean to say that other groups definitively shouldn't be selective, or even shouldn't be selective using our criteria. We just don't have the evidence to suggest that our criteria were particularly helpful in our case, so we can't really recommend it for others.
Great post, and interesting and surprising result.
An obvious alternative selection criterion would be something like “how good would it be if this person got really into EA”; I wonder if you would be any better at predicting that. This one takes longer to get feedback on, unfortunately.
My instinctual response to this was: "well it is not very helpful to admit someone for whom it would be great if they got into EA if they really seem like they won't".
However, since it seems like we are not particularly good at predicting whether they will get involved or not maybe this is a metric we should incorporate. (My intuition is that we would still want a baseline? There could be someone it would be absolutely amazing to have get involved but if they are extremely against EA ideas and disruptive that might lower the quality of the fellowship for others.)
I am not super confident, though, that we would be very good at predicting this anyways. Are there certain indicators of this that you would suggest? I am also really not sure how/when we would collect feedback. Also open to thoughts here.
Thanks for the post, but I don't think you can conclude from your analysis that your criteria weren't helpful and the result is not necessarily that surprising.
If you look at professional NBA basketball players, there's not much of a correlation between how tall a basketball player is and how much they get paid or some other measure of how good they are. Does this mean NBA teams are making a mistake by choosing tall basketball players? Of course not!
The mistake your analysis is making is called 'selecting on the dependent variable' or 'collider bias'. You are looking at the correlation between two variables (interview score and engagement) in a specific subpopulation, the subpopulation that scored highly in interview score. However, that specific subpopulation correlation may not be representative of the correlation between interview score and engagement in the broader relevant population i.e., all students who applied to the fellowship. This is related to David Moss's comment on range restrictions.
The correlation in the population is the thing you care about, not the correlation in your subpopulation. You want to know whether the scores are helpful for selecting people into or out of the fellowship. For this, you need to know about engagement of people not in the fellowship as well as people in the fellowship.
This sort of thing comes up all the time, like in the basketball case. Another common example with a clear analogy to your case is grad school admissions. For admitted students, GRE scores are (usually) not predictive of success. Does that mean schools shouldn't select students based on GRE? Only if the relationship between success and GPA for admitted students is representative of the relationship for unadmitted students, which is unlikely to be the case.
The simplest thing you could do to improve this would be to measure engagement for all the people who applied (or who you interviewed if you only have scores for them) and then re-estimate the correlation on the full sample, rather than the selected subsample. This will provide a better answer to your question of whether scores are predictive of engagement. It seems like the things included in your engagement measure are pretty easy to observe so this should be easy to do. However, a lot of them are explicitly linked to participation in the fellowship which biases it towards fellows somewhat, so if you could construct an alternative engagement measure which doesn't include these, that would likely be better.
Broadly, I agree with your points. You're right that we don't care about the relationship in the subpopulation, but rather about the relationship in the broader population. However, there are a couple of things I think are important to note here:
As mentioned in my response on range restrictions, in some cases we did not reject many people at all. In those cases, our subpopulation was almost the entire population. This is not the case for the NBA or GRE examples.
Lastly, possibly more importantly: we only know of maybe 3 cases of people being rejected from the fellowship but becoming involved in the group in any way at all. All of these were people who were rejected and later reapplied and completed the fellowship. We suspect this is both due to the fact that the fellowship causes people to become engaged, and also because people who are rejected may be less likely to want to get involved. As a result, it wouldn't really make sense to try to measure engagement in this group.
In general, we believe that in order to use a selection method based on subjective interview rankings -- which are very time-consuming and open us up to the possibility of implicit bias -- we need to have some degree of evidence that our selection method actually works. After two years, we have found none using the best available data.
That being said -- this fall, we ended up admitting everyone who we interviewed. Once we know more about how engaged these fellows end up being, we can follow up with an analysis that is truly of the entire population.
The simplest thing you could do to improve this would be to measure engagement for all the people who applied and then re-estimate the correlation on the full sample, rather than the selected subsample... However, a lot of them are explicitly linked to participation in the fellowship which biases it towards fellows somewhat, so if you could construct an alternative engagement measure which doesn't include these, that would likely be better.
The other big issue with this approach is that this would likely be confounded by the treatment effect of being selected for and undertaking the fellowship. i.e. we would hope that going through the fellowship actually makes people more engaged, which would lead to the people with higher scores (who get accepted to the fellowship) also having higher engagement scores.
But perhaps what you had in mind was combining the simple approach with a more complex approach, like randomly selecting people for the fellowship across the range of predictor scores and evaluating the effects of the fellowship as well as the effect of the initial scores?
I have done a decent amount of HR work focused on hiring over the past several years, as well as a lot of reading regarding how to do hiring well. While recruiting for a fellowship isn't completely identical to hiring an employee, there are enough of similarities to justify learning.
I can't say I am surprised that a "hiring" process like the one described here failed to properly filter/select best fit candidates. Finding the right people for for a job (or in this case a fellowship) can be really difficult. Not difficult in a way that requires for time or effort, but difficult in the way of many of the more fuzzy and ambiguous things in life: if you are very careful and use best practices, then your "accuracy rate" is still somewhere between 50% and 90%. However, I am very happy to see the willingness to analyze data and reveal flawed processes. That suggests that you are already on the right path. A+ for that. You seem to be far beyond where I was when I was your age.
Thanks for sharing. Firstly, as someone who went through the Yale EA Fellowship, I have to thank the organizers for their thoughtfulness through all stages of the Fellowship.
I have two questions: 1) how have you thought about DEI in the selection process and minimizing any risks associated with unconscious bias or other systemic bias that may lead to certain individuals not being admitted to the program. 2) Has the team considered leveraging an expimental or quasi-experimental approach in thinking about how the selection process may influence the engagement of the cohort.
We actually originally created the scoring breakdown partly to help with unconscious biases. Before, we had given people a general score after an interview but we learned that that is often really influenced by biases and that breaking scores down into components with specific things to look for would reduce that. We are hoping the checkbox system we are trialing out this semester will reduce it even more as it aims to be even more objective. It is still possible, though, that it would lead to a systemic bias if the checkboxes themselves have one ingrained in them. We will be on the lookout for that and it is part of the reason we are not using it for selection this round :)
Additionally, at the end of the selection process we would look into the demographics of the selected fellows compared to all those being interviewed. Fortunately, for the past several years our selections actually were pretty representative of the total demographics of applicants. Unfortunately, our diversity of applicants, particularly racial, has not been as high as we would like and we are looking for ways to change that.
As for an experimental approach - I would be interested if you had any ideas on how to go about doing that?
These generally seem like very relevant criteria, so I'm definitely surprised by the results.
The only part I can think of that might have contributed to lower predictiveness of engagement is the "experience" criteria--I'd guess there might have been people who were very into EA since before the fellowship, and that this made them both score poorly on this metric and get very involved with the group later on. I wonder what the correlations look like after controlling for experience (although it's probably not that different, since it was only one of seven criteria).
I'm also curious: I'd guess that familiarity with EA-related books/blogs/cause areas/philosophies is a strong (positive) predictor of receptiveness to EA. Do you think this mostly factored into the scores as a negative contributor to the experience score, or was it also a big consideration for some of the other scores?
Interesting, but based on the small sample and limited range of scores (and I also agree with the points made by Moss and Rhys-Bernard) ...
I'm not sure whether you have enough data/statistical power to say anything substantially informative/conclusive. Even saying 'we have evidence that there is not a strong relation' may be too strong.
To help us understand this, can you report (frequentist) confidence intervals around your estimates? (Or even better, a Bayesian approach involving a flat but informative prior and a posterior distribution in light of the data?)
I'll try to say more on this later. A good reference is: Harms and Lakens (2018), “Making ‘null effects’ informative: statistical techniques and inferential frameworks”
Also, even 'insignificant' results may actually be rather informative for practical decision-making... if they cause us to reasonably substantially update our beliefs. We rationally make inferences and adjust our choices based on small amount of data all the time, even if we can't say something like 'it is less than 1% likely that what I just saw would have observed by chance'. Maybe 12% (p>0.05 !) of the time the dark cloud I see in the sky will fade away, but seeing this cloud still makes me decide to carry an umbrella... as now the expected benefits outweigh the costs..
I agree, I do not think I would say that "we have evidence that there is not a strong relation". But I do feel comfortable saying that we do not have evidence that there is any relation at all.
The 95% confidence intervals are extremely wide, given our small sample sizes:
Spring 2019: -0.75 to 0.5 (95th) and -0.55 to 0.16 (75th)
Fall 2019: -0.37 to 0.69 and -0.19 to 0.43
Spring 2020: -0.67 to 0.66 and -0.37 to 0.37
Summer 2020: -0.60 to 0.51 and -0.38 to 0.26
The upper ends are very high, and there is certainly a possibility that our interview scoring process is actually good. But, of the observed effects, two are negative, and two are positive. The highest positive observed correlation is only 0.10.
To somebody who has never been to San Francisco in the summer, it seems reasonable to expect it to rain. It's cloudy, it's dark, and it's humid. You might even bring an umbrella! But, after four days, you've noticed that it hasn't rained on any of them, despite continuing to be gloomy. You also notice that almost nobody else is carrying an umbrella; many of those who are are only doing so because you told them you were! In this situation, it seems unlikely that you would need to see historical weather charts to conclude that the cloudy weather probably doesn't imply what you thought it did.
This is analogous to our situation. We thought our interview scores would be helpful. But it's been several years, and we haven't seen any evidence that they have been. It's costly to use this process, and we would like to see some benefit if we are going to use it. We have not seen that benefit in any of our four cohorts. So, it makes sense to leave the umbrella at home, for now.
Thanks for sharing the confidence intervals. I guess it might be reasonable to conclude from your experience that the interview scores have not been informative enough to justify their cost.
What I am saying is that it doesn't seem (to me) that the data and evidence presented allows you to say that. (But maybe other analysis or inference from your experience might in fact drive that conclusion, the 'other people in San Francisco' in your example.)
But if I glance at just the evidence/confidence intervals it suggests to me that there may be a substantial probability that in fact there is a strongly positive relationship and the results are a fluke.
On the other hand I might be wrong. I hope to get a chance to follow up on this:
We could simulate a case where the measure has 'the minimum correlation to the outcome to make it worth using for selecting on', and see how likely it would be, in such a case, to observe the correlations as low as you observed
Or we could start with a minimally informative 'prior' over our beliefs about the measure, and do a Bayesian updating exercise in light of your observations; we could then consider the posterior probability distribution and consider whether it might justify discontinuing the use of these scores
What are the community's thoughts on making the application super long so that only the most interested people apply (and basically accept everyone who applies)? Would this be considered selective in the same way as rejecting people?
It's an interesting idea, but even if this ends up producing very engaged participants you have to be careful.
If you (deliberately and successfully) only select for people who are super keen, you end up with a super keen cohort but potentially only minimal counterfactual impact as all those you selected would have ended up really involved anyway. This was briefly mentioned in the post and I think is worth exploring further.
Thank you so much for the insights! We've tried longer applications to ensure that the fellows are more engaged due to bad experiences of fellows dropping out / derailing the conversation in the past. However, the point about counterfactual impact has nudged me to shorten our application!
I agree that exactly that tradeoff is important! There's definitely a balance to be struck, and you certainly wouldn't want to exclude those who already very aligned on the basis of low counterfactual impact, as the participation of those people will likely be very positive for other members!
I don't have a strong opinion about this in the context of fellowships, but I can refer to setting a high entry bar in recruiting community members and volunteers in general, and specifically, by asking them to invest time in reading content. I hope this helps and not completely off-topic.
Though EA is a complex set of ideas and we want people to have a good understanding of what it's all about, demanding a lot from new people can be fairly offputting and counterproductive.
From my experience, people who are of high potential to be both highly-engaged and of high value to the community are often both quite busy people and relatively normal human beings.
As for the first point, if you sent someone a long list of content, he/she might just say "this is too demanding, can't handle this right now". As for the second point, we have to accept that people have much shorter attention spans than we would like to imagine, especially if they are not really familiar with the content.
Me and Gidon Kadosh from EA Israel have thought long and hard about how to lower the perceived effort of people who come to our website by creating this "Learn More" page on our website. Though it's in Hebrew, you might be able to understand what we tried to do here on a structural level. We plan to make it even more attractive for readers, possibly by splitting this page into individual pages focusing on a specific subject, and allowing the user to conveniently move on to the next/previous subject - This way we both lower perceived effort of reading this content and create a feeling of progress for the user.
I'm really not sure there is a correlation between the willingness of someone to invest a lot of time in reading lots of content or filling a long application before they have a clear view of the value in doing this. Going back to recruiting new community members and volunteers, there are brilliant people who are value-aligned, but just don't have the time (or are not willing) to invest the time needed to fill in a highly-demanding application, or read 20 articles about something they are not sure they care about yet.
Thank you so much for the insights! We've tried longer applications to ensure that the fellows are more engaged due to bad experiences of fellows dropping out / derailing the conversation in the past. However, the point about high-potential people being busy has convinced me to shorten our application!