## Posts

Immigration reform: a shallow cause exploration 2023-02-20T15:57:13.180Z
Pain relief: a shallow cause exploration 2023-01-27T09:55:26.991Z
Evaluating StrongMinds: how strong is the evidence? 2023-01-19T00:12:18.916Z
A can of worms: the non-significant effect of deworming on happiness in the KLPS 2022-12-21T13:10:17.916Z
A dozen doubts about GiveWell’s numbers 2022-11-01T02:25:09.459Z
Finding before funding: Why EA should probably invest more in research 2022-08-17T08:48:44.756Z
Deworming and decay: replicating GiveWell’s cost-effectiveness analysis 2022-07-25T20:26:53.344Z
Estimating the cost-effectiveness of scientific research 2022-07-16T12:20:24.202Z
JoelMcGuire's Shortform 2022-06-15T04:08:55.773Z
The Bearable Brightness of Wellbeing: The life of an HLI researcher 2022-04-08T14:13:58.266Z
The effect of cash transfers on subjective well-being and mental health 2020-11-20T18:12:04.570Z

Comment by JoelMcGuire on EA is three radical ideas I want to protect · 2023-03-29T21:24:46.454Z · EA · GW

Great piece. Short and sweet.

Given the stratospheric karma this post has reached, and the ensuing likelihood it becomes a referenced classic, I thought it'd be a good time to descend to some pedantry.

"Scope sensitivity" as a phrase doesn't click with me. For some reason, it bounces off my brain. Please let me know if I seem alone in this regard. What scope are we sensitive to? The scope of impact? Also some of the related slogans "shut up and multiply" and "cause neutral" aren't much clearer. "Shut up and multiply" which seems slightly offputting / crass as a phrase stripped of context, gives no hint at what we're multiplying[1]. "Cause neutral" without elaboration, seems objectionable. We shouldn't be neutral about causes! We should prefer the ones that do the most good! They both require extra context and elaboration. If this is something that is used to introduce EA, which now seems likelier, I think this section confuses a bit. A good slogan should have a clear, and difficult to misinterpret meaning that requires little elaboration. "Radical compassion / empathy" does a good job of this. "Scout mindset" is slightly more in-groupy, but I don't think newbies would intuitive that thinking like a scout involves careful exploration of ideas and the importance of reporting the truth of what you find.

Some alternatives to "scope sensitivity" are:

• "Follow the numbers" / "crunch the numbers": we don't quite primarily "follow the data / evidence" anymore, but we certainly try to follow the numbers.
• "More is better" / "More-imization" okay, this is a bit silly, but I assume that Peter was intentionally avoiding saying something like "Maximization mindset" which is more intuitive than "scope sensitivity", but has probably fallen a bit out of vogue. We think that doing more good for the same cost is always better.
• "Cost-effectiveness guided" while it sounds technocratic, that's kind of the point. Ultimately it all comes back to cost-effectiveness. Why not say so?
1. ^

If I knew nothing else, I'd guess it's a suggestion of the profound implications of viewing probabilities as dependent (multiplicative) instead of dependent (addictive) and, consequently, support for complex systems approaches /GEM modelling instead of reductive OLSing with sparse interaction terms. /Joke

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-29T20:01:45.310Z · EA · GW

Jason,

You raise a fair point. One we've been discussing internally. Given the recent and expected adjustments to StrongMinds, it seems reasonable to update and clarify our position on AMF to say something like, "Under more views, AMF is better than or on par with StrongMinds. Note that currently, under our model, when AMF is better than StrongMinds, it isn't wildly better.” Of course, while predicting how future research will pan out is tricky, we'd aim to be more specific.

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-29T19:25:41.748Z · EA · GW

A high neutral point implies that many people in developing countries believe their lives are not worth living.

This isn't necessarily the case. I assume that if people described their lives as having negative wellbeing, this wouldn't imply they thought their life was not worth continuing.

• People can have negative wellbeing and still want to live for the sake of others or causes greater than themselves.
• Life satisfaction appears to be increasing over time in low income countries. I think this progress is such that many people who may have negative wellbeing at present, will not have negative wellbeing their whole lives.

Edit: To expand a little, for these reasons, as well as the very reasonable drive to survive (regardless of wellbeing), I find it difficult to interpret revealed preferences and it's unclear they're a bastion of clarity in this confusing debate.

Anectdotally, I've clearly had periods of negative wellbeing before (sometimes starkly), but never wanted to die during those periods. If I knew that such periods were permanent, I'd probably think it was good for me to not-exist, but I'd still hesitate to say I'd prefer to not-exist, because I don't just care about my wellbeing. As Tyrion said "Death is so final, and life is so full of possibilities."

I think these difficulties should highlight that the difficulties here aren't just localized to this area of the topic.

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-29T19:23:58.400Z · EA · GW

2. I don't think 38% is a defensible estimate for spillovers, which puts me closer to GiveWell's estimate of StrongMinds than HLI's estimate of StrongMinds.

I wrote this critique of your estimate that household spillovers was 52%. That critique had three parts. The third part was an error, which you corrected and brought the answer down to 38%. But I think the first two are actually more important: you're deriving a general household spillover effect from studies specifically designed to help household members, which would lead to an overestimate.

I thought you agreed with that from your response here, so I'm confused as to why you’re still defending 38%. Flagging that I'm not saying the studies themselves are weak (though it's true that they're not very highly powered). I'm saying they're estimating a different thing from what you're trying to estimate, and there are good reasons to think the thing they're trying to estimate is higher. So I think your estimate should be lower.

I could have been clearer, the 38% is a placeholder while I do the Barker et al. 2022 analysis. You did update me about the previous studies' relevance. My arguments are less supporting the 38% figure - which I expect to update with more data and more about explaining why I think that I have a higher prior for household spillovers from psychotherapy than you and Alex seem to. But really, the hope is that we can soon be discussing more and better evidence.

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-24T23:03:15.184Z · EA · GW

My intuition, which is shared by many, is that the badness of a child's death is not merely due to the grief of those around them. Thus the question should not be comparing just the counterfactual grief of losing a very young child VS an [older adult], but also "lost wellbeing" from living a net-positive-wellbeing life in expectation.

I didn't mean to imply that the badness of a child's death is just due to grief. As I said in my main comment, I place substantial credence (2/3rds) in the view that death's badness is the wellbeing lost. Again, this my view not HLIs.

The 13 WELLBY figure is the household effect of a single person being treated by StrongMinds. But that uses the uncorrected household spillover (53% spillover rate). With the correction (38% spillover) it'd be 10.5 WELLBYs (3.7 WELLBYs for recipient + 6.8 for household).

GiveWell arrives at the figure of 80% because they take a year of life as valued at 4.55 WELLBYs = 4.95 - 0.5 according to their preferred neutral point, and StrongMinds benefit ,according to HLI, to the direct recipient is 3.77 WELLBYs --> 3.77 / 4.55 = ~80%. I'm not sure where the 40% figure comes from.

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-24T17:08:25.916Z · EA · GW

To be clear on what the numbers are: we estimate that group psychotherapy has an effect of 10.5 WELLBYs on the recipient's household, and that the death of a child in a LIC has a -7.3 WELLBY effect on the bereaved household. But the estimate for grief was very shallow. The report this estimate came from was not focused on making a cost-effectiveness estimate of saving a life (with AMF). Again, I know this sounds weasel-y, but we haven't yet formed a view on the goodness of saving a life so I can't say how much group therapy HLI thinks is preferable averting the death of a child.

That being said, I'll explain why this comparison, as it stands, doesn't immediately strike me as absurd. Grief has an odd counterfactual. We can only extend lives. People who're saved will still die and the people who love them will still grieve. The question is how much worse the total grief is for a very young child (the typical beneficiary of e.g., AMF) than the grief for the adolescent, or a young adult, or an adult, or elder they'd become [1]-- all multiplied by mortality risk at those ages.

So is psychotherapy better than the counterfactual grief averted? Again, I'm not sure because the grief estimates are quite shallow, but the comparison seems less absurd to me when I hold the counterfactual in mind.

1. ^

I assume people, who are not very young children, also have larger social networks and that this could also play into the counterfactual (e.g., non-children may be grieved for by more people who forged deeper bonds). But I'm not sure how much to make of this point.

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-24T00:46:39.544Z · EA · GW

I'd point to the literature on time lagged correlations between household members emotional states that I quickly summarised in the last installment of the household spillover discussion. I think it implies a household spillover of 20%. But I don't know if this type of data should over- or -underestimate the spillover ratio relative to what we'd find in RCTs. I know I'm being really slippery about this, but the Barker et al. analysis stuff so far makes me think it's larger than that.

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-23T23:34:25.210Z · EA · GW

I find nothing objectionable in that characterization. And if we only had these three studies to guide us then I'd concede that a discount of some size seems warranted. But we also have A. our priors. And B. some new evidence from Barker et al. Both of point me away from very small spillovers, but again I'm still very unsure. I think I'll have clearer views once I'm done analyzing the Barker et al. results and have had someone, ideally Nathanial Barker, check my work.

[Edit: Michael edited to add: "It's not clear any specific number away from 0 could be justified."] Well not-zero certainly seems more justifiable than zero. Zero spillovers implies that emotional empathy doesn't exist, which is an odd claim.

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-23T23:06:21.655Z · EA · GW

# Joel’s response

[Michael's response below provides a shorter, less-technical explanation.]

# Summary

Alex’s post has two parts. First, what is the estimated impact of StrongMinds in terms of WELLBYs? Second, how cost-effective is StrongMinds compared to the Against Malaria Foundation (AMF)? I briefly present my conclusions to both in turn. More detail about each point is presented in Sections 1 and 2 of this comment.

## The cost-effectiveness of StrongMinds

GiveWell estimates that StrongMinds generates 1.8 WELLBYs per treatment (17 WELLBYs per $1000, or 2.3x GiveDirectly[1]). Our most recent estimate[2] is 10.5 WELLBYs per treatment (62 WELLBYs per$1000, or 7.5x GiveDirectly) . This represents a 83% discount (an 8.7 WELLBYs gap)[3] to StrongMinds effectiveness[4]. These discounts, while sometimes informed by empirical evidence, are primarily subjective in nature. Below I present the discounts, and our response to them, in more detail.

Figure 1: Description of GiveWell’s discounts on StrongMinds’ effect, and their source

Notes: The graph shows the factors that make up the 8.7 WELLBY discount.

Table 1: Disagreements on StrongMinds per treatment effect (10.5 vs. 1.8 WELLBYs) and cost

Note: HLI estimates StrongMinds has an effect of 1.8 WELLBYs per household of recipient. HLI estimates that this figure is 10.5. This represents a 8.7 WELLBY gap.

How do we assess GiveWell’s discounts? We summarise our position below.

Figure 2: HLI’s views on GiveWell’s total discount of 83% to StrongMind’s effects

• We think there’s sufficient evidence and reason to justify the size and magnitude of 5% of GiveWell’s total discount
• For ~45% of their total discount, we are sympathetic to including a discount, but we are unsure about the magnitude (generally, we think the discount would be lower). The adjustments that I think are the most plausible are:
• A discount of up to 15% for conversion between depression and life-satisfaction SD.
• A discount of up to 20% for loss of effectiveness at scale.
• A discount of up to 5% for response biases.
• Reducing the household size down to 4.8 people.
• We are unsympathetic to ~35% of their total discount, because our intuitions differ, but there doesn’t appear to be sufficient existing evidence to settle the matter (i.e., household spillovers).
• We think that for 15% of their total discount, the evidence that exists doesn’t seem to substantiate a discount (i.e., their discounts on StrongMind's durability).

However, as Michael mentions in his comment, a general source of uncertainty we have is about how and when to make use of subjective discounts. We will make more precise claims about the cost-effectiveness of StrongMinds when we finalise our revision and expansion.

## The cost-effectiveness of AMF

The second part of Alex's post is asking how cost-effective is StrongMinds compared to the Against Malaria Foundation (AMF)? AMF, which prevents malaria with insecticide treated bednets, is in contrast to StrongMinds, a primarily life-saving intervention. Hence, as @Jason rightly pointed out elsewhere in the comments, its cost-effectiveness strongly depends on philosophical choices about the badness of death and the neutral point (see Plant et al., 2022). GiveWell takes a particular set of views (deprivationism with a neutral point of 0.5) that are very favourable to life saving interventions. But there are other plausible views that can change the results, and even make GiveWell’s estimate of StrongMinds seem more cost-effective than AMF. Whether you accept our original estimate of StrongMinds, or GiveWell’s lower estimate, the comparison is still incredibly sensitive to these philosophical choices. I think GiveWell is full of incredible social scientists, and I admire many of them, but I'm not sure that should privilege their philosophical intuitions.

## Further research and collaboration opportunities

We are truly grateful to GiveWell for engaging with our research on StrongMinds. I think we largely agree with GiveWell regarding promising steps for future research. We’d be keen to help make many of these come true, if possible. Particularly regarding: other interventions that may benefit from a SWB analysis, household spillovers, publication bias, the SWB effects of psychotherapy (i.e. not just depression), and surveys about views on the neutral point and the badness of death. I would be delighted if we could make progress on these issues, and doubly so if we could do so together.

# 1. Disagreements on the cost-effectiveness of StrongMinds

HLI estimates that psychotherapy produces 10.5 WELLBYs (or 62 per $1000, 7.5x GiveDirectly) for the household of the recipient, while GiveWell estimates that psychotherapy has about a sixth of the effect, 1.8 WELLBYs (17 per$1000 or 2.3x GiveDirectly[5]). In this section, I discuss the sources of our disagreement regarding StrongMinds in the order I presented in Table 1.

## 1.1 Household spillover differences

Household spillovers are our most important disagreement. When we discuss the household spillover effect or ratio we’re referring to the additional benefit each non-recipient member of the household gets, as a percentage of what the main recipient receives. We first analysed household spillovers in McGuire et al. (2022), which was recently discussed here. Notably, James Snowden pointed out a mistake we made in extracting some data, which reduces the spillover ratio from 53% to 38%.

GiveWell’s method relies on:

• Discounting the 38% figure citing several general reasons. (A) Specific concerns that the studies we use might overestimate the benefits because they focused on families with children that had high-burden medical conditions. (B) A shallow review of correlational estimates of household spillovers and found spillover ratios ranging from 5% to 60%.
• And finally concluding that their best guess is that the spillover percentage is 15 or 20%[6], rather than 53% (what we used in December 2022) or 38% (what we would use now in light of Snowden’s analysis). Since their resulting figure is a subjective estimate, we aren’t exactly sure why they give that figure, or how much they weigh each piece of evidence.

Table 2: HLI and GiveWell’s views on household spillovers of psychotherapy

Note: The household spillover for cash transfers we estimated is 86%.

I reassessed the evidence very recently - as part of the aforementioned discussion with James Snowden - and Alex’s comments don’t lead me to update my view further. In my recent analysis, I explained that I think I should weigh the studies we previously used less because they do seem less relevant to StrongMinds, but I’m unsure what to use instead. And I also hold a more favourable intuition about household spillovers for psychotherapy, because parental mental health seems important for children (e.g., Goodman, 2020).

But I think we can agree that collecting and analysing new evidence could be very important here. The data from Barker et al. (2022), a high quality RCT of the effect of CBT on the general population in Ghana (n = ~7,000) contains information on both partners' psychological distress when one of them received cognitive behavioural therapy, so this data can be used to estimate any spousal spillover effects from psychotherapy. I am in the early stage of analysing this data[7]. There also seems to be a lot of promising primary work that could be done to estimate household spillovers alongside the effects of psychotherapy.

## 1.2 Conversion between measures, data sources, and units

The conversion between depression and life-satisfaction (LS) scores ties with household spillovers in terms of importance for explaining our disagreements about the effectiveness of psychotherapy. We’ve previously assumed that a one standard deviation (SD) decrease in depression symptoms (or affective mental health; MHa) is equivalent to a one SD improvement in life-satisfaction or happiness (i.e., a 1:1 conversion), see here for our previous discussion and rationale.

Givewell has two concerns with this:

1. Depression and life-satisfaction measures might not be sufficiently empirically or conceptually related to justify a 1:1 conversion. Because of this, they apply an empirically based 10% discount.
2. They are concerned that recipients of psychotherapy have a smaller variance in subjective wellbeing (SWB) than general populations (e.g., cash transfers), which leads to inflated effect sizes. They apply a 20% subjective discount to account for this.

Hence, GiveWell applied a 30% discount (see Table 4 below).

Table 3: HLI and GiveWell’s views on converting between SDs of depression and life satisfaction

Overall, I agree that there are empirical reasons for including a discount in this domain, but I’m unsure of its magnitude. I think it will likely be smaller than GiveWell’s 30% discount.

### 1.2.1 Differences between the two measures

First, GiveWell mentions a previous estimate of ours suggesting that mental health (MH) treatments[8] impact depression 11% more than SWB. Our original calculation used a naive average, but on reflection, it seems more appropriate to use a sample-size-weighted average (because of the large differences in samples between studies), which results in depression measures overestimating SWB measures by 4%, instead of 11%.

Results between depression and happiness measures are also very close in Bhat et al. (2022; n = 589), the only study I've found so far that looks at effects of psychotherapy on both types of measures. We can standardise the effects in two ways. Depending on the method, the SWB effects are larger by 18% or smaller by 1% than MHa effects[9]. Thus, effects of psychotherapy on depression appear to be of similar size as effects on SWB. Given these results, I think the discount due to empirical differences could be smaller than 10%, I would guess 3%.

Another part of this is that depression and life satisfaction are not the same concept. So if the scores are different, there is a further moral question about which deserves more weight. The HLI ‘house view’, as our name indicates, favours happiness (how good/bad we feel) as what matters. Further, we suspect that measures of depression are conceptually closer to happiness than measures of life satisfaction are. Hence, if push came to shove, and there is a difference, we’d care more about the depression scores, so no discount would be justified. From our conversation with Alex, we understand that the GiveWell ‘house view’ is to care more about life satisfaction than happiness. In this case, GiveWell would be correct, by their lights, to apply some reduction here.

### 1.2.2 Differences in variance

In addition to their 11% conversion discount, GiveWell adds another 20% discount because they think a sample of people with depression have a smaller variance in life satisfaction scores.[10] Setting aside the technical topic of why variance in variances matters, I investigated whether there are lower SDs in life satisfaction when you screen for baseline depression using a few datasets. I found that, if anything, the SDs are larger by 4% (see Table 4 below). Although I see the rationale behind GiveWell’s speculation, the evidence I’ve looked at suggests a different conclusion.

Table 4: Life-satisfaction SD depending on clinical mental health cutoff

Note: BHPS = The British Household Panel Survey, HILDA = The Household Income and Labour Dynamics Survey, NIDS = National Income Dynamics Study. LS = life satisfaction, dep = depression.

However, I’m separately concerned that SD changes in trials where recipients are selected based on depression (i.e., psychotherapy) are inflated compared to trials without such selection (i.e., cash transfers)[11].

Overall, I think I agree with GiveWell that there should be a discount here that HLI doesn’t implement, but I’m unsure of its magnitude, and I think that it’d be smaller than GiveWell’s. More data could likely be collected on these topics, particularly how much effect sizes in practice differ between life-satisfaction and depression, to reduce our uncertainty.

## 1.3 Loss of effectiveness outside trials and at scale

GiveWell explains their concern, summarised in the table below:

“Our general expectation is that programs implemented as part of randomized trials are higher quality than similar programs implemented at scale. [...] For example, HLI notes that StrongMinds uses a reduced number of sessions and slightly reduced training, compared to Bolton (2003), which its program is based on.48 We think this typeof modification could reduce program effectiveness relative to what is found in trials. [...] We can also see some evidence for lower effects in larger trials…”

Table 5: HLI and GiveWell’s views on an adjustment for StongMind’s losing effectiveness at scale

While GiveWell provides several compelling reasons for why StongMinds efficacy will decrease as it scales, I can’t find the reason GiveWell provides for why these reasons result in a 25% discount. It seems like a subjective judgement informed by some empirical factors and perhaps from previous experience studying this issue (e.g., cases like No Lean Season). Is there any quantitative evidence that suggests that when RCT interventions scale they drop 25% in effectiveness? While GiveWell also mentions that larger psychotherapy trials have smaller effects, I assume this is driven by publication bias (discussed in Section 1.6). I’m also less sure that scaling has no offsetting benefits. I would be surprised if when RCTs are run, the intervention has all of its kinks ironed out. In fact, there’s many cases of the RCT version of an intervention being the “minimum viable product” (Karlan et al., 2016). While I think a discount here is plausible, I’m very unsure of its magnitude.

In our updated meta-analysis we plan on doing a deeper analysis of the effect of expertise and time spent in therapy, and to use this to better predict the effect of StrongMinds. We’re awaiting the results from Baird et al. which should better reflect their new strategy as StrongMinds trained but did not directly deliver the programme.

## 1.4 Disagreements on the durability of psychotherapy

GiveWell explains their concern summarised in the table below, “We do think it’s plausible that lay-person-delivered therapy programs can have persistent long-term effects, based on recent trials by Bhat et al. 2022 and Baranov et al. 2020. However, we’re somewhat skeptical of HLI’s estimate, given that it seems unlikely to us that a time-limited course of group therapy (4-8 weeks) would have such persistent effects. We also guess that some of the factors that cause StrongMinds’ program to be less effective than programs studied in trials (see above) could also limit how long the benefits of the program endure. As a result, we apply an 80% adjustment factor to HLI’s estimates. We view this adjustment as highly speculative, though, and think it’s possible we could update our view with more work.

Table 6: HLI and GiveWell’s views on a discount to account for a decrease in durability

Since this disagreement appears mainly based on reasoning, I’ll explain why my intuitions - and my interpretation of the data - differ from GiveWell here. I already assume that StrongMinds decays 4% more each year than does psychotherapy in general (see table 3). Baranov et al. (2020) and Bhat et al. (2022) both find long-term effects that are greater than what our general model predicts. This means that we already assume a higher decay rate in general, and especially for StrongMinds than the two best long-term studies of psychotherapy suggest. I show how these studies compare to our model in Figure 3 below.

Figure 3: Effects of our model over time, and the only long-term psychotherapy studies in LMICs

Edit: I updated the figure to add the StrongMinds model, which starts with a higher effect but has a faster decay.

Baranov et al. (2020, 16 intended sessions) and Bhat et al. (2022, 6-14 intended sessions, with 70% completion rate) were both time limited. StrongMinds historically used 12 sessions (it may be 8 now) of 90 minutes[12]. Therefore, our model is more conservative than the Baranov et al. result, and closer to the Bhat et al. which has a similar range of sessions. Another reason, in favour of the duration of StrongMinds, which I mentioned in McGuire et al. (2021), is that 78% of groups continued meeting on their own at least six months after the programme formally ended.

Bhat et al. (2022) is also notable in another regard: They asked ~200 experts to predict the impact of the intervention after 4.5 years. The median prediction underestimated the effectiveness by nearly 1/3rd, which makes me inclined to weigh expert priors less here[13].

Additionally, there seems to be something double-county in GiveWell’s adjustments. The initial effect is adjusted by 0.75 for “Lower effectiveness at scale and outside of trial contexts” and the duration effect is adjusted by 0.80, also for “lower effectiveness at scale and outside of trial contexts”. Combined this is a 0.55 adjustment instead of one 0.8 adjustment. I feel like one concern should show up as one discount.

## 1.5 Disagreements on social desirability bias[14]

GiveWell explains their concern, which is summarised in the table below: “One major concern we have with these studies is that participants might report a lower level of depression after the intervention because they believe that is what the experimenter wants to see [...] HLI responded to this criticism [section 4.4] and noted that studies that try to assess experimenter-demand effects typically find small effects.[...] We’re not sure these tests would resolve this bias so we still include a downward adjustment (80% adjustment factor).

Table 7: HLI and GiveWell’s views on converting between SDs of depression and life satisfaction

Participants might report bigger effects to be agreeable with the researchers (socially driven bias) or in the hopes of future rewards (cognitively driven bias; Bandiera et al., 2018), especially if they recognise the people delivering the survey to be the same people delivering the intervention[15].

But while I also worry about this issue, I am less concerned than GiveWell that response bias poses a unique threat to psychotherapy. Because if this bias exists, it seems likely to apply to all RCTs of interventions with self-reported outcomes (and without active controls). So I think the relevant question is why the propensity to response bias might differ between cash transfers and psychotherapy? Here are some possibilities:

• It seems potentially more obvious that psychotherapy should alleviate depression than cash transfers should increase happiness. If so, questions about self-reported wellbeing may be more subject to bias in psychotherapy trials[16].
• We could expect that the later the follow-up, the less salient the intervention is, the less likely respondents are to be biased in this way (Park & Kumar, 2022). This is one possibility that could favour cash transfers because they have relatively longer follow-ups than psychotherapy.
• However, it is obvious to cash transfer participants whether they are in the treatment (they receive cash) or control conditions (they get nothing). This seems less true in psychotherapy trials where there are often active controls.

GiveWell responded to the previous evidence I cited (McGuire & Plant, 2021, Section 4.4)[17] by arguing that the tests run in the literature, by investigating the effect of the general propensity towards socially desirable responding or the expectations of surveyor, are not relevant because: “If the surveyor told them they expected the program to worsen their mental health or improve their mental health, it seems unlikely to overturn whatever belief they had about the program’s expected effect that was formed during their group therapy sessions.” But, if participants' views about an intervention seem unlikely to be overturned by what the surveyor seems to want - when what the surveyor wants and the participant’s experience differs - then that’s a reason to be less concerned about socially motivated response bias in general.

However, I am more concerned with socially desirable responses driven by cognitive factors. Bandiera et al. (2018, p. 25) is the only study I found to discuss the issue, but they do not seem to think this was an issue with their trial: “Cognitive drivers could be present if adolescent girls believe providing desirable responses will improve their chances to access other BRAC programs (e.g. credit). If so, we might expect such effects to be greater for participants from lower socioeconomic backgrounds or those in rural areas. However, this implication runs counter to the evidence in Table A5, where we documented relatively homogenous impacts across indices and time periods, between rich/poor and rural/urban households.

I agree with GiveWell that more research would be very useful, and could potentially update my views considerably, particularly with respect to the possibility of cognitively driven response bias in RCTs deployed in low-income contexts.

## 1.6 Publication bias

GiveWell explains their concern, which we summarise in the table below: “HLI’s analysis includes a roughly 10% downward adjustment for publication bias in the therapy literature relative to cash transfers literature. We have not explored this in depth but guess we would apply a steeper adjustment factor for publication bias in therapy relative to our top charities. After publishing its cost-effectiveness analysis, HLI published a funnel plot showing a high level of publication bias, with well-powered studies finding smaller effects than less-well-powered studies.57 This is qualitatively consistent with a recent meta-analysis of therapy finding a publication bias of 25%.”

Table 8: HLI and GiveWell’s views on a publication bias discount

After some recent criticism, we have revisited this issue and are working on estimating the bias empirically. Publication bias seems like a real issue, where a 10-25% correction like what GiveWell suggests seems plausible, but we’re unsure about the magnitude as our research is ongoing. In our update of our psychotherapy meta-analysis we plan to employ a more sophisticated quantitative approach to adjust for publication bias.

## 1.7 Household size

GiveWell explains their concern, which we summarise in the table below: “HLI estimates household size using data from the Global Data Lab and UN Population Division. They estimate a household size of 5.9 in Uganda based on these data, which appears to be driven by high estimates for rural household size in the Global Data Lab data, which estimate a household size of 6.3 in rural areas in 2019. A recent Uganda National Household Survey, on the other hand, estimates household size of 4.8 in rural areas. We’re not sure what’s driving differences in estimates across these surveys, but our best guess is that household size is smaller than the 5.9 estimate HLI is using.

Table 9: HLI and GiveWell’s views on household size of StrongMind’s recipients

I think the figures GiveWell cites are reasonable. I favour using international datasets because I assume it means greater comparability between countries, but I don’t feel strongly about this. I agree it could be easy and useful to try and understand StrongMinds recipient’s household sizes more directly. We will revisit this in our StrongMinds update.

## 1.8 Cost per person of StrongMinds treated

The one element where we differ that makes StrongMinds look more favourable is cost. As GiveWell explains “HLI’s most recent analysis includes a cost of $170 per person treated by StrongMinds, but StrongMinds cited a 2022 figure of$105 in a recent blog post

Table 10: HLI and GiveWell’s views on cost per person for StrongMind’s treatment

According to their most recent quarterly report, a cost per person of $105 was the goal, but they claim$74 per person for 2022[18]. We agree this is a more accurate/current figure, and the cost might well be lower now. A concern is that the reduction in costs comes at the expense of treatment fidelity – an issue we will review in our updated analysis.

# 2. GiveWell’s cost-effectiveness estimate of AMF is dependent on philosophical views

GiveWell estimates that AMF produces 70 WELLBYs per $1000[19], which would be 4 times better than StrongMinds. GiveWell described the philosophical assumptions of their life saving analysis as: “...Under the deprivationist framework and assuming a “neutral point” of 0.5 life satisfaction points. [...] we think this is what we would use and it seems closest to our current moral weights, which use a combination of deprivationism and time-relative interest account." Hence, they conclude that AMF produces 70 WELLBYs per$1000, which makes StrongMinds 0.24 times as cost-effective as AMF. However, the position they take is nearly the most favourable one can take towards interventions that save lives[20]. But there are other plausible views about the neutral point and the badness of death (we discuss this in Plant et al., 2022). Indeed, assigning credences to higher neutral points[21] or alternative philosophical views of death’s badness will reduce the cost-effectiveness of AMF relative to StrongMinds (see Figure 3). In some cases, AMF is less cost-effective than GiveWell’s estimate of StrongMinds[22].

Figure 4: Cost-effectiveness of charities under different philosophical assumptions (with updated StrongMinds value, and GiveWell’s estimate for StrongMinds)

To be clear, HLI does not (yet) take a stance on these different philosophical views. While I present some of my views here, these do not represent HLI as a whole.

Personally, I’d use a neutral point closer to 2 out of 10[23]. Regarding the philosophy, I think my credences would be close to uniformly distributed across the Epicurean, TRIA, and deprivationist views. If I plug this view into our model introduced in Plant et al. (2022) then this would result in a cost-effectiveness for AMF of 29 WELLBYs per $1000 (rather than 81 WELLBYs per$1000)[24], which is about half as good as the 62 WELLBYs per $1000 for StrongMinds. If GiveWell held these views, then AMF would fall within GiveWell’s pessimistic and optimistic estimates of 3-57 WELLBYs per$1000 for StrongMinds’ cost-effectiveness. For AMF to fall above this range, you need to (A) put almost all your credence in deprivationism and (B) have a neutral point lower than 2[25].

1. ^

Coincidently, this is (barely) within our most recent confidence interval for comparing the cost-effectiveness of StrongMinds to GiveDirectly (95% CI: 2, 100).

2. ^

This calculation is based on a correction for a mistake in our spillover ratio discussed here (a spillover ratio of 38% instead of 53%). Our previous estimate was 77 WELLBYs per $1000 (Plant et al., 2022; McGuire et al., 2022). 3. ^ The discount on the effect per$1000 is smaller because GiveWell used a 38% smaller cost figure.

4. ^

Note that the reduction in cost-effectiveness is only 27% because they also think that the costs are 62% smaller.

5. ^

Coincidently, this is (barely) within our most recent confidence interval for comparing the cost-effectiveness of StrongMinds to GiveDirectly (95% CI: 2, 100).

6. ^

The text and the table give different values.

7. ^

But if you want to accept that the results could be very off, see here for a document with tables with my very preliminary results.

8. ^

These are positive psychology interventions (like mindfulness and forgiveness therapy) which might not completely generalise to psychotherapy in LMICs.

9. ^

Psychotherapy improved happiness by 0.38 on a 1-10 score and reduced depression by 0.97 (on the PHQ-9’s 0-27 scale). If we convert the depression score to a 1-10 scale, using stretch transformation, then the effect is a reduction in depression of 0.32. Hence, the SWB changes are 18% larger than MHa changes. If we convert both results to Cohen’s d, we find a Cohen’s d of 0.167 for depression and a Cohen’s d of 0.165 for happiness. Hence changes in MHa are 1% greater than SWB.

10. ^

“it seems likely that SD in life satisfaction score is lower among StrongMinds recipients, who are screened for depression at baseline46 and therefore may be more concentrated at the lower end of the life satisfaction score distribution than the average individual.”

11. ^

Sample selection based on depression (i.e., selection based on the outcome used) could shrink the variance of depression scores in the sample, which would inflate standardised effects sizes of depression compared to trials without depression selection, because standardisation occurs by dividing the raw effect by its standard deviation (i.e., standardised mean differences, such as Cohen’s d). To explore this, I used the datasets mentioned in Table XX, all of which also included measures of depression or distress and the data from Barker et al. (2022, n = 11,835). I found that the SD of depression for those with clinically significant depression was 18 to 21% larger than it was for the general sample (both the mentally ill and healthy). This seems to indicate that SD changes from psychotherapy provide inflated SD changes in depression compared to cash transfers, due to smaller SDs of depression. However, I think this may be offset by another technical adjustment. Our estimate of the life-satisfaction SD we use to convert SD changes (in MHa or SWB) to WELLBYs might be larger, which means the effects of psychotherapy and cash transfers are underestimated by 14% compared to AMF. When we convert from SD-years to WELLBYs we’ve used a mix of LMIC and HIC sources to estimate the general SD of LS. But I realised that there’s a version of the World Happiness Report that published data that included the SDs of LS for many countries in LMICs. If we use this more direct data for Sub-Saharan Countries then it suggests a higher SD of LS than what I previously estimated (2.5 instead of 2.2, according to a crude estimate), a 14% increase.

12. ^

In one of the Bhat et al. trials, each session was 30 to 45 minutes (it’s unclear what the session length was for the other trials).

13. ^

Note, I was one of the predictors, and my guess was in line with the crowd (~0.05 SDs), and you can’t see others' predictions beforehand on the Social Science Prediction Platform.

14. ^

Note, this is more about ‘experimenter demand effects’ (i.e., being influenced by the experimenters in a certain direction, because that’s what they want to find) than ‘socially desirability bias’ (i.e., answering that one is happier than they are because it looks better). The latter is controlled for in an RCT. We keep the wording used by GW here.

15. ^

GiveWell puts it in the form of this scenario “If a motivated and pleasant IPT facilitator comes to your village and is trying to help you to improve your mental health, you may feel some pressure to report that the program has worked to reward the effort that facilitator has put into helping you.” But these situations are why most implementers in RCTs aren’t the surveyors. I’d be concerned if there were more instances of implementers acting as surveyors in psychotherapy than cash transfer studies.

16. ^

On the other hand, who in poverty expects cash transfers to bring them misery? That seems about as rare (or rarer) as those who think psychotherapy will deepen their suffering. However, I think the point is about what participants think that implementers most desire.

17. ^

Since then, I did some more digging. I found Dhar et al. (2018) and Islam et al. (2022) which use a questionnaire to test for propensity to answer questions in a socially desirable manner, but find similarly small results of socially motivated response bias. Park et al. (2022) takes an alternative approach where they randomise a subset of participants to self-survey, and argue that this does not change the results.

18. ^

This is mostly consistent with 2022 expenses / people treated = 8,353,149 / 107,471 = $78. 19. ^ 81 WELLBYs per$1000 in our calculations, but they add some adjustments.

20. ^

The most favourable position would be assuming deprivationism and a neutral point of zero.

21. ^

People might hold that the neutral point is higher than 0.5 (on a 0-10 scale), and thereby reduce the cost-effectiveness of AMF. The IDinsight survey GiveWell uses surveys people from Kenya and Ghana but has a small sample (n = 70) for its neutrality question. In our pilot report (n = 79; UK sample; Samuelsson et al., 2023) we find a neutral point of 1.3. See Samuelsson et al. (2023; Sections 1.3 and 6) for a review of the different findings in the literature and more detail on our findings. Recent unpublished work by Julian Jamison finds a neutral point of 2.5 on a sample size of ~1,800 drawn from the USA, Brazil and China. Note that, in all these cases, we recommend caution in concluding that any of these values is the neutral point. There is still more work to be done.

22. ^

Under GiveWell’s analysis, there are still some combinations of philosophical factors where AMF produces 17 WELLBYs or less (i.e., is as or less good than SM in GiveWell’s analysis): (1) An Epicurean view, (2) Deprivationism with neutral points above 4, and (3) TRIA with high ages of connectivity and neutral points above 3 or 4 (depending on the combination). This does not include the possibility of distributing credences across different views.

23. ^

I would put the most weight on the work by HLI and Jamison and colleagues, mentioned in above, which finds a neutral point of 1.3/10 and 2.5/10, respectively.

24. ^

I average the results across each view.

25. ^

We acknowledge that many people may hold these views. We also want to highlight that many people may hold other views. We encourage more work investigating the neutral point and investigating the extent to which these philosophical views are held.

Comment by JoelMcGuire on Assessment of Happier Lives Institute’s Cost-Effectiveness Analysis of StrongMinds · 2023-03-22T21:59:34.150Z · EA · GW

Joel from HLI here,

Alex kindly shared a draft of this report and discussed feedback from Michael and I more than a year ago. He also recently shared this version before publication. We’re very pleased to finally see that this is published!

We will be responding in more (maybe too much) detail tomorrow. I'm excited to see more critical discussion of this topic.

Edit: the response (Joel's, Michael's, Sam's) has arrived.

Comment by JoelMcGuire on Why I don’t agree with HLI’s estimate of household spillovers from therapy · 2023-02-27T21:37:35.693Z · EA · GW

I'd assume that 1. you don't need the whole household, depending on the original sample size, it seems plausible to randomly select a subset of household members [1](e.g., in house A you interview recipient and son, in B. recipient and partner, etc...) and 2. they wouldn't need to consent to participate, just to be surveyed, no?

If these assumptions didn't hold, I'd be more worried that this would introduce nettlesome selection issues.

1. ^

I recognise this isn't necessarily simple as I make it out to be. I expect you'd need to be more careful with the timing of interviews to minimise the likelihood that certain household members are more likely to be missing (children at school, mother at the market, father in the fields, etc.).

Comment by JoelMcGuire on Why I don’t agree with HLI’s estimate of household spillovers from therapy · 2023-02-27T21:05:07.727Z · EA · GW

Given that this post has been curated, I wanted to follow up with a few points I’d like to emphasise that I forgot to include in the original comment.

• To my knowledge, we were the first to attempt to estimate household spillovers empirically. In hindsight, it shouldn't be too surprising that it’s been a messy enterprise. I think I've updated towards "messiness will continue".
• One hope of ours in the original report was to draw more attention to the yawning chasm of good data on this topic.
• "The lack of data on household effects seems like a gap in the literature that should be addressed by further research. We show that including household spillovers can change the relative cost-effectiveness of two interventions, which demonstrates the need to account for the impact of interventions beyond the direct recipient."
• Relatedly, spillovers don't have to be huge to be important. If you have a household of 5, with 1 recipient and 4household non-recipients, household spillovers only need to be 25% that of the recipient effect for the two effects to be equivalent in size. I'm still pretty confident we omit an important parameter when we fail to estimate household spillovers.
• So I’m pleased with this conversation and hopeful that spillovers for all outcomes in the global health and wellbeing space will be given more empirical consideration as a consequence.
• There are probably relatively cost-effective ways to gather more data regarding psychotherapy spillovers in particular.
• I’ve heard that some people working with Vida-Plena are trying to find funding for an RCT that includes spillovers — but I haven’t spoken to Joy about this recently.
• StrongMinds could be willing to do more work here. I think they’re planning an RCT — if they get it funded, I think adding a module for household surveys shouldn’t be too expensive.
• There’s also a slew of meta-analyses of interventions aimed at families that didn’t always seem jointly to target parents and children that may include more RCTs where we can infer spillovers. Many of these I missed before: Siegenthaler et al. (2012), Thanhäuser et al. (2017), Yap et al. (2016), Loechner et al. (2018), Lannes et al., (2018), and Havinga et al. (2021)
• In general, household spillovers should be relatively cheap to estimate if they just involve surveying a randomly selected additional household member and clarifying the relationships between those surveyed.

I still don't have the Barker et al. RCT spillover results, but will update this comment once I know.

Comment by JoelMcGuire on Taking a leave of absence from Open Philanthropy to work on AI safety · 2023-02-24T23:20:02.355Z · EA · GW

Neat! Cover jacket could use a graphic designer in my opinion. It's also slotted under engineering? Am I missing something?

Comment by JoelMcGuire on Immigration reform: a shallow cause exploration · 2023-02-24T22:28:59.601Z · EA · GW

Dear Srdjan,

I think we do address the potential for negative impacts. As we say in section 2.2 (and elaborate on in Appendix B.3:

"From 11 studies we estimate that a 1% increase in immigrants as a share of the population is associated with a (non-significant) decrease of -0.004 SDs of SWB (or -0.008 WELLBYs) for the native population."

Additionally, we have a subsection (3.2) called "risk of backlash effects". Again, these may not be the concerns you have in mind, but to say we're only mentioning positive effects is wrong. We mention throughout the report that we're concerned with potential negative impacts and include speculative estimates of these in our BOTECs.

And in section B.2 we say

One concern is that emigration could limit the ability of the origin country to improve its institutions. This could be true if emigration drains high-skilled individuals from the country, making it poorer and less likely to reform. We expect this concern to be blunted somewhat by high rates of return (29% globally, Azose & Raftery, 2018) and remittances (which are three times the size of development aid). Remittances seem correlated to beneficial political (Williams, 2018) and economic (Yoshino et al., 2017; Kratou et al., 2015) effects at the country level. Historical quasi-experimental evidence from Sweden found that Swedish emigration led to a higher likelihood of reform for local governments in Sweden (Karadja & Prawitz, 2019). However, this study may not generalise to other contexts. We think that the positive effects of emigration may be less likely in authoritarian countries where potential reformers may emigrate at higher rates.

Comment by JoelMcGuire on Why I don’t agree with HLI’s estimate of household spillovers from therapy · 2023-02-24T21:54:40.656Z · EA · GW

James courteously shared a draft of this piece with me before posting, I really appreciate that and his substantive, constructive feedback.

1. I blundered

The first thing worth acknowledging is that he pointed out a mistake that substantially changes our results. And for that, I’m grateful. It goes to show the value of having skeptical external reviewers.

He pointed out that Kemp et al., (2009) finds a negative effect, while we recorded its effect as positive — meaning we coded the study as having the wrong sign.

What happened is that MH outcomes are often "higher = bad", and subjective wellbeing is "higher = better", so we note this in our code so that all effects that imply benefits are positive. What went wrong was that we coded Kemp et al., (2009), which used the GHQ-12 as "higher = bad" (which is usually the case) when the opposite was true. Higher equalled good in this case because we had to do an extra calculation to extract the effect [footnote: since there was baseline imbalance in the PHQ-9, we took the difference in pre-post changes], which flipped the sign.

This correction would reduce the spillover effect from 53% to 38% and reduce the cost-effectiveness comparison from 9.5 to 7.5x, a clear downwards correction.

This is how the forest plot should look.

2. James’s other critiques

I think James’s other critiques are also pretty reasonable. This updates me towards weighting these studies less. That said, what I should place weight on instead remains quite uncertain for me.

I’ve thought about it a bit, but I’m unsure what to make of the observational evidence. My reading of the observational literature mostly accords with James (I think), and it does appear to suggest smaller spillovers than the low quality RCTs I previously referenced (20% versus the now 38%). Here's a little table I made while doing a brief review during my discussion with James.

However, I wonder if there’s something about the more observational literature that makes household spillovers appear smaller, regardless of the source. To investigate this further, I briefly compared household spillovers from unemployment and mental health shocks. This led me to an estimate of around 57% as the household spillover of unemployment, which I think we could use as a prior for other economic shocks. This is a bit lower than the 86% I estimated as  the household spillover for cash transfers. Again, not quite sure what to make of this.

3. Other factors that influence my priors / fuzzy tummy feelings about psychotherapy spillovers.

• Mother’s mental health seems really important, over and above a father’s mental health.  Augustijn (2022) finds a higher relationship between mother<> child MH than father<>child mental health (a 1 LS point change in the mother predicts a 0.13 change in LS for child (as compared to 0.06 for fathers). Many of the studies above seem to have larger mother → child effects than father → child. This could be relevant as StrongMinds primarily treats women.
• Mental health appears important relative to the effect of income.
• See figure from Clark et al., (2018) -- shown below.
• Mcnamee et al., (2021) (published version of Mendolia et al., 2018) finds that having a partner with a long term emotional or nervous condition that requires treatment has a -0.08 effect on LS, and that log household income has a 0.064 effect. If we interpret 0.69 log-units as leading to a doubling, and assume that most $1000 CTs lead to about a doubling in household income, then the effect of doubling income is 0.064 * 0.69 = 0.04 effect on LS. If I assume that the effect of depression is similar to “long-term emotional or nervous condition that requires treatment” and psychotherapy cures 40% of depression cases, then this leads to an effect of psychotherapy of 0.4 * -0.08 = 0.032. Or the effect of psychotherapy relative to doubling the income on a partner is 73%. Applying this to the 86% CT spillover would get us a 63% spillover ratio for psychotherapy. • You could compare income and mental health effects on wellbeing in other studies -- but I haven’t had time to do so (and I’m not really sure of how informative this is). • Powdthavee & Vignoles (2008), which found the effect of mother distress in the previous period on children is 14% of the effect that the child’s own wellbeing in the previous period had on their present wellbeing. But it also seems to give weirdly small coefficients (and non-significant) for the effect of log-income on life satisfaction (0.054 for fathers, negative -0.132 for mothers). • Early life exposure to a parent’s low mental health seems plausibly related to very long term wellbeing effects through higher likelihood of worse parenting (abuse, fewer resources to support the child; Zheng et al., 2021) • I’m unsure if positive and negative shocks spillover in the same way. Negativity seems more contagious than positivity. For instance in Hurd et al. (2014) the spillover effects of re-employment seemed less than the harms of unemployment. Also see Adam et al., (2014)[4] – I’m sure there’s much more on this topic. This makes me think that it may not be wild to see a relatively smaller gap between the spillovers of cash transfers and psychotherapy than we may initially expect. • Most of these studies are in HICs. It seems plausible that spillovers for any intervention could be different and I suspect higher in LMICs than HICs. I assume emotional contagion is a function of spending time together, and spending more time together seems likelier when houses are smaller (you can’t shut yourself in your room as easily), transportation is relatively more expensive, difficult, and dangerous – and you may have fewer reasons to go elsewhere. One caveat is that household sizes are larger, so there may be less time directly spent with any given household member, so that’s a factor that could push in the other direction. 4. What’s next? I think James and I probably agree that making sense out of the observational evidence is tricky to say the least, and a high quality RCT would be very welcome for informing our views and settling our disagreements. I think some further insight could come sooner rather than later. As I was writing this, I wondered if there was any possibility of household spillovers in Barker et al., (2022), a recent study about the effects of CBT on the general population in Ghana that looked into community spillovers of psychotherapy (-33% the size of the treatment effect but non-significant – but's that's a problem for another time). In section 3.2 the paper reads, "At the endline we administered only the “adult” survey, again to both the household head and their spouse… In our analysis of outcomes, we include the responses of both adults in control households; in households where an individual received CBT, we only include treated individuals.’" This means that while Barket et al. didn’t look into it we should be able to estimate the spousal mental health spillover of having your partner receive CBT. In further good news, the replication materials are public. But I’ll leave this as a teaser while I try to figure out how to run the analysis. 1. ^ Why a 25% discount? I guess partners are likelier to share tendencies towards a given level of wellbeing, but I think this "birds of a feather" effect is smaller than the genetic effects. Starting from a 50% guess for genetic effects (noted in the next footnote), I thought that the assortative mating effects would be about half the magnitude or 25%. 2. ^ How did I impute the parent/child effect? The study was ambiguous about the household relations being analysed. So I assumed that it was 50-50 parents and children and that the spouse-to-spouse spillover was a 1/4th that of the parent-to-child spillover. 3. ^ Why a 50% discount? There appears to be an obvious genetic factor between a parent and child’s levels of wellbeing that could confound these estimates. Jami et al., (2021) reviews ~30 studies that try to disentangle the genetic and environmental link between families affective mental health. My reading is that environmental (pure contagion) effects dominate the anxiety transmission, and genetic-environmental factors seem roughly balanced for depression. Since we mostly consider psychotherapy to treat depression, I only reference the depression results when coming up with the 50% figure. 4. ^ “When positive posts were reduced in the News Feed, the percentage of positive words in people's status updates decreased by B = −0.1% compared with control [t(310,044) = −5.63, P < 0.001, Cohen's d = 0.02], whereas the percentage of words that were negative increased by B = 0.04% (t = 2.71, P = 0.007, d = 0.001). Conversely, when negative posts were reduced, the percent of words that were negative decreased by B = −0.07% [t(310,541) = −5.51, P < 0.001, d = 0.02] and the percentage of words that were positive, conversely, increased by B = 0.06% (t = 2.19, P < 0.003, d = 0.008). Comment by JoelMcGuire on Most people endorse some form of 'eugenics' · 2023-02-23T00:31:31.989Z · EA · GW The guess is based on a recent (unpublished and not sure I can cite) survey that I think did the best job yet at eliciting people's views on the neutral point in three countries (two LMICs). I agree it's a big ask to get people to use the exact same scales. But I find it reassuring that populations who we wouldn't be surprised as having the best and worst lives tend to rate themselves as having about the best and worst lives that a 0 to 10 scale allows (Afghanis at ~2/10 and Finns at ~8/10. That's not to dismiss the concern. I think it's plausible that there are systematic differences in scale use (non-systematic differences would wash out statistically). Still, I think people self-reporting about their wellbeing is informative enough to find and fix the issues rather than give up. For those somehow interested in this nerdy aside, for further reading see Kaiser & Oswald (2022) on whether subjective scales behave how we'd expect (they do), Plant (2020) on the comparability of subjective scales, and Kaiser & Vendrik (2022) on how threatened subjective scales are to divergences from the linearity assumption (not terribly). Full disclosure: I'm a researcher at the Happier Lives Institute, which does cause prioritization research using subjective well-being data, so it's probably not surprising I'm defending the use of this type of data. Comment by JoelMcGuire on Most people endorse some form of 'eugenics' · 2023-02-22T20:31:04.521Z · EA · GW Trying to hold onto the word “eugenics” seems to indicate an unrealistically optimistic belief in people’s capacity to tolerate semantics. Letting go is a matter of will, not reason. E.g., I pity the leftist who thinks they can, in every conversation with a non-comrade, explain the difference between the theory of a classless society, the history of ostensibly communist regimes committing omnicide, and the hitherto unrealised practice of “real communism” (outside of a few scores of 20th-century Israeli villages and towns). To avoid the reverse problem when discussing “communist” regimes, I refer to “authoritarian regimes with command economies”. And I’m convinced it’s almost always better to go with “Social Democracy”. Who cares if no other word has caught on yet. Marketing is a great and powerful force (one EAs seem only dimly to understand). Use more words if you have to. The key point is that “it’s a good idea to avoid tying yourself to words where the most common use is associated with mass murder.” [1] Turning to the example. I’d pray to Hedone that most EAs can read the room well enough to avoid making such arguments while we still have nuclear wars to stop, pandemics to prevent, diseases to cure, global poverty to stamp out, and many cheap and largely uncontroversial treatments for depression and everyday sadness we’ve yet to scale. But assuming they did make that argument, I think the response to “That’s eugenics” should be something like: “No, eugenics is associated with stripping a group of people of their right to reproduce. I’m discussing supporting families to make choices about their children’s health. Screening is already supported for many debilitating health conditions because of the suffering they produce, I’m saying that we should provide that same support when the conditions that produce the suffering are mental rather than physical.” But maybe a takeaway here is: “don’t feed the trolls”? 1. ^ Note: One response is, “we can’t give up on every word once it’s tainted by associated with some unseemly set of disreputes.” And that’s fair. For instance, I’m fine being associated with “Happiness Science” because the most common use is associated with social science into self-reported wellbeing, not a genocide-denying Japanese cult. The point is that choice of association depends on what most people associate the word with. Language will always be more bottom-up than top-down and seems much closer to a rowdy democracy than a sober technocracy. Comment by JoelMcGuire on Most people endorse some form of 'eugenics' · 2023-02-22T19:36:45.739Z · EA · GW A note on the "positive utility" bit. I am very uncertain about this. We don't really know where on subjective wellbeing scales people construe wellbeing to go from positive to negative. My best bet is around 2.5 on a 0 to 10 scale. This would indicate that ~18% of people in SSA or South Asia have lives with negative wellbeing if what we care about is life satisfaction (debatable). For the world, this means 11%, which is similar to McAskill's guess of 10% in WWOTF. And insofar as happiness is separate from life satisfaction. It's very rare for a country, on average, to report being more unhappy than happy. Comment by JoelMcGuire on Most people endorse some form of 'eugenics' · 2023-02-21T21:52:28.575Z · EA · GW I haven't downvoted or read the post, but one explanation is the title "You're probably a eugenicist" seems clickbaity and aimed at persuasion. It reads as ripe for plucking out of context by our critics. I immediately see it cited in the next major critique published in a major news org: "In upvoted posts on the EA forum, EAs argue they can have 'reasonable' conversations about eugenics. One idea for dealing with controversial ideas is to A. use a different word and or B. make it more boring. If the title read something like, "Most people favor selecting for valuable hereditary traits." My pulse would quicken less upon reading. Comment by JoelMcGuire on What does Putin’s suspension of a nuclear treaty today mean for x-risk from nuclear weapons? · 2023-02-21T18:51:49.097Z · EA · GW I dont see this as much of an update. Mutual inspections under the treaty haven‘t taken place for a year, it’s basically already been suspended since the invasion. I would be more concerned if he formally withdrew, but he didn’t even do that. Comment by JoelMcGuire on Deworming and decay: replicating GiveWell’s cost-effectiveness analysis · 2023-02-20T22:55:19.288Z · EA · GW In retrospect, I think my reply didn't do enough to acknowledge that A. using a different starting value seems reasonable and B. this would lead to a much smaller change in cost-effectiveness foor deworming. While very belated, I'm updating the post to note this for posterity. Comment by JoelMcGuire on Immigration reform: a shallow cause exploration · 2023-02-20T18:43:51.781Z · EA · GW I agree that advocacy for high skilled immigration is more likely to succeed, and that the benefits would probably come more from technological and material progress. The problem is we currently aren't prepared to try and estimate the benefits of these society and world wide spillover effects. Maybe we will return to this if (big if) we explore policies that may cost-effectively increase GDP growth (which some argue is = tech progress in the long run?), and through that subjective wellbeing [1] . Regarding Malengo, I asked Johannes a few questions about it and I'm referencing that post whenever I cite Malengo numbers. I didn't add it here because most of our work was already done when they wrote a post about it, and I was too lazy, and I didn't think it looked particularly promising in my initial estimates. However, I now notice that in my previous calculations I didn't consider remittances, which seems like an omission. As we discussed in the report, it's unclear how remittances balance the negative effects of separation from the immigrant, but I think that separation pains are less of a concern if it's a young adult leaving -- as that's pretty normal in many cultures. So here's a BOTEC with remittances considered. As Johannes said, they expect that 64% of students will settle permanently in Germany (or a similar country) after graduating. I interpret this to imply an expected stay of 38.4 years, which if life-satisfaction difference between the countries closes slightly to 2 life-satisfaction points, will mean 2 * 38.4 = 78.8 WELLBYs per student sent. It costs$15,408 [2]to fund a student to matriculate in Germany.

If we're only concerned with the student, this implies a cost-effectiveness of 76.8 / $15,408 = ~5 WELLBYs per$1000. Which is a bit less than 8 WELLBYs per $1k that I estimate come from GiveDirectly cash transfers. But this excludes remittances. They expect Malengo participants to remit ~$2k a year. If assume a 1 to 1 equivalence between $1k of remittances and GiveDirectly cash transfers, and assume this continues for 20 years, this would imply a 40 * 8 = 320 additional WELLBYs generated by remittances. This boosts the cost-effectiveness of Malengo from ~0.6x GiveDirectly to ~3x GiveDirectly. However, I'm unsure that the equivalence assumption would be warranted, or how long we could expect remittances to continue[3]. One potential concern is that the families that can send students to university (regardless of cost) are going to be much wealthier, so remittances will matter relatively less. 1. ^ Note that the relationship between GDP <> SWB is rather contested -- see previous work from HLI on the topic that argues that maybe GDP doesn't matter much for subjective wellbeing, and a reply post from Founder's Pledge that argues against that point using the same evidence. In the latter post there's also a quite interesting discussion Vadim and an author of one of the key papers (Kelsey O'Connor). 2. ^ Malengo costs 12k euros to send a student, and they hope to get overhead down to 20% once they scale. Converting this into dollars implies$15k.

3. ^

This question should be easy in principle, to gain traction on. We could use a survey that includes immigrants that asks about if they send remittances, and how long they've lived in the country they moved to.

Comment by JoelMcGuire on Unjournal's 1st eval is up: Resilient foods paper (Denkenberger et al) & AMA ~48 hours · 2023-02-08T04:52:47.169Z · EA · GW

Comment by JoelMcGuire on Unjournal's 1st eval is up: Resilient foods paper (Denkenberger et al) & AMA ~48 hours · 2023-02-07T20:42:37.322Z · EA · GW

Hi David, I'm excited about this! It certainly seems like a step in the right direction. A few vague questions that I'm hoping you'll divine my meaning from:

• Maybe this is redundant with Gideon's question, but I'd like to press further on this. What is the "depth" of research you think is best suited for the Unjournal? It seems like the vibe is "at least econ working paper level of rigor". But it seems like a great amount of EA work is shallower, or more weirdly formatted than a working paper. I.e., Happier Lives Insitute reports are probably a bit below that level of depth (and we spend a lot more time than many others) and GiveWell's CEAs have no dedicated write-ups. Would either of these research projects be suitable for the Unjournal?
•  It seems like the answer is no and you're focusing on more conventional academic work. I take the theory of change here, as you say to be "make rigorous work more impactful". But that seems to rely on getting institutional buy in from academia, which sounds quite uphill. An alternative path is "to make impactful (i.e., EA) work more rigorous". I'd guess there's already a large appetite for this in EA. How you see the tradeoffs here?
• I expect that once there's been quite a few unjournal reviews, that people will attempt to compare scores across projects. I.e., I can imagine a world in which a report / paper receives a 40/100 and people  point out "This is below the 60/100 of the average article in the same area". How useful do you expect such comparisons to be? Do you think there could / should be a way to control for the depth of the analysis?
Comment by JoelMcGuire on Shallow investigation: Loneliness · 2023-02-06T19:46:52.638Z · EA · GW

I wonder if a positive step would be to raise the retirement age?

This sounds plausible. I wonder if people's attitudes towards retirement is a huge affective forecasting error where they think it'll be sublime, but it ends up isolating and boring (like school breaks were for me as a kid).

Anecdotally, children and grandchildren seem to form an increasing fraction of people's social network as they age because they attrition much slower than friends or colleagues (assuming you're not a terrible parent)

But I wonder if that attrition isn't related to parenthood. I haven't had kids yet but my friends seem to drop off the map socially as they become parents. It's sort of concerning. Having children also seems pretty isolating to many living in environments that cater to nuclear families.

Surely there's evidence for this question as it pertains to loneliness. I know that having grandchildren is clearly good for subjective wellbeing, but the evidence for the effects of parenthood, in general, are much more mixed/ambiguous.

Comment by JoelMcGuire on Shallow investigation: Loneliness · 2023-02-06T19:36:20.110Z · EA · GW

I appreciate this post! Loneliness is something I think about often because it appears to be, alongside mental health issues, as one of the things that appears relatively worse for people's subjective wellbeing than say, their income.

That being said, it was always unclear what can be done, and this review doesn't seem to suggest there's a frontrunner in terms of interventions.

•  I'm puzzled that so many interventions involve 1 on 1 interactions. This doesn't seem scalable or in the spirit of well, decreasing loneliness.
• Group interventions seem more promising but there appears to be less evidence.
• I wonder if the reason why group-psychotherapy in some cases appears better than 1 on 1 therapy is because of an added loneliness reduction bonus.
• I'd also be curious to hear what ideas there are for new interventions / RCTs.
• Intergenerational cohabitation seems like a plausible way to solve two problems at once. Older people are lonelier and have more housing. Younger people don't have much housing. Why not connect them? I'd be curious to see the results for an RCT of a matchmaking programme.
• I feel like there are obvious intervention's no one's tested:  try to connect lonely people to one another, perhaps in a group setting with a facilitator? How much does a website/ app like Meetup already do this?
• I'd also be interested to know if there are any quasi-experiments that could be studied related to larger scale interventions. I.e., does the walkability of a city increase wellbeing through decreases in loneliness? What about more social clubs?

A question I'm curious about is what are the biggest barriers to lonely people going out and making friends on their own? Is it transportation? Cost? Unclear where to go? Church used to be the easy button, but many people aren't religious anymore and we don't have a clear substitute. Why isn't this a problem that markets can solve?

What about digital interventions? Some people seem to be content with forming and maintaining relationships through a digital medium (e.g., online gaming).

I'd also really like to know how common loneliness is in low income countries, and how the barriers differ towards forming more positive social bonds.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-02-05T20:32:56.189Z · EA · GW

Gregory,

Thank you for pointing out two errors.

• First, the coding mistake with the standard error correction calculation.
• Second, and I didn't pick this up in the last comment, that the CT effect size change calculation were all referencing the same model, while the PT effect size changes were referencing their non-publication bias analog.

______________________________________________________

After correcting these errors, the picture does shift a bit, but the quantitative changes are relatively small.

Here's the results where only the change due to the publication bias adjusts the cost-effectiveness comparison. More of the tests indicate a downwards correction, and the average / median test now indicates an adjustment from 9.4x to 8x. However, when we remove all adjustments that favor  PT in the comparison (models 19, 25, 23, 21, 17, 27, 15) the (average / median) is ratio of PT to CT is now (7x / 8x). This is the same as it was before the corrections.

Note: I added vertical reference lines to mark the 3x, 7x and 9.44x multiples.

Next, I present the changes where we include the model choices as publication bias adjustments (e.g., any reduction in effect size that comes from using a fixed effect model or outlier removal is counted against PT -- Gregory and Ryan support this approach. I'm still unsure, but it seems plausible and I'll read more about it). The mean / median adjustment leads to a 6x/7x comparison ratio. Excluding all PT favorable results leads to an average / median correction of 5.6x / 5.8x slightly below the 6x I previously reported

Note: I added vertical reference lines to mark the 3x, 7x and 9.44x multiples.

Since the second approach bites into the cost-effectiveness comparison more and to a degree that's worth mentioning if true, I'll read more / raise this with my colleagues about whether using fixed effect models / discarding outliers are appropriate responses to suspicion of publication bias.

If it turns out this is a more appropriate approach, then I should eat my hat re:

My complete guess is that if StrongMinds went below 7x GiveDirectly we'd qualitatively soften our recommendation of StrongMinds and maybe recommend bednets to more donors.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-02-04T23:16:36.449Z · EA · GW

I may respond later after I’ve read more into this, but briefly — thank you! This is interesting and something I’m willing to change my mind about it. Also didn’t know about WAAP, but it sounds like a sensible alternative.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-02-04T20:38:33.674Z · EA · GW

I will try and summarise and comment on what I think are some possible suggestions you raise, which happen to align with your three sections.

1. Discard the results that don't result in a discount to psychotherapy [1]

If I do this, the average comparison of PT to CT goes from 9.4x --> 7x. That seems like a plausible correction, but I'm not sure it's the one I should use. I interpreted these results s as indicating none of the tests give reliable results. I'll quote myself:

I didn’t expect this behaviour from these tests. I’m not sure how I should update on using them in the future. I assume the likeliest issue is that they are unreliable in the conditions of these meta-analyses (high heterogeneity). Any thoughts on how to correct for publication bias in future analyses is welcome!

I'm really unsure if 9.4x --> 7x is a plausible magnitude of correction. The truth of the perfect test could suggest a greater or smaller correction, I'm really uncertain given the behavior of these tests. That leaves me scratching my head at what principled choice to make.

I think if we discussed this beforehand and I said "Okay, you've made some good points, I'm going to run all the typical tests and publish their results." would you have said have advised me to not even try, and instead, make ad hoc adjustments. If so, I'd be surprised given that's the direction I've taken you to be arguing I should move away from.

2. Compare the change of all models to a single reference value of 0.5 [2]

When I do this, and again remove anything that doesn't produce a discount for psychotherapy, the average correction leads to a 6x cost-effectiveness ratio of PT to CT. This is a smaller shift than you seem to imply.

3. Fix the weighting between the general and StrongMinds specific evidence [3].

Gregory is referring to my past CEA of StrongMinds in guesstimate, where if you assign an effect size of 0 to the meta-analytic results it only brings StrongMinds cost-effectiveness to 7x GiveDirectly. While such behavior is permissible in the model, obviously if I thought the effect  of psychotherapy in general was zero or close to, I would throw my StrongMinds CEA in the bin.

As I noted in my previous comment discussing the next version of my analysis, I said: " I expect to assign relatively lower weight to the StrongMinds specific evidence." To elaborate, I expect the effect estimate of StrongMinds to be based much more heavily on the meta-analytic results. This is something I already said I'd change.

I'll also investigate different ways of combining the charity specific and general evidence. E.g., a model that pins the estimates StrongMinds effects as relative to the general evidence.  Say if the effects of StrongMinds are always say 5% higher, then if we reduce the effects of psychotherapy by from 0.5 to 0.1 then the estimate of StrongMinds would go from 0.525 to 0.105.

So what happens if we assign 100% of the weight to the meta-analytic results? The results would shrink by 20% [4]. If we apply this to the cost-effectiveness ratio that I so far think Gregory would endorse as the most correct (6x), this would imply a ~ 5x figure.

Is a reduction of 9.4x to 5x enough to make HLI pause its recommendation? As I said before:

Aside: Given that this appears to be the worst case scenario [a reduction to 3x], I’m not actually sure this would mean we drop our recommendation given that we haven’t found anything clearly better yet (see our analyses of anti-malarial bednets and deworming). I think it’s likelier that we would end up recommending anti-malaria bed nets to those with sympathetic philosophical views.

Gregory rightly pointed out that we haven't made it clear what sort of reduction would result in us abandoning our recommendation of StrongMinds. I can't speak for the team, but for me this would definitely be if it was less than 1x GiveDirectly. The reason why this is so low is I expect our recommendations to come in grades, and not just a binary. My complete guess is that if StrongMinds went below 7x GiveDirectly we'd qualitatively soften our recommendation of StrongMinds and maybe recommend bednets to more donors. If it was below 4x we'd probably also recommend GiveDirectly. If it was below 1x we'd drop StrongMinds. This would change if / when we find something much more (idk: 1.5-2x?) cost-effective and better evidenced than StrongMinds.

However, I suspect this is beating around the bush -- as I think the point Gregory is alluding to is "look at how much their effects appear to wilt with the slightest scrutiny. Imagine what I'd find with just a few more hours."

If that's the case, I understand why -- but that's not enough for me to reshuffle our research agenda. I need to think there's a big, clear issue now to ask the team to change our plans for the year. Again, I'll be doing a full re-analysis in a few months.

4. Use a fixed effects model instead?

I'm treating this as a separate point because I'm not sure if this is what Gregory suggests. While it's true that fixed effects models are less sensitive to small studies with large effects, fixed effects models are almost never used. I'll quote Harrer et al., (2021) again (emphasis theirs):

In many fields, including medicine and the social sciences, it is therefore conventional to always use a random-effects model, since some degree of between-study heterogeneity can virtually always be anticipated. A fixed-effect model may only be used when we could not detect any between-study heterogeneity (we will discuss how this is done in Chapter 5) and when we have very good reasons to assume that the true effect is fixed. This may be the case when, for example, only exact replications of a study are considered, or when we meta-analyze subsets of one big study. Needless to say, this is seldom the case, and applications of the fixed-effect model “in the wild” are rather rare.

I'm not an expert here, but I'm hesitant to use a fixed effects model for these reasons.

1. ^

"However, for these results, the ones that don't make qualitative sense should be discarded, and the the key upshot should be: "Although a lot of statistical corrections give bizarre results, the ones which do make sense also tend to show significant discounts to the PT effect size".

2. ^

"This is the case because the % changes are being measured, not against the single reference value of 0.5 in the original model, but the equivalent model in terms of random/fixed, outliers/not, etc. but without any statistical correction technique."

3. ^

"On the guestimate, setting the meta-regressions to zero effect still results in ~7x multiples for Strongminds versus cash transfers."

4. ^

We estimate the raw total effects of general psychotherapy to be 1.56 (see table 1) and 1.92 for StrongMinds (see end of section 4, page 18). 1.56/ 1.92 = 0.8125. The adjusted effects are smaller but produce a very similar ratio (1.4 & 1.7, table 2).

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-02-03T18:30:29.126Z · EA · GW

Here’s my attempt to summarise some of the discussion that Ryan Briggs and Gregory Lewis instigated in the comments of this post, and the analyses it prompted on my end – as requested by Jason [Should I add this to the original post?]. I would particularly like to thank Gregory for his work replicating my analyses and raising several painful but important questions for my analysis. I found the dialogue here very useful, thought provoking, and civil -- I really want to thank everyone for making the next version of this analysis better.

## Summary

• The HLI analysis of the cost-effectiveness of StrongMinds relies not on a single study, but on a meta-analysis of multiple studies.
• Regarding this meta-analysis, some commenters (Ryan Briggs and Gregory Lewis) pointed out our lack of a forest plot and funnel plot, a common feature of meta-analyses.
• Including a forest plot shows some outlier studies with unusually large effects, and a wide variance in the effects between studies (high heterogeneity).
• Including a funnel plot shows evidence that there may be publication bias. Comparing this funnel plot to the one for cash transfers makes the diagnosis of publication bias appear worse in psychotherapy than cash transfers.
• My previous analysis employed an ad-hoc and non-standard method for correcting for publication bias. It suggested a smaller (15%) downward correction to psychotherapy’s effects than what some commenters (Ryan and Gregory) thought that a more standard approach would imply (50%+).
• Point taken, I tried to throw the book at my own analysis to see if it survived. Somewhat to my surprise, it seemed relatively unscathed.
• After applying the six standard publication bias correction methods to both the cash transfer and psychotherapy datasets in 42 different analyses, I found that, surprisingly:
• About half the tests increase the effectiveness of psychotherapy relative to cash transfers, and the average test suggests no adjustment.
• Only four tests show a decrease in the cost-effectiveness ratio of psychotherapy to cash transfers below of 9.4x → 7x.
• The largest reduction of psychotherapy relative to cash transfers is from 9.4x to 3.1x as cost-effective as GiveDirectly cash transfers. It’s based on the older correction method; Trim and Fill.
• I have several takeaways.
• I didn’t expect this behaviour from these tests. I’m not sure how I should update on using them in the future. I assume the likeliest issue is that they are unreliable in the conditions of these meta-analyses (high heterogeneity). Any thoughts on how to correct for publication bias in future analyses is welcome!
• Given the ambivalent results, it doesn't seem like any action is urgently needed (i.e., immediately pause the StrongMinds recommendation).
• However, this discussion has raised my sense of the priority of doing the re-analysis of psychotherapy and inspired me to do quite a few things differently next time. I hope to start working on this soon (but I don’t want to promise dates).
• I’m not saying “look, everything is fine!”.I should have investigated publication bias more thoroughly in the original analysis. The fact that after I’ve done that now and it doesn’t appear to suggest substantial changes to my analysis is probably more due to luck than a nose for corners I can safely cut.

## 1. The story so far

In the comments, Ryan Briggs and Gregory Lewis have pointed out that my meta-analysis of psychotherapy omits several typical and easy to produce figures. These are forest plots and funnel plots. A forest plot shows the individual study effects and the study effects. If I included this, it would have shown two things.

First, that there is quite a bit of variation in the effects between studies (i.e., heterogeneity). What heterogeneity implies is a bit controversial in meta-analyses, and I’ll return to this, but for now I’ll note that some take the presence of high heterogeneity as an indication that meta-analytic results are meaningless. At the other end of professional opinion, other experts think that high heterogeneity is often inevitable and merely warrants prudence. However, even the most permissive towards heterogeneity think that it makes an analysis more complicated.

The second thing the forest plot shows is that there were a few considerable outliers. Notably, some of these outliers (Bolton et al., 2003; Bass et al., 2006) are part of the evidence I used to estimate that StrongMinds is more cost-effective than the typical psychotherapy intervention in LMICs. The other figure I omitted was a funnel plot. Funnel plots are made to show if there are many more small studies that find large effects than small with small, null or negative effects than we would expect due to a random draw. In the funnel plots for the psychotherapy data, which Gregory first provided by using a version of the the data I use, he rightly pointed out that there is considerable asymmetry, which suggests that there may be publication bias (i.e., the small sized studies that find small, null, or negative effects are less likely to be published and included than small studies with larger effects). This finding seemed all the more concerning given that I found pretty much no asymmetry in the cash transfers data I compare psychotherapy to.

I supplemented this with a newer illustration, the p-curve, meant to detect publication bias that’s not just about the size of an effect, but its precision. The p-curve suggests publication bias if there’s an uptick in the number of effect sizes near the 0.05 significance level relative to the 0.03 or 0.04 level. The idea is that researchers are inclined to fiddle with their specifications until they are significant, but that they’re limited in their ambitions to perform questionable research practices and will tend to push them just over the line. The p-curve for psychotherapy shows a slight uptick near the 0.05 level, compared to none in cash transfers. This is another sign that the psychotherapy evidence base appears to have more publication bias than cash transfers.

Ryan and Gregory rightly pushed me on this – as I didn’t show these figures that make psychotherapy look bad. I have excuses, but they aren’t very good so I won’t repeat them here. I think it’s fair to say that these could have and should have been included.

The next, and most concerning point that Ryan and Gregory made was that if we take the Egger regression test seriously (a formal, less eye-bally way of testing for funnel plot asymmetry), it’d indicate that psychotherapy’s effect size should be dramatically reduced[1]. This frankly alarmed me. If this was true, I potentially made a large mistake [2].

## 2. Does correcting for publication bias substantially change our results?

To investigate this I decided to look into the issue of correcting for publication bias in more depth. To do so I heavily relied upon Harrer et al. (2021), a textbook for doing meta-analyses in R.

My idea for investigating this issue would be to go through every method for correcting publication bias mentioned in Harrer et al. (2021) and show how these methods change the cash transfers to psychotherapy comparison. I thought this would be more reasonable than trying to figure out which one was the method to rule them all. This is also in line with the recommendations of the textbook “No publication bias method consistently outperforms all the others. It is therefore advisable to always apply several techniques…” For those interested in an explanation of the methods, I found Harrer et al. (2021) to be unusually accessible. I don’t expect I’ll do better.

One issue is that these standard approaches don’t seem readily applicable to the models we used. Our models are unusual in that they are 1. Meta-regressions, where we try to explain the variation in effect sizes using study characteristics like time since the intervention ended, and 2. Multi-level meta-analyses that attempt to control for the dependency introduced by adding multiple timepoints or outcomes from a single study. It doesn’t seem like you can easily plug these models into the standard publication bias methods. Because of this uncertainty we tried to run several different types of analyses (see details in 2.1) based on whether a model included the full data or excluded outliers or follow-ups or used a fixed or random effects estimator[3]

I ran (with the help of my colleague Samuel[4]) the corrections for both psychotherapy and cash transfers and then apply the percentage of correction to their cost-effectiveness comparison. It doesn’t seem principled to only run these corrections on psychotherapy. Even though the problem seems worse in psychotherapy, I think the appropriate thing to do is also run these corrections on the cash transfers evidence and see if the correction is greater for psychotherapy.

If you want to go straight to the raw results, I collected them in a spreadsheet that I hope is easy to understand. Finally, if you’re keen on replicating this analysis, we’ve posted the code we used here.

### 2.1 Model versions

Measures of heterogeneity and publication bias seem to be designed for simpler meta-analysis models than those we use in our analysis. We use a meta-regression with follow-up time (and sometimes dosage), so the estimate of the intercept is affected by the coefficients for time and other variables. Reading through Harrer et al. (2021) and a brief google search didn’t give us much insight as to whether these methods could easily apply to a meta-regression model. Furthermore, most techniques presented by Harrer et al. (2021) used a simple meta-analysis model which employed a different set of R functions (metagen rather than the rma.uni or rma.mv models we use).

Instead, we create a simple meta-analysis model to calculate the intercept for psychotherapy and for cash. We then apply the publication bias corrections to these models and get the % change this created. We then apply the % change of the correction to the effect for psychotherapy and cash and obtain their new cost-effectiveness ratio.

Hence, we are not using the model we directly use in our analysis, but we apply to our analysis the change in effectiveness that the correction method would produce on a model appropriate for said correction method.

Because of our uncertainty, we ran several different types of analyses based on whether a model included the full data[5] or excluded outliers[6] or follow-ups[7] or used a fixed or random effects estimator[8].

### 2.2 Results

The results of this investigation are shown below. Tests that are to the left of the vertical line represent decreases in the cost-effectiveness of psychotherapy relative to cash transfers. The reference models are the six right on the line (in turquoise).  I’ll add further commentary below.

Details of the results can be seen in this spreadsheet. We removed tests 28, 29, 30, 34, 35, 36. These were generally favourable to psychotherapy. We removed them because they were p-curve and Rücker’s limit corrections models that we specified as fixed-effects models but they seemed to force the models into random-effects models, making their addition seem inappropriate[9].

Surprisingly, when we apply these tests, very few dramatically reduce the cost-effectiveness of psychotherapy compared to cash transfers, as indicated by changes to their intercepts / the estimated average overall effect.

• Only four tests show a decrease below 7x for PT.
• The largest correction (using Trim and Fill) reduces PT <> CT from 9.4x to 3.1x as cost-effective as GiveDirectly cash transfers.
• Aside: Given that this appears to be the worst case scenario, I’m not actually sure this would mean we drop our recommendation given that we haven’t found anything clearly better yet (see our analyses of anti-malarial bednets and deworming). I think it’s likelier that we would end up recommending anti-malaria bed nets to those with sympathetic philosophical views.
• The trim and fill and selection models are the ones most consistently negative to PT.
• The worst models for PT are trim and fill, and selection models. But the trim and fill models are the oldest (most outdated?) models that seem to be the least recommended (Harrer et al., (2021) says they are “often outperformed by other methods”). The PET and PEESE models tend to actually make psychotherapy look even better compared to cash transfers.
• Surprisingly, many tests increase the cost-effectiveness ratio in favour of psychotherapy!

### 2.3 Uncertainties

• A big problem is that most of these tests are sensitive to heterogeneity, so we’re left with a relatively high level of uncertainty in interpreting these results. Are differences between the minimal or most negative update due to the heterogeneity? I’m not sure.
• This should partially be alleviated by adding in the tests with the outliers removed, but while this reduces heterogeneity a bit (PT I^2: 95% → 56%, CT I^2: 75% → 25%), it’s still relatively high.
• Further, the largest downwards adjustment that involves removing outliers is from 9.4x → 7.5x.
• It’s unclear if these publication bias adjustments would differentially affect estimates for the decay rate of the benefit. Our analysis was about the average effect (i.e., the intercept). It’s unclear how publication bias should affect the estimate of the decay rate of psychotherapy (or cash).

### 2.4 A note about heterogeneity

Sometimes it’s suggested that the high heterogeneity in a meta-analysis means it is impossible to interpret (see details of heterogeneity in my analyses in this spreadsheet). Whilst heterogeneity is important to report and discuss, we don’t think it disqualifies this analysis.

However, high levels of heterogeneity appear to be a common problem with meta-analyses. It’s unclear that this is uniquely a problem with our meta-analysis of psychotherapy. In their big meta-analysis of psychotherapy, Cuijpers et al. (2023; see Table 2) also have high levels of heterogeneity. Our cash transfer meta-analysis also has high (albeit lower than psychotherapy) levels of heterogeneity.

High heterogeneity would be very problematic if it meant the studies are so different they are not measuring the same thing. Alternative explanations are that (1) psychotherapy is a phenomenon with high variance (supported by similar findings of psychotherapy in HICs), and/or (2) studies about psychotherapy in LMICs are few and implemented in different ways, so we expect this data is going to be messy.

## 3. Next Steps

• 4 tests suggest psychotherapy is 3-7x cash, 8 tests suggest psychotherapy is 7-9.4x cash, and 18 tests suggest psychotherapy is 9.4 or more times cash.  Because of the ambiguous nature of the results, I don’t plan on doing anything immediately like suggesting we pause the StrongMinds recommendation.
• However, this analysis and the surrounding discussion has updated me on the importance of updating and expanding the psychotherapy meta-analysis sooner. Here are some things I’d like to commit to:
• Do a systematic search and include all relevant studies, not just a convenience sample.
• Heavily consider including a stricter inclusion criteria. And if we don’t perform more subset analyses and communicate them more clearly.
• Include more analyses that include dosage (how many hours in session) and expertise of the person delivering the therapy.
• Include better data about the control group, paying special attention to whether the control group could be considered as receiving a high, low quality, or no placebo.
• In general, include and present many more robustness checks.
• Add an analogous investigation of publication bias like the one performed here.
• Make our data freely available and our analysis easily replicable at the time of publication.
• Am I missing anything?
• After updating the psychotherapy meta-analysis we will see how it changes our StrongMinds analysis.
• I’ve also expected to make a couple changes to that analysis[10], hopefully incorporating the new Baird et al. RCT. Note if it comes soon and its result strongly diverge from our estimates this could also expedite our re-analysis.
1. ^

Note that the Egger regression is a diagnostic test, not a form of correction. However, the PET and PEESE methods are correction methods and are quite similar in structure to the Egger regression test.

2. ^

Point taken that the omission is arguably,  a non-trivial mistake.

3. ^

Choosing a fixed or random effects model is another important and controversial question in modelling meta-analysis and we wanted to test whether the publication bias corrections were particularly sensitive to it. However, it seems like our data is not suitable to the assumptions of a fixed effects model – and this isn’t uncommon. As Harrer et al., (2021) say: “In many fields, including medicine and the social sciences, it is therefore conventional to always use a random-effects model, since some degree of between-study heterogeneity can virtually always be anticipated. A fixed-effect model may only be used when we could not detect any between-study heterogeneity (we will discuss how this is done in Chapter 5) and when we have very good reasons to assume that the true effect is fixed. This may be the case when, for example, only exact replications of a study are considered, or when we meta-analyze subsets of one big study. Needless to say, this is seldom the case, and applications of the fixed-effect model “in the wild'' are rather rare.”

4. ^

If my analyses are better in the future, it's because of my colleague Samuel Dupret. Look at the increase in quality between the first cash transfer and psychotherapy reports and the household spillover report. That was months apart. You know what changed? Sam.

5. ^

The same data we use in our full models.

6. ^

Some methods are not robust to high levels of heterogeneity, which is more often present when there are outliers. We select outliers for the fixed and random effects models based on “‘non-overlapping confidence intervals’ approach, in which a study is defined as an outlier when the 95% confidence interval (CI) of the effect size does not overlap with the 95% CI of the pooled effect size” (Cuijpers et al., 2023; see Harrer et al., 2021 for a more detailed explanation).

7. ^

We are concerned that these methods are not made with the assumption of a meta-regression and might react excessively to the follow-up data (i.e., effect sizes other than the earliest effect size collected in a study), which are generally smaller effects (because of decay) with smaller sample sizes (because of attrition).

8. ^

Choosing a fixed or random effects model is another important and controversial question in modelling meta-analysis and we wanted to test whether the publication bias corrections were particularly sensitive to it. However, it seems like our data is not suitable to the assumptions of a fixed effects model – and this isn’t uncommon. As Harrer et al., (2021) say: “In many fields, including medicine and the social sciences, it is therefore conventional to always use a random-effects model, since some degree of between-study heterogeneity can virtually always be anticipated. A fixed-effect model may only be used when we could not detect any between-study heterogeneity (we will discuss how this is done in Chapter 5) and when we have very good reasons to assume that the true effect is fixed. This may be the case when, for example, only exact replications of a study are considered, or when we meta-analyze subsets of one big study. Needless to say, this is seldom the case, and applications of the fixed-effect model “in the wild'' are rather rare.”

9. ^

The only tests that are different from the random effects ones are 32 and 38 because the list of outliers were different for fixed effects and random effects.

10. ^

I expect to assign relatively lower weight to the StrongMinds specific evidence. I was leaning this direction since the summer, but these conversations -- particularly the push from Simon, hardened my views on this. This change would decrease the cost-effectiveness of StrongMinds. Ideally, I’d like to approach the aggregation of the StrongMinds specific and general evidence of lay-group psychotherapy in LMICs in a more formally Bayesian manner, but this would come with many technical difficulties. I will also look into the counterfactual impact of their scaling strategy where they instruct other groups in how to provide group psychotherapy.

Comment by JoelMcGuire on Pain relief: a shallow cause exploration · 2023-01-30T20:18:45.857Z · EA · GW

Hi Jon, I'm also concerned that subjective wellbeing measures may lose interpersonal comparability when someone experiences extreme suffering.

But it's not clear to me from the Lancet report  or your description of the alternatives how we'd measure / construct them. Would it be like DALYs or QALYs? Would we ask the public to make time tradeoffs? Would we ask people with painful conditions to make tradeoffs or evaluate their experience?

I'm more open to alternatives here than I usually am, but I'd be very surprised if the best measure didn't ask the people with extreme suffering (or barring that, those close to them) about their experiences. This is because extreme suffering seems almost by definition something someone doesn't understand without having the misfortune to live with/despite it.

(One minor point: it wasn't clear to me why going from 0 to 10 on a pain scale represents an 11-point change.)

Oops! You're right. An 11-point scale (0 to 10) can only afford at most a 10 unit change.

Comment by JoelMcGuire on Pain relief: a shallow cause exploration · 2023-01-30T19:46:41.602Z · EA · GW

Nick,

Do you have any ideas for how one would most cost-effectively treating arthritis pain in a low income country?

On access to NSAIDS, I don't think access is that bad in lower income countries. NSAIDs are in every drug shop in every village here, but the issue is with HOW they are given.

Okay, this is good to know. Ditto with the bit about gastritis and peptic ulcers.

I also don't really know what you mean by "palliative care centers" in low income countries

We weren't sure what type of facilities had the biggest supply problems that could be ameliorated, so we were unsure. Good to know palliative care units may have relatively more access problems.

Migraines are not super common, difficult to diagnostically separate from other conditions, and specific drugs  for them are very expensive. My fairly strong instinct is that  can't imagine this ever being a high impact area of intervention.

Could you explain a bit more? This doesn't seem entirely right, but you have more expertise here.

• It seems like they're relatively common. In Sharma et al., (2020) they say: "We examine two headache disorders: migraines and cluster headaches. The former are common, affecting around one in six people... Disability-Adjusted Life Years lost in 2019, according to the Global Burden of Disease, were 42 million for migraines, 46 million for malaria, and 47 million for depression (source)."
• I don't know anything about diagnosing migraines, so I'll trust you there.
• I get that specific migraine drugs are very expensive, but we specifically mentioned using NSAIDs, which seem relatively effective "Common NSAIDs can eliminate most pain from migraines in half of all cases: aspirin, 52% (Kirthi et al., 2013) or ibuprofen, 57% (Rabbie et al., 2013)."

I know these are only shallow investigations, but if you don't have access to experts on pain in LMIC countries (which probably exist, especially in the palliatve care field) I think you could potentially even save time early in your investigation process by talking to a few medical professionals in lower income countries to get more ideas and screen out a couple of these interventions.

Fair. Reaching out and talking to experts is not something we emphasized in these reports. The short calendar window for completing these reports made this difficult. Could we reach out to you if / when we look further into these topics?

Comment by JoelMcGuire on Pain relief: a shallow cause exploration · 2023-01-30T16:18:21.826Z · EA · GW

Sam and I tracked a total of 143 hours for the research and writing of this report. We probably spent a bit more copy-editing and preparing for publication.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-26T19:07:12.325Z · EA · GW

Fair point! I'll try to to summarize things from my perspective once things have settled a bit more.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-25T20:25:19.807Z · EA · GW

Hi Gregory, I wanted to respond quickly on a few points. A longer respond about what I see as the biggest issue (is our analysis overestimating the effects of psychotherapy and StrongMinds by =< 2x??) may take a bit longer as I think about this and run some analyses as wifi permits (I'm currently climbing in Mexico).

This is really useful stuff, and I think I understand where you're coming from.

I'd take this episode as a qualified defence of the 'old fashioned way of doing things'.

FWIW, as I think I've expressed elsewhere, I think I went too far trying to build a newer better wheel for this analysis, and we've planned on doing a traditional systematic review and meta-analysis of psychotherapy in LMICs since the fall.

• It is also odd to have an extensive discussion of publication bias (up to and including ones own attempt to make a rubric to correct for it) without doing the normal funnel plot +/- tests for small study effects.

I get it, and while I could do some more self flagellation on behalf of my former hubris at pursuing this rubric, I'll temporarily refrain and point out that small study effects were incorporated as a discount against psychotherapy -- they just didn't end up being very big.

• Even if you didn't look for it, metareg in R will confront you with heterogeneity estimates for all your models in its output (cf.). One should naturally expect curiosity (or alarm) on finding >90% heterogeneity, which I suspect stays around or >90% even with the most expansive meta-regressions. Not only are these not reported in the write-up, but in the R outputs provided (e.g. here) these parts of the results have been cropped out.

But it doesn't do that if you 1. aren't using metareg or 2. are using multi-level models. Here's the full output from the metafor::rma.mv() call I was hiding.

It contains a Q test for heterogeneity, which flags statistically significant heterogeneity. What does this mean? I'll quote from the text we've referenced

Cochran’s Q increases both when the number of studies increases, and when the precision (i.e. the sample size of a study) increases.

Therefore Q,  and whether it is significant highly depends on the size of your meta-analysis, and thus its statistical power. We should therefore not only rely on Q, and particularly the Q-test, when assessing between-study heterogeneity.

It also reports sigma^2 which should be equivalent to the tau^2 / tau statistic which "quantifies the variance of the true effect sizes underlying our data." We can use it to create a 95% CI for the true effect of the intercept, which is:

> 0.58 - (1.96 * 0.3996) =  -0.203216

> 0.58 + (1.96 * 0.3996) = 1.363216

This is similar to what we find we calculate the prediction intervals (-0.2692, 1.4225). Quoting the text again regarding prediction intervals:

Prediction intervals give us a range into which we can expect the effects of future studies to fall based on present evidence.

Say that our prediction interval lies completely on the “positive” side favoring the intervention. This means that, despite varying effects, the intervention is expected to be beneficial in the future across the contexts we studied. If the prediction interval includes zero, we can be less sure about this, although it should be noted that broad prediction intervals are quite common.

Commenting on the emphasized section, the key thing I've tried to keep in mind "is how does the psychotherapy evidence base / meta-analysis  compare to the cash transfer evidence base / meta-analysis / CEA?". So while the prediction interval for psychotherapy contains negative values, which is typically seen as a sign of high heterogeneity, it also did so in the cash transfers meta-analysis. So I'm not quite sure what to make of the magnitude or qualitative difference in heterogeneity, which I've assumed is the relevant feature.

I guess a general point is that calculating and assessing heterogeneity is not straightforward, especially for multi-level models. Now, while one could argue we used multi-level models as part of our nefarious plan to pull the wool over folks eyes, that's just not the case. It just seems like the appropriate way to account dependency introduced by including multiple timepoints in a study, which seems necessary to avoid basing our estimates of how long the effects last on guesswork.

• That something is up (i.e. huge hetereogeneity, huge small study effects) with the data can be seen on the forest plot (and definitely in the funnel plot). It is odd to skip these figures and basic assessment before launching into a much more elaborate multi-level metaregression.

Understandable, but for a bit of context -- we also didn't get into the meta-analytic diagnostics in our CEA of cash transfers. While my co-authors and I did this stuff in the meta-analysis the CEA was based on, I didn't feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist) -- especially after wasting precious time on my quest to be clever (see bias rubric in appendix C).  Doing the full meta-analysis for cash transfers took up the better part of a year, and we couldn't afford to do that again. So I thought that broadly mirroring the CEA I did for cash transfers was a way to "cut to the chase". I saw the meta-analysis as a way to get an input to the CEA, and I was trying to do the 20% (with a meta-analysis in ~3 months rather than a year) . I'm not saying that this absolves me, but it's certainly context for the tunnel vision.

• Mentioning prior sensitivity analyses which didn't make the cut for the write-up invites wondering what else got left in the file-drawer.

Fair point! This is an omission I hope to remedy in due course. In the mean time, I'll try and respond with some more detailed comments about correcting for publication bias -- which I expect is also not as straightforward as it may sound.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-24T04:05:27.527Z · EA · GW

Interesting poll Ryan! I'm not sure how much to take away because I think epistemic / evidentiary standards is pretty fuzzy in the minds of most readers. But still, point taken that people probably expect high standards.

It's also rough to see that if we project the Egger regression line back to the origin then the predicted effect when the SE is zero is basically zero.

I'm not sure about that. Here's the output of the Egger test. If I'm interpreting it correctly then that's smaller, but not zero. I'll try to figure out how what the p-curve suggested correction says.

Edit: I'm also not sure how much to trust the Egger test to tell me what the corrected effect size should be, so this wasn't an endorsement that I think the real effect size should be halfed. It seems different ways of making this correction give very different answers. I'll add a further comment with more details.

I do think going forward it would be worth taking seriously community expectations about what underlies charity recommendations, and if something is tentative or rough then I hope that it gets clearly communicated as such, both originally and in downstream uses.

Seems reasonable.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-24T03:08:20.336Z · EA · GW

There's two separate topics here, the one I was discussing in the quoted text was about whether an intra RCT comparison of two interventions was necessary or whether  two meta-analyeses of two interventions would be sufficient. The references to GiveWell were not about the control groups they accept, but about their willingness to use meta-analyses instead of RCTs with arms comparing different interventions they suggest.

Another topic is the appropriate control group to compare psychotherapy against. But, I think you make a decent argument that placebo quality could matter. It's given me some things to think about, thank you.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-24T00:19:06.111Z · EA · GW

To be clear, this isn't the bar HLI uses. As I said in section 2:

At HLI, we think the relevant factors for recommending a charity are:

(1) cost-effectiveness is substantially better than our chosen benchmark (GiveDirectly cash transfers); and

(2) strong evidence of effectiveness.

To elaborate, we interpret "substantially" to mean "around as good as the best charity we've found so far" which is currently 9x GiveDirectly, but I assume the specific number will change over time.

I was trying to propose a possible set of conditions where we could agree that it was reasonable for a charity to be recommended by someone in the EA community. I was aiming for inclusivity here and to leave room for the possibility that Founder's Pledge may have good reasons I'm not privy to for using GiveDirectly as a bar

I'm also unsure that GiveWell's bar will generalise to other types of analyses. i.e., I think it's very plausible that other evaluators find that cash transfers are much better than GiveWell does.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-23T23:26:19.538Z · EA · GW

Hi Ryan,

Our preferred model uses a meta-regression with the follow-up time as a moderator, not the typical "average everything" meta-analysis. Because of my experience presenting the cash transfers meta-analysis, I wanted to avoid people fixating on the forest plot and getting confused about the results since it's not the takeaway result. But In hindsight I think it probably would have been helpful to include the forest plot somewhere.

I don't have a good excuse for the publication bias analysis. Instead of making a funnel plot I embarked on a quest to try and find a more general system for adjusting for biases between intervention literatures. This was, perhaps unsurprisingly, an incomplete work that failed to achieve many of its aims (see Appendix C) -- but it did lead to a discount of psychotherapy's effects relative to cash transfers. In hindsight, I see the time spent on that mini project as a distraction. In the future I think we will spend more time focusing on using extant ways to adjust for publication bias quantitatively.

Part of the reasoning was because we weren't trying to do a systematic meta-analysis, but trying to do a quicker version on a convenience sample of studies. As we said on page 8 "These studies are not exhaustive (footnote: There are at least 24 studies, with an estimated total sample size of 2,310, we did not extract. Additionally, there appear to be several protocols registered to run trials studying the effectiveness and cost of non-specialist-delivered mental health interventions.). We stopped collecting new studies due to time constraints and the perception of diminishing returns."

I wasn't sure if a funnel plot was appropriate when applied to a non-systematically selected sample of studies. As I've said elsewhere, I think we could have made the depth (or shallowness) of our analysis more clear.

so I do think there was enough time to check a funnel plot for publication bias or odd heterogeneity

While that's technically true that there was enough time, It certainly doesn't feel like it! -- HLI is a very small research organization (from 2020 through 2021 I was pretty much the lone HLI empirical researcher), and we have to constantly balance between exploring new cause areas / searching for interventions, and updating / improving previous analyses. It feels like I hit publish on this yesterday. I concede that I could have done better, and I plan on doing so in the future, but this balancing act is an art. It sometimes takes conversations like this to put items on our agenda.

FWIW, here some quick plots I cooked up with the cleaner data. Some obvious remarks:

• The StrongMinds relevant studies (Bolton et al., 2003; Bass et al., 2006) appear to be unusually effective (outliers?).
• There appears more evidence of publication bias than was the case with our cash transfers meta-analysis (see last plot).
• I also added a p-curve. What you don't want to see is a larger number of studies at the 0.05 mark than the 0.04 significance level, but that's what you see here.

Here are the cash transfer plots for reference:

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-23T22:40:24.498Z · EA · GW

Hi Gregory,

The data we use is from the tab “Before 23.02.2022 Edits Data”. The “LayOrGroup Cleaner” is another tab that we used to do specific exploratory tests. So the selection of studies changes a bit.

1. We also clean the data in our code so the effects are set to positive in our analysis (i.e., all of the studies find reductions in depression/increases in wellbeing). Except for the Haushofer et al., which is the only decline in wellbeing.

2. We attempt to control for this problem by using a multi-level model (with random intercepts clustered at the level of the authors), but this type of meta-analysis is not super common.

3. We are using random effects. We are planning on exploring how best to set the model in our next analysis, and how using different models changes our analysis. Our aim is to do something more in the spirit of a multiverse analysis than our present analysis.

Perhaps what is going on is the overall calculated impact is much more sensitive to the regression coefficient for time decay than the pooled effect size, so the lone study with longer follow-up exerts a lot of weight dragging this upwards.

Yes, Baranov et al. has an especially strong effect on the time decay coefficient, not the pooled effect size. I'm less concerned this was a fluke as Bhat et al., (2021) has since been published. , which also found very durable effects of lay delivered psychotherapy primarily delivered to women.

4. I think you raise some fair challenges regarding the permissiveness of inclusion. Ideally, we'd include many studies that are at least somewhat relevant, and then weight the study by its precision and relevance. But there isn't a clear way to find out which characteristics of a study may drive the difference in its effect without including a wide evidence base and running a lot of moderating analyses.  I think many meta-analyses through the baby out with the bath water because of the strictness of their PICOs, and miss answering some very important questions because of it, e.g. "like how do the effects decay over time?".

5. As we say in the report.

We prefer an exponential model because it ts our data better (it has a higher 𝑅2) and it matches the pattern  found in other studies of psychotherapy’s trajectory. (footnote: The only two studies we have found that have tracked the trajectory of psychotherapy with suficient time granularity also find that the effects decay at a diminishing rate (Ali et al., 2017; Bastiaansen et al., 2020)).

So R^2 wasn't the only reason, but yes it was very low. I agree that it would be a good idea to report more statistics including the residual heterogeneity in future reports.

6. I think this is fair, and that more robustness checks are warranted in the next version of the analysis.

7. We planned quantitatively comparing the publication bias / small study effects between psychotherapy and cash transfers, as psychotherapy does appear to have more risk as you pointed out.

8. At the risk of sounding like a broken record, we plan on doing many more robustness checks in the flavor of a multiverse analysis when we update the analysis. If we find that our previous analyses appeared to have been unusually optimistic, we will adjust it until we think it's sensible.

These are good points, and I think they make me realize we could have framed our analysis differently. I saw this meta-analysis as:

•  An attempt to push EA analyses away from using one or a couple of studies and towards using larger bodies of evidence.
• To try point that the ho the change effects over time is an important parameter and we should try to estimate it.
•  A way to form a prior on the size and persistence of the effects of psychotherapy in low income countries.
• That the quality of this analysis was an attempt to be more rigorous than most shallow EA analyses, but definitely less rigorous than an quality peer reviewed academic paper

I think this last point is not something we clearly communicated.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-19T22:29:59.533Z · EA · GW

Yes, the pooled results are mostly done with meta-regressions where studies are weighted by the inverse of the standard error (so more imprecisely estimated effect sizes are weighted less).

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-19T20:38:37.633Z · EA · GW

Hi Rina! I appreciate the nice words.

Could I add on to Nick's comment and an ask for clarification about  including"Any form of face-to-face psychotherapy delivered to groups or by non-specialists deployed in LMICs." -- it seems in your Appendix B that the studies incorporated in meta-regression include a  lot of individually delivered interventions, do you still use them and if so how/any differently?

Yes, we still use individually delivered interventions as general evidence of psychotherapy's efficacy in low and middle-income countries. We assigned this general evidence 46% of the weight in the StrongMind's analysis (see Table 2 in the StrongMinds report).

While we found that group-delivered psychotherapy is more impactful, I'm not entirely clear what the causal mechanism for this would be, so I thought it'd be appropriately conservative to leave in that evidence. We showed and discussed this topic in Table 2 of our psychotherapy report (page 16).

This type of generally recruited and potentially partially biased sample seems a little different than a sample that includes women survivors of torture/ violence/ SA/ in post-conflict settings of which you have a number of RCTs.

I discussed the potential issues with the differences in samples and the way I try to address them in my response to Henry, so I won't repeat myself unless you have a further concern there.

Regarding the risk of bias due to mistaken beliefs about receiving material benefits -- this is honestly new to me since Nick Laing brought it up a couple of months ago. Insomuch as this bias exists, I assume for StrongMinds, this will have to go down over time as word travels that they do not, in fact, do much other than mental health treatments.

And to reiterate the crux here: for this to affect our comparison to, say, cash transfers, we need to believe that this bias leads to people over-reporting their benefits more (or less) than it would for the people who receive cash transfers who hope that if they give positive responses, they'll get even more cash transfers.

I'm not trying to dismiss this concern out of hand, but I'd prefer to collect more evidence before I change my analysis. I will, if possible, try to make that evidence come to be (just as I try to push for the creation of evidence to inform other questions we're uncertain about) -- if I can do so cost-effectively, but in my position, resources are limited.

Are there baseline mental health scores for all these samples that you could look at?

There are in many cases, but that's not data we recorded from the samples, but I think for most studies, the sample was selected for having psychological distress above some clinical threshold. That may be worth looking into.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-19T19:55:07.852Z · EA · GW

Hi Henry,

I addressed the variance in the primacy of psychotherapy in the studies in response to Nick's comment, so I'll respond to your other issues.

Some of the studies deal with quite specific groups of people eg. survivors of violence, pregnant women, HIV-affected women with young children. Generalising from psychotherapy's effects in these groups  to psychotherapy in the general population seems unreasonable.

I agree this would be a problem if we only had evidence from one quite specific group. But when we have evidence from multiple groups, and we don't have strong reasons for thinking that psychotherapy will affect these groups differently than the general population -- I think it's better to include rather than exclude them.

I didn't show enough robustness checks like this, which is a mistake I'll remedy in the next version. I categorised the population of every study as involving "conflict or violence", "general" or "HIV".  Running these trial characteristics as moderating factors suggests that, if anything, adding these additional populations underestimates the efficacy. But this is a point worth returning to.

Similarly, the therapies applied between studies seem highly variable including "Antenatal Emotional Self-Management Training", group therapy, one-on-one peer mentors. Lumping these together and drawing conclusions about "psychotherapy" generally seems unreasonable.

I'm less concerned with variation in the type of therapy not generalising because  as I say in the report (page 5) "...different forms of psychotherapy share many of the same strategies. We do not focus on a particular form of psychotherapy. Previous meta-analyses find mixed evidence supporting the superiority of any one form of psychotherapy for treating depression (Cuijpers et al., 2019)."

Due to the fact most types of psychotherapy seem about as effective, and expertise doesn't seem  to be of first order importance, I formed the view that if you regularly get someone talk to about their problems in a semi-structured way it'll probably be pretty good for them. This isn't a view I'd defend to the death, but I held it strongly enough to justify (at least to myself and the team) doing the simpler version of the analysis I performed.

With the difficulty of blinding patients to psychotherapy, there seems to be room for the Hawthorne effect to be skewing the results of each of the 39 studies: with patients who are aware that they've received therapy feeling obliged to say that it helped.

Right, but this is the case with most interventions (e.g., cash transfers). So long as the Hawthorne effect is balanced across interventions (which I'm not implying is assured), then we should still be able to compare their cost-effectiveness using self-reports.

Furthermore, only 8 of the trials had waitlist or do nothing controls. The rest of the trials received some form of "care as usual" or a placebo like "HIV education".  Presumably these more active controls could also elicit a Hawthorne effect or response bias?

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-19T18:45:27.870Z · EA · GW

"Any form of face-to-face psychotherapy delivered to groups or by non-specialists deployed in LMICs." These three studies below you included don't have psychotherapy as the intervention, unless I'm missing something.

Ah yes, I admit this looks a bit odd. But I'll try to explain. As I said in the psychotherapy CEA on page 9 (I didn't try to hide this too hard!):

Similarly, most studies make high use of psychotherapy. We classied a study as making high (low) use of psychological elements if it appeared that psychotherapy was (not) the primary means of relieving distress, or if relieving distress was not the primary aim of the intervention. For instance, we assigned Tripathy et al., (2010) as making low use of psychotherapy because their intervention was primarily targeted at reducing maternal and child mortality through group discussions of general health problems but still contained elements of talk therapy. We classied “use of psychotherapy” as medium if an intervention was primarily but not exclusively psychotherapy.

I also tried to show the relative proportion of papers falling in each category in the first figure:

The complete list of studies with low or medium use of psychotherapy elements are:

One concern I anticipate is: "were you sneaking in these studies to inflate your effect?" and that's certainly not the case. In a model in an earlier draft of the analysis but didn't make the final cut, I regressed whether a trial made high, medium, or low use of psychotherapy on the effects. I found that, if anything, the trials without "high" use of psychotherapy elements have smaller effects.

In lieu of a reference, I'll post the R output where the outcome is in SD changes in mental health measures.

I plan on being stricter with the studies I include in the next analysis version. When I first did this meta-analysis, I thought quantity was more important than quality, and I think my views have changed since then. I don't think including these studies less relevant to psychotherapy affects the results too much other than moving the results to align with the "psychosocial" intervention prior.

I also recognize that this is probably confusing, and we didn't explain this well. These are things I will return to when we return to this analysis and give it an upgrade in rigour and clarity.

Comment by JoelMcGuire on Evaluating StrongMinds: how strong is the evidence? · 2023-01-19T16:23:03.045Z · EA · GW

Nick, thank you for your comment. I always appreciate your friendly tone.

but I also believe that when an org gets to the scale that StrongMinds have now reached, they should have an RCT vs cash at least in the works.

I agree this would be great ... but this also seems like a really strict requirement, and I'm not sure it's necessary. GiveWell seems nonplussed by not having an RCT comparing deworming to cash transfers or malaria prevention to vitamin-A supplementation, I'm inclined to believe this is something they've thought about. That's not meant to be a knockdown argument against the idea, but if an RCT like this is needed for comparing psychotherapy to cash transfers -- why not every other intervention currently recommended in the EA space?

More directly, if we have two high-quality trials run separately about two different interventions but measuring similar outcomes -- how much better is this than running an RCT with two arms? It certainly reduces differences in confounders (particularly unobserved) between trials. But I think it's possible it could also have weaknesses.

• It seems plausibly more expensive to coordinate two high-quality interventions in a single trial than let them be run separately. For instance, IIRC correctly in Haushofer et al., 2020 the comparison between psychotherapy and GiveDirectly was a bit apples to oranges. For psychotherapy, they hired a local NGO to start a new programme from scratch, which they compared to GiveDirectly, which by that time, was a well-oiled cash-slinging machine. To get two organizations that already know how to deploy the intervention well to collaborate on a single RCT seems difficult and expensive.
• It also may have limited generalizability. Running separate trials of charity interventions makes it likelier that the results reflect the circumstances that the charity operates in -- unless they have an area of overlap -- which is possible, but finding that seems like another reason this could be difficult.

Lastly, regarding the making the dream RCT happen, HLI is currently rather resource constrained, so in our work, we have to make do with the existing literature, and we're only just now exploring "research advocacy" as an option. Running an RCT would probably cost a multiple of our annual budget. For StrongMinds, they have more resources, but not much more. If it was desired to have another RCT with a psychotherapy and cash arm, I wonder if the GiveDirectly RCT machine may be the most promising way to get that evidence.

You also say that there are 39 studies analysed, but it looks like there are a lot less studies than that, with individual studies broken into differnent groups (like a,b,c,d).

I think what may be happening here is I use "studies" as synonymous with "trials". So in my usage, one paper can analyse multiple studies (or trials). However, on reflection, I realise I sometimes refer to papers as studies -- which is unhelpful, therefore, I think it'd be clearer if I referred to each separate intervention experiment as a "trial".  Another thing that may be confusing is that sometimes authors will publish multiple papers in the same year. I distinguish these papers by adding an "a" or "b" etc to the end of the reference.

But if you count all of the different unique "trials", it does come out to 39.

Also have you thought about publishing your Meta-analysis in a peer reviewed journal?

We're keen to do this. But the existing meta-analysis is probably 65% of the rigour necessary for an academic paper. This year we are trying to redo the analysis with an academic collaborator so the search will be systematic, the data will be double-screened, and we have many more robustness tests.

(I'll answer the selection criterion question separately)

Comment by JoelMcGuire on StrongMinds should not be a top-rated charity (yet) · 2023-01-18T23:59:35.826Z · EA · GW

I think this was a valuable contribution to the discussion around charity evaluation. We agree that StrongMinds’ figures about their effect on depression are overly optimistic. We erred by not pointing this out in our previous work and not pushing StrongMinds to cite more sensible figures. We have raised this issue with StrongMinds and asked them to clarify which claims are supported by causal evidence.

There are some other issues that Simon raises, like social desirability bias, that I think are potential concerns. The literature we reviewed in our StrongMinds CEA (page 26) doesn’t suggest it’s a large issue, but I only found one study that directly addresses this in a low-income country (Haushofer et al., 2020), so the evidence appears very limited here (but let me know if I’m wrong). I wouldn’t be surprised if more work changed my mind on the extent of this bias. However, I would be very surprised if this alone changed the conclusion of our analysis. As is typically the case, more research is needed.

Having said that, I have a few issues with the post and see it as more of a conversation starter than the end of the conversation. I respond to a series of quotes from the original post below.

I’m going to leave aside discussing HLI here. Whilst I think they have some of the deepest analysis of StrongMinds, I am still confused by some of their methodology, it’s not clear to me what their relationship to StrongMinds is.

If there's confusion about our methodology, that’s fair, and I’ve tried to be helpful in that regard. Regarding our relationship with StrongMinds, we’re completely independent.

“The key thing to understand about the HLI methodology is that it follows the same structure as the Founders Pledge analysis and so all the problems I mention above regarding data apply just as much to them as FP.”

This is false. As we’ve explained before, our evaluation of StrongMinds is primarily based on a meta-analysis of psychological interventions in LMICs, which is a distinction between our work and Founders Pledge that means that many of the problems mentioned apply less to our work.

I also have some issues with the claims this post makes. I’ll focus on Simon’s summary of his argument:

“I think the strongest statement I can make (which I doubt StrongMinds would disagree with) is: ‘StrongMinds have made limited effort to be quantitative in their self-evaluation, haven’t continued monitoring impact after intervention, haven’t done the research they once claimed they would. They have not been vetted sufficiently to be considered a top charity, and only one independent group has done the work to look into them.”

Next, I remark on the problem with each line.

“I think the strongest statement I can make (which I doubt StrongMinds would disagree with) is – ”

I think StrongMinds would disagree with this argument. This strikes me as overconfident.

“StrongMinds have made limited effort to be quantitative in their self-evaluation, haven’t continued monitoring impact after intervention…”

If quantitative means “RCTs”, then sure, but until very recently, they surveyed the depression score before and after treatment for every participant (which in 2019 meant an n = 28,294, unpublished data shared with me during their evaluation). StrongMinds also followed up 18 months after their initial trial and in 2019 they followed up with 300 participants six months after they received treatment (again, unpublished data). I take that as at least a sign they’re trying to quantitatively evaluate their impact – even if they could do much better (which I agree they could).

“[StrongMinds] haven’t done the research they once claimed they would.”

I'm a bit confused by this point. It sounds more like the appropriate claim is, “they didn’t do the research they once claimed they would do fast enough.” As Simon pointed out, there’s an RCT whose results should be released soon by Baird et al. From conversations we’ve had with StrongMinds, they’re also planning on starting another RCT in 2023. I also know that they completed a controlled trial in 2020 (maybe randomised, still unsure) with a six-month and year follow-up. However, I agree that StrongMinds could and should invest in collecting more causal data. I just don’t think the situation is as bleak as it has been made out to be, as running an RCT can be an enormous undertaking.

“They have not been vetted sufficiently to be considered a top charity, and only one independent group has done the work to look into them.”

This either means (a) only Founders Pledge has evaluated StrongMinds, which is wrong, or (b) HLI doesn’t count because we are not independent, which would be both wrong and uncharitable.

“Based on Phase I and II surveys, it seems to me that a much more cost-effective intervention would be to go around surveying people. I’m not exactly sure what’s going on with the Phase I / Phase II data, but the best I can tell is in Phase I we had a ~7.5 vs ~5.1 PHQ-9 reduction from “being surveyed” vs “being part of the group” and in Phase II we had ~5.1 vs ~4.5 PHQ-9 reduction from “being surveyed” vs “being part of the group”. For what it’s worth, I don’t believe this is likely the case, I think it’s just a strong sign that the survey mechanism being used is inadequate to determine what is going on.”

I think this could have a pretty simple explanation. StrongMinds used a linear model to estimate: depression reduction = group + sessions. This will lead to a non-zero intercept if the relationship between sessions and depression reduction is non-linear, which we see in the graphs provided in the post.

Comment by JoelMcGuire on StrongMinds should not be a top-rated charity (yet) · 2023-01-17T20:58:17.759Z · EA · GW

FWIW, I'm not unsympathetic to comparing everything to GiveDirectly CTs, and this is probably something we will (continue to) discuss internally at HLI.

Comment by JoelMcGuire on Moral Weights according to EA Orgs · 2023-01-13T17:14:54.457Z · EA · GW

Sure,

See section 2.2

The other factor is where to locate the neutral point, the place at which someone has overall zero wellbeing, on a 0-10 life satisfaction scale; we assess that as being at each location between 0/10 and 5/10.

Or

We might suppose, then, that the neutral point on the life satisfaction scale is somewhere between 0 and 5.

Or you could also note that we estimate the lower bound of the value of saving a life as assuming a neutral point of 5.

Comment by JoelMcGuire on Moral Weights according to EA Orgs · 2023-01-13T06:46:28.730Z · EA · GW

Hopefully it does should how counter-intuitive their model can be

Is this because we argued that it's plausible that a life can have negative wellbeing?

Comment by JoelMcGuire on Moral Weights according to EA Orgs · 2023-01-13T06:43:38.734Z · EA · GW

Hi Simon,

I think it's valuable to see all of this in one place, and I appreciate the digging required to piece this together.