This is a short post on an idea from statistics and its relevance to EA. The initial post [EA · GW] highlighted the fact that expectations cannot always be sensibly thought of as representative values from distributions.
Probability served three ways
Suppose an event X is reported to have some probability p. We’re all aware at this point that, in practice, that p comes from some fitted model. Even if that means fitted inside someone’s head. This means it comes with uncertainty. However, it can be difficult to visualize what uncertainty in probability means.
Luckily, we can also model probabilities directly. A sample from the Beta distribution can be used as the parameter of a Bernoulli coin toss. The following three Beta distributions all have the same expectation: 1/2.
The interpretation here is:
The probability is either very high or very low - we don’t know which.
The probability is uniformly distributed - it could be anywhere.
We’re fairly sure the probability is right in the middle.
Suppose we encounter in some discussion a point estimate of a probability. For example E[p]=1/2. Or perhaps, the idea of expectation might not even be stated explicitly - but no other uncertainty information is given. It is natural to wonder: which flavour of p are we are talking about?
Implication for planning
Suppose a highly transmissible new disease infallibly kills some subset of humans. Or malevolent aliens. Or whatever is salient for the reader. Interpret the p in our example as the probability an arbitrary human is in the affected group a year from now.
Under distribution 1., and with roughly 32% probability, more than 99% of people are affected.
Under distribution 3., and with roughly 47% probability, the proportion of the population affected is between 45% and 55%.
I’m going to baldly assert that knowing which distribution we face should alter our response to it. Despite the coincidence of expectations. Which distribution represents the worst x-risk? Which would it be easiest to persuade people to take action on?
Nice post! Here's an illustrative example in which the distribution of p matters for expected utility.
Say you and your friend are deciding whether to meet up but there's a risk that you have a nasty, transmissible disease. For each of you, there's the same probability p that you have the disease. Assume that whether you have the disease is independent of whether your friend has it. You're not sure if p has a beta(0.1,0.1) distribution or a beta(20,20) distribution, but you know that the expected value of p is 0.5.
If you meet up, you get +1 utility. If you meet up and one of you has the disease, you'll transmit it to the other person, and you get -3 utility. (If you both have the disease, then there's no counterfactual transmission, so meeting up is just worth +1.) If you don't meet up, you get 0 utility.
It makes a difference which distribution p has. Here's an intuitive explanation. In the first case, it's really unlikely that one of you has it but not the other. Most likely, either (i) you both have it, so meeting up will do no additional harm or (ii) neither of you has it, so meeting up is harmless. In the second case, it's relatively likely that one of you has the disease but not the other, so you're more likely to end up with the bad outcome.
If you crunch the numbers, you can see that it's worth meeting up in the first case, but not in the second. For this to be true, we have to assume conditional independence: that you and your friend having the disease are independent events, conditional on the probability of an arbitrary person having the disease being p. It doesn't work if we assume unconditional independence but I think conditional independence makes more sense.
The calculation is a bit long-winded to write up here, but I'm happy to if anyone is interested in seeing/checking it. The gist is to write the probability of a state obtaining as the integral wrt p of the probability of that state obtaining, conditional on p, multiplied by the pdf of p (i.e. P(s1,s2)=∫P(s1,s2|p)f(p)dp). Separate the states via conditional independence (i.e. P(s1,s2|p)=P(s1|p)P(s2|p)) and plug in values (e.g. P(you have it|p)=p) and integrate. Here's the calculation of the probability you both have it, assuming the beta(0.1,0.1) distribution. Then calculate the expected utility of meeting up as normal, with the utilities above and the probabilities calculated in this way. If I haven't messed up, you should find that the expected utility is positive in the beta(0.1,0.1) case (i.e. better to meet up) and negative in the beta(20,20) case (i.e. better not to meet up).
Reflecting on this example and your x-risk questions, this highlights the fact that in the beta(0.1,0.1) case, we're either very likely fine or really screwed, whereas in the beta(20,20) case, it's similar to a fair coin toss. So it feels easier to me to get motivated to work on mitigating the second one. I don't think that says much about which is higher priority to work on though because reducing the risk in the first case could be super valuable. The value of information narrowing uncertainty in the first case seems much higher though.
I share the intuition that the second case would be easier to get people motivated for, as it represents more of a confirmed loss.
However, as your example shows actually the first case could lead to an 'in it together' effect on co-ordination. Assuming the information is taken seriously. Which is hard as, in advance, this kind of situation could encourage a 'roll the dice' mentality.
This brings to mind the assumption of normal distributions when using frequentest parametric statistical tests (t-test, ANOVA, etc.). If plots 1-3 represented random samples from three groups, an ANOVA would indicate there was no significant difference between the mean values of any group, which usually be reported as there being no significant difference between the groups (even though there is clearly a difference between them). In practice, this can come up when comparing a treatment that has a population of non-responders and strong responders vs. a treatment where the whole population has an intermediate response. This can be easily overlooked in a paper if the data is just shown as mean and standard deviation, and although better statistical practices are starting to address this now, my experience is that even experienced biomedical researchers often don't notice this problem. I suspect that there are many studies which have failed to identify that a group is composed of multiple subgroups that respond differently by averaging them out in this way.
The usual case for dealing with non-normal distributions is to test for normality (i.e. Shapiro-Wilk's test) in the data from each group and move to a non-parametric test if that fails for one or more groups (i.e. Mann-Whitney's, Kruskal-Wallis's or Friedman's tests), but even that is just comparing medians so I think it would probably still indicate no significant difference between (the median values of) these plots. Testing for difference between distributions is possible (i.e. Kolmogorov–Smirnov's test), but my experience is that this seems to be over-powered and will almost always report a significant difference between two moderately sized (~50+ samples) groups, and the result is just that there is a significant difference in distributions, not what that actually represents (i.e differing means, standard deviations, kurtosis, skewness, long-tailed, completely non-normal, etc. )
One of the topics I hope to return to here is the importance of histograms. They're not a universal solvent. However they are easily accessible without background knowledge. And as a summary of results, they require fewer parametric assumptions.
I very much agree about the reporting of means and standard deviations, and how much a paper can sweep under the rug by that method.
One answer is that there is no difference between 'orders' of random variables in Bayesian statistics. You've either observed something or you haven't. If you haven't, then you figure out what distribution the variable has.
The relationship between that distribution and the real world is a matter of your assiduousness to scientific method in constructing the model.
Lack of a reported distribution on a probability, e.g. p=0.42, isn't the same as a lack of one. It could be taken as the assertion that the distribution on the probability is a delta function at 0.42. Which is to say the reporter is claiming to be perfectly certain what the probability is.
There is no end to how meta we could go, but the utility of going one order up here is to see that it can actually flip our preferences.