# Nines of safety: Terence Tao’s proposed unit of measurement of risk

post by anson.ho · 2021-12-12T18:01:22.514Z · EA · GW · 24 comments

## Contents

  The proposal
Why?
Relation to EA
None


I recently came across Terence Tao’s post proposing a risk measure called “nines of safety”. I think it’s a very interesting proposal, and given that many EA forum users think a lot about risks and probabilities, I’m curious to hear what opinions other people have.

Below I’ll briefly summarise my understanding of the idea, and ask some specific questions about how this might be related to EA. For more details, I highly recommend reading the original post

# The proposal

While we often use percentages to describe probabilities and proportions, it can be hard to tell whether a given percentage is “good” or “bad”. For instance, having a 60% chance of success seems risky for a medical operation, but a result of 60% to 40% would be a landslide victory in a two-party election.

Part of the difficulty is that percentages can be used in a multitude of ways, with different interpretations in different scenarios. Tao proposes a unit that can be used to measure both the risk and safety of some event that has a really bad outcome (e.g. a global pandemic). The trouble with percentages in this scenario is that the probability of getting a good outcome needs to be really high in order for us to be comfortable. For instance, 90% odds of successfully completing a potentially life-threatening medical operation might seem a bit risky. 99% odds would probably feel quite a bit better, and at 99.9% odds of success we might start feeling reasonably safe (in general, this depends on how bad the negative outcomes are, the counterfactual of not doing the operation, etc.).

Writing out all of these 9s seems a bit clumsy, and the measure that Tao proposes addresses this - it’s called “nines of safety”, and informally measures how many consecutive 9s there are in the probability of success (the nines of risk would be the same as the nines of safety, but applied to the probability of failure). So in the previous example:

• 90% success = 1 nine of safety
• 99% success = 2 nines of safety
• 99.9% success = 3 nines of safety

We can formalise this in terms of the base-10 logarithm:

which allows us to extend “nines of safety” so that it isn’t just a whole number. We can write a table to convert between the probabilities of success, failure, and the nines of safety:

Note that the nines of safety are rounded to 1 decimal place, because in practice probability estimates are likely to be quite uncertain, and extra decimal places may not be particularly significant.

In general (as in the aforementioned example of the medical operation), the number of nines of safety depends on several factors, such as the number of people exposed and the duration of exposure. We might also need to consider repeated exposures, which can be quite complicated – depending on the task, individual exposures may not necessarily be independent from each other.

# Why?

This potentially has several benefits:

• Easier mental arithmetic:
• Due to the properties of logarithms, adding “nines of safety” is the same as multiplying probabilities, which makes mental calculation easier (especially true if we’re dealing with relative risks, e.g. “What are the odds of catching COVID in a vaccinated group relative to an unvaccinated control group?”)
• It also makes it easier to convert from individual risk to group risk (see this comment)
• “Apples-to-apples comparisons”: since percentages are interpreted very differently depended on context, having a measure that is uniquely devoted to measuring risks can be quite helpful
• Finer characterisation of high odds of success: e.g. a small change in percentage from 99% to 99.9% odds of success leads to an addition of 1 nine of safety

I think it’s also worth mentioning that a similar idea is already used in some fields, like reliability engineering and assessing the purity of substances. In the post, Tao summarises how nines of safety would be used quite nicely:

“In summary, when debating the value of a given risk mitigation measure, the correct question to ask is not quite “Is it certain to work” or “Can it fail?”, but rather “How many extra nines of safety does it add?”.”

One possible objection to this would be that expected value calculations already account for “low-probability, high-risk” scenarios. A counterargument to this is that expected value requires estimating both the probability and the impact, which leads to greater uncertainty than just considering the nines of safety (which only depends on probability). Overall though, I’m unsure about how useful nines of safety might be compared to expected value in different cause areas.

# Relation to EA

I guess some obvious questions would be something like, “how many nines of safety are there for different problems in your field?” As an example, if we convert the risks from The Precipice, we get:

If you think that these numbers are severely underestimating the X-risks, then perhaps you have a case against nines of safety being a particularly useful measure.

In general, I’m curious about several things:

• What are some examples of things in EA where using nines of safety might be a good idea? I’m especially interested in examples where there is disagreement about how likely an intervention is to work (e.g. researcher A believes intervention X is better than intervention Y, but researcher B believes otherwise).
• Does “nines of safety” seem useful relative to techniques we already use, like expected values?
• What are some arguments against nines of safety?

comment by AllAmericanBreakfast · 2021-12-12T19:44:35.262Z · EA(p) · GW(p)

I like that it frames safety as a noun, not just an adjective. “We’re 99% safe” vs. “we have two nines of safety.” For some reason, it hits you different if safety feels like a tangible product you can make, or buy, rather than an intangible perception or description.

My worry is that this proposal is meant to address lay people’s innumeracy. But it requires a lengthy explanation. I suspect there are many people who don’t understand that an earthquake generating a 6 on the Richter scale is a 10x more powerful earthquake than a 5.

Another alternative is just to say “this intervention would make us 10x safer,” rather than “this intervention gives us an extra nine of safety.”

So this proposal seems to me to have a tradeoff between the psychological impact of nounification and the potential confusion of a logarithmic scale. I don’t see any risk of harm, but I think that it is probably best used in contexts where you can expect the audience to know and feel comfortable with the log scale.

Replies from: AllAmericanBreakfast, Samuel Shadrach, anson.ho
comment by AllAmericanBreakfast · 2021-12-12T19:57:09.693Z · EA(p) · GW(p)

As a follow up, the more common proposal for this issue is to switch to ratios. For example, rather than saying you have a 99.99999% chance (7 nines) of not dying from a lightning strike this year, say that only 1 in 10,000,000 people die from lightning strikes per year.

I think this is harder when we’re discussing global risks and unprecedented risks. It’s hard to conceptualize humanity going extinct in 1 in 6 earths-this-century (Toby Ord’s guess). Easier to think of a 17% chance. Maybe percentages work best for one-off risks, and ratios work better when we have a base rate to work with?

comment by acylhalide (Samuel Shadrach) · 2021-12-16T14:22:48.604Z · EA(p) · GW(p)

people who don’t understand that an earthquake generating a 6 on the Richter scale is a 10x more powerful earthquake than a 5.

Imo the point of Richter scale is that most people don't care about physical variables like power or intensity, nor do they care how those are defined. Instead you want to directly describe the felt experience - so the felt experience of a Richter scale 6 earthquake is worse than that of Richter scale 5 but it's not 10x worse.

Replies from: AllAmericanBreakfast
comment by AllAmericanBreakfast · 2021-12-16T15:32:15.443Z · EA(p) · GW(p)

This is not historically accurate. You can find a pretty good account of the development of the Richter scale on Wikipedia. It was developed to replace a felt-experience-based assessment, such as the Rossi-Forel, which had some nice vivid descriptions.

Felt experience still was used to choose the zero point.

If we think about property damage, many buildings are going to be built to withstand an earthquake up to a certain magnitude. Below this, damage, measured in dollars or in lost lives, may be less than 10x per point on the Richter scale. Above this, damage may suddenly jump by much more. It’s all about the relationship between historical trends in earthquake magnitude and investment in engineering to resist future earthquakes.

In my experience, log scales are convenient for scientists, because we’re less prone to error. If I need a solution to be at pH 1, that’s easy to remember. If I had to convert that to absolute H+ concentration, I’d be prone to dropping a zero somewhere. But if you don’t understand log scales, or are using them as a subjective guide rather than a scientific instrument, I think they’re less helpful - as evidenced by the fact that they don’t get used for, say, measuring wealth or population levels, which are the areas where lay audiences routinely encounter large numbers.

For another interesting history of the move from subjective assessments to more uniform observations, check out the development of the Beaufort Scale!

comment by acylhalide (Samuel Shadrach) · 2021-12-16T15:55:34.968Z · EA(p) · GW(p)

Just saw wikipedia. You're right it replaced previous method which was even more felt-experience-based. But wiki doesn't tell why a log scale was used. I'd be keen to know about this.

I think the difference when it comes to wealth and population levels is that 2x as much wealth actually buys you 2x more chocolates and 2x more population actually needs 2x more houses to live, where number of chocolates and houses are concrete things you can experience. Versus say amplitude of needle movement (richter) or amplitude of sound waves (decibel) - these seem like things that only exist in a physics theory and there needs to be a map to what is actually experienced.

So I meant a person experiencing a Richter scale 7 earthquake is likely not to going to say his personal experience of it was 100x as bad as a Richter scale 5 earthquake, if he was on an open field.

Damage to buildings is something that can be experienced, are you saying the scale is based on that?

Agreed that log scales can be convenient for scientists but that too depends - for instance it's useful if the variable exists in a a theory that follows a multiplicative or exponential law (like law of mass action for concentrations) rather than additive law.

Replies from: AllAmericanBreakfast, AllAmericanBreakfast
comment by AllAmericanBreakfast · 2021-12-16T16:12:28.431Z · EA(p) · GW(p)

I think what you’re reaching for are the Weber-Fechner laws, which point out that human perception seems to operate on a log scale. The Wikipedia article on the topic illustrates.

However, my read on the Richter scale is that even if you’re right that people routinely think a 1-point jump on the RS feels like a less-than-10x jump in perception of shaking, that this is an effect, not a cause, of the choice of scale. But I don’t concede that - as I say, I think it’s likely to be more complex.

comment by acylhalide (Samuel Shadrach) · 2021-12-16T16:20:10.290Z · EA(p) · GW(p)

Thanks for the reference, Weber-Fechner seems interesting.

comment by AllAmericanBreakfast · 2021-12-16T16:07:14.756Z · EA(p) · GW(p)

It does say why a log scale was chosen.

“ First, to span the wide range of possible values, Richter adopted Gutenberg's suggestion of a logarithmic scale, where each step represents a tenfold increase of magnitude, similar to the magnitude scale used by astronomers for star brightness.”

comment by acylhalide (Samuel Shadrach) · 2021-12-16T16:21:23.514Z · EA(p) · GW(p)

I saw this but I'm not sure what "wide range of possible values" means, or if earthquake intensities have wider ranges than any other variable. Keen to know what you feel.

Replies from: AllAmericanBreakfast
comment by AllAmericanBreakfast · 2021-12-16T17:43:27.569Z · EA(p) · GW(p)

This means the same thing it means for any other use of a log scale. It means that the value can be a very small number or very, very big number.

The Richter scale is measuring the amplitude of waves recorded by seismographs. The amplitude of these waves for human-perceptible earthquakes has crossed 9.6 orders of magnitude since we started using seismographs.

Given this, the Richter scale does not have the widest range of any other log scale, if that's what you mean. The pH scale, for example, has a range of 15 orders of magnitude.

If you are communicating with a lay audience, and want to give them a sense of the expected damage from an earthquake, or how it'll physically feel, probably the best way to do that is with a verbal description.

If you are communicating with a scientific audience, you want to use a measured seismograph value. You could write out all those zeroes to express a high wave amplitude, but  it's convenient to use the log scale.

comment by acylhalide (Samuel Shadrach) · 2021-12-17T05:58:12.915Z · EA(p) · GW(p)

It means that the value can be a very small number or very, very big number.

This is true for any variable though. Distance goes from picometres to light years. Weight goes from micrograms to solar masses. Temperature goes from 0.001 K to millions of degrees.

The main difference with Richter that I can see is that the felt experience of even very small amplitudes is not proportionately small, so it makes sense to talk about them, instead of considering them a rounding error.

Scientific audience can use scientific notation or invent new units (like light years or metric tonnes etc), atleast for variables that follow additive laws. Cause you'll typically be adding light years to light years and metric tonnes to metric tonnes, and ignoring anything smaller in that specific discussion.

Replies from: AllAmericanBreakfast
comment by AllAmericanBreakfast · 2021-12-17T07:04:47.696Z · EA(p) · GW(p)

The way scientists and engineers deal with these issues of scale, when not using a log scale, is with unit choice. In our lab, we talk about “microns” when discussing the micro scale, and “nanometers” when discussing the nanoscale. This lets us keep our numbers conveniently sized for discussion. It has nothing to do with the felt size of a nanometer versus a micrometer. It has everything to do with the convenience and precision of technical discussion among colleagues.

Log scales are designed by and for scientists for similar purposes.

When we communicate that a substance is dangerously acidic, we typically do that with big red warning letters and pictures indicating the danger. When we indicate that a vinegar or a citrus fruit is tart (also a function of acidity), we do it by comparing with a familiar taste, or use a vivid verbal description. Log scales are nowhere to be found.

comment by acylhalide (Samuel Shadrach) · 2021-12-17T11:39:07.408Z · EA(p) · GW(p)

Ugh okay you didn't get me again. Let me summarise my views.

Physical law = relation between physical variables

If additive physical law, use a unit of choice depending on which scale the discussion is going in. Anything not in the same scale is too small to matter for the physical law. Unit of choice makes calculation easier. Examples: mass (conversation of mass), distance (distance is additive), time (time is additive)

If multiplicative or exponential physical law, can use a log scale. Small things also matter for the physical law. Log scale makes calculation easier. Examples: Concentration and pH (law of mass action)

If and only if physical variable is not proportional to phenomenal experience, define a phenomenal variable. Create a phenomenal law, usually "Phenomenal variable = log (physical variable)". Phenomenal variable does not make calcuation easier, it only exists to relate physical variables to phenomenal experiences. Examples: Sound wave amplitude and decibel (no physical law), seismograph needle amplitude and Richter (no physical law)

I'm not sure which part you disagree with.

Replies from: AllAmericanBreakfast
comment by AllAmericanBreakfast · 2021-12-17T15:28:31.850Z · EA(p) · GW(p)

For starters, you haven’t given any examples of laws here. You’ve only given examples of units. And from your wording, I’m not sure if you understand the difference.

For example, when you say “distance is additive,” I’m not sure what you mean. Distance is a scalar, not a law, and laws involving distance may use all kinds of arithmetic transformations of distance. For example, Newton’s law of universal gravitation related the force of attraction between two bodies to the inverse square of their distance.

Not only have you not been clear enough with your language, you have also not supplied evidence for your claims. By evidence, I mean either historical examples about how various unit types were developed, hypotheticals about our ability to distinguish levels of magnitude with our senses, or real-world examples of how we communicate expectations about sensory experience to the lay public. I’ve given you all of these forms of evidence, and you haven’t responded to them.

For these reasons, I don’t understand you, and am no longer interested in talking with you about this subject.

comment by anson.ho · 2021-12-16T01:10:18.094Z · EA(p) · GW(p)

I think these are all good points, thanks for sharing!

To push back on the point about lay people innumeracy a bit, doesn't expected value also need a somewhat lengthy explanation? In addition, I think a common mistake is to conflate EV and averages, so should we have similar concerns about EV as well?

Maybe a counterargument to this would be that "nines of safety" has obvious alternatives (e.g. ratios, as you point out), but perhaps it's harder to do this for EV?

Replies from: AllAmericanBreakfast
comment by AllAmericanBreakfast · 2021-12-16T01:43:22.638Z · EA(p) · GW(p)

In general, it's best to use knowledge that's common to your audience when possible. If that's not possible, then you have to find the right balance precision, brevity, and familiarity. The appropriate balance will heavily depend on the audience and topic.

My practice, when writing informally, is to notice when I'm about to use a jargon term, and then search my knowledge of colloquial speech to see if there's a common term or phrase that captures this jargon term. If so, I tend to use it.

Here are two examples of sentences from the EA forum containing the phrase "expected value," and how I might rephrase them in more colloquial speech. I won't link to the source, because that would be a little tedious, but credit for the sentences goes to the authors, and you can find the source by searching for the sentence itself.

1.

"Here, the option with the greatest expected value is donating to the speculative research (at least on certain theories of value - more on those in a moment)."

->

"Here, speculative research is the best option because of its massive upside potential, at least depending on what we care about..."

2.

"My previous model, in which I took expected value estimates and adjusted them based on my intuition, was clearly inadequate."

->

"Before, I estimated the costs and benefits and then adjusted those estimates intuitively, which definitely wasn't good enough."

comment by Bogdan Ionut Cirstea · 2021-12-13T09:46:30.827Z · EA(p) · GW(p)

If I remember correctly (from 'The Precipice') 'Unaligned AI ~1 in 50 1.7' should actually be 'Unaligned AI ~1 in 10 1'.

Replies from: anson.ho
comment by anson.ho · 2021-12-16T00:56:08.767Z · EA(p) · GW(p)

Thanks for pointing this out! Should be fixed now

It's in use in electric grid availability

Replies from: robirahman
comment by Robi Rahman (robirahman) · 2021-12-12T20:24:18.627Z · EA(p) · GW(p)

I've heard this for e.g. server uptime as well.

comment by acylhalide (Samuel Shadrach) · 2021-12-13T07:02:50.542Z · EA(p) · GW(p)

Some counterpoints:

- This only makes sense for risks smaller than say 0.1% (or whatever your mental model considers too small). One misconception that affects advocacy for x-risk reduction is that all x-risks are tiny and hence Pascal's wager-like thinking is required to be in favour of reducing x-risks. Speaking in terms of concrete probabilities - like yes we think there's a 10% chance we all die by the end of the century and this is not small and we should be working on this - seems better than adding another layer of abstraction.

- For people who do understand math, this is additional mental burden not less. Also this does not preserve  the ability to do calculations mentally, like adding the probabilities up or anything more complex.

- Stuff like Richter scale or decibel scale make sense to do on logarithmic scale because the underlying variable exists only in a physical theory, that does not match with the "felt experience" of the variable. Imo risks greater than 0.1% or  atleast 1% can be felt by people in concrete terms.

comment by Harrison Durland (Harrison D) · 2021-12-12T20:20:40.572Z · EA(p) · GW(p)

"Writing out all of these 9s seems a bit clumsy"
Personally, I don't see it as more clumsy/awkward/inconvenient than trying to learn and accurately use terms like "three nines." And then you get to situations where it's a non-integer number of nines (e.g., 0.83 nines): trying to convert that to percentages seems like a pain/intuition block, especially given that most people aren't familiar with this system. On that point, I would strongly echo the points of AllAmericanBreakfast--whose example of the Richter scale seems like a great example: my impression has been that most (lay) people do not accurately understand these numbers in terms of logarithms, and so it seems like they are less likely to understand this nines system.

Ultimately, I imagine there probably are some mathematically-oriented justifications for using this system, but I think that the key deficiency here is about lay and intuitive understanding, and my impression is that this system does the opposite of helping with that--or at least that it would be far more effective to use better language with existing systems (e.g., saying that the risk has tripled from 0.1% to 0.3% instead of saying the safety has decreased from 99.9% to 99.7%) and/or teach people to better understand the existing systems, rather than introducing a new system.

comment by Jaime Sevilla (Jsevillamol) · 2021-12-12T19:57:32.253Z · EA(p) · GW(p)

Yes please. This is a great idea and I would want us to move towards a culture where this is more common. Even better if we can use logarithmic odds instead [EA · GW], but I understand that is a harder sell.

Talking about probabilities makes sense for repeated events where we care about the proportion of outcomes. This is not the case for existential risk.

Also I am going to be pedantic and point out that Tao's example about the election is misleading. The percentage is not the chances of winning the election! Instead is the pollling results. The implicit probability being discussed is the chances of the election outcome given the polling, that is a far more extreme probability depending on how representative the poll is and how close the results are.

comment by Donald Hobson · 2021-12-20T01:30:11.600Z · EA(p) · GW(p)

Nines of unsafety, for the pessimists. So 2 9's of unsafety is a 99% chance of doom.