Second Edition

Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

```
In [1]:
```# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
!pip install empiricaldist

```
In [2]:
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Pmf

Here's a question that appeared on Reddit recently:

I am trying to approximate a distribution based on the 68-95-99.7 rule--but in reverse. I would like to find an approximation for standard deviation, given:

Assumed normal dist.

Mean = 22.6

n = 100

19 of n scored less than 10.0

Min = 0, Max = 37

My intuition tells me that it is possible to solve this problem but I don't think I learned how to do it in school and don't know the right words to use to look it up. Thanks for any assistance!!!

A user named efrique responded:

You have 19% less than 10

If the sample size were large enough (100 isn't very large, so this will have a fair bit of noise in it), you could just look at the 19th percentile of a normal. That's

`-0.8779`

standard deviations below the mean, implying`22.6-10 = 12.6`

is 0.8779 of a standard deviation.

First, let's check his math. I'll compute the 19th percentile of the standard normal distribution:

```
In [3]:
```from scipy.stats import norm
norm.ppf(0.19)

```
Out[3]:
```

```
In [4]:
```sigma = 12.6 / 0.8779
sigma

```
Out[4]:
```

Let's see what Bayes has to say about it.

If we knew that the standard deviation was 14, for example, we could compute the probability of a score less than or equal to 10:

```
In [5]:
```sigma = 14
dist = norm(22.6, sigma)
ple10 = dist.cdf(10)
ple10

```
Out[5]:
```

```
In [6]:
```from scipy.stats import binom
binom(100, ple10).pmf(19)

```
Out[6]:
```

But we don't know the standard deviation. So I'll make up a range of possible values.

```
In [7]:
```hypos = np.linspace(1, 41, 101)

Now we can compute the probability of a score less than or equal to 10 under each hypothesis.

```
In [8]:
```ple10s = norm(22.6, hypos).cdf(10)
ple10s.shape

```
Out[8]:
```

And the probability that 19 out of 100 would be less than or equal to 10.

```
In [9]:
```likelihood1 = binom(100, ple10s).pmf(19)
likelihood1.shape

```
Out[9]:
```

Here's what it looks like.

```
In [10]:
```plt.plot(hypos, likelihood1)
plt.xlabel('Standard deviation')
plt.ylabel('Likelihood');

```
```

If we have no other information about sigma, we could use a uniform prior.

```
In [11]:
```prior = Pmf(1, hypos)
posterior = prior * likelihood1
posterior.normalize()

```
Out[11]:
```

```
In [12]:
```posterior.plot()
plt.xlabel('Standard deviation')
plt.ylabel('PMF');

```
```

```
In [13]:
```posterior.max_prob()

```
Out[13]:
```

The posterior mean is a little higher.

```
In [14]:
```posterior.mean()

```
Out[14]:
```

And the credible interval is pretty wide.

```
In [15]:
```posterior.credible_interval(0.9)

```
Out[15]:
```

```
In [16]:
```sigma = 14
dist = norm(22.6, sigma)
plt0 = dist.cdf(0)
plt0

```
Out[16]:
```

And the probability that at least one person gets a score less than or equal to 0.

I'm using `sf`

, which computes the survival function, also known as the complementary CDF, or `1 - cdf(x)`

.

```
In [17]:
```binom(100, plt0).sf(0)

```
Out[17]:
```

With mean 22.6 and standard deviation 14, it is likely that someone would get a 0.

If the standard deviation were lower, it would be less likely, so this data provides some evidence, but not much.

Nevertheless can do the same computation for the range of possible sigmas.

```
In [18]:
```plt0s = norm(22.6, hypos).cdf(0)

And compute the likelihood that at least one person gets a 37.

```
In [19]:
```likelihood2 = binom(100, plt0s).sf(0)

Here's what it looks like.

```
In [20]:
```plt.plot(hypos, likelihood2)
plt.xlabel('Standard deviation')
plt.ylabel('Likelihood');

```
```

```
In [21]:
```prior = Pmf(1, hypos)
prior.normalize()
posterior2 = prior * likelihood1 * likelihood2
posterior2.normalize()

```
Out[21]:
```

Here's what the posteriors look like:

```
In [22]:
```posterior.plot()
posterior2.plot()
plt.xlabel('Standard deviation')
plt.ylabel('PMF');

```
```

```
In [23]:
```posterior.mean(), posterior2.mean()

```
Out[23]:
```

We have one more piece of information to work with. We also know that the high score was 37, which is the maximum score possible.

As an exercise, compute the likelihood of this data under each of the hypothetical standard deviations in `hypos`

and use the result to update the posterior.

You should find that it contains almost no additional evidence.

```
In [24]:
```# Solution
pgt37s = norm(22.6, hypos).sf(37)

```
In [25]:
```# Solution
likelihood3 = binom(100, pgt37s).sf(0)

```
In [26]:
```# Solution
plt.plot(hypos, likelihood2)
plt.xlabel('Standard deviation')
plt.ylabel('Likelihood');

```
```

```
In [27]:
```# Solution
prior = Pmf(1, hypos)
prior.normalize()
posterior3 = prior * likelihood1 * likelihood2 * likelihood3
posterior3.normalize()

```
Out[27]:
```

Here's what the posteriors look like:

```
In [28]:
```# Solution
posterior.plot()
posterior2.plot()
posterior3.plot()
plt.xlabel('Standard deviation')
plt.ylabel('PMF');

```
```

```
In [29]:
```# Solution
posterior.mean(), posterior2.mean(), posterior3.mean()

```
Out[29]:
```

```
In [ ]:
```