One simplification we can make right away is that we know $\mu$ will be centered at $\bar{x}$ and is symmetric:
$$P( \bar{x} - y < \mu< \bar{x} + y) = 0.95$$where $y$ is some number we need to find. We can further rewrite this as:
$$P( - y < \mu - \bar{x}< + y) = 0.95$$which we know follows a $t-$distribution. Note that these are probailities, which are integrals of the probability distribution
Here's a visual to understand what we're after. Note that I'm actually answering this problem to make the graph, so wait until later to try to understand the code!
In [1]:
import scipy.stats as ss
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#make some points for plot
N = 5
x = np.linspace(-5,5, 1000)
T = ss.t.ppf(0.975, df=N-1)
y = ss.t.pdf(x, df=N-1)
plt.plot(x,y)
plt.fill_between(x, y, where= np.abs(x) < T)
plt.text(0,np.max(y) / 3, 'Area=0.95', fontdict={'size':14}, horizontalalignment='center')
plt.axvline(T, linestyle='--', color='orange')
plt.axvline(-T, linestyle='--', color='orange')
plt.xticks([-T, T], ['-y', 'y'])
plt.yticks([])
plt.ylabel(r'$p(\mu - \bar{x})$')
plt.xlabel(r'$\mu - \bar{x}$')
plt.show()
This is a very similar problem to the prediction intervals we had before. We know that $p(\mu - \bar{x})$ follows a $T(0, \sigma_x /\sqrt{N}, N - 1)$ distribution and we can use the same idea as $Z$-scores as we did for prediction intervals
The 'mean' our error in the population mean distribution is 0, because our error in population mean is always centered around 0.
After taking 5 samples, we've found that the sample mean is 45 and the sample standard deviation, $\sigma_x$ is 3. What is the 95% confidence interval for the true mean, $\mu$?
We can write this more like this:
$$P(- y < \mu - \bar{x} < +y) = 0.95$$Our interval will go from 2.5% to 97.5% (95% of probability), so let's find the $T$-value for $-\infty$ to 2.5% and 97.5% to $\infty$. Remember that the $T$-value depends on the degrees of freedom, N-1.
In [2]:
import scipy.stats
#The lower T Value. YOU MUST GIVE THE SAMPLE NUMBER
print(scipy.stats.t.ppf(0.025, 4))
print(scipy.stats.t.ppf(0.975, 4))
In [3]:
print(-scipy.stats.t.ppf(0.025, 4) * 3 / np.sqrt(5))
The final answer is $P(45 - 3.72 < 45 < 45 + 3.72) = 0.95$ or $45\pm 3.72$
scipy.stats.t.ppf
or scipy.stats.norm.ppf
function to find them.
In [4]:
# DO NOT COPY, JUST GENERATING DATA FOR EXAMPLE
data = scipy.stats.norm.rvs(size=100, scale=15, loc=50)
In [5]:
#Check if sample size is big enough.
#This code will cause an error if it's not
assert len(data) > 25
CI = 0.95
sample_mean = np.mean(data)
#The second argument specifies what the denominator should be (N - x),
#where x is 1 in this case
sample_var = np.var(data, ddof=1)
Z = scipy.stats.norm.ppf((1 - CI) / 2)
y = -Z * np.sqrt(sample_var / len(data))
print('{} +/ {}'.format(sample_mean, y))
Is that low? Well, remember that our error in the mean follows standard deviation divided by the root of number of samples.
In [6]:
# DO NOT COPY, THIS JUST GENERATES DATA FOR EXAMPLE
data = scipy.stats.norm.rvs(size=4, scale=15, loc=50)
In [7]:
CI = 0.95
sample_mean = np.mean(data)
sample_var = np.var(data, ddof=1)
T = scipy.stats.t.ppf((1 - CI) / 2, df=len(data)-1)
y = -T * np.sqrt(sample_var / len(data))
print('{} +/ {}'.format(sample_mean, y))
This is a prediction interval, so we're computing a interval on the distribution itself and we know everything about it.
A randomly chosen slab will have a thickness of $3.4 \pm 1.40$ 95% of the time.
We know that $p(\bar{x} - \mu)$ is normally distributed with ${\cal N}(0, \sigma / \sqrt{N})$. We want to find
$$ P(-y < \bar{x} - \mu < +y) = 0.95$$At a 95% confidence level, the true mean is $3.38 \pm 0.248$.
Again we know that $p(\bar{x} - \mu)$ is normally distributed with ${\cal N}(0, \sigma / \sqrt{N})$. We want to find
$$ P(-y < \bar{x} - \mu < +y) = 0.99$$We know that $p(\bar{x} - \mu)$ is a $t$-distribution because $N$ is small. It is distributed as $T(0, \sigma_x / \sqrt{N})$. We want to find
$$ P(-y < \bar{x} - \mu < +y) = 0.90$$
In [8]:
#Notice it is 95%, so the interval goes from
#5% to 95% containing 90% of probability
T = scipy.stats.t.ppf(0.95, df=6-1)
print(T)
The population mean of the slabs is $3.65 \pm 1.028$ with 90% confidence.
We know, just like last example, that $P(\bar{x} - \mu)$ is a normal distribution because $N$ is large enough for the central limit theorem to apply. It is distributed as ${\cal N}(0, \sigma_x / \sqrt{N})$. We want to find
$$ P(-y < \bar{x} - \mu < +y) = 0.90$$
In [9]:
#make some points for plot
N = 5
x = np.linspace(-5,5, 1000)
T = ss.t.ppf(0.10, df=N-1)
y = ss.t.pdf(x, df=N-1)
plt.plot(x,y)
plt.fill_between(x, y, where= x > T)
plt.text(0,np.max(y) / 3, 'Area=0.90', fontdict={'size':14}, horizontalalignment='center')
plt.axvline(T, linestyle='--', color='orange')
plt.xticks([T], ['lower-bound'])
plt.yticks([])
plt.ylabel(r'$p(\mu - \bar{x})$')
plt.xlabel(r'$\mu - \bar{x}$')
plt.show()
In [10]:
#make some points for plot
N = 5
x = np.linspace(-5,5, 1000)
T = ss.t.ppf(0.90, df=N-1)
y = ss.t.pdf(x, df=N-1)
plt.plot(x,y)
plt.fill_between(x, y, where= x < T)
plt.text(0,np.max(y) / 3, 'Area=0.90', fontdict={'size':14}, horizontalalignment='center')
plt.axvline(T, linestyle='--', color='orange')
plt.xticks([T], ['upper-bound'])
plt.yticks([])
plt.ylabel(r'$p(\mu - \bar{x})$')
plt.xlabel(r'$\mu - \bar{x}$')
plt.show()