In [17]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind, ttest_1samp, poisson
%matplotlib inline
In any data analysis, it's important to visualize distributions and quantify statistics about those distributions. In this tutorial we'll cover a few ways of doing this.
First, we'll generate two random datasets. They'll both be normal distributions with the following parameters.
In [2]:
# The data
# N = 25
N = 75
mu1 = 7
mu2 = 10
sd = 5
In [3]:
# Generate random data
data1 = sd * np.random.randn(N) + mu1
data2 = sd * np.random.randn(N) + mu2
As a first step let's plot the histograms of each dataset. This is a good high-level idea for whether these two distributions differ.
In [4]:
# Histograms
fig, ax = plt.subplots()
ax.hist(data1)
ax.hist(data2)
Out[4]:
We'll run an old-fashioned t-test to find a p-value for whether these distributions are different. This test calculates the difference in mean of each distribution, and weights this by their standard deviations.
In [5]:
result = ttest_ind(data1, data2)
print(result.pvalue)
Rather than visualizing the raw histograms, it's common to look at a bar plot of the data, along with some metric of "variability" of each group of data. If you could only look at this bar plot (in a paper figure, for instance), would you be able to tell if the difference was significant?
Now we'll repeat for standard error. Compare and contrast the information provided by each metric. As you might notice, the standard error is much smaller. Try making the same plot, but using a larger N
. The standard error will change. This is because standard error is meant to represent the error in the mean of the distribution, whereas standard deviation describes the distribution itself.
In [8]:
# Means and standard deviations of each dataset
mn1 = np.mean(data1)
mn2 = np.mean(data2)
std1 = np.std(data1)
std2 = np.std(data2)
# Now standard error
ste1 = std1 / np.sqrt(N)
ste2 = std2 / np.sqrt(N)
In [9]:
fig, axs = plt.subplots(2, 1, sharex=True, sharey=True)
for ax, err1, err2, nm in zip(axs, [std1, ste1], [std2, ste2], ['std', 'ste']):
ax.vlines([0, 1], [mn1 - err1, mn2 - err2], [mn1 + err1, mn2 + err2])
_ = plt.setp(ax, xlim=[-1, 2], title='Error: {}'.format(nm))
plt.tight_layout()
In [16]:
fig, ax = plt.subplots()
ax.hist(data1)
ax.hist(data2)
ymax = ax.get_ylim()[-1]
for ii, (imn, iste, c) in enumerate([(mn1, ste1, 'b'), (mn2, ste2, 'g')]):
ax.hlines(ymax + 1 + ii, imn - iste, imn + iste, lw=6, color=c)
ax.set_ylim([None, ax.get_ylim()[-1] + 2])
Out[16]:
The standard error that we've shown above in some sense depends on assumptions we're making on the data. By calculating standard error, we're assuming a couple of things at least:
But what if our distribution looked something like this:
In [43]:
data = np.random.poisson(lam=2, size=5000)
mn = data.mean()
ste = data.std() / np.sqrt(data.shape[0])
fig, ax = plt.subplots()
ax.hist(data, bins=np.arange(0, 10, 1))
ax.hlines(ax.get_ylim()[-1] + 10, mn - ste, mn + ste, lw=10)
Out[43]:
As you can see above, this is not a symmetric distribution, so we can' just calculate its mean / standard error and call it a day.
Luckily, with modern computational power we can simulate the uncertainty in the mean of this distribution. One way of doing this is called the bootstrap.
Briefly, the bootstrap follows these steps:
Here's what it looks like in practice:
In [30]:
n_boots = 1000
means = np.zeros(n_boots)
for ii in range(n_boots):
ixs_sample = np.random.randint(0, data.shape[0], size=data.shape[0])
sample = data[ixs_sample]
means[ii] = sample.mean()
Now we can plot the distribution of our means, as well as the 2.5th and 97.5th percentiles of this distribution (corresponding to a 95% confidence interval)
In [45]:
# Calculate confidence intervals
clo, chi = np.percentile(means, [2.5, 97.5])
fig, ax = plt.subplots()
ax.hist(means, histtype='step', color='r')
ax.hlines(ax.get_ylim()[-1] + 10, clo, chi, lw=10, color='r')
ax.hlines(ax.get_ylim()[-1] + 20, mn - ste, mn + ste, lw=10, color='g')
Out[45]:
Note the difference between the green and red horizontal bars. The green bar is thinner, representing an overconfidence in the "true" mean of the distribution. Because the distribution is not symmetric, this is wrong, and a better description of the distribution is given by the bootstrapped confidence interval.