There are many specialized packages for dealing with data analysis and statistical programming. One very important code that you will see in MATH1024, Introduction to Probability and Statistics, is R. A Python package for performing similar analysis of large data sets is pandas. However, simple statistical tasks on simple data sets can be tackled using numpy
and scipy
.
A data file containing the monthly rainfall for Southampton, taken from the Met Office data can be downloaded from this link. We will save that file locally, and then look at the data.
The first few lines of the file are:
In [1]:
!head southampton_precip.txt
We can use numpy
to load this data into a variable, where we can manipulate it. This is not ideal: it will lose the information in the header, and that the first column corresponds to years. However, it is simple to use.
In [2]:
import numpy
In [3]:
data = numpy.loadtxt('southampton_precip.txt')
In [4]:
data
Out[4]:
We see that the first column - the year - has been converted to a floating point number, which is not helpful. However, we can now split the data using standard numpy
operations:
In [5]:
years = data[:, 0]
rainfall = data[:, 1:]
We can now plot, for example, the rainfall in January for all years:
In [6]:
%matplotlib inline
from matplotlib import rcParams
rcParams['figure.figsize']=(12,9)
In [7]:
from matplotlib import pyplot
In [8]:
pyplot.plot(years, rainfall[:,0])
pyplot.xlabel('Year')
pyplot.ylabel('Rainfall in January');
numpy
contains a number of basic statistical functions, such as min
, max
and mean
. These will act on entire arrays to give the "all time" minimum, maximum, and average rainfall:
In [9]:
print("Minimum rainfall: {}".format(rainfall.min()))
print("Maximum rainfall: {}".format(rainfall.max()))
print("Mean rainfall: {}".format(rainfall.mean()))
Of more interest would be either
So the mean rainfall in the first year, 1855, would be
In [10]:
print ("Mean rainfall in 1855: {}".format(rainfall[0, :].mean()))
Whilst the mean rainfall in January, averaging over all years, would be
In [11]:
print ("Mean rainfall in January: {}".format(rainfall[:, 0].mean()))
If we wanted to plot the mean rainfall per year, across all years, this would be tedious - there are 145 years of data in the file. Even computing the mean rainfall in each month, across all years, would be bad with 12 months. We could write a loop. However, numpy
allows us to apply a function along an axis of the array, which does this is one operation:
In [12]:
mean_rainfall_in_month = rainfall.mean(axis=0)
mean_rainfall_per_year = rainfall.mean(axis=1)
The axis
argument gives the direction we want to keep - that we do not apply the operation to. For this data set, each row contains a year and each column a month. To find the mean in a given month we want to keep the row information (axis
0) and take the mean over the column. To find the mean in a given year we want to keep the column information (axis
1) and take the mean over the row.
We can now plot how the mean varies with each year.
In [13]:
pyplot.plot(years, mean_rainfall_per_year)
pyplot.xlabel('Year')
pyplot.ylabel('Mean rainfall');
We can also compute the standard deviation:
In [14]:
std_rainfall_per_year = rainfall.std(axis=1)
We can then add confidence intervals to the plot:
In [15]:
pyplot.errorbar(years, mean_rainfall_per_year, yerr = std_rainfall_per_year)
pyplot.xlabel('Year')
pyplot.ylabel('Mean rainfall');
This isn't particularly pretty or clear: a nicer example would use better packages, but a quick fix uses an alternative matplotlib
approach:
In [16]:
pyplot.plot(years, mean_rainfall_per_year)
pyplot.fill_between(years, mean_rainfall_per_year - std_rainfall_per_year,
mean_rainfall_per_year + std_rainfall_per_year,
alpha=0.25, color=None)
pyplot.xlabel('Year')
pyplot.ylabel('Mean rainfall');
Looking at the means by month, it would be better to give them names rather than numbers. We will also summarize the available information using a boxplot:
In [17]:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
pyplot.boxplot(rainfall, labels=months)
pyplot.xlabel('Month')
pyplot.ylabel('Mean rainfall');
Much better ways of working with categorical data are available through more specialized packages.
We can go beyond the basic statistical functions in numpy
and look at other standard tasks. For example, we can look for simple trends in our data with a linear regression. There is a function to compute the linear regression in scipy
we can use. We will use this to see if there is a trend in the mean yearly rainfall:
In [18]:
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(years, mean_rainfall_per_year)
pyplot.plot(years, mean_rainfall_per_year, 'b-', label='Data')
pyplot.plot(years, intercept + slope*years, 'k-', label='Linear Regression')
pyplot.xlabel('Year')
pyplot.ylabel('Mean rainfall')
pyplot.legend();
In [19]:
print("The change in rainfall (the slope) is {}.".format(slope))
print("However, the error estimate is {}.".format(std_err))
print("The correlation coefficient between rainfall and year"
" is {}.".format(r_value))
print("The probability that the slope is zero is {}.".format(p_value))
It looks like there's a good chance that the slight decrease in mean rainfall with time is a real effect.
Random processes and random variables may be at the heart of probability and statistics, but computers cannot generate anything "truly" random. Instead they can generate pseudo-random numbers using random number generators (RNGs). Constructing a random number generator is a hard problem and wherever possible you should use a well-tested RNG rather than attempting to write your own.
Python has many ways of generating random numbers. Perhaps the most useful are given by the numpy.random
module, which can generate a numpy
array filled with random numbers from various distributions. For example:
In [20]:
from numpy import random
uniform = random.rand(10000)
normal = random.randn(10000)
fig = pyplot.figure()
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.hist(uniform, 20)
ax1.set_title('Uniform data')
ax2.hist(normal, 20)
ax2.set_title('Normal data')
fig.tight_layout()
fig.show();
Whilst the standard distributions are given by the convenience functions above, the full documentation of numpy.random
shows many other distributions available. For example, we can draw $10,000$ samples from the Beta distribution using the parameters $\alpha = 1/2 = \beta$ as
In [21]:
beta_samples = random.beta(0.5, 0.5, 10000)
pyplot.hist(beta_samples, 20)
pyplot.title('Beta data')
pyplot.show();
We can do this $5,000$ times and compute the mean of each set of samples:
In [22]:
n_trials = 5000
beta_means = numpy.zeros((n_trials,))
for trial in range(n_trials):
beta_samples = random.beta(0.5, 0.5, 10000)
beta_means[trial] = numpy.mean(beta_samples)
pyplot.hist(beta_means, 20)
pyplot.title('Mean of Beta trials')
pyplot.show();
Here we see the Central Limit Theorem in action: the distribution of the means appears to be normal, despite the distribution of any individual trial coming from the Beta distribution, which looks very different.
Four separate datasets are given:
x | y | x | y | x | y | x | y |
---|---|---|---|---|---|---|---|
10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 8.0 | 6.58 |
8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 |
13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 |
9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 8.0 | 8.84 |
11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 8.0 | 8.47 |
14.0 | 9.96 | 14.0 | 8.10 | 14.0 | 8.84 | 8.0 | 7.04 |
6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 8.0 | 5.25 |
4.0 | 4.26 | 4.0 | 3.10 | 4.0 | 5.39 | 19.0 | 12.50 |
12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 8.0 | 5.56 |
7.0 | 4.82 | 7.0 | 7.26 | 7.0 | 6.42 | 8.0 | 7.91 |
5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 8.0 | 6.89 |
Plot each dataset. Add the best fit line. Then look at the description of Anscombe's quartet, and consider in what order the operations in this exercise should have been done.