Classical Statistics Exercises

Here are some quick and easy descriptions on how to do basic classical statistics stuff in python.

Simple Representations of Data

On the simplest level, any data set can be represented very simply with just a few numbers : mean, median, variance are among the most popular ones, but the "best" representation depends on the problem at hand. Also popular are quantiles: these measure the point where the cumulative distribution function takes a given quantile value (this is also the point up to which the probability density function integrates to that quantile).

Most of those are really easy to do in python:


In [5]:
import numpy as np
from pylab import *
### First, make some data
s = 10 ## size of the sample

## create s random normally distributed numbers with a mean of 2 and a standard deviation of 2
sample = np.random.normal(2,2, size=s) 
#sample = np.random.chisquare(4, size=s)
#sample = np.random.poisson(20, size=s)
mn = np.mean(sample)
md = np.median(sample)
vr = np.var(sample)
st = np.std(sample)

## import quantiles from scipy
from scipy.stats.mstats import mquantiles as quantiles

## a popular choice are the 0.25, 0.5 and 0.75 quantiles
qn = quantiles(sample, [0.25, 0.5, 0.75])

## print results
print("The sample mean is is: %.4f"%mn)
print("The sample median should be similar (for a symmetric distribution): %.4f"%md)
print("The sample variance is: %.4f"%vr)
print("The sample standard deviation should be sqrt(variance): %.4f"%st)
print("The 0.25 quantile is : %.4f"%qn[0])
print("The 0.5 quantile should be the same as the median: %.4f"%qn[1])
print("The 0.25 quantile is : %.4f"%qn[0])


The sample mean is is: 2.3873
The sample median should be similar (for a symmetric distribution): 2.3434
The sample variance is: 1.4208
The sample standard deviation should be sqrt(variance): 1.1920
The 0.25 quantile is : 1.3823
The 0.5 quantile should be the same as the median: 2.3434
The 0.25 quantile is : 1.3823

I invite you to play around with the sample size $s$ and see how much the (accuracy of the) values of the various statistics change(s). Plotting the distributions can be useful. Also, I've implemented two alternative distributions for you to play around with: a chi-square distribution and a Poisson distribution (note that the Poisson distribution is discrete!).

Statistical Distributions in Numpy and Scipy

Both numpy and scipy have many of the most useful standard distributions already in their code base. You've already seen above how to quickly and easily make random numbers in numpy. If that's all you want (and this can be incredibly powerful for producing simulated data), you're good to go.

However, scipy can do more. In scipy.stats, there are classes for many distributions, which have methods that can do a whole lot more than just produce random numbers for you! For example, to produce an instance of a normal distribution with "frozen" parameters (mean=2, standard deviation=2 here), goes like this:


In [3]:
import scipy.stats
n = scipy.stats.norm(2, 4)

You can now use this distribution in various ways. For example, to plot the pdf and cdf, you can do this:


In [6]:
## define some x coordinates
x = np.linspace(-10, 20, 2000)

## make a figure
fig = figure(figsize=(12,6))
ax = fig.add_subplot(121)
plot(x, n.pdf(x))
ax.set_title("PDF")

ax2 = fig.add_subplot(122)
plot(x, n.cdf(x))
ax.set_title("CDF")

savefig("normaldist_pdfcdf.pdf", format="pdf")
close()


Out[6]:
<matplotlib.text.Text at 0x10fd5a810>

Similarly, it allows you to play around with the mean, median, standard deviation and other moments.

Something else that might be useful is the capability of fitting a distribution to a data sample:


In [7]:
## let's make another sample
sample = np.random.normal(2,2,size=1000)

## now fit a normal distribution to the sample with the 
## fit method of the norm class
f = scipy.stats.norm.fit(sample)

print("The fit parameter for the mean of the distribution is: %.4f"%f[0])
print("The fit parameter for the standard deviation of the distribution is: %.4f"%f[1])


The fit parameter for the mean of the distribution is: 2.0241
The fit parameter for the standard deviation of the distribution is: 1.9819

There are many more distributions in scipy.stats. Have a look!

Likelihood Functions

Estimating the best-fit parameters of a distribution is one of the standard problems in astronomy. Numpy and scipy have many different tools for this available already. In the following, we're going to play around with a Poisson likelihood for a while. The Poisson distribution is useful anywhere where you have discrete, rare events. These can be, for example, particle in a detector, car accidents, or other data sets like these.

Because likelihood functions can have quite large and small values, we'll work with the log-likelihood without loss of any capabilities in inferring from the data.

First, we define the log-likelihood. Then we make some Poisson sample data, with a rate parameter of 5. This could be, for example, a count rate in a photon detector detecting of the order of 5 photons per second.


In [15]:
def poisson_loglike(rate, data):
    ## make rate parameter into array of same length as data:
    r = np.ones(len(data))*rate
    ## now we can compute the log-likelihood:
    llike = -np.sum(r) + np.sum(data*np.log(r))-np.sum(scipy.special.gammaln(data + 1))
    return llike


sample = np.random.poisson(5, size=10000)

I'll leave it to you to verify that the expression in the definition of the likelihood function is indeed that of the Poisson likelihood for N samples.

Now, let's have a look at the shape of the likelihood. We'll use the first 100 events from our sample:


In [16]:
## return likelihood for sample s for a bunch of guesses
def guess_like(s):
    guesses = np.arange(100)
    like_all = [poison_loglike(g, s) for g in guesses]
    return like_all

for n in [1, 10, 100, 10000]:
    fig = figure()
    plot(guesses, guess_like(sample[:n]), color="black", label="%i samples"%n)
    savefig("likelihoodtest_%isamples.pdf"%n, format="pdf")
    close()


-c:5: RuntimeWarning: invalid value encountered in multiply

As more data points are included, the values close to that from which the data was drawn (here, a rate parameter of 5).

Usually, our models are more complex than a simple distribution with one or two parameters. Say you have a spectrum and want to fit a spectral model. You have pretty good data, so you can approximate it with a Gaussian distribution. scipy already gives you the tools you need to do this. scipy.optimize allows you to find the minimum of a given function. For weighted least squares, scipy.optimize.curve_fit will do the trick:


In [17]:
import scipy.optimize

## define a straight line function:
def straight(x, a, b):
    return x*a + b

## x-coordinate
x = np.linspace(0, 10, 1000)

## some data
a = -2
b = 3
y = straight(x, a, b)

## add some random Gaussian noise:
data = np.random.normal(0, 2, size=len(x))+y

## fit straight line to data:
popt, pcov = scipy.optimize.curve_fit(straight, x, data)

print("Fit straight line data with a=-2, b=3")
print("Fitted parameter a is %.3f"%popt[0])
print("Fitted parameter b is %.3f"%popt[1])


Fit straight line data with a=-2, b=3
Fitted parameter a is -2.007
Fitted parameter b is 3.002

curve_fit also gives you the covariance matrix, which is really handy, because it allows you to estimate the errors on the parameters you've just derived, by taking the square root of the diagonal of the covariance matrix. Taking the square root will give you the estimated standard deviation on the parameter.


In [18]:
errors = np.sqrt(np.diag(pcov))
print("Fitted parameter a with error is %.3f +/- %.3f"%(popt[0], errors[0]))
print("Fitted parameter b with error is %.3f +/- %.3f"%(popt[1], errors[1]))


Fitted parameter a with error is -2.007 +/- 0.023
Fitted parameter b with error is 3.002 +/- 0.133

However, it is also possible to use likelihood functions other than Gaussian ones. In this case, define the log-likelihood explicitly and use one of scipy.optimize's many minimisation routines. Be sure to minimise the negative log-likelihood!

Recall the Poisson likelihood from earlier. Instead of plotting the log-likelihood and finding the maximum on the plot, we can estimate it by minimising the negative log-likelihood:


In [ ]:
sample = np.random.poisson(5, size=10000)

## define the -log-likelihood
neg_poisson = lambda r: -poisson_loglike(r, sample)

## minimise using Powell's algorithm, starting value for rate parameter is 10
popt= scipy.optimize.fmin_powell(neg_poisson, 10, full_output=True)

print("The fit-parameter for the Poisson rate parameter: %.3f "%popt[0])

There are quite a large range of optimisation routines available in scipy.optimize. All of them have certain strengths and weaknesses. I invite you to have a look in the documentation, and to play around with some of the algorithms to see which works for a given problem.