Numerical Summaries of Data

Congratulations, you have some data. Now what? Well, ideally, you'll have a research question to match. In practice, that's not always true or even possible. So, in the following tutorial, we're going to present some methods to explore your data, using an example data set from SDSS. Most of these methods focus on summarizing the data in some way, that is, compress the information in your data set in such a way that it will be more useful to you than the raw data. There are many ways to achieve this sort of compression; we'll only showcase a few of them. In general, one can summarize data numerically, that is, compute a set of numbers that describe the data in a meaningful way that trades loss of information with descriptiveness. One can also summarize data sets visually, in the form of graphs. Here, we will explore the former category (although we won't get away without making any graphs here, either!). The following tutorial will explore visual summaries of data in more detail.

Many of these methods may seem familiar, but we'll try to teach you something new, too! At the same time, you'll also get to know some of the most useful packages for numerical computation in python: numpy and scipy.

Before we start, let's load some astronomy data!


In [1]:
import os
import requests 

# get some CSV data from the SDSS SQL server
URL = "http://skyserver.sdss.org/dr12/en/tools/search/x_sql.aspx"

cmd = """
SELECT TOP 10000
    p.u, p.g, p.r, p.i, p.z, s.class, s.z, s.zerr
FROM
    PhotoObj AS p
JOIN
    SpecObj AS s ON s.bestobjid = p.objid
WHERE
    p.u BETWEEN 0 AND 19.6 AND
    p.g BETWEEN 0 AND 20 AND
    (s.class = 'STAR' OR s.class = 'GALAXY' OR s.class = 'QSO')
"""
if not os.path.exists('all_colors.csv'):
    cmd = ' '.join(map(lambda x: x.strip(), cmd.split('\n')))
    response = requests.get(URL, params={'cmd': cmd, 'format':'csv'})
    with open('all_colors.csv', 'w') as f:
        f.write(response.text)

In [2]:
import pandas as pd
df = pd.read_csv("all_colors.csv",skiprows=1)

In [3]:
df


Out[3]:
u g r i z class z1 zerr
0 19.16978 17.08930 16.04692 15.55223 15.17684 GALAXY 0.076246 0.000019
1 18.88655 17.26482 16.53105 16.23936 16.05161 STAR 0.000122 0.000009
2 18.94358 17.47576 16.79267 16.49504 16.30348 STAR 0.000161 0.000011
3 18.42756 16.98290 16.79660 16.78195 16.72049 STAR 0.000123 0.000010
4 18.76420 16.92113 16.04104 15.65526 15.39297 STAR 0.000241 0.000008
5 17.20783 15.75872 15.63107 15.59838 15.56201 STAR 0.000142 0.000005
6 18.28920 16.84220 16.68952 16.64136 16.58683 STAR 0.000222 0.000008
7 17.69090 16.22015 15.54465 15.20622 15.05557 STAR 0.000150 0.000009
8 19.10833 17.49509 16.73375 16.40439 16.16703 STAR 0.000188 0.000009
9 18.18435 16.81769 16.30527 16.06046 15.91306 STAR 0.000210 0.000012
10 18.98582 17.43833 16.70121 16.40430 16.22956 STAR 0.000339 0.000010
11 17.78138 16.38120 16.31779 16.31096 16.27825 STAR 0.000247 0.000006
12 16.12573 15.05371 15.04288 15.09971 15.12204 STAR 0.000225 0.000004
13 17.88314 16.59722 16.52256 16.50586 16.57543 STAR 0.000207 0.000007
14 16.50660 15.11010 15.02940 15.05553 15.01521 STAR 0.000166 0.000004
15 17.54357 16.15225 16.02519 16.01606 15.98047 STAR 0.000212 0.000006
16 17.46517 15.97317 15.81342 15.78044 15.71757 STAR 0.000230 0.000006
17 18.55322 16.99053 16.78870 16.72765 16.73230 STAR 0.000225 0.000009
18 17.65196 16.34665 16.21154 16.19021 16.13198 STAR 0.000207 0.000007
19 18.94093 17.30824 16.68718 16.37980 16.19246 STAR 0.000109 0.000016
20 19.42305 17.70929 16.88225 16.48898 16.24508 STAR 0.000103 0.000012
21 17.49978 15.94900 15.74521 15.66466 15.56982 STAR 0.000208 0.000008
22 17.28362 15.78144 15.60855 15.56985 15.50644 STAR 0.000233 0.000005
23 18.48349 16.99278 16.83139 16.76890 16.71789 STAR 0.000178 0.000011
24 17.90772 16.59890 16.51875 16.52415 16.49093 STAR 0.000183 0.000008
25 18.05621 16.50874 16.21932 16.08952 16.00218 STAR 0.000201 0.000008
26 17.80663 16.30726 16.14573 16.10564 16.07638 STAR 0.000068 0.000006
27 19.48872 17.48968 16.54476 16.17460 15.91617 STAR -0.000018 0.000008
28 18.55945 17.00299 16.78813 16.69907 16.62170 STAR 0.000206 0.000009
29 17.05843 15.67654 15.53414 15.51704 15.48902 STAR 0.000182 0.000005
... ... ... ... ... ... ... ... ...
9970 18.66580 18.66566 18.89007 19.06359 19.27806 QSO 1.096592 0.000229
9971 18.55215 16.84621 15.91145 15.46771 15.12429 GALAXY 0.122022 0.000034
9972 17.00455 15.53162 14.82834 14.41894 14.12202 GALAXY 0.037770 0.000008
9973 18.91488 16.83624 15.89665 15.45328 15.09546 GALAXY 0.059557 0.000014
9974 18.85091 17.52745 16.87109 16.44857 16.19461 GALAXY 0.081072 0.000009
9975 19.27532 18.08082 17.56893 17.21910 17.06681 GALAXY 0.125263 0.000008
9976 18.91428 17.95433 17.71039 17.56416 17.54964 STAR -0.000035 0.000030
9977 17.98449 17.07576 16.77341 16.65288 16.64050 STAR -0.000041 0.000007
9978 18.88045 16.92197 16.11383 15.71679 15.40778 GALAXY 0.054414 0.000015
9979 17.52261 16.09159 15.28299 14.85429 14.51337 GALAXY 0.037375 0.000012
9980 18.35714 17.27558 17.09887 17.07650 17.08409 STAR -0.000111 0.000014
9981 18.48715 17.19322 16.63678 16.31632 16.11982 GALAXY 0.069888 0.000016
9982 18.68424 17.68931 17.35985 17.13924 16.98174 GALAXY 0.037802 0.000006
9983 18.68946 17.18591 16.56903 16.32432 16.17589 STAR -0.000136 0.000011
9984 19.03388 16.73407 15.74353 15.37458 15.14844 STAR -0.000015 0.000008
9985 18.10310 16.75151 16.09899 15.76913 15.51989 GALAXY 0.016488 0.000017
9986 17.32072 15.74650 15.11705 14.88004 14.77378 STAR 0.000011 0.000007
9987 18.87338 16.83988 15.79945 15.32242 14.90727 GALAXY 0.086303 0.000019
9988 18.53954 18.10493 17.81214 17.44697 17.50096 GALAXY 0.146669 0.000004
9989 18.82698 17.70184 17.23304 17.06105 16.98843 STAR -0.000465 0.000010
9990 18.66400 17.08254 16.47720 16.24797 16.14481 STAR 0.000139 0.000009
9991 19.01188 17.14509 16.36780 15.98576 15.69298 GALAXY 0.017033 0.000024
9992 19.48624 18.17650 17.69476 17.41129 17.31583 GALAXY 0.033752 0.000008
9993 19.26799 17.34957 16.48178 16.10234 15.86800 STAR 0.000081 0.000008
9994 17.52086 16.05171 15.38772 15.03875 14.83801 STAR -0.000807 0.000011
9995 19.24875 18.19267 17.74338 17.52076 17.44764 STAR -0.000265 0.000010
9996 18.45863 17.21609 16.71649 16.53343 16.42949 STAR -0.000106 0.000010
9997 19.22442 16.56367 15.32245 14.84301 14.56661 STAR -0.000055 0.000008
9998 19.17015 17.75889 17.12812 16.68459 16.40116 GALAXY 0.066407 0.000011
9999 19.54867 17.26397 16.29182 15.88584 15.65525 STAR -0.000101 0.000007

10000 rows × 8 columns

Summarizing Data

There are a few very simple, but often used ways to summarize data: the arithmetic mean and median, along the standard deviation, variance or standard error. For a first look at what your data looks like, these can be incredibly useful tools, but be aware of their limitations, too!

The arithmetic mean of a (univariate) sample $x = \{x_1, x_2, ..., x_n\}$ with n elements is defined as

$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}$ .

In python, you can of course define your own function to do this, but much faster is the implementation in numpy:


In [15]:
import numpy as np

galaxies = df[df["class"]=="GALAXY"]
x = np.array(galaxies["r"])

x_mean = np.mean(x)

print("The mean of the r-band magnitude of the galaxies in the sample is %.3f"%x_mean)


The mean of the r-band magnitude of the galaxies in the sample is 16.651

Note that if you have multivariate data (i.e. several dimensions), you can use the axis keyword to specify which dimension to average over:


In [16]:
x_multi = np.array(galaxies[["u", "g", "r"]])
print(x.shape)


(4299,)

In [17]:
## global average
print(np.mean(x_multi))

## average over the sample for each colour
print(np.mean(x_multi, axis=0))

## average over all colours for each galaxy in the sample
print(np.mean(x_multi, axis=1))


17.6026387927
[ 18.80438917  17.35221503  16.65131218]
[ 17.43533333  16.05213667  18.12191667 ...,  17.50825667  18.4525
  18.01905333]

Note: which dimension you want to average over depends on the shape of your array, so be careful (and check the shape of the output, if in doubt!).

The average is nice to have, since it's a measure for the "center of gravity" of your sample, however, it is also prone to be strongly affected by outliers! In some cases, the median, the middle of the sample, can be a better choice. For a sample of length $n$, the median is either the $(n+1)/2$-th value, if $n$ is odd, or the mean of the middle two values $n/2$ and $(n+1)/2$, if $n$ is even. Again, numpy allows for easy and quick computation:


In [21]:
x_med = np.median(x)
print("The median of the r-band magnitude of the sample of galaxies is %.3f"%x_med)


The median of the r-band magnitude of the sample of galaxies is 16.756

Both the median and mean are useful, but be aware that unless the underlying distribution that created your sample is symmetric:


In [118]:
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(16,6))
xcoord = np.linspace(0,2,100)
ax1.plot(xcoord, scipy.stats.norm(1.0, 0.3).pdf(xcoord))
ax1.set_title("Symmetric distribution")
ax2.plot(xcoord, scipy.stats.lognorm(1.0, 0.1).pdf(xcoord))
ax2.set_title("Asymmetric distribution")


Out[118]:
<matplotlib.text.Text at 0x118bfac50>

For a (large enough) sample drawn from the distribution the mean and median will not be the same, and more importantly, in this case, they will also not be equal to the most probable value (the top of the distribution). The latter is also called the mode. There's no intrinsic nifty way to compute the mode of a distribution (or sample). In practice, for a univariate sample, you might make a histogram of the sample (that is, define finite bins over the width of your sample, and count all elements that fall into each bin), and then find the bin with the most counts. Note that for creating a histogram, you need to pick the number of your bins (or, equivalently, the width of each bin). This can be tricky, as we'll discuss in more detail in the visualization part of this tutorial.


In [37]:
h, edges = np.histogram(x, bins=30,range=[16.,19.6], normed=False)

In [45]:
## find the index where the binned counts have their maximum
h_max = np.where(h == np.max(h))[0]

## these are the maximum binned counts
max_counts = h[h_max]

## find the middle of the bin with the maximum counts
edge_min = edges[h_max]
edge_max = edges[h_max+1]
edge_mean = np.mean([edge_min, edge_max])

print("The mode of the distribution is located at %.4f"%edge_mean)


The mode of the distribution is located at 17.0200

Mean, median and mode all tell us something about the center of the sample. However, it tells us nothing about the spread of the sample: are most values close to the mean (high precision) or are they scattered very far (low precision)?

For this, we can use the variance, the squared deviations from the mean:

$s^2_x = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2}$

or its square root, the standard deviation: $s_x = \sqrt{s^2_x}$.

Similarly to the mean, there are functions in numpy for the mean and the standard deviation, and again, it is possible to specify the axis along which to compute either:


In [47]:
x_var = np.var(x)
x_std = np.std(x)

print("The variance of the r-band magnitude for the galaxies in the sample is %.4f"%x_var)
print("The standard deviation of the r-band magnitude for the galaxies in the sample is %.4f"%x_std)

x_var_multi = np.var(x_multi, axis=0)
x_std_multi = np.std(x_multi, axis=0)

print("The variance of the for all three bands for the galaxies in the sample is "+str(x_var_multi))
print("The standard deviation for all three bands for the galaxies in the sample is " + str(x_std_multi))


The variance of the r-band magnitude for the galaxies in the sample is 0.7825
The standard deviation of the r-band magnitude for the galaxies in the sample is 0.8846
The variance of the for all three bands for the galaxies in the sample is [ 0.52926058  0.63001774  0.78250212]
The standard deviation for all three bands for the galaxies in the sample is [ 0.72750298  0.79373657  0.8845915 ]

Note: If your data set contains NaN values, you can use the functions nanmean, nanvar and nanstd instead, which will compute the mean, variance and standard deviation, respectively, while ignoring any NaN values in your data.

What is the error on the mean? That depends on the size of your sample! Imagine you pick many samples from the population of possible outcomes. If each sample has a large number of elements, then you are more likely to find similar means for each of your samples than if the sample size is small.

Given that we might not have many samples from the population (but perhaps only one SDSS data set!), we'd like to quantify how well we can specify the mean. We do this by dividing the variance by the number of data points and taking the square root to get the standard error:


In [51]:
se = np.sqrt(np.var(x)/x.shape[0])
print("The standard error on the mean for the r-band galaxy magnitudes is %.4f"%se)


The standard error on the mean for the r-band galaxy magnitudes is 0.0135

Mean and variance give us general information about the center and the spread of the sample, but tell us nothing about the shape.

If we want yet more information about how the data are distributed, we'll have to introduce yet another concept: the quantile the $\alpha$-quantile of a sample is the point below which a fraction $\alpha$ of the data occur. For example, the $0.25$-quantile is the point below which $25\%$ of the sample is located. Note that the 0.5-quantile is the median.

The 0.25, 0.5 and 0.75 quantiles are also called the first, second and third quantile, respectively. The difference between the 0.25 and 0.75 quantile are called the interquartile range (remember that! We'll come back to it!).

This time, there's no nifty numpy function (that I know of) that we can use. Instead, we'll turn to scipy instead, which contains much functionality for statistics:


In [52]:
from scipy.stats.mstats import mquantiles

q = mquantiles(x, prob=[0.25, 0.5, 0.75])
print("The 0.25, 0.5 and 0.75 of the r-band galaxy magnitudes are " + str(q))


The 0.25, 0.5 and 0.75 of the r-band galaxy magnitudes are [ 16.182876  16.75585   17.29843 ]

Using this, we can now compute the Tukey five-number summary: a collection of five numbers that contains first quartil, median, second quartile as well as the minimum and maximum values in the sample. Together, they give a reasonably good first impression of how the data are distributed:


In [53]:
def tukey_five_number(x):
    x_min = np.min(x)
    x_max = np.max(x)
    q = mquantiles(x, prob=[0.25, 0.5, 0.75])
    return np.hstack([x_min, q, x_max])

In [54]:
print("The Tukey five-number summary of the galaxy r-band magnitudes is: " + str(tukey_five_number(x)))


The Tukey five-number summary of the galaxy r-band magnitudes is: [ 11.62901   16.182876  16.75585   17.29843   24.80204 ]

Finally, if you're working with a pandas DataFrame, you can use the method describe to print some statistical summaries as well:


In [112]:
df.describe()


Out[112]:
u g r i z z1 zerr
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 18.544234 17.250815 16.713258 16.466774 16.309103 0.124569 0.060707
std 0.907272 1.032828 1.152263 1.212772 1.286174 0.396333 4.647630
min 11.960040 11.688460 11.271150 11.048570 10.629170 -0.002965 -4.000000
25% 18.080027 16.666845 16.043048 15.749308 15.533058 0.000080 0.000008
50% 18.792405 17.414145 16.785750 16.507530 16.331530 0.001224 0.000010
75% 19.239483 17.955283 17.462840 17.208642 17.096488 0.083216 0.000016
max 19.599920 19.920830 24.802040 28.182330 22.826920 5.854882 429.914900

Let's think more about the mean of a sample. Imagine we actually have a reasonably good idea of what the mean should be. We might then want to ask whether the sample we collected is consistent with that predetermined theoretical value. To do that, we can take the difference between the sample mean and the theoretical value, but we'll need to normalize it by the standard error. After all, the sample mean could be far from the theoretical prediction, but if the spread in the sample is large, that might not mean much.

$t = \frac{\bar{x} - \mu}{\sqrt{(s_x^2/n)}}$

This number is also called the Student t-statistic for univariate data.

This is, of course, not hard to write down in python with the functions for mean and variance we've learned above, but as above, we can make use of functionality in scipy to achieve the same result:


In [64]:
import scipy.stats

## some theoretical prediction
mu = 16.8

## compute the t-statistic.
t = scipy.stats.ttest_1samp(x, mu)

t_statistic = t.statistic
p_value = t.pvalue

print("The t-statistic for the galaxy r-band magnitude is %.4f"%t_statistic)
print("The p-value for that t-statistic is " + str(p_value))


The t-statistic for the galaxy r-band magnitude is -11.0196
The p-value for that t-statistic is 7.23412760149e-28

need to write some explanation here

Statistical Distributions in scipy

Aside from the two examples above, scipy.stats has a large amount of functionality that can be helpful when exploring data sets.

We will not go into the theory of probabilities and probability distributions here. Some of that, you will learn later this week. But before we start, some definitions:

  • random variable: technically, a function that maps sample space (e.g. "red", "green", "blue") of some process onto real numbers (e.g. 1, 2, 3)
  • we will distinguish between continuous random variables (which can take all real values in the support of the distribution) and discrete random variables (which may only take certain values, e.g. integers).
  • the probability mass function (PMF) for discrete variables maps the probability of a certain outcome to that outcome
  • in analogy, the probability density function (PDF) for continuous random variables is the probability mass in an interval, divided by the size of that interval (in the limit of the interval size going to zero)
  • the cumulative distribution function (CDF) at a point x of a distribution is the probability of a random variable X being smaller than x: it translates to the integral (sum) over the PDF (PMF) from negative infinity up to that point x for a continuous (discrete) distribution.

scipy.stats defines a large number of both discrete and continuous probability distributions that will likely satisfy most of your requirements. For example, let's define a standard normal distribution.


In [67]:
## continuous (normal) distribution

loc = 2.0
scale = 0.4 

dist = scipy.stats.norm(loc=loc, scale=scale)
dist


Out[67]:
<scipy.stats._distn_infrastructure.rv_frozen at 0x117f7be90>

We've now created an object that defines a normal (Gaussian) distribution with a scale parameter of 0.4 and a center of 2. What can we do with this?

For example, draw a random sample:


In [68]:
s = dist.rvs(100)
print(s)


[ 2.50431444  2.11058364  2.24071986  2.47497436  1.75850039  1.62600395
  2.20289435  1.89238838  1.91996427  2.00952353  2.14367686  2.23121868
  1.62096689  2.74025872  2.19697373  0.85651104  2.01703714  1.51215581
  1.49519117  1.51461253  2.57479157  1.98103806  2.41766122  2.52023994
  1.44844469  1.65138438  1.65253577  2.71651984  1.81187593  1.23089199
  1.47922214  2.17067098  1.38393825  1.26531693  2.7052431   2.49835938
  1.59987787  2.0119105   2.1934772   1.47614296  2.9129221   2.15115603
  1.53630225  3.00673904  1.58606667  2.09797873  1.66217778  1.10926118
  2.23990586  1.34947789  2.43955822  1.37295593  1.79612009  1.78488342
  2.12709361  1.90490522  2.00577896  1.83835018  2.11482897  1.8851669
  1.97353032  1.86727465  1.79220657  1.19724902  1.64829546  1.84016529
  2.02792293  2.53090503  2.14080624  2.21234171  1.83794103  1.86191577
  1.35946016  2.28622588  2.01345457  1.70204355  2.14510779  1.98139534
  2.0577144   2.05795734  2.47645635  2.10432697  2.13939156  2.57930661
  2.06077678  2.44659289  1.75413505  1.76590278  2.68072807  1.67090906
  2.40336699  1.87528165  1.89928568  1.12027431  2.28240783  1.70494533
  0.65313723  2.42197543  2.05932398  1.44786458]

We have sampled 100 numbers from that distribution!

We can also compute the PDF in a given range:


In [69]:
xcoord = np.linspace(0,4,500)
pdf = dist.pdf(xcoord)

plt.plot(xcoord, pdf)


Out[69]:
[<matplotlib.lines.Line2D at 0x1182476d0>]

or the CDF:


In [71]:
cdf = dist.cdf(xcoord)

plt.plot(xcoord, cdf)


Out[71]:
[<matplotlib.lines.Line2D at 0x1182f5f10>]

Mean, median, var and std work as methods for the distribution, too:


In [72]:
print("distribution mean %.4f"%dist.mean())
print("distribution median %.4f"%dist.median())
print("distribution standard deviation %.4f"%dist.std())
print("distribution variance %.4f"%dist.var())


distribution mean 2.0000
distribution median 2.0000
distribution standard deviation 0.4000
distribution variance 0.1600

Easier than that, we can get it to just give us the moments of the distribution: the mean, the variance, the skewness and the kurtosis (below, replace 'mvsk' by any combination of those four letters to get it to print any of the four):


In [76]:
dist.stats(moments='mvsk')


Out[76]:
(array(2.0), array(0.16000000000000003), array(0.0), array(0.0))

For computing the $n$th-order non-central moment, use the moment method:


In [81]:
dist.moment(1)


Out[81]:
2.0

For more advanced purposes, we can also compute the survival function (1-CDF) and the percent point function (1/CDF) of the distribution:


In [82]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(16,6))
ax1.plot(xcoord, dist.sf(xcoord))
ax1.set_title("Survival function")

ax2.plot(xcoord, dist.ppf(xcoord))
ax2.set_title("Percent point function")


Out[82]:
<matplotlib.text.Text at 0x118471ad0>

To compute a confidence interval around the median of the distribution, use the method interval:


In [83]:
dist.interval(0.6)


Out[83]:
(1.6633515065708342, 2.3366484934291658)

Finally, if you have some data that you think was drawn from the distribution, you may use the fit method to fit the distribution's parameters to the sample:


In [86]:
normal_data = np.random.normal(4.0, 0.2, size=1000)

data_loc, data_scale = scipy.stats.norm.fit(normal_data)
print("the location and scale parameters of the fit distribution are %.3f and %.3f, respectively."%(data_loc, data_scale))


the location and scale parameters of the fit distribution are 4.017 and 0.204, respectively.

Continuous distributions function in exactly the same way, except that instead of a pdf method, they will have a pmf method:


In [91]:
## set the distribution
dist = scipy.stats.poisson(10)

## make an x-coordinate: Poisson distribution only defined
## for positive integer numbers!
xcoord = np.arange(0,50,1.0)

## plot the results
plt.scatter(xcoord, dist.pmf(xcoord))


Out[91]:
<matplotlib.collections.PathCollection at 0x11892ca50>

Exercise: Take the galaxy data we've used above and fit a distribution of your choice (see http://docs.scipy.org/doc/scipy/reference/stats.html for a list of all of them) and compare the parameters of your distribution to the sample mean and variance (if they're comparable.

Multivariate data

Of course, most data sets aren't univariate. As we've seen above, we can use the same functions that we've used to compute mean, median, variance and standard deviation for multi-variate data sets in the same way as for single-valued data, except that we need to specify the axis along which to compute the operation.

However, these functions will compute the mean, variance etc. for each dimension in the data set independently. This unfortunately tells us nothing about whether the data vary with each other, that is, whether they are correlated in any way. One way to look at whether data vary with each other is by computing the covariance. Let's take our slightly expanded galaxy data set (with three colours from above, and compute the covariance between the three magnitude bands.

Because of the way I've set up the array above, we need to take the transpose of it in order to get the covariance between the bands (and not between the samples!).


In [97]:
cov = np.cov(x_multi.T)
print(cov)


[[ 0.52938372  0.51149323  0.48624376]
 [ 0.51149323  0.63016432  0.67565774]
 [ 0.48624376  0.67565774  0.78268418]]

Note that the matrix is symmetric about the diagonal. Also, the values on the diagonal are just the variance:


In [98]:
x_var = np.var(x_multi, axis=0)
print(x_var)
x_var_cov = np.diag(cov)
print(x_var_cov)


[ 0.52926058  0.63001774  0.78250212]
[ 0.52938372  0.63016432  0.78268418]

To compute the actual correlation between two samples, compute the correlation coefficient, which is the ratio of sample covariance and the product of the individual standard deviations:

$s_{xy} = \frac{1}{n-1}\sum{(x_i - \bar{x})(y_i - \bar{y})}$ ;

$r = \frac{s_{xy}}{s_x s_y}$


In [104]:
r = np.corrcoef(x_multi.T)
print(r)


[[ 1.          0.88557979  0.75539737]
 [ 0.88557979  1.          0.96206977]
 [ 0.75539737  0.96206977  1.        ]]

The correlation coefficient can range between -1 and 1. If two samples are only offset by a scaling factor (perfect correlation), then $r = 1$. $r = 0$ implies that there is no correlation.

Another way to compare two samples is to compare the means of the two samples. To do this, we can use a generalization of the Student's t-test above to higher dimensions for two samples $x$ and $y$:

$t = \frac{\bar{x} - \bar{y}}{\sqrt{\left(\frac{s_x^2}{n} + \frac{s_y^2}{n}\right)}}$.

In python:


In [101]:
t = scipy.stats.ttest_ind(x_multi[:,0], x_multi[:,1])
t_stat = t.statistic
t_pvalue = t.pvalue

print("The t-statistic for the two bands is %.4f"%t_stat)
print("The p-value for that t-statistic is " + str(t_pvalue))


The t-statistic for the two bands is 88.4215
The p-value for that t-statistic is 0.0

Note that the t-statistic tests whether the means of the two samples are the same. We can also test whether a sample is likely to be produced by a reference distribution (single-sample Kolmogorov-Smirnov test) or whether two samples are produced by the same, underlying (unknown) distribution (2-sample Kolmogorov-Smirnov test).

The one-sample KS-test (test sample against a reference distribution) can take, for example, the cdf method of any of the distributions defined in scipy.stats.


In [109]:
ks = scipy.stats.kstest(x, scipy.stats.norm(np.mean(x), np.var(x)).cdf)
print("The KS statistic for the sample and a normal distribution is " + str(ks.statistic))
print("The corresponding p-value is " + str(ks.pvalue))


The KS statistic for the sample and a normal distribution is 0.0689309216324
The corresponding p-value is 3.62017717266e-18

Analogously, for the 2-sample KS-test:


In [111]:
ks = scipy.stats.ks_2samp(x_multi[:,0], x_multi[:,1])
print("The KS statistic for the u and g-band magnitudes is " + str(ks.statistic))
print("The corresponding p-value is " + str(ks.pvalue))


The KS statistic for the u and g-band magnitudes is 0.76366596883
The corresponding p-value is 0.0

There are many more statistical tests to compare distributions and samples, too many to showcase them all here. Some of them are implemented in scipy.stats, so I invite you to look there for your favourite test!

Linear Regression

Perhaps you've found a high correlation coefficient in your data. Perhaps you already expect some kind of functional dependence of your (bivariate) data set. In this case, linear regression comes in handy. Here, we'll only give a very brief overview of how to do it (and introduce you to scipy.optimize). Later in the week, you'll learn how to do parameter estimation properly.

Doing simple linear regression in python requires two steps: one, you need a linear (in the coefficients) model, and a way to optimize the parameters with respect to the data. Here's where scipy.optimize comes in really handy. But first, let's build a model:


In [113]:
def straight(x, a, b):
    return a*x+b

Now we'll take two variables, in this case the g- and the r-band magnitudes, and use curve_fit from scipy.optimize to do linear regression:


In [114]:
import scipy.optimize 

x = np.array(galaxies["r"])
y = np.array(galaxies["g"])

popt, pcov = scipy.optimize.curve_fit(straight, x, y, p0=[1.,2.])

curve_fit returns first the best-fit parameters, and also the covariance matrix of the parameters:


In [116]:
print("The best fit slope and intercept are " + str(popt))
print("The covariance matrix of slope and intercept is " + str(pcov))


The best fit slope and intercept are [ 0.86325717  2.97785033]
The covariance matrix of slope and intercept is [[  1.39444634e-05  -2.32193615e-04]
 [ -2.32193615e-04   3.87723998e-03]]

Let's plot this to see how well our model's done!


In [131]:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111)
ax.scatter(x,y, color="orange", alpha=0.7)
ax.plot(x, straight(x, *popt), color="black")
ax.set_xlabel("r-band magnitude")
ax.set_ylabel("g-band magnitude")


Out[131]:
<matplotlib.text.Text at 0x1199fa150>

Note that standard linear regression assumes a model that's strictly linear in the coefficients (that is, for example a power law of the type $f(x) = ax^b$ wouldn't be), as well as errors on the data that are independent and identically distributed (they are homoscedastic), curve_fit allows for non-linear models as well as heteroscedastic errors:


In [128]:
## make up some heteroscedastic errors:
yerr = np.random.normal(size=y.shape[0])*y/10.

popt, pcov = scipy.optimize.curve_fit(straight, x, y, sigma=yerr)

In [132]:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111)
ax.errorbar(x,y, fmt="o", yerr=yerr, color="orange", alpha=0.7)
ax.plot(x, straight(x, *popt), color="black")
ax.set_xlabel("r-band magnitude")
ax.set_ylabel("g-band magnitude")


Out[132]:
<matplotlib.text.Text at 0x119a5d750>

The errors that I made up for the example above depend on the value of each data point. Therefore, for a higher r-band magnitude, the error will be larger, too.

Note that curve_fit still assumes your measurement errors are Gaussian, which will allow you to use (generalized) least-squares to do the optimization. If this is not true, you will need to define your own likelihood. The likelihood is the probability of obtaining a data set under the assumption of a model and some model parameters. What that means in detail we'll worry about more on Thursday and Friday. Today, I'll just give you an example of how to set up a likelihood (not the only one, mind you) and use other methods in scipy.optimize to do the minimization:


In [170]:
logmin = -10000000000.0
class PoissonLikelihood(object):
    def __init__(self, x, y, func):
        """ Set up the model"""
        self.x = x
        self.y = y
        self.func = func
    
    def loglikelihood(self, pars, neg=True):
        """Set up a simple Poisson likelihood"""
        m = self.func(self.x, *pars)
        logl = np.sum(-m + self.y*np.log(m) - scipy.special.gammaln(self.y + 1.))
        
        ## catch infinite values and NaNs
        if np.isinf(logl) or np.isnan(logl):
            logl = logmin
        ## can choose whether to return log(L) or -log(L)
        if neg:
            return -logl
        else:
            return logl
    
    def __call__(self, pars, neg=True):
        return self.loglikelihood(pars, neg)

In [171]:
loglike = PoissonLikelihood(x, y, straight)
ll = loglike([1., 3.], neg=True)
print("The log-likelihood for our trial parameters is " + str(ll))


The log-likelihood for our trial parameters is 10710.7336839

In practice, we want to maximize the likelihood, but optimization routines always minimize a function. Therefore, we actually want to minimize the log-likelihood.

With the class we've set up above, we can now put this into scipy.optimize.minimize. This function provides a combined interface to the numerous optimization routines available in scipy.optimize. Some of these allow for constrained optimization (that is, optimization where the possible parameter range is restricted), others don't. Each optimization routine may have its own keyword parameters. I invite you to look at the documentation here http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html for more details.


In [173]:
res = scipy.optimize.minimize(loglike, [1, 3], method="L-BFGS-B")

The output will be an object that contains the necessary information, for example the best-fit parameters, the covariance matrix (for some, not all methods), a status message describing whether the optimization was successful, and the value of the likelihood function at the end of the optimization:


In [175]:
res


Out[175]:
   status: 0
  success: True
     nfev: 30
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      fun: 10108.5945972429
        x: array([ 0.86242749,  2.99169114])
  message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
      jac: array([-0.29031071,  0.1204171 ])
      nit: 5

In [178]:
print("Was the optimization successful? " + str(res.success))
print("The best-fit parameters are: " + str(res.x))
print("The likelihood value at the best-fit parameters is: " + str(res.fun))


Was the optimization successful? True
The best-fit parameters are: [ 0.86242749  2.99169114]
The likelihood value at the best-fit parameters is: 10108.5945972