# Numerical Summaries of Data

Congratulations, you have some data. Now what? Well, ideally, you'll have a research question to match. In practice, that's not always true or even possible. So, in the following tutorial, we're going to present some methods to explore your data, using an example data set from SDSS. Most of these methods focus on summarizing the data in some way, that is, compress the information in your data set in such a way that it will be more useful to you than the raw data. There are many ways to achieve this sort of compression; we'll only showcase a few of them. In general, one can summarize data numerically, that is, compute a set of numbers that describe the data in a meaningful way that trades loss of information with descriptiveness. One can also summarize data sets visually, in the form of graphs. Here, we will explore the former category (although we won't get away without making any graphs here, either!). The following tutorial will explore visual summaries of data in more detail.

Many of these methods may seem familiar, but we'll try to teach you something new, too! At the same time, you'll also get to know some of the most useful packages for numerical computation in python: numpy and scipy.

Before we start, let's load some astronomy data!



In [1]:

import os
import requests

# get some CSV data from the SDSS SQL server
URL = "http://skyserver.sdss.org/dr12/en/tools/search/x_sql.aspx"

cmd = """
SELECT TOP 10000
p.u, p.g, p.r, p.i, p.z, s.class, s.z, s.zerr
FROM
PhotoObj AS p
JOIN
SpecObj AS s ON s.bestobjid = p.objid
WHERE
p.u BETWEEN 0 AND 19.6 AND
p.g BETWEEN 0 AND 20 AND
(s.class = 'STAR' OR s.class = 'GALAXY' OR s.class = 'QSO')
"""
if not os.path.exists('all_colors.csv'):
cmd = ' '.join(map(lambda x: x.strip(), cmd.split('\n')))
response = requests.get(URL, params={'cmd': cmd, 'format':'csv'})
with open('all_colors.csv', 'w') as f:
f.write(response.text)




In [2]:

import pandas as pd




In [3]:

df




Out[3]:

u
g
r
i
z
class
z1
zerr

0
19.16978
17.08930
16.04692
15.55223
15.17684
GALAXY
0.076246
0.000019

1
18.88655
17.26482
16.53105
16.23936
16.05161
STAR
0.000122
0.000009

2
18.94358
17.47576
16.79267
16.49504
16.30348
STAR
0.000161
0.000011

3
18.42756
16.98290
16.79660
16.78195
16.72049
STAR
0.000123
0.000010

4
18.76420
16.92113
16.04104
15.65526
15.39297
STAR
0.000241
0.000008

5
17.20783
15.75872
15.63107
15.59838
15.56201
STAR
0.000142
0.000005

6
18.28920
16.84220
16.68952
16.64136
16.58683
STAR
0.000222
0.000008

7
17.69090
16.22015
15.54465
15.20622
15.05557
STAR
0.000150
0.000009

8
19.10833
17.49509
16.73375
16.40439
16.16703
STAR
0.000188
0.000009

9
18.18435
16.81769
16.30527
16.06046
15.91306
STAR
0.000210
0.000012

10
18.98582
17.43833
16.70121
16.40430
16.22956
STAR
0.000339
0.000010

11
17.78138
16.38120
16.31779
16.31096
16.27825
STAR
0.000247
0.000006

12
16.12573
15.05371
15.04288
15.09971
15.12204
STAR
0.000225
0.000004

13
17.88314
16.59722
16.52256
16.50586
16.57543
STAR
0.000207
0.000007

14
16.50660
15.11010
15.02940
15.05553
15.01521
STAR
0.000166
0.000004

15
17.54357
16.15225
16.02519
16.01606
15.98047
STAR
0.000212
0.000006

16
17.46517
15.97317
15.81342
15.78044
15.71757
STAR
0.000230
0.000006

17
18.55322
16.99053
16.78870
16.72765
16.73230
STAR
0.000225
0.000009

18
17.65196
16.34665
16.21154
16.19021
16.13198
STAR
0.000207
0.000007

19
18.94093
17.30824
16.68718
16.37980
16.19246
STAR
0.000109
0.000016

20
19.42305
17.70929
16.88225
16.48898
16.24508
STAR
0.000103
0.000012

21
17.49978
15.94900
15.74521
15.66466
15.56982
STAR
0.000208
0.000008

22
17.28362
15.78144
15.60855
15.56985
15.50644
STAR
0.000233
0.000005

23
18.48349
16.99278
16.83139
16.76890
16.71789
STAR
0.000178
0.000011

24
17.90772
16.59890
16.51875
16.52415
16.49093
STAR
0.000183
0.000008

25
18.05621
16.50874
16.21932
16.08952
16.00218
STAR
0.000201
0.000008

26
17.80663
16.30726
16.14573
16.10564
16.07638
STAR
0.000068
0.000006

27
19.48872
17.48968
16.54476
16.17460
15.91617
STAR
-0.000018
0.000008

28
18.55945
17.00299
16.78813
16.69907
16.62170
STAR
0.000206
0.000009

29
17.05843
15.67654
15.53414
15.51704
15.48902
STAR
0.000182
0.000005

...
...
...
...
...
...
...
...
...

9970
18.66580
18.66566
18.89007
19.06359
19.27806
QSO
1.096592
0.000229

9971
18.55215
16.84621
15.91145
15.46771
15.12429
GALAXY
0.122022
0.000034

9972
17.00455
15.53162
14.82834
14.41894
14.12202
GALAXY
0.037770
0.000008

9973
18.91488
16.83624
15.89665
15.45328
15.09546
GALAXY
0.059557
0.000014

9974
18.85091
17.52745
16.87109
16.44857
16.19461
GALAXY
0.081072
0.000009

9975
19.27532
18.08082
17.56893
17.21910
17.06681
GALAXY
0.125263
0.000008

9976
18.91428
17.95433
17.71039
17.56416
17.54964
STAR
-0.000035
0.000030

9977
17.98449
17.07576
16.77341
16.65288
16.64050
STAR
-0.000041
0.000007

9978
18.88045
16.92197
16.11383
15.71679
15.40778
GALAXY
0.054414
0.000015

9979
17.52261
16.09159
15.28299
14.85429
14.51337
GALAXY
0.037375
0.000012

9980
18.35714
17.27558
17.09887
17.07650
17.08409
STAR
-0.000111
0.000014

9981
18.48715
17.19322
16.63678
16.31632
16.11982
GALAXY
0.069888
0.000016

9982
18.68424
17.68931
17.35985
17.13924
16.98174
GALAXY
0.037802
0.000006

9983
18.68946
17.18591
16.56903
16.32432
16.17589
STAR
-0.000136
0.000011

9984
19.03388
16.73407
15.74353
15.37458
15.14844
STAR
-0.000015
0.000008

9985
18.10310
16.75151
16.09899
15.76913
15.51989
GALAXY
0.016488
0.000017

9986
17.32072
15.74650
15.11705
14.88004
14.77378
STAR
0.000011
0.000007

9987
18.87338
16.83988
15.79945
15.32242
14.90727
GALAXY
0.086303
0.000019

9988
18.53954
18.10493
17.81214
17.44697
17.50096
GALAXY
0.146669
0.000004

9989
18.82698
17.70184
17.23304
17.06105
16.98843
STAR
-0.000465
0.000010

9990
18.66400
17.08254
16.47720
16.24797
16.14481
STAR
0.000139
0.000009

9991
19.01188
17.14509
16.36780
15.98576
15.69298
GALAXY
0.017033
0.000024

9992
19.48624
18.17650
17.69476
17.41129
17.31583
GALAXY
0.033752
0.000008

9993
19.26799
17.34957
16.48178
16.10234
15.86800
STAR
0.000081
0.000008

9994
17.52086
16.05171
15.38772
15.03875
14.83801
STAR
-0.000807
0.000011

9995
19.24875
18.19267
17.74338
17.52076
17.44764
STAR
-0.000265
0.000010

9996
18.45863
17.21609
16.71649
16.53343
16.42949
STAR
-0.000106
0.000010

9997
19.22442
16.56367
15.32245
14.84301
14.56661
STAR
-0.000055
0.000008

9998
19.17015
17.75889
17.12812
16.68459
16.40116
GALAXY
0.066407
0.000011

9999
19.54867
17.26397
16.29182
15.88584
15.65525
STAR
-0.000101
0.000007

10000 rows × 8 columns



## Summarizing Data

There are a few very simple, but often used ways to summarize data: the arithmetic mean and median, along the standard deviation, variance or standard error. For a first look at what your data looks like, these can be incredibly useful tools, but be aware of their limitations, too!

The arithmetic mean of a (univariate) sample $x = \{x_1, x_2, ..., x_n\}$ with n elements is defined as

$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}$ .

In python, you can of course define your own function to do this, but much faster is the implementation in numpy:



In [15]:

import numpy as np

galaxies = df[df["class"]=="GALAXY"]
x = np.array(galaxies["r"])

x_mean = np.mean(x)

print("The mean of the r-band magnitude of the galaxies in the sample is %.3f"%x_mean)




The mean of the r-band magnitude of the galaxies in the sample is 16.651



Note that if you have multivariate data (i.e. several dimensions), you can use the axis keyword to specify which dimension to average over:



In [16]:

x_multi = np.array(galaxies[["u", "g", "r"]])
print(x.shape)




(4299,)




In [17]:

## global average
print(np.mean(x_multi))

## average over the sample for each colour
print(np.mean(x_multi, axis=0))

## average over all colours for each galaxy in the sample
print(np.mean(x_multi, axis=1))




17.6026387927
[ 18.80438917  17.35221503  16.65131218]
[ 17.43533333  16.05213667  18.12191667 ...,  17.50825667  18.4525
18.01905333]



Note: which dimension you want to average over depends on the shape of your array, so be careful (and check the shape of the output, if in doubt!).

The average is nice to have, since it's a measure for the "center of gravity" of your sample, however, it is also prone to be strongly affected by outliers! In some cases, the median, the middle of the sample, can be a better choice. For a sample of length $n$, the median is either the $(n+1)/2$-th value, if $n$ is odd, or the mean of the middle two values $n/2$ and $(n+1)/2$, if $n$ is even. Again, numpy allows for easy and quick computation:



In [21]:

x_med = np.median(x)
print("The median of the r-band magnitude of the sample of galaxies is %.3f"%x_med)




The median of the r-band magnitude of the sample of galaxies is 16.756



Both the median and mean are useful, but be aware that unless the underlying distribution that created your sample is symmetric:



In [118]:

%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(16,6))
xcoord = np.linspace(0,2,100)
ax1.plot(xcoord, scipy.stats.norm(1.0, 0.3).pdf(xcoord))
ax1.set_title("Symmetric distribution")
ax2.plot(xcoord, scipy.stats.lognorm(1.0, 0.1).pdf(xcoord))
ax2.set_title("Asymmetric distribution")




Out[118]:

<matplotlib.text.Text at 0x118bfac50>



For a (large enough) sample drawn from the distribution the mean and median will not be the same, and more importantly, in this case, they will also not be equal to the most probable value (the top of the distribution). The latter is also called the mode. There's no intrinsic nifty way to compute the mode of a distribution (or sample). In practice, for a univariate sample, you might make a histogram of the sample (that is, define finite bins over the width of your sample, and count all elements that fall into each bin), and then find the bin with the most counts. Note that for creating a histogram, you need to pick the number of your bins (or, equivalently, the width of each bin). This can be tricky, as we'll discuss in more detail in the visualization part of this tutorial.



In [37]:

h, edges = np.histogram(x, bins=30,range=[16.,19.6], normed=False)




In [45]:

## find the index where the binned counts have their maximum
h_max = np.where(h == np.max(h))[0]

## these are the maximum binned counts
max_counts = h[h_max]

## find the middle of the bin with the maximum counts
edge_min = edges[h_max]
edge_max = edges[h_max+1]
edge_mean = np.mean([edge_min, edge_max])

print("The mode of the distribution is located at %.4f"%edge_mean)




The mode of the distribution is located at 17.0200



Mean, median and mode all tell us something about the center of the sample. However, it tells us nothing about the spread of the sample: are most values close to the mean (high precision) or are they scattered very far (low precision)?

For this, we can use the variance, the squared deviations from the mean:

$s^2_x = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2}$

or its square root, the standard deviation: $s_x = \sqrt{s^2_x}$.

Similarly to the mean, there are functions in numpy for the mean and the standard deviation, and again, it is possible to specify the axis along which to compute either:



In [47]:

x_var = np.var(x)
x_std = np.std(x)

print("The variance of the r-band magnitude for the galaxies in the sample is %.4f"%x_var)
print("The standard deviation of the r-band magnitude for the galaxies in the sample is %.4f"%x_std)

x_var_multi = np.var(x_multi, axis=0)
x_std_multi = np.std(x_multi, axis=0)

print("The variance of the for all three bands for the galaxies in the sample is "+str(x_var_multi))
print("The standard deviation for all three bands for the galaxies in the sample is " + str(x_std_multi))




The variance of the r-band magnitude for the galaxies in the sample is 0.7825
The standard deviation of the r-band magnitude for the galaxies in the sample is 0.8846
The variance of the for all three bands for the galaxies in the sample is [ 0.52926058  0.63001774  0.78250212]
The standard deviation for all three bands for the galaxies in the sample is [ 0.72750298  0.79373657  0.8845915 ]



Note: If your data set contains NaN values, you can use the functions nanmean, nanvar and nanstd instead, which will compute the mean, variance and standard deviation, respectively, while ignoring any NaN values in your data.

What is the error on the mean? That depends on the size of your sample! Imagine you pick many samples from the population of possible outcomes. If each sample has a large number of elements, then you are more likely to find similar means for each of your samples than if the sample size is small.

Given that we might not have many samples from the population (but perhaps only one SDSS data set!), we'd like to quantify how well we can specify the mean. We do this by dividing the variance by the number of data points and taking the square root to get the standard error:



In [51]:

se = np.sqrt(np.var(x)/x.shape[0])
print("The standard error on the mean for the r-band galaxy magnitudes is %.4f"%se)




The standard error on the mean for the r-band galaxy magnitudes is 0.0135



Mean and variance give us general information about the center and the spread of the sample, but tell us nothing about the shape.

If we want yet more information about how the data are distributed, we'll have to introduce yet another concept: the quantile the $\alpha$-quantile of a sample is the point below which a fraction $\alpha$ of the data occur. For example, the $0.25$-quantile is the point below which $25\%$ of the sample is located. Note that the 0.5-quantile is the median.

The 0.25, 0.5 and 0.75 quantiles are also called the first, second and third quantile, respectively. The difference between the 0.25 and 0.75 quantile are called the interquartile range (remember that! We'll come back to it!).

This time, there's no nifty numpy function (that I know of) that we can use. Instead, we'll turn to scipy instead, which contains much functionality for statistics:



In [52]:

from scipy.stats.mstats import mquantiles

q = mquantiles(x, prob=[0.25, 0.5, 0.75])
print("The 0.25, 0.5 and 0.75 of the r-band galaxy magnitudes are " + str(q))




The 0.25, 0.5 and 0.75 of the r-band galaxy magnitudes are [ 16.182876  16.75585   17.29843 ]



Using this, we can now compute the Tukey five-number summary: a collection of five numbers that contains first quartil, median, second quartile as well as the minimum and maximum values in the sample. Together, they give a reasonably good first impression of how the data are distributed:



In [53]:

def tukey_five_number(x):
x_min = np.min(x)
x_max = np.max(x)
q = mquantiles(x, prob=[0.25, 0.5, 0.75])
return np.hstack([x_min, q, x_max])




In [54]:

print("The Tukey five-number summary of the galaxy r-band magnitudes is: " + str(tukey_five_number(x)))




The Tukey five-number summary of the galaxy r-band magnitudes is: [ 11.62901   16.182876  16.75585   17.29843   24.80204 ]



Finally, if you're working with a pandas DataFrame, you can use the method describe to print some statistical summaries as well:



In [112]:

df.describe()




Out[112]:

u
g
r
i
z
z1
zerr

count
10000.000000
10000.000000
10000.000000
10000.000000
10000.000000
10000.000000
10000.000000

mean
18.544234
17.250815
16.713258
16.466774
16.309103
0.124569
0.060707

std
0.907272
1.032828
1.152263
1.212772
1.286174
0.396333
4.647630

min
11.960040
11.688460
11.271150
11.048570
10.629170
-0.002965
-4.000000

25%
18.080027
16.666845
16.043048
15.749308
15.533058
0.000080
0.000008

50%
18.792405
17.414145
16.785750
16.507530
16.331530
0.001224
0.000010

75%
19.239483
17.955283
17.462840
17.208642
17.096488
0.083216
0.000016

max
19.599920
19.920830
24.802040
28.182330
22.826920
5.854882
429.914900



Let's think more about the mean of a sample. Imagine we actually have a reasonably good idea of what the mean should be. We might then want to ask whether the sample we collected is consistent with that predetermined theoretical value. To do that, we can take the difference between the sample mean and the theoretical value, but we'll need to normalize it by the standard error. After all, the sample mean could be far from the theoretical prediction, but if the spread in the sample is large, that might not mean much.

$t = \frac{\bar{x} - \mu}{\sqrt{(s_x^2/n)}}$

This number is also called the Student t-statistic for univariate data.

This is, of course, not hard to write down in python with the functions for mean and variance we've learned above, but as above, we can make use of functionality in scipy to achieve the same result:



In [64]:

import scipy.stats

## some theoretical prediction
mu = 16.8

## compute the t-statistic.
t = scipy.stats.ttest_1samp(x, mu)

t_statistic = t.statistic
p_value = t.pvalue

print("The t-statistic for the galaxy r-band magnitude is %.4f"%t_statistic)
print("The p-value for that t-statistic is " + str(p_value))




The t-statistic for the galaxy r-band magnitude is -11.0196
The p-value for that t-statistic is 7.23412760149e-28



need to write some explanation here

## Statistical Distributions in scipy

Aside from the two examples above, scipy.stats has a large amount of functionality that can be helpful when exploring data sets.

We will not go into the theory of probabilities and probability distributions here. Some of that, you will learn later this week. But before we start, some definitions:

• random variable: technically, a function that maps sample space (e.g. "red", "green", "blue") of some process onto real numbers (e.g. 1, 2, 3)
• we will distinguish between continuous random variables (which can take all real values in the support of the distribution) and discrete random variables (which may only take certain values, e.g. integers).
• the probability mass function (PMF) for discrete variables maps the probability of a certain outcome to that outcome
• in analogy, the probability density function (PDF) for continuous random variables is the probability mass in an interval, divided by the size of that interval (in the limit of the interval size going to zero)
• the cumulative distribution function (CDF) at a point x of a distribution is the probability of a random variable X being smaller than x: it translates to the integral (sum) over the PDF (PMF) from negative infinity up to that point x for a continuous (discrete) distribution.

scipy.stats defines a large number of both discrete and continuous probability distributions that will likely satisfy most of your requirements. For example, let's define a standard normal distribution.



In [67]:

## continuous (normal) distribution

loc = 2.0
scale = 0.4

dist = scipy.stats.norm(loc=loc, scale=scale)
dist




Out[67]:

<scipy.stats._distn_infrastructure.rv_frozen at 0x117f7be90>



We've now created an object that defines a normal (Gaussian) distribution with a scale parameter of 0.4 and a center of 2. What can we do with this?

For example, draw a random sample:



In [68]:

s = dist.rvs(100)
print(s)




[ 2.50431444  2.11058364  2.24071986  2.47497436  1.75850039  1.62600395
2.20289435  1.89238838  1.91996427  2.00952353  2.14367686  2.23121868
1.62096689  2.74025872  2.19697373  0.85651104  2.01703714  1.51215581
1.49519117  1.51461253  2.57479157  1.98103806  2.41766122  2.52023994
1.44844469  1.65138438  1.65253577  2.71651984  1.81187593  1.23089199
1.47922214  2.17067098  1.38393825  1.26531693  2.7052431   2.49835938
1.59987787  2.0119105   2.1934772   1.47614296  2.9129221   2.15115603
1.53630225  3.00673904  1.58606667  2.09797873  1.66217778  1.10926118
2.23990586  1.34947789  2.43955822  1.37295593  1.79612009  1.78488342
2.12709361  1.90490522  2.00577896  1.83835018  2.11482897  1.8851669
1.97353032  1.86727465  1.79220657  1.19724902  1.64829546  1.84016529
2.02792293  2.53090503  2.14080624  2.21234171  1.83794103  1.86191577
1.35946016  2.28622588  2.01345457  1.70204355  2.14510779  1.98139534
2.0577144   2.05795734  2.47645635  2.10432697  2.13939156  2.57930661
2.06077678  2.44659289  1.75413505  1.76590278  2.68072807  1.67090906
2.40336699  1.87528165  1.89928568  1.12027431  2.28240783  1.70494533
0.65313723  2.42197543  2.05932398  1.44786458]



We have sampled 100 numbers from that distribution!

We can also compute the PDF in a given range:



In [69]:

xcoord = np.linspace(0,4,500)
pdf = dist.pdf(xcoord)

plt.plot(xcoord, pdf)




Out[69]:

[<matplotlib.lines.Line2D at 0x1182476d0>]



or the CDF:



In [71]:

cdf = dist.cdf(xcoord)

plt.plot(xcoord, cdf)




Out[71]:

[<matplotlib.lines.Line2D at 0x1182f5f10>]



Mean, median, var and std work as methods for the distribution, too:



In [72]:

print("distribution mean %.4f"%dist.mean())
print("distribution median %.4f"%dist.median())
print("distribution standard deviation %.4f"%dist.std())
print("distribution variance %.4f"%dist.var())




distribution mean 2.0000
distribution median 2.0000
distribution standard deviation 0.4000
distribution variance 0.1600



Easier than that, we can get it to just give us the moments of the distribution: the mean, the variance, the skewness and the kurtosis (below, replace 'mvsk' by any combination of those four letters to get it to print any of the four):



In [76]:

dist.stats(moments='mvsk')




Out[76]:

(array(2.0), array(0.16000000000000003), array(0.0), array(0.0))



For computing the $n$th-order non-central moment, use the moment method:



In [81]:

dist.moment(1)




Out[81]:

2.0



For more advanced purposes, we can also compute the survival function (1-CDF) and the percent point function (1/CDF) of the distribution:



In [82]:

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(16,6))
ax1.plot(xcoord, dist.sf(xcoord))
ax1.set_title("Survival function")

ax2.plot(xcoord, dist.ppf(xcoord))
ax2.set_title("Percent point function")




Out[82]:



To compute a confidence interval around the median of the distribution, use the method interval:



In [83]:

dist.interval(0.6)




Out[83]:

(1.6633515065708342, 2.3366484934291658)



Finally, if you have some data that you think was drawn from the distribution, you may use the fit method to fit the distribution's parameters to the sample:



In [86]:

normal_data = np.random.normal(4.0, 0.2, size=1000)

data_loc, data_scale = scipy.stats.norm.fit(normal_data)
print("the location and scale parameters of the fit distribution are %.3f and %.3f, respectively."%(data_loc, data_scale))




the location and scale parameters of the fit distribution are 4.017 and 0.204, respectively.



Continuous distributions function in exactly the same way, except that instead of a pdf method, they will have a pmf method:



In [91]:

## set the distribution
dist = scipy.stats.poisson(10)

## make an x-coordinate: Poisson distribution only defined
## for positive integer numbers!
xcoord = np.arange(0,50,1.0)

## plot the results
plt.scatter(xcoord, dist.pmf(xcoord))




Out[91]:

<matplotlib.collections.PathCollection at 0x11892ca50>



Exercise: Take the galaxy data we've used above and fit a distribution of your choice (see http://docs.scipy.org/doc/scipy/reference/stats.html for a list of all of them) and compare the parameters of your distribution to the sample mean and variance (if they're comparable.

### Multivariate data

Of course, most data sets aren't univariate. As we've seen above, we can use the same functions that we've used to compute mean, median, variance and standard deviation for multi-variate data sets in the same way as for single-valued data, except that we need to specify the axis along which to compute the operation.

However, these functions will compute the mean, variance etc. for each dimension in the data set independently. This unfortunately tells us nothing about whether the data vary with each other, that is, whether they are correlated in any way. One way to look at whether data vary with each other is by computing the covariance. Let's take our slightly expanded galaxy data set (with three colours from above, and compute the covariance between the three magnitude bands.

Because of the way I've set up the array above, we need to take the transpose of it in order to get the covariance between the bands (and not between the samples!).



In [97]:

cov = np.cov(x_multi.T)
print(cov)




[[ 0.52938372  0.51149323  0.48624376]
[ 0.51149323  0.63016432  0.67565774]
[ 0.48624376  0.67565774  0.78268418]]



Note that the matrix is symmetric about the diagonal. Also, the values on the diagonal are just the variance:



In [98]:

x_var = np.var(x_multi, axis=0)
print(x_var)
x_var_cov = np.diag(cov)
print(x_var_cov)




[ 0.52926058  0.63001774  0.78250212]
[ 0.52938372  0.63016432  0.78268418]



To compute the actual correlation between two samples, compute the correlation coefficient, which is the ratio of sample covariance and the product of the individual standard deviations:

$s_{xy} = \frac{1}{n-1}\sum{(x_i - \bar{x})(y_i - \bar{y})}$ ;

$r = \frac{s_{xy}}{s_x s_y}$



In [104]:

r = np.corrcoef(x_multi.T)
print(r)




[[ 1.          0.88557979  0.75539737]
[ 0.88557979  1.          0.96206977]
[ 0.75539737  0.96206977  1.        ]]



The correlation coefficient can range between -1 and 1. If two samples are only offset by a scaling factor (perfect correlation), then $r = 1$. $r = 0$ implies that there is no correlation.

Another way to compare two samples is to compare the means of the two samples. To do this, we can use a generalization of the Student's t-test above to higher dimensions for two samples $x$ and $y$:

$t = \frac{\bar{x} - \bar{y}}{\sqrt{\left(\frac{s_x^2}{n} + \frac{s_y^2}{n}\right)}}$.

In python:



In [101]:

t = scipy.stats.ttest_ind(x_multi[:,0], x_multi[:,1])
t_stat = t.statistic
t_pvalue = t.pvalue

print("The t-statistic for the two bands is %.4f"%t_stat)
print("The p-value for that t-statistic is " + str(t_pvalue))




The t-statistic for the two bands is 88.4215
The p-value for that t-statistic is 0.0



Note that the t-statistic tests whether the means of the two samples are the same. We can also test whether a sample is likely to be produced by a reference distribution (single-sample Kolmogorov-Smirnov test) or whether two samples are produced by the same, underlying (unknown) distribution (2-sample Kolmogorov-Smirnov test).

The one-sample KS-test (test sample against a reference distribution) can take, for example, the cdf method of any of the distributions defined in scipy.stats.



In [109]:

ks = scipy.stats.kstest(x, scipy.stats.norm(np.mean(x), np.var(x)).cdf)
print("The KS statistic for the sample and a normal distribution is " + str(ks.statistic))
print("The corresponding p-value is " + str(ks.pvalue))




The KS statistic for the sample and a normal distribution is 0.0689309216324
The corresponding p-value is 3.62017717266e-18



Analogously, for the 2-sample KS-test:



In [111]:

ks = scipy.stats.ks_2samp(x_multi[:,0], x_multi[:,1])
print("The KS statistic for the u and g-band magnitudes is " + str(ks.statistic))
print("The corresponding p-value is " + str(ks.pvalue))




The KS statistic for the u and g-band magnitudes is 0.76366596883
The corresponding p-value is 0.0



There are many more statistical tests to compare distributions and samples, too many to showcase them all here. Some of them are implemented in scipy.stats, so I invite you to look there for your favourite test!

### Linear Regression

Perhaps you've found a high correlation coefficient in your data. Perhaps you already expect some kind of functional dependence of your (bivariate) data set. In this case, linear regression comes in handy. Here, we'll only give a very brief overview of how to do it (and introduce you to scipy.optimize). Later in the week, you'll learn how to do parameter estimation properly.

Doing simple linear regression in python requires two steps: one, you need a linear (in the coefficients) model, and a way to optimize the parameters with respect to the data. Here's where scipy.optimize comes in really handy. But first, let's build a model:



In [113]:

def straight(x, a, b):
return a*x+b



Now we'll take two variables, in this case the g- and the r-band magnitudes, and use curve_fit from scipy.optimize to do linear regression:



In [114]:

import scipy.optimize

x = np.array(galaxies["r"])
y = np.array(galaxies["g"])

popt, pcov = scipy.optimize.curve_fit(straight, x, y, p0=[1.,2.])



curve_fit returns first the best-fit parameters, and also the covariance matrix of the parameters:



In [116]:

print("The best fit slope and intercept are " + str(popt))
print("The covariance matrix of slope and intercept is " + str(pcov))




The best fit slope and intercept are [ 0.86325717  2.97785033]
The covariance matrix of slope and intercept is [[  1.39444634e-05  -2.32193615e-04]
[ -2.32193615e-04   3.87723998e-03]]



Let's plot this to see how well our model's done!



In [131]:

fig = plt.figure(figsize=(10,6))
ax.scatter(x,y, color="orange", alpha=0.7)
ax.plot(x, straight(x, *popt), color="black")
ax.set_xlabel("r-band magnitude")
ax.set_ylabel("g-band magnitude")




Out[131]:

<matplotlib.text.Text at 0x1199fa150>



Note that standard linear regression assumes a model that's strictly linear in the coefficients (that is, for example a power law of the type $f(x) = ax^b$ wouldn't be), as well as errors on the data that are independent and identically distributed (they are homoscedastic), curve_fit allows for non-linear models as well as heteroscedastic errors:



In [128]:

## make up some heteroscedastic errors:
yerr = np.random.normal(size=y.shape[0])*y/10.

popt, pcov = scipy.optimize.curve_fit(straight, x, y, sigma=yerr)




In [132]:

fig = plt.figure(figsize=(10,6))
ax.errorbar(x,y, fmt="o", yerr=yerr, color="orange", alpha=0.7)
ax.plot(x, straight(x, *popt), color="black")
ax.set_xlabel("r-band magnitude")
ax.set_ylabel("g-band magnitude")




Out[132]:

<matplotlib.text.Text at 0x119a5d750>



The errors that I made up for the example above depend on the value of each data point. Therefore, for a higher r-band magnitude, the error will be larger, too.

Note that curve_fit still assumes your measurement errors are Gaussian, which will allow you to use (generalized) least-squares to do the optimization. If this is not true, you will need to define your own likelihood. The likelihood is the probability of obtaining a data set under the assumption of a model and some model parameters. What that means in detail we'll worry about more on Thursday and Friday. Today, I'll just give you an example of how to set up a likelihood (not the only one, mind you) and use other methods in scipy.optimize to do the minimization:



In [170]:

logmin = -10000000000.0
class PoissonLikelihood(object):
def __init__(self, x, y, func):
""" Set up the model"""
self.x = x
self.y = y
self.func = func

def loglikelihood(self, pars, neg=True):
"""Set up a simple Poisson likelihood"""
m = self.func(self.x, *pars)
logl = np.sum(-m + self.y*np.log(m) - scipy.special.gammaln(self.y + 1.))

## catch infinite values and NaNs
if np.isinf(logl) or np.isnan(logl):
logl = logmin
## can choose whether to return log(L) or -log(L)
if neg:
return -logl
else:
return logl

def __call__(self, pars, neg=True):
return self.loglikelihood(pars, neg)




In [171]:

loglike = PoissonLikelihood(x, y, straight)
ll = loglike([1., 3.], neg=True)
print("The log-likelihood for our trial parameters is " + str(ll))




The log-likelihood for our trial parameters is 10710.7336839



In practice, we want to maximize the likelihood, but optimization routines always minimize a function. Therefore, we actually want to minimize the log-likelihood.

With the class we've set up above, we can now put this into scipy.optimize.minimize. This function provides a combined interface to the numerous optimization routines available in scipy.optimize. Some of these allow for constrained optimization (that is, optimization where the possible parameter range is restricted), others don't. Each optimization routine may have its own keyword parameters. I invite you to look at the documentation here http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html for more details.



In [173]:

res = scipy.optimize.minimize(loglike, [1, 3], method="L-BFGS-B")



The output will be an object that contains the necessary information, for example the best-fit parameters, the covariance matrix (for some, not all methods), a status message describing whether the optimization was successful, and the value of the likelihood function at the end of the optimization:



In [175]:

res




Out[175]:

status: 0
success: True
nfev: 30
hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
fun: 10108.5945972429
x: array([ 0.86242749,  2.99169114])
message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
jac: array([-0.29031071,  0.1204171 ])
nit: 5




In [178]:

print("Was the optimization successful? " + str(res.success))
print("The best-fit parameters are: " + str(res.x))
print("The likelihood value at the best-fit parameters is: " + str(res.fun))




Was the optimization successful? True
The best-fit parameters are: [ 0.86242749  2.99169114]
The likelihood value at the best-fit parameters is: 10108.5945972