Congratulations, you have some data. Now what? Well, ideally, you'll have a research question to match. In practice, that's not always true or even possible.
So, in the following tutorial, we're going to present some methods to explore your data, using an example data set from SDSS. Most of these methods focus on *summarizing* the data in some way, that is, compress the information in your data set in such a way that it will be more useful to you than the raw data. There are many ways to achieve this sort of compression; we'll only showcase a few of them. In general, one can summarize data numerically, that is, compute a set of numbers that describe the data in a meaningful way that trades loss of information with descriptiveness. One can also summarize data sets visually, in the form of graphs. Here, we will explore the former category (although we won't get away without making any graphs here, either!). The following tutorial will explore visual summaries of data in more detail.

Many of these methods may seem familiar, but we'll try to teach you something new, too! At the same time, you'll also get to know some of the most useful packages for numerical computation in python: numpy and scipy.

Before we start, let's load some astronomy data!

```
In [1]:
```import os
import requests
# get some CSV data from the SDSS SQL server
URL = "http://skyserver.sdss.org/dr12/en/tools/search/x_sql.aspx"
cmd = """
SELECT TOP 10000
p.u, p.g, p.r, p.i, p.z, s.class, s.z, s.zerr
FROM
PhotoObj AS p
JOIN
SpecObj AS s ON s.bestobjid = p.objid
WHERE
p.u BETWEEN 0 AND 19.6 AND
p.g BETWEEN 0 AND 20 AND
(s.class = 'STAR' OR s.class = 'GALAXY' OR s.class = 'QSO')
"""
if not os.path.exists('all_colors.csv'):
cmd = ' '.join(map(lambda x: x.strip(), cmd.split('\n')))
response = requests.get(URL, params={'cmd': cmd, 'format':'csv'})
with open('all_colors.csv', 'w') as f:
f.write(response.text)

```
In [2]:
```import pandas as pd
df = pd.read_csv("all_colors.csv",skiprows=1)

```
In [3]:
``````
df
```

```
Out[3]:
```

There are a few very simple, but often used ways to summarize data: the arithmetic mean and median, along the standard deviation, variance or standard error. For a first look at what your data looks like, these can be incredibly useful tools, but be aware of their limitations, too!

The arithmetic mean of a (univariate) sample $x = \{x_1, x_2, ..., x_n\}$ with n elements is defined as

$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}$ .

In python, you can of course define your own function to do this, but much faster is the implementation in numpy:

```
In [15]:
```import numpy as np
galaxies = df[df["class"]=="GALAXY"]
x = np.array(galaxies["r"])
x_mean = np.mean(x)
print("The mean of the r-band magnitude of the galaxies in the sample is %.3f"%x_mean)

```
```

`axis`

keyword to specify which dimension to average over:

```
In [16]:
```x_multi = np.array(galaxies[["u", "g", "r"]])
print(x.shape)

```
```

```
In [17]:
```## global average
print(np.mean(x_multi))
## average over the sample for each colour
print(np.mean(x_multi, axis=0))
## average over all colours for each galaxy in the sample
print(np.mean(x_multi, axis=1))

```
```

Note: which dimension you want to average over depends on the shape of your array, so be careful (and check the shape of the output, if in doubt!).

The average is nice to have, since it's a measure for the "center of gravity" of your sample, however, it is also prone to be strongly affected by outliers! In some cases, the **median**, the middle of the sample, can be a better choice. For a sample of length $n$, the median is either the $(n+1)/2$-th value, if $n$ is odd, or the mean of the middle two values $n/2$ and $(n+1)/2$, if $n$ is even.
Again, numpy allows for easy and quick computation:

```
In [21]:
```x_med = np.median(x)
print("The median of the r-band magnitude of the sample of galaxies is %.3f"%x_med)

```
```

```
In [118]:
```%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(16,6))
xcoord = np.linspace(0,2,100)
ax1.plot(xcoord, scipy.stats.norm(1.0, 0.3).pdf(xcoord))
ax1.set_title("Symmetric distribution")
ax2.plot(xcoord, scipy.stats.lognorm(1.0, 0.1).pdf(xcoord))
ax2.set_title("Asymmetric distribution")

```
Out[118]:
```

**not** be equal to the most probable value (the top of the distribution). The latter is also called the **mode**. There's no intrinsic nifty way to compute the mode of a distribution (or sample). In practice, for a univariate sample, you might make a **histogram** of the sample (that is, define finite bins over the width of your sample, and count all elements that fall into each bin), and then find the bin with the most counts. Note that for creating a histogram, you need to pick the number of your bins (or, equivalently, the width of each bin). This can be tricky, as we'll discuss in more detail in the visualization part of this tutorial.

```
In [37]:
```h, edges = np.histogram(x, bins=30,range=[16.,19.6], normed=False)

```
In [45]:
```## find the index where the binned counts have their maximum
h_max = np.where(h == np.max(h))[0]
## these are the maximum binned counts
max_counts = h[h_max]
## find the middle of the bin with the maximum counts
edge_min = edges[h_max]
edge_max = edges[h_max+1]
edge_mean = np.mean([edge_min, edge_max])
print("The mode of the distribution is located at %.4f"%edge_mean)

```
```

Mean, median and mode all tell us something about the center of the sample. However, it tells us nothing about the spread of the sample: are most values close to the mean (high precision) or are they scattered very far (low precision)?

For this, we can use the variance, the squared deviations from the mean:

$s^2_x = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2}$

or its square root, the standard deviation: $s_x = \sqrt{s^2_x}$.

Similarly to the mean, there are functions in numpy for the mean and the standard deviation, and again, it is possible to specify the axis along which to compute either:

```
In [47]:
```x_var = np.var(x)
x_std = np.std(x)
print("The variance of the r-band magnitude for the galaxies in the sample is %.4f"%x_var)
print("The standard deviation of the r-band magnitude for the galaxies in the sample is %.4f"%x_std)
x_var_multi = np.var(x_multi, axis=0)
x_std_multi = np.std(x_multi, axis=0)
print("The variance of the for all three bands for the galaxies in the sample is "+str(x_var_multi))
print("The standard deviation for all three bands for the galaxies in the sample is " + str(x_std_multi))

```
```

**Note**: If your data set contains NaN values, you can use the functions `nanmean`

, `nanvar`

and `nanstd`

instead, which will compute the mean, variance and standard deviation, respectively, while ignoring any NaN values in your data.

What is the error on the mean? That depends on the size of your sample! Imagine you pick many samples from the population of possible outcomes. If each sample has a large number of elements, then you are more likely to find similar means for each of your samples than if the sample size is small.

Given that we might not have many samples from the population (but perhaps only one SDSS data set!), we'd like to quantify how well we can specify the mean. We do this by dividing the variance by the number of data points and taking the square root to get the **standard error**:

```
In [51]:
```se = np.sqrt(np.var(x)/x.shape[0])
print("The standard error on the mean for the r-band galaxy magnitudes is %.4f"%se)

```
```

Mean and variance give us general information about the center and the spread of the sample, but tell us nothing about the shape.

If we want yet more information about how the data are distributed, we'll have to introduce yet another concept: the **quantile** the $\alpha$-quantile of a sample is the point below which a fraction $\alpha$ of the data occur.
For example, the $0.25$-quantile is the point below which $25\%$ of the sample is located. Note that the 0.5-quantile is the median.

The 0.25, 0.5 and 0.75 quantiles are also called the first, second and third quantile, respectively. The difference between the 0.25 and 0.75 quantile are called the **interquartile range** (remember that! We'll come back to it!).

This time, there's no nifty numpy function (that I know of) that we can use. Instead, we'll turn to scipy instead, which contains much functionality for statistics:

```
In [52]:
```from scipy.stats.mstats import mquantiles
q = mquantiles(x, prob=[0.25, 0.5, 0.75])
print("The 0.25, 0.5 and 0.75 of the r-band galaxy magnitudes are " + str(q))

```
```

```
In [53]:
```def tukey_five_number(x):
x_min = np.min(x)
x_max = np.max(x)
q = mquantiles(x, prob=[0.25, 0.5, 0.75])
return np.hstack([x_min, q, x_max])

```
In [54]:
```print("The Tukey five-number summary of the galaxy r-band magnitudes is: " + str(tukey_five_number(x)))

```
```

`describe`

to print some statistical summaries as well:

```
In [112]:
```df.describe()

```
Out[112]:
```

Let's think more about the mean of a sample. Imagine we actually have a reasonably good idea of what the mean should be. We might then want to ask whether the sample we collected is consistent with that predetermined theoretical value. To do that, we can take the difference between the sample mean and the theoretical value, but we'll need to normalize it by the standard error. After all, the sample mean could be far from the theoretical prediction, but if the spread in the sample is large, that might not mean much.

$t = \frac{\bar{x} - \mu}{\sqrt{(s_x^2/n)}}$

This number is also called the **Student t-statistic** for univariate data.

This is, of course, not hard to write down in python with the functions for mean and variance we've learned above, but as above, we can make use of functionality in scipy to achieve the same result:

```
In [64]:
```import scipy.stats
## some theoretical prediction
mu = 16.8
## compute the t-statistic.
t = scipy.stats.ttest_1samp(x, mu)
t_statistic = t.statistic
p_value = t.pvalue
print("The t-statistic for the galaxy r-band magnitude is %.4f"%t_statistic)
print("The p-value for that t-statistic is " + str(p_value))

```
```

** need to write some explanation here **

Aside from the two examples above, `scipy.stats`

has a large amount of functionality that can be helpful when exploring data sets.

We will not go into the theory of probabilities and probability distributions here. Some of that, you will learn later this week. But before we start, some definitions:

**random variable**: technically, a function that maps sample space (e.g. "red", "green", "blue") of some process onto real numbers (e.g. 1, 2, 3)- we will distinguish between
**continuous**random variables (which can take all real values in the*support*of the distribution) and**discrete**random variables (which may only take certain values, e.g. integers). - the
**probability mass function (PMF)**for discrete variables maps the probability of a certain outcome to that outcome - in analogy, the
**probability density function (PDF)**for continuous random variables is the probability mass in an interval, divided by the size of that interval (in the limit of the interval size going to zero) - the
**cumulative distribution function (CDF)**at a point x of a distribution is the probability of a random variable X being smaller than x: it translates to the integral (sum) over the PDF (PMF) from negative infinity up to that point x for a continuous (discrete) distribution.

`scipy.stats`

defines a large number of both discrete and continuous probability distributions that will likely satisfy most of your requirements. For example, let's define a standard normal distribution.

```
In [67]:
```## continuous (normal) distribution
loc = 2.0
scale = 0.4
dist = scipy.stats.norm(loc=loc, scale=scale)
dist

```
Out[67]:
```

We've now created an object that defines a normal (Gaussian) distribution with a scale parameter of 0.4 and a center of 2. What can we do with this?

For example, draw a random sample:

```
In [68]:
```s = dist.rvs(100)
print(s)

```
```

We have sampled 100 numbers from that distribution!

We can also compute the PDF in a given range:

```
In [69]:
```xcoord = np.linspace(0,4,500)
pdf = dist.pdf(xcoord)
plt.plot(xcoord, pdf)

```
Out[69]:
```

or the CDF:

```
In [71]:
```cdf = dist.cdf(xcoord)
plt.plot(xcoord, cdf)

```
Out[71]:
```

Mean, median, var and std work as methods for the distribution, too:

```
In [72]:
```print("distribution mean %.4f"%dist.mean())
print("distribution median %.4f"%dist.median())
print("distribution standard deviation %.4f"%dist.std())
print("distribution variance %.4f"%dist.var())

```
```

```
In [76]:
```dist.stats(moments='mvsk')

```
Out[76]:
```

For computing the $n$th-order non-central moment, use the `moment`

method:

```
In [81]:
```dist.moment(1)

```
Out[81]:
```

```
In [82]:
```fig, (ax1, ax2) = plt.subplots(1,2, figsize=(16,6))
ax1.plot(xcoord, dist.sf(xcoord))
ax1.set_title("Survival function")
ax2.plot(xcoord, dist.ppf(xcoord))
ax2.set_title("Percent point function")

```
Out[82]:
```

To compute a confidence interval around the median of the distribution, use the method `interval`

:

```
In [83]:
```dist.interval(0.6)

```
Out[83]:
```

`fit`

method to fit the distribution's parameters to the sample:

```
In [86]:
```normal_data = np.random.normal(4.0, 0.2, size=1000)
data_loc, data_scale = scipy.stats.norm.fit(normal_data)
print("the location and scale parameters of the fit distribution are %.3f and %.3f, respectively."%(data_loc, data_scale))

```
```

`pdf`

method, they will have a `pmf`

method:

```
In [91]:
```## set the distribution
dist = scipy.stats.poisson(10)
## make an x-coordinate: Poisson distribution only defined
## for positive integer numbers!
xcoord = np.arange(0,50,1.0)
## plot the results
plt.scatter(xcoord, dist.pmf(xcoord))

```
Out[91]:
```

**Exercise**: Take the galaxy data we've used above and fit a distribution of your choice (see http://docs.scipy.org/doc/scipy/reference/stats.html for a list of all of them) and compare the parameters of your distribution to the sample mean and variance (if they're comparable.

Of course, most data sets aren't univariate. As we've seen above, we can use the same functions that we've used to compute mean, median, variance and standard deviation for multi-variate data sets in the same way as for single-valued data, except that we need to specify the axis along which to compute the operation.

However, these functions will compute the mean, variance etc. for each dimension in the data set *independently*. This unfortunately tells us nothing about whether the data vary with each other, that is, whether they are correlated in any way. One way to look at whether data vary with each other is by computing the covariance. Let's take our slightly expanded galaxy data set (with three colours from above, and compute the covariance between the three magnitude bands.

Because of the way I've set up the array above, we need to take the transpose of it in order to get the covariance between the bands (and not between the samples!).

```
In [97]:
```cov = np.cov(x_multi.T)
print(cov)

```
```

```
In [98]:
```x_var = np.var(x_multi, axis=0)
print(x_var)
x_var_cov = np.diag(cov)
print(x_var_cov)

```
```

To compute the actual correlation between two samples, compute the correlation coefficient, which is the ratio of sample covariance and the product of the individual standard deviations:

$s_{xy} = \frac{1}{n-1}\sum{(x_i - \bar{x})(y_i - \bar{y})}$ ;

$r = \frac{s_{xy}}{s_x s_y}$

```
In [104]:
```r = np.corrcoef(x_multi.T)
print(r)

```
```

The correlation coefficient can range between -1 and 1. If two samples are only offset by a scaling factor (perfect correlation), then $r = 1$. $r = 0$ implies that there is no correlation.

Another way to compare two samples is to compare the *means* of the two samples. To do this, we can use a generalization of the Student's t-test above to higher dimensions for two samples $x$ and $y$:

$t = \frac{\bar{x} - \bar{y}}{\sqrt{\left(\frac{s_x^2}{n} + \frac{s_y^2}{n}\right)}}$.

In python:

```
In [101]:
```t = scipy.stats.ttest_ind(x_multi[:,0], x_multi[:,1])
t_stat = t.statistic
t_pvalue = t.pvalue
print("The t-statistic for the two bands is %.4f"%t_stat)
print("The p-value for that t-statistic is " + str(t_pvalue))

```
```

Note that the t-statistic tests whether the *means* of the two samples are the same. We can also test whether a sample is likely to be produced by a reference distribution (single-sample Kolmogorov-Smirnov test) or whether two samples are produced by the same, underlying (unknown) distribution (2-sample Kolmogorov-Smirnov test).

The one-sample KS-test (test sample against a reference distribution) can take, for example, the `cdf`

method of any of the distributions defined in scipy.stats.

```
In [109]:
```ks = scipy.stats.kstest(x, scipy.stats.norm(np.mean(x), np.var(x)).cdf)
print("The KS statistic for the sample and a normal distribution is " + str(ks.statistic))
print("The corresponding p-value is " + str(ks.pvalue))

```
```

Analogously, for the 2-sample KS-test:

```
In [111]:
```ks = scipy.stats.ks_2samp(x_multi[:,0], x_multi[:,1])
print("The KS statistic for the u and g-band magnitudes is " + str(ks.statistic))
print("The corresponding p-value is " + str(ks.pvalue))

```
```

There are *many* more statistical tests to compare distributions and samples, too many to showcase them all here. Some of them are implemented in scipy.stats, so I invite you to look there for your favourite test!

Perhaps you've found a high correlation coefficient in your data. Perhaps you already expect some kind of functional dependence of your (bivariate) data set.
In this case, linear regression comes in handy. Here, we'll only give a *very* brief overview of how to do it (and introduce you to scipy.optimize). Later in the week, you'll learn how to do parameter estimation properly.

Doing simple linear regression in python requires two steps: one, you need a linear (in the coefficients) model, and a way to optimize the parameters with respect to the data. Here's where scipy.optimize comes in really handy. But first, let's build a model:

```
In [113]:
```def straight(x, a, b):
return a*x+b

`curve_fit`

from scipy.optimize to do linear regression:

```
In [114]:
```import scipy.optimize
x = np.array(galaxies["r"])
y = np.array(galaxies["g"])
popt, pcov = scipy.optimize.curve_fit(straight, x, y, p0=[1.,2.])

`curve_fit`

returns first the best-fit parameters, and also the covariance matrix of the parameters:

```
In [116]:
```print("The best fit slope and intercept are " + str(popt))
print("The covariance matrix of slope and intercept is " + str(pcov))

```
```

Let's plot this to see how well our model's done!

```
In [131]:
```fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111)
ax.scatter(x,y, color="orange", alpha=0.7)
ax.plot(x, straight(x, *popt), color="black")
ax.set_xlabel("r-band magnitude")
ax.set_ylabel("g-band magnitude")

```
Out[131]:
```

*homoscedastic*), `curve_fit`

allows for non-linear models as well as *heteroscedastic* errors:

```
In [128]:
```## make up some heteroscedastic errors:
yerr = np.random.normal(size=y.shape[0])*y/10.
popt, pcov = scipy.optimize.curve_fit(straight, x, y, sigma=yerr)

```
In [132]:
```fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111)
ax.errorbar(x,y, fmt="o", yerr=yerr, color="orange", alpha=0.7)
ax.plot(x, straight(x, *popt), color="black")
ax.set_xlabel("r-band magnitude")
ax.set_ylabel("g-band magnitude")

```
Out[132]:
```

The errors that I made up for the example above depend on the value of each data point. Therefore, for a higher r-band magnitude, the error will be larger, too.

Note that `curve_fit`

still assumes your measurement errors are Gaussian, which will allow you to use (generalized) least-squares to do the optimization. If this is not true, you will need to define your own *likelihood*. The likelihood is the probability of obtaining a data set under the assumption of a model and some model parameters. What that means in detail we'll worry about more on Thursday and Friday. Today, I'll just give you an example of how to set up a likelihood (not the only one, mind you) and use other methods in `scipy.optimize`

to do the minimization:

```
In [170]:
```logmin = -10000000000.0
class PoissonLikelihood(object):
def __init__(self, x, y, func):
""" Set up the model"""
self.x = x
self.y = y
self.func = func
def loglikelihood(self, pars, neg=True):
"""Set up a simple Poisson likelihood"""
m = self.func(self.x, *pars)
logl = np.sum(-m + self.y*np.log(m) - scipy.special.gammaln(self.y + 1.))
## catch infinite values and NaNs
if np.isinf(logl) or np.isnan(logl):
logl = logmin
## can choose whether to return log(L) or -log(L)
if neg:
return -logl
else:
return logl
def __call__(self, pars, neg=True):
return self.loglikelihood(pars, neg)

```
In [171]:
```loglike = PoissonLikelihood(x, y, straight)
ll = loglike([1., 3.], neg=True)
print("The log-likelihood for our trial parameters is " + str(ll))

```
```

In practice, we want to maximize the likelihood, but optimization routines always *minimize* a function. Therefore, we actually want to minimize the log-likelihood.

With the class we've set up above, we can now put this into `scipy.optimize.minimize`

. This function provides a combined interface to the numerous optimization routines available in `scipy.optimize`

. Some of these allow for constrained optimization (that is, optimization where the possible parameter range is restricted), others don't. Each optimization routine may have its own keyword parameters. I invite you to look at the documentation here http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html for more details.

```
In [173]:
```res = scipy.optimize.minimize(loglike, [1, 3], method="L-BFGS-B")

```
In [175]:
``````
res
```

```
Out[175]:
```

```
In [178]:
```print("Was the optimization successful? " + str(res.success))
print("The best-fit parameters are: " + str(res.x))
print("The likelihood value at the best-fit parameters is: " + str(res.fun))

```
```