We are going to spend much of the next 2 weeks going over some basic of statistics and not doing a whole lot of actual machine learning. So today is about giving you a flavor of the kinds of things that we'll be doing later in the course.
In [7]:
# Execute this cell
# This is just to get some things setup for later
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline
You have some one-dimensional ("univariate") data that you would like to try to understand. Where by "understand" we mean "know the distribution in the measured space", i.e., you want to know the probability distribution function (PDF). The easiest way to do that is to make a histogram. Simple, right?
Let's work through some examples to see what problems we encounter and how we might overcome them.
In [8]:
# Execute this cell to generate a univariate data array, x
# this is the same data used in Ivezic, Figure 6.5
np.random.seed(0)
N = 1000
mu_gamma_f = [(5, 1.0, 0.1),
(7, 0.5, 0.5),
(9, 0.1, 0.1),
(12, 0.5, 0.2),
(14, 1.0, 0.1)]
true_pdf = lambda x: sum([f * stats.cauchy(mu, gamma).pdf(x)
for (mu, gamma, f) in mu_gamma_f])
x = np.concatenate([stats.cauchy(mu, gamma).rvs(int(f * N))
for (mu, gamma, f) in mu_gamma_f])
np.random.shuffle(x)
x = x[x > -5]
x = x[x < 25]
Now let's take a first look at the data. Use
plt.hist()
where that function plots a histogram of any univariate data array given as an argument. It takes lots of other arguments too. See (http://matplotlib.org/api/pyplot_api.html?highlight=hist#matplotlib.pyplot.hist). You might start with histtype = "step"
so that we can later add information to the plot and still see the original histogram. See what happens if you don't do this.
In [ ]:
plt.hist( # complete
How would you characterize this distribution? Could we reasonably think of it as a normal distribution that we could characterize by some mean and standard deviation? Maybe, but even just by looking at this plot we see that it wouldn't be a particularly good description of the data.
Now do what we would typically do in astronomy: try re-plotting the histogram a couple of different ways to see if it reveals anything more about the data. Try using only 5 bins bins=5
, 20 bins, and with left-centered bins align = "left"
instead of "mid"
centered bins (which is the default)
In [ ]:
plt.hist( # complete
plt.hist( # complete
plt.hist( # complete
Note that the heights of these PDFs are all different! That's because the y-axis only refers to the first histogram. Try using normed = True
so that the histograms are all normalized to 1.
In [ ]:
plt.hist( # complete
plt.hist( # complete
plt.hist( # complete
We find that small changes in parameters to the histogram function significantly change the PDF. That's bad, because the underlying data clearly have not changed.
One of the problems with histograms is that some bins end up with little (or no) data. We can fix this by making variable-width bin sizes that have the same number of objects in each bin. How can we do this?
In [10]:
#Execute this cell
a = np.linspace(1,42,num=42)
print a
print a[::2]
print a[::3]
If you are familiar with slicing, then you know that [::2]
and [::3]
say to count by 2 and count by 3. But that isn't what they really do. They say to take every other index of the array or every 3rd index of the array. So, if your array is sorted (like a
is), then you could use this to instead define the number of values in a bin. That is for any given value of M
bins = np.append(np.sort(x)[::M], np.max(np.sort(x)[:-1]))
would give bins with M
objects in each bin. Note that you need to add the maximum value to set the right edge of the last bin. Try it for M=100, 50, and 25
.
In [95]:
bins = # complete
In [ ]:
plt.hist( # complete
Again, this can look pretty different depending on what the number of objects you choose as the minimum for each bin and compared to the plots above. And it looks a lot different from the plots above.
So, what is the "right" way to set the bin size?
There is no "right" way, but we'll encounter some suggestions in Chapter 4. Let's take a quick look at them now.
"Scott's rule" suggests that the optimal bin width is $$\Delta_b = \frac{3.5\sigma}{N^{1/3}}.$$
That's great, but what if we don't know the standard deviation, $\sigma$ (e.g., if the distribution isn't really Gaussian)? We can then instead used the "Freedman-Diaconis rule":
$$\Delta_b = \frac{2(q_{75}-q_{25})}{N^{1/3}} = \frac{2.7\sigma_G}{N^{1/3}}.$$
Let's try that, where $\sigma_G$ is 0.7413 times the difference between the upper and lower quartiles, which you can determine with np.percentile()
.
In [ ]:
q25 = # complete
q75 = # complete
sigmaG = # complete
Compare this to what you get using stats.sigmaG()
from the astroML package. You'll have to import stats from astroML and give it a different name since stats
right now refers to scipy.stats
.
Now set the bin size accordingly, using np.arange()
and plot. Make sure that you don't throw away the last object in data set! How many bins do you get? How does that compare to what we were using above?
In [ ]:
binsize = # complete
bins = np.append(np.arange( # complete
In [ ]:
plt.hist( # complete
Did you find that tedious? Me too. Fortunately there is a shortcut! Try it.
In [ ]:
from astroML.plotting import hist as fancyhist
fancyhist(x, bins="scott", histtype="step")
fancyhist(x, bins="freedman", histtype="step")
But note that even those don't yield quite the same results! But we can do better!
An obvious thing to do is to simply show all of the data.
In [ ]:
# execute this cell
plt.hist(x,histtype="step")
plt.plot(x, 0*x, '|', color='k', markersize=25) #Note markersize is (annoyingly) in *points*
This is called a rug plot and now we have a better idea of where most of the data and where the gaps really are (as opposed to where the binning makes them appear to be). However, the markers are all piled up, so we have lost all sense of the relative numbers of objects. Are there ~10 at x=2.5 or could there be 100?
This is where Kernel Density Estimation (KDE) comes in. (As a side note, KDE is the core of the quasar classification work that I do, which is how I got into this machine learning business in the first place. Way before it was popular I might add!) In short the idea here is to represent each data point not as a delta function, but rather as a distribution (e.g., a Gaussian). Then those distributions ("kernels") are summed up to produce the PDF. One of the advantages of this is that it combines the best of 1) the histgram and 2) the rug plot: where 1) tells us the relative height of the distribution and 2) centers the data points at the actual location of the data instead of within some arbitrary bin.
Just about any distribution can be used as the kernel, but the most common are a Gaussian kernal and an Epanechnikov kernel. One downside of the Gaussian kernel is that the tails are technically infinite in extent. So each point has some finite probability of being everywhere. The Epanechnikov kernel has truncated wings.
One still has the problem of deciding the width of the kernel (e.g., for the Gaussian the "mean" is fixed at the value of the point, but how wide should you make the Gaussian?). For my work, we do this with a self-test of the data. Specifically, what is the optimal width such that objects with a known classification are indeed given that classification by our machine learning algorithm. But that is the topic for another day. For now, we'll just play with the widths by hand to see what might work best. N.B. the widths of the kernel distribution are referred to as "bandwidth".
In [13]:
# execute this cell to load the KDE module
# No need to try to understand what is going on here now, we'll come back to this later.
# But see the note below
from sklearn.neighbors import KernelDensity
xplot = np.linspace(x.min(),x.max(),1000) # Use this instead of 'x' for plotting
def kde_sklearn(data, bandwidth = 1.0, kernel="linear"):
kde_skl = KernelDensity(bandwidth = bandwidth, kernel=kernel)
kde_skl.fit(data[:, np.newaxis])
log_pdf = kde_skl.score_samples(xplot[:, np.newaxis]) # sklearn returns log(density)
return np.exp(log_pdf)
Before we try the Gaussian and Epanechnikov kernels, let's first start with a tophat using kernel = "tophat"
, which will produce a plot much like the rug plot.
Start with bandwidth=0.01
. See what happens when you adjust this.
In [ ]:
PDFtophat = kde_sklearn( # complete
plt.plot( # complete
The defaults give a result that is essentially what you would get if you made a histogram with a really large number of bins.
Now let's compare what happens when we adjust the bandwidth (which is just the width of the kernel function). Try 0.1 and 0.5.
In [ ]:
PDFtophat1 = kde_sklearn( # complete
plt.plot( # complete
PDFtophat5 = kde_sklearn( # complete
plt.plot( # complete
plt.legend( # complete
Now let's see what we get with the Gaussian and Epanechnikov kernels. Play with the bandwidths until you get something that looks reasonable (and roughly matches) for the two kernels. They need not be the same.
In [ ]:
PDFgaussian = kde_sklearn( # complete
PDFepanechnikov = kde_sklearn( # complete
plt.plot( # complete
plt.plot( # complete
plt.legend( # complete
This is pretty different from the histogram that we started out with, isn't it?
Lastly, we have used 1000 points, so you aren't seeing the kernel shape for the individual points. Try remaking $x$ with only 15 points and see what this looks like. Adjust the figure size (using figsize
), the bandwidth and the axis limits until you can see the differences between the two kernels. Play with the bandwidths to see what affect they have now that you can see the individual kernels.
In [ ]:
plt # Complete following the above solution