Classification

G. Richards (2016), based particularly on materials from Andy Connolly, also Ivezic.

Density estimation and clustering are unsupervised forms of classification. Let's now move on to supervised classification. That's where we actually know the "truth" for some of our objects and can use that information to help guide the classification of unknown objects.

Generative vs. Discriminative Classification

We will talk about two different types of classification where each has a slightly different approach. As an example, if you are trying to determine whether your neighbor is speaking Spanish or Portuguese, you could 1) learn both Spanish and Portuguese so that you'd know exactly what they are saying or 2) learn the keys rules about the differences between the languages.

If we find ourselves asking which category is most likely to generate the observed result, then we are using using density estimation for classification and this is referred to as generative classification. Here we have a full model of the density for each class or we have a model which describes how data could be generated from each class.

If, on the other hand, we don't care about the full distribution, then we are doing something more like clustering, where we don't need to map the distribution, we just need to define boundaries. Classification that finds the decision boundary that separates classes is called discriminative classification. For high-dimensional data, this may be a better choice.

For example, in the figure below, to classify a new object with $x=1$, it would suffice to know that either 1) model 1 is a better fit than model 2, or 2) that the decision boundary is at $x=1.4$.

In my work, we actually do both. We first do discriminative classification using a decision boundary based on $KD$ trees and then we do generative classification using density estimation for the class of interest.

Scoring Your Results

The first question that we need to address is how we score our results (defined the success of our classification).

In the simplest case, there are 2 types of errors:

  • a False Positive, where we have assigned a true class label when it is really false. This is called a "Type-1 error".
  • a False Negative, where we have assigned a false class label when it is really true. This is called a "Type-II error".

All 4 possibilities are:

  • True Positive = correctly identified
  • False Negative = incorrectly rejected
  • True Negative = correctly rejected
  • False Positive = incorrectly identified

Based on these, we usually define either of these pairs of terms. Which is used is largely a matter of preference in different fields, but we'll see that there are some key differences.

$ {\rm completeness} = \frac{\rm true\ positives}{\rm true\ positives + false\ negatives}$

$ {\rm contamination} = \frac{\rm false\ positives}{\rm true\ positives + false\ positives} = {\rm false\ discovery\ rate}$

or

$ {\rm true\ positive\ rate} = \frac{\rm true\ positives} {\rm true\ positives + false\ negatives} $

$ {\rm false\ positive\ rate} = \frac{\rm false\ positives} {\rm true\ negatives + false\ positives} = {\rm Type1\ error} $

where completeness = true positive rate and this is also called sensitivity or recall.

Similarly

$ {\rm efficiency} = 1 - {\rm contamination} = {\rm precision}. $

Scikit-Learn also reports the F1 score which is the harmonic mean of precision and sensitivity (efficiency and completeness).

Depending on your goals, you may want to maximize the completeness or the efficiency, or a combination of both.

For example you might want to minimize voter fraud (contamination), but if doing so reduced voter participation (completeness) by a larger amount, then that wouldn't be such a good thing. So you need to decide what balance it is that you want to strike.

To better understand the differences between these measures, let's take the needle in a haystack problem. You have 100,000 stars and 1000 quasars. If you correctly identify 900 quasars as such and mistake 1000 stars for quasars, then we have:

  • TP = 900
  • FN = 100
  • TN = 99,000
  • FP = 1000

Which gives

$ {\rm true\ positive\ rate} = \frac{900}{900 + 100} = 0.9 = {\rm completeness} $

$ {\rm false\ positive\ rate} = \frac{1000}{99000 + 1000} = 0.01 $

Not bad right? Well, sort of. The FPR isn't bad, but there are lots of stars, so the contamination rate isn't so great:

$ {\rm contamination} = \frac{1000}{900 + 1000} = 0.53 $

Comparing the performance of classifiers

So, "best" performance is a bit of a subjective topic. We trade contamination as a function of completeness and this is science dependent.

Before we start talking about different classification algorithms, let's first talk about how we can quantify which of the methods is "best". (N.B. We have skipped ahead to Ivezic $\S$ 9.8).

The way that we will do this is with a Receiver Operating Characteristic (ROC) curve. (Apparently this is yet another example in statistics/machine learning where the name of something was deliberately chosen to scare people away from the field.) A ROC curve simply plots the true-positive vs. the false-positive rate.

One concern about ROC curves is that they are sensitive to the relative sample sizes. As we already demonstrated above, if there are many more background events than source events small false positive results can dominate a signal. For these cases we can plot efficiency (1 - contamination) vs. completeness.

Indeed, I had never even heard of a ROC curve until I started preparing this class. I have always made "completeness-contamination" plots, which makes a lot more sense to me (both in terms of what can be learned and nomenclature).

Here is a comparison of the two types of plots:

Here we see that to get higher completeness, you could actually suffer significantly in terms of efficiency, but your FPR might not go up that much if there are lots of true negatives. I'll point this out again later when we do a specific example.

Generally speaking, you want to chose a decision boundary (see below) that maximizes the area under the curve.

Below is the code that makes these plots. We'll talk about the data that goes into it in a bit. For now, we'll concentrate on how to generate the ROC and completeness-contamination plots.

We'll be comparing 7 different classifiers (with a generic clf object), making training and test sets with split_samples, then using these tools to generate our plots:

Note that the sklearn.metrics algorithms take y_test, which are classes, and y_prob, which are not class predictions, but rather probabilities, whereas the AstroML algorithm wants y_pred (which we get by converting y_prob into discrete predictions as a function of the probability).


In [ ]:
%matplotlib inline

# Ivezic, Figure 9.17
# Author: Jake VanderPlas
# License: BSD
import numpy as np
from matplotlib import pyplot as plt

from sklearn.naive_bayes import GaussianNB
#from sklearn.lda import LDA
#from sklearn.qda import QDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from astroML.classification import GMMBayes

from sklearn.metrics import precision_recall_curve, roc_curve

from astroML.utils import split_samples, completeness_contamination
from astroML.datasets import fetch_rrlyrae_combined

#----------------------------------------------------------------------
# get data and split into training & testing sets
X, y = fetch_rrlyrae_combined()
y = y.astype(int)
(X_train, X_test), (y_train, y_test) = split_samples(X, y, [0.75, 0.25], random_state=0)

#------------------------------------------------------------
# Fit all the models to the training data
def compute_models(*args):
    names = []
    probs = []
    for classifier, kwargs in args:
        clf = classifier(**kwargs)
        clf.fit(X_train, y_train)
        y_probs = clf.predict_proba(X_test)[:, 1]

        names.append(classifier.__name__)
        probs.append(y_probs)

    return names, probs

names, probs = compute_models((GaussianNB, {}),
                              (LinearDiscriminantAnalysis, {}),
                              (QuadraticDiscriminantAnalysis, {}),
                              (LogisticRegression,
                               dict(class_weight='auto')),
                              (KNeighborsClassifier,
                               dict(n_neighbors=10)),
                              (DecisionTreeClassifier,
                               dict(random_state=0, max_depth=12,
                                    criterion='entropy')),
                              (GMMBayes, dict(n_components=3, min_covar=1E-5,
                                              covariance_type='full')))

#------------------------------------------------------------
# Plot ROC curves and completeness/efficiency
fig = plt.figure(figsize=(10, 4))
fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15, top=0.9, wspace=0.25)

# ax1 will show roc curves
ax1 = plt.subplot(131)

# ax2 will show precision/recall
ax2 = plt.subplot(132)

# ax3 will show completeness/efficiency
ax3 = plt.subplot(133)

labels = dict(GaussianNB='GNB',
              LinearDiscriminantAnalysis='LDA',
              QuadraticDiscriminantAnalysis='QDA',
              KNeighborsClassifier='KNN',
              DecisionTreeClassifier='DT',
              GMMBayes='GMMB',
              LogisticRegression='LR')

thresholds = np.linspace(0, 1, 1001)[:-1]

# iterate through and show results
for name, y_prob in zip(names, probs):
    # Note that these take y_prob and not y_pred
    fpr, tpr, thresh = roc_curve(y_test, y_prob)
    precision, recall, thresh2 = precision_recall_curve(y_test, y_prob)

    # add (0, 0) as first point
    fpr = np.concatenate([[0], fpr])
    tpr = np.concatenate([[0], tpr])
    precision = np.concatenate([[0], precision])
    recall = np.concatenate([[1], recall])

    ax1.plot(fpr, tpr, label=labels[name])
    ax2.plot(precision, recall, label=labels[name])
    
    # Whereas this does take y_pred, which we need to compute
    # by looping through all possible probability thresholds
    comp = np.zeros_like(thresholds)
    cont = np.zeros_like(thresholds)
    for i, t in enumerate(thresholds):
        y_pred = (y_prob >= t)
        comp[i], cont[i] = completeness_contamination(y_pred, y_test)
    ax3.plot(1-cont, comp, label=labels[name])

ax1.set_xlim(0, 0.04)
ax1.set_ylim(0, 1.02)
ax1.xaxis.set_major_locator(plt.MaxNLocator(5))
ax1.set_xlabel('false positive rate')
ax1.set_ylabel('true positive rate')
ax1.legend(loc=4)

ax2.set_xlabel('precision')
ax2.set_ylabel('recall')
ax2.set_xlim(0, 1.0)
ax2.set_ylim(0.2, 1.02)

ax3.set_xlabel('efficiency')
ax3.set_ylabel('completeness')
ax3.set_xlim(0, 1.0)
ax3.set_ylim(0.2, 1.02)

plt.show()

Note that I've plotted both recall-precision and completeness-efficiency just to demonstrate that they are the same thing.

Generative Classification

We can use Bayes' theorem to relate the labels to the features in an $N\times D$ data set $X$. The $j$th feature of the $i$th point is $x_{ij}$ and there are $k$ classes giving discrete labels $y_k$. Then we have

$$p(y_k|x_i) = \frac{p(x_i|y_k)p(y_k)}{\sum_i p(x_i|y_k)p(y_k)},$$

where $x_i$ is assumed to be a vector with $j$ components.

$p(y=y_k)$ is the probability of any point having class $k$ (equivalent to the prior probability of the class $k$).

In generative classifiers we model class-conditional densities $p_k(x) = p(x|y=y_k)$ and our goal is to estimate the $p_k$'s.

Before we get into the generative classification algortithms, we'll first discuss 3 general concepts:

  • Discriminant Functions
  • Bayes Classifiers
  • Decision Boundaries

The Discriminant Function

We can relate classification to density estimation and regression.

$\hat{y} = f(y|x)$ represents the best guess of $y$ given $x$. So classification can be thought of as the analog of regression where $y$ is a discrete category rather than a continuous variable, for example $y=\{0,1\}$.

In classification we refer to $f(y|x)$ as the discriminant function.

For a simple 2-class example:

$$\begin{eqnarray} g(x) = f(y|x) & = & \int y \, p(y|x) \, dy \\ % & = & \int y p(y|x) \, dy \\ & = & 1 \cdot p(y=1 | x) + 0 \cdot p(y=0 | x) = p(y=1 | x). % & = & p(y=1 | x) \end{eqnarray} $$

From Bayes rule:

$$g(x) = \frac{p(x|y=1) \, p(y=1)}{p(x|y=1) \, p(y=1) + p(x|y=0) \, p(y=0)}$$

Bayes Classifier

If the discriminant function gives a binary prediction, we call it a Bayes classifier, formulated as

$$\begin{eqnarray} \widehat{y} & = & \left\{ \begin{array}{cl} 1 & \mbox{if $g(x) > 1/2$}, \\ 0 & \mbox{otherwise,} \end{array} \right. \\ & = & \left\{ \begin{array}{cl} 1 & \mbox{if $p(y=1|x) > p(y=0|x)$}, \\ 0 & \mbox{otherwise.} \end{array} \right.\end{eqnarray}$$

Where this can be generalized to any number of classes, $k$, and not just two.

Decision Boundary

A decision boundary is just set of $x$ values at which each class is equally likely:

$$ p(x|y=1)p(y=1) = p(x|y=0)p(y=0); $$$$g_1(x) = g_2(x) \; {\rm or}\; g(x) = 1/2$$

Below is an example of a decision boundary in 1-D. In short, we assign classifications according to which pdf is higher at every given $x$.

Simplest Classifier: Naive Bayes

In practice classification can be very complicated as the data are generally multi-dimensional (that is we don't just have $x$, we have $x_{j=0},x_1,x_2,x_3...x_n$, so we want $p(x_0,x_1,x_2,x_3...x_n|y)$.

However, if we assume that all attributes are conditionally independent (which is not always true, but is often close enough), then this simplifies to

$$ p(x_1,x_2|y_k) = p(x_1|y)p(x_2|y_k)$$

which can be written as

$$ p({x_{j=0},x_1,x_2,\ldots,x_N}|y_k) = \prod_j p(x_j|y_k).$$

From Bayes' rule and conditional independence we get

$$ p(y_k | {x_0,x_1,\ldots,x_N}) = \frac{\prod_j p(x_j|y_k) p(y_k)} {\sum_l \prod_j p(x_j|y_l) p(y_l)}. $$

We calculate the most likely value of $y$ by maximizing over $y_k$:

$$ \hat{y} = \arg \max_{y_k} \frac{\prod_j p(x_j|y_k) p(y_k)} {\sum_l \prod_j p(x_j|y_l) p(y_l)}, $$

From there the process is just estimating densities: $p(x|y=y_k)$ and $p(y=y_k)$ are learned from a set of training data, where

  • $p(y=y_k)$ is just the frequency of the class $k$ in the training set
  • $p(x|y=y_k)$ is just the density (probability) of an object with class $k$ having the attributes $x$

A catch is that if the training set does not cover the full parameter space, then $p(x_i|y=y_k)$ can be $0$ for some value of $y_k$ and $x_i$. The posterior probability is then $p(y_k|\{x_i\}) = 0/0$ which is a problem! A trick called Laplace smoothing can be implemented to fix it.

Gaussian Naive Bayes

It is totally unclear from the discussion in the book that $x_i$ are discrete measurements. However, one way to handle continuous values for $X$ is to model $p(x_i|y=y_k)$ as one-dimensional normal distributions, with means $\mu_{ik}$ and widths $\sigma_{ik}$. The naive Bayes estimator is then

$$\hat{y} = \arg\max_{y_k}\left[\ln p(y=y_k) - \frac{1}{2}\sum_{i=1}^N\left(2\pi(\sigma_{ik})^2 + \frac{(x_i - \mu_{ik})^2}{(\sigma_{ik})^2} \right) \right]$$

In Scikit-Learn Gaussian Naive Bayes classification is implemented as follows, with a simple example given in the next cell.


In [2]:
import numpy as np
from sklearn.naive_bayes import GaussianNB
X = np.random.random((100,2))
y = (X[:,0] + X[:,1] > 1).astype(int)
gnb = GaussianNB()
gnb.fit(X,y)
y_pred = gnb.predict(X)

In [ ]:
%matplotlib inline
# Ivezic, Figure 9.2
# Author: Jake VanderPlas
# License: BSD
#   The figure produced by this code is published in the textbook
#   "Statistics, Data Mining, and Machine Learning in Astronomy" (2013)
#   For more information, see http://astroML.github.com
#   To report a bug or issue, use the following forum:
#    https://groups.google.com/forum/#!forum/astroml-general
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import colors

from sklearn.naive_bayes import GaussianNB

#------------------------------------------------------------
# Simulate some data
np.random.seed(0)
mu1 = [1, 1]
cov1 = 0.3 * np.eye(2)

mu2 = [5, 3]
cov2 = np.eye(2) * np.array([0.4, 0.1])

X = np.concatenate([np.random.multivariate_normal(mu1, cov1, 100),
                    np.random.multivariate_normal(mu2, cov2, 100)])
y = np.zeros(200)
y[100:] = 1

#------------------------------------------------------------
# Fit the Naive Bayes classifier
clf = GaussianNB()
clf.fit(X, y)

# predict the classification probabilities on a grid
xlim = (-1, 8)
ylim = (-1, 5)
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 71), np.linspace(ylim[0], ylim[1], 81))
xystack = np.vstack([xx.ravel(),yy.ravel()])
Xgrid = xystack.T

Z = clf.predict_proba(Xgrid)
# Gives probability for both class 1 and class 2 for each grid point
# As these are degenerate, take just one and then
# re-shape it to the grid pattern needed for contour plotting
Z = Z[:, 1].reshape(xx.shape)

#------------------------------------------------------------
# Plot the results
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)

# Plot the points
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.binary)

# Add the decision boundary, which is just the contour where
# the probability exceeds some threshold, here 0.5.
ax.contour(xx, yy, Z, [0.5], colors='k')

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

plt.show()

And now an example using real data. Here we have a 4-D $X$ and we are going to take them 1-D at a time to see how much improvement comes from adding each new dimension of $X$.


In [ ]:
# Ivezic, Figure 9.3
# Author: Jake VanderPlas
# License: BSD
#   The figure produced by this code is published in the textbook
#   "Statistics, Data Mining, and Machine Learning in Astronomy" (2013)
#   For more information, see http://astroML.github.com
#   To report a bug or issue, use the following forum:
#    https://groups.google.com/forum/#!forum/astroml-general
import numpy as np
from matplotlib import pyplot as plt

from sklearn.naive_bayes import GaussianNB
from astroML.datasets import fetch_rrlyrae_combined
from astroML.utils import split_samples
from astroML.utils import completeness_contamination

#----------------------------------------------------------------------
# get data and split into training & testing sets
X, y = fetch_rrlyrae_combined() # X is a 4-D color-color-color-color space
X = X[:, [1, 0, 2, 3]]  # rearrange columns for better 1-color results

# Split the data into training and test sets
(X_train, X_test), (y_train, y_test) = split_samples(X, y, [0.75, 0.25], random_state=0)

N_tot = len(y)
N_stars = np.sum(y == 0)
N_rrlyrae = N_tot - N_stars
N_train = len(y_train)
N_test = len(y_test)
N_plot = 5000 + N_rrlyrae

#----------------------------------------------------------------------
# perform Naive Bayes

# Create blank arrays to hold the output
y_class = []
y_pred = []
y_probs = []

Ncolors = np.arange(1, X.shape[1] + 1)

order = np.array([1, 0, 2, 3])

for nc in Ncolors:
    clf = GaussianNB()
    clf.fit(X_train[:, :nc], y_train)
    
    y_pred.append(clf.predict(X_test[:, :nc]))
    y_class.append(clf)    

# Use astroML utils code to compute completeness and contamination
completeness, contamination = completeness_contamination(y_pred, y_test)

print "completeness", completeness
print "contamination", contamination


#------------------------------------------------------------
# Compute the decision boundary (for 2 colors)
clf = y_class[1]
xlim = (0.7, 1.35)
ylim = (-0.15, 0.4)

xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 81),
                     np.linspace(ylim[0], ylim[1], 71))

Z = clf.predict_proba(np.c_[yy.ravel(), xx.ravel()])
Z = Z[:, 1].reshape(xx.shape)

#----------------------------------------------------------------------
# plot the results
fig = plt.figure(figsize=(10, 5))
fig.subplots_adjust(bottom=0.15, top=0.95, hspace=0.0,
                    left=0.1, right=0.95, wspace=0.2)

# left plot: data and decision boundary
ax = fig.add_subplot(121)
im = ax.scatter(X[-N_plot:, 1], X[-N_plot:, 0], c=y[-N_plot:],
                s=4, lw=0, cmap=plt.cm.Oranges, zorder=2)
im.set_clim(-0.5, 1)

im = ax.imshow(Z, origin='lower', aspect='auto',
               cmap=plt.cm.binary, zorder=1,
               extent=xlim + ylim)
im.set_clim(0, 1.5)
ax.contour(xx, yy, Z, [0.5], colors='k')

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.set_xlabel('$u-g$')
ax.set_ylabel('$g-r$')

# Plot completeness vs Ncolors
ax = plt.subplot(222)
ax.plot(Ncolors, completeness, 'o-k', ms=6)

ax.xaxis.set_major_locator(plt.MultipleLocator(1))
ax.yaxis.set_major_locator(plt.MultipleLocator(0.2))
ax.xaxis.set_major_formatter(plt.NullFormatter())

ax.set_ylabel('completeness')
ax.set_xlim(0.5, 4.5)
ax.set_ylim(-0.1, 1.1)
ax.grid(True)

# Plot contamination vs Ncolors
ax = plt.subplot(224)
ax.plot(Ncolors, contamination, 'o-k', ms=6)

ax.xaxis.set_major_locator(plt.MultipleLocator(1))
ax.yaxis.set_major_locator(plt.MultipleLocator(0.2))
ax.xaxis.set_major_formatter(plt.FormatStrFormatter('%i'))

ax.set_xlabel('N colors')
ax.set_ylabel('contamination')
ax.set_xlim(0.5, 4.5)
ax.set_ylim(-0.1, 1.1)
ax.grid(True)

plt.show()

If you shifted the decision boundary "up" by hand, what would happen to the completeness, contamination, and false positive rate?

What happens if you change the fraction of objects in the training set?

Extra Credit: Graph the false positive rate.

The "naive" refers to the fact that we are assuming that all of the variable are independent. If we relax that assumption and allow for covariances, then we have a Gaussian Bayes classifier. But note that this comes with a large jump in computational cost!

Linear Discriminant Analysis

In Linear Discriminant Analysis (LDA) we assume that the class distributions have identical covariances for all $k$ classes (all classes are a set of shifted Gaussians).

The class-dependent covariances that would normally give rise to a quadratic dependence on $X$ cancel out if they are assumed to be constant. The Bayes classifier is, therefore, linear with respect to $X$, and discriminant boundary between classes is the line that minimizes the overlap between Gaussians.

Relaxing the requirement that the covariances of the Gaussians are constant, the discriminant function becomes quadratic in $X$.

This is sometimes known as Quadratic Discriminant Analysis (QDA).

LDA and QDA are implemented in Scikit-Learn as follows and an example using the same data as above is given below.


In [5]:
import numpy as np
from sklearn.lda import LDA
from sklearn.qda import QDA

X = np.random.random((100,2))
y = (X[:,0] + X[:,1] > 1).astype(int)
lda = LDA()
lda.fit(X,y)
y_pred = lda.predict(X)

qda = QDA()
qda.fit(X,y)
y_pred = qda.predict(X)

In [ ]:
# Ivezic, Figures 9.4 and 9.5
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import colors

from sklearn.lda import LDA
from sklearn.qda import QDA

from astroML.datasets import fetch_rrlyrae_combined
from astroML.utils import split_samples
from astroML.utils import completeness_contamination

#----------------------------------------------------------------------
# get data and split into training & testing sets
X, y = fetch_rrlyrae_combined()
X = X[:, [1, 0, 2, 3]]  # rearrange columns for better 1-color results
(X_train, X_test), (y_train, y_test) = split_samples(X, y, [0.75, 0.25], random_state=0)

N_tot = len(y)
N_stars = np.sum(y == 0)
N_rrlyrae = N_tot - N_stars
N_train = len(y_train)
N_test = len(y_test)
N_plot = 5000 + N_rrlyrae

#----------------------------------------------------------------------
# perform LDA
lda = LDA()
lda.fit(X_train[:, :2], y_train)
y_predLDA = lda.predict(X_test[:, :2])

# perform QDA
qda = QDA()
qda.fit(X_train[:, :2], y_train)
y_predQDA = qda.predict(X_test[:, :2])
    
completenessLDA, contaminationLDA = completeness_contamination(y_predLDA, y_test)
completenessQDA, contaminationQDA = completeness_contamination(y_predQDA, y_test)

print "completeness", completenessLDA, completenessQDA
print "contamination", contaminationLDA, contaminationQDA

#------------------------------------------------------------
# Compute the decision boundary
xlim = (0.7, 1.35)
ylim = (-0.15, 0.4)

xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 71),
                     np.linspace(ylim[0], ylim[1], 81))

Z_LDA = lda.predict_proba(np.c_[yy.ravel(), xx.ravel()])
Z_LDA = Z_LDA[:, 1].reshape(xx.shape)
Z_QDA = qda.predict_proba(np.c_[yy.ravel(), xx.ravel()])
Z_QDA = Z_QDA[:, 1].reshape(xx.shape)

#----------------------------------------------------------------------
# plot the results
fig = plt.figure(figsize=(10, 5))
fig.subplots_adjust(bottom=0.15, top=0.95, hspace=0.0,
                    left=0.1, right=0.95, wspace=0.2)

# left plot: data and decision boundary
ax = fig.add_subplot(121)
im = ax.scatter(X[-N_plot:, 1], X[-N_plot:, 0], c=y[-N_plot:],
                s=4, lw=0, cmap=plt.cm.Oranges, zorder=2)
im.set_clim(-0.5, 1)

im = ax.imshow(Z_LDA, origin='lower', aspect='auto',
               cmap=plt.cm.binary, zorder=1,
               extent=xlim + ylim)
im.set_clim(0, 1.5)

ax.contour(xx, yy, Z_LDA, [0.5], linewidths=2., colors='k')

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.set_xlabel('$u-g$')
ax.set_ylabel('$g-r$')

# right plot: qda
ax = fig.add_subplot(122)
im = ax.scatter(X[-N_plot:, 1], X[-N_plot:, 0], c=y[-N_plot:],
                s=4, lw=0, cmap=plt.cm.Oranges, zorder=2)
im.set_clim(-0.5, 1)

im = ax.imshow(Z_QDA, origin='lower', aspect='auto',
               cmap=plt.cm.binary, zorder=1,
               extent=xlim + ylim)
im.set_clim(0, 1.5)

ax.contour(xx, yy, Z_QDA, [0.5], linewidths=2., colors='k')

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.set_xlabel('$u-g$')
ax.set_ylabel('$g-r$')

plt.show()

If it is obvious from looking at your data that a linear or quadratic boundary will work well, then great. But what if that is not the case?

GMM and Bayes Classification

Our classifications so far have made some restrictive assumptions: either that of conditional independence or that the distributions are Gaussian. However, a more flexible model might improve the completeness and efficiency of the classification. For that we can look to the techniques from Chapter 6.

The natural extension of the Gaussian assumptions is to use GMM's to determine the density distribution, i.e., a GMM Bayes Classifier.

Note that the number of Gaussian components, $K$, must be chosen for each class, $k$, independently.

astroML implements GMM Bayes classification as:


In [7]:
import numpy as np
from astroML.classification import GMMBayes
X = np.random.random((100,2))
y = (X[:,0] + X[:,1] > 1).astype(int)
gmmb = GMMBayes(3) # 3 clusters per class
gmmb.fit(X,y)
y_pred = gmmb.predict(X)

We now apply the GMM Bayes classifier to the real data from above. With just one component, we get results that are similar to those from Naive Bayes. But with 5 components (and all 4 attributes), we do pretty well.


In [ ]:
# Ivezic, Figure 9.6
import numpy as np
from matplotlib import pyplot as plt

from astroML.classification import GMMBayes
#from astroML.decorators import pickle_results
from astroML.datasets import fetch_rrlyrae_combined
from astroML.utils import split_samples
from astroML.utils import completeness_contamination

#----------------------------------------------------------------------
# get data and split into training & testing sets
X, y = fetch_rrlyrae_combined()
X = X[:, [1, 0, 2, 3]]  # rearrange columns for better 1-color results

# GMM-bayes takes several minutes to run, and is order[N^2]
#  truncating the dataset can be useful for experimentation.
#X = X[::10]
#y = y[::10]

(X_train, X_test), (y_train, y_test) = split_samples(X, y, [0.75, 0.25], random_state=0)
N_tot = len(y)
N_stars = np.sum(y == 0)
N_rrlyrae = N_tot - N_stars
N_train = len(y_train)
N_test = len(y_test)
N_plot = 5000 + N_rrlyrae

#----------------------------------------------------------------------
# perform GMM Bayes
Ncolors = np.arange(1, X.shape[1] + 1)
Ncomp = [1, 5]

def compute_GMMbayes(Ncolors, Ncomp):
    classifiers = []
    predictions = []

    for ncm in Ncomp:
        classifiers.append([])
        predictions.append([])
        for nc in Ncolors:
            clf = GMMBayes(ncm, min_covar=1E-5, covariance_type='full')
            clf.fit(X_train[:, :nc], y_train)
            y_pred = clf.predict(X_test[:, :nc])

            classifiers[-1].append(clf)
            predictions[-1].append(y_pred)

    return classifiers, predictions

classifiers, predictions = compute_GMMbayes(Ncolors, Ncomp)
completeness, contamination = completeness_contamination(predictions, y_test)

print "completeness", completeness[0]
print "contamination", contamination[0]

#------------------------------------------------------------
# Compute the decision boundary
clf = classifiers[1][1]
xlim = (0.7, 1.35)
ylim = (-0.15, 0.4)

xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 71),
                     np.linspace(ylim[0], ylim[1], 81))

Z = clf.predict_proba(np.c_[yy.ravel(), xx.ravel()])
Z = Z[:, 1].reshape(xx.shape)

#----------------------------------------------------------------------
# plot the results
fig = plt.figure(figsize=(8, 4))
fig.subplots_adjust(bottom=0.15, top=0.95, hspace=0.0,
                    left=0.1, right=0.95, wspace=0.2)

# left plot: data and decision boundary
ax = fig.add_subplot(121)
im = ax.scatter(X[-N_plot:, 1], X[-N_plot:, 0], c=y[-N_plot:],
                s=4, lw=0, cmap=plt.cm.binary, zorder=2)
im.set_clim(-0.5, 1)

im = ax.imshow(Z, origin='lower', aspect='auto',
               cmap=plt.cm.Oranges, zorder=1,
               extent=xlim + ylim)
im.set_clim(0, 1.5)

ax.contour(xx, yy, Z, [0.5], linewidths=2., colors='k')

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.set_xlabel('$u-g$')
ax.set_ylabel('$g-r$')

# plot completeness vs Ncolors
ax = fig.add_subplot(222)
ax.plot(Ncolors, completeness[0], '^--k', c='k', label='N=%i' % Ncomp[0])
ax.plot(Ncolors, completeness[1], 'o-k', c='k', label='N=%i' % Ncomp[1])

ax.xaxis.set_major_locator(plt.MultipleLocator(1))
ax.yaxis.set_major_locator(plt.MultipleLocator(0.2))
ax.xaxis.set_major_formatter(plt.NullFormatter())

ax.set_ylabel('completeness')
ax.set_xlim(0.5, 4.5)
ax.set_ylim(-0.1, 1.1)
ax.grid(True)

# plot contamination vs Ncolors
ax = fig.add_subplot(224)
ax.plot(Ncolors, contamination[0], '^--k', c='k', label='N=%i' % Ncomp[0])
ax.plot(Ncolors, contamination[1], 'o-k', c='k', label='N=%i' % Ncomp[1])
ax.legend(prop=dict(size=12),
          loc='lower right',
          bbox_to_anchor=(1.0, 0.78))

ax.xaxis.set_major_locator(plt.MultipleLocator(1))
ax.yaxis.set_major_locator(plt.MultipleLocator(0.2))
ax.xaxis.set_major_formatter(plt.FormatStrFormatter('%i'))

ax.set_xlabel('N colors')
ax.set_ylabel('contamination')
ax.set_xlim(0.5, 4.5)
ax.set_ylim(-0.1, 1.1)
ax.grid(True)

plt.show()

We can take this to the extreme by having one mixture component at each training point. We also don't have to restrict ourselves to a Gaussian kernel, we can use any kernel that we like. The resulting non-parametric Bayes classifier is referred to as Kernel Discriminant Analysis (KDA) . It seems like this would be a lot more computationally intensive, but at least now we don't have to optimize the locations of the components, we just need to determine the bandwidth of the kernel. In the end, it can result in better classification.

One of the tricks to speed things up is that we don't need to know the actually class probability, we just need to know which is larger. This is explained in more detail in Riegel, Gray, & Richards 2008, and it is implemented in a series of papers starting with Richards et al. 2004.

If you follow those and you are so inclined, you could apply for jobs at Skytree or Wise.io which were started by my collaborator, Alex Gray, and astronomer, Josh Bloom, respectively.

It is worth noting that this illustrates one of the downsides of the book, astroML, and Scikit-learn. They teach you about the basics of the algorithms, but if you wanted to use a truly big data set, then you really need the next level, such as KDA, but that is merely described here, not implemented.

K-Nearest Neighbor Classifier

If we did KDA with a variable bandwidth that depended only on the distance of the nearest neighbor, then we'd have what we call a Nearest-Neighbor classifier. Here if $x$ is close to $x'$, then $p(y|x) \approx p(y|x')$. Note that we have not assumed anything about the conditional density distribution, so it is completely non-parametric.

The number of neighbors, $K$, regulates the complexity of the classification, where a larger $K$ decreases the variance in the classification but leads to an increase in the bias. (N.B., the 3rd different use of $K$ or $k$ in this notebook!)

The distance measure is usually N-D Euclidean. However, if the attributes have very different properties, then normalization, weighting, etc. may be needed.

Scikit-learn implements K-Nearest Neighbors classification as


In [ ]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
X = np.random.random((100,2))
y = (X[:,0] + X[:,1] > 1).astype(int)
knc = KNeighborsClassifier(5) # use 5 nearest neighbors
knc.fit(X,y)
y_pred = knc.predict(X)

Implementing it for the same example as above shows that it isn't all that great for this particular case. See below. We probably need more training data to reduce the variance for it to work better.


In [ ]:
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import colors

from sklearn.neighbors import KNeighborsClassifier
from astroML.datasets import fetch_rrlyrae_combined
from astroML.utils import split_samples
from astroML.utils import completeness_contamination

#----------------------------------------------------------------------
# get data and split into training & testing sets
X, y = fetch_rrlyrae_combined()
X = X[:, [1, 0, 2, 3]]  # rearrange columns for better 1-color results
(X_train, X_test), (y_train, y_test) = split_samples(X, y, [0.75, 0.25], random_state=0)

N_tot = len(y)
N_st = np.sum(y == 0)
N_rr = N_tot - N_st
N_train = len(y_train)
N_test = len(y_test)
N_plot = 5000 + N_rr

#----------------------------------------------------------------------
# perform Classification
classifiers = []
predictions = []
Ncolors = np.arange(1, X.shape[1] + 1)
kvals = [1, 5]

from sklearn import metrics
    
for k in kvals:
    classifiers.append([])
    predictions.append([])
    for nc in Ncolors:
        clf = KNeighborsClassifier(n_neighbors=k)
        clf.fit(X_train[:, :nc], y_train)
        y_pred = clf.predict(X_test[:, :nc])
        
        classifiers[-1].append(clf)
        predictions[-1].append(y_pred)

completeness, contamination = completeness_contamination(predictions, y_test)

print "completeness (as a fn of neighbors and colors)", completeness
print "contamination", contamination

#------------------------------------------------------------
# Compute the decision boundary
clf = classifiers[1][1]
xlim = (0.7, 1.35)
ylim = (-0.15, 0.4)

xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 71),
                     np.linspace(ylim[0], ylim[1], 81))

Z = clf.predict(np.c_[yy.ravel(), xx.ravel()])
Z = Z.reshape(xx.shape)

#----------------------------------------------------------------------
# plot the results
fig = plt.figure(figsize=(10, 5))
fig.subplots_adjust(bottom=0.15, top=0.95, hspace=0.0,
                    left=0.1, right=0.95, wspace=0.2)

# left plot: data and decision boundary
ax = fig.add_subplot(121)
im = ax.scatter(X[-N_plot:, 1], X[-N_plot:, 0], c=y[-N_plot:],
                s=4, lw=0, cmap=plt.cm.Oranges, zorder=2)
im.set_clim(-0.5, 1)

im = ax.imshow(Z, origin='lower', aspect='auto',
               cmap=plt.cm.binary, zorder=1,
               extent=xlim + ylim)
im.set_clim(0, 2)

ax.contour(xx, yy, Z, [0.5], linewidths=2., colors='k')

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.set_xlabel('$u-g$')
ax.set_ylabel('$g-r$')

ax.text(0.02, 0.02, "k = %i" % kvals[1],
        transform=ax.transAxes)

# plot completeness vs Ncolors
ax = fig.add_subplot(222)

ax.plot(Ncolors, completeness[0], 'o-k', label='k=%i' % kvals[0])
ax.plot(Ncolors, completeness[1], '^--k', label='k=%i' % kvals[1])

ax.xaxis.set_major_locator(plt.MultipleLocator(1))
ax.yaxis.set_major_locator(plt.MultipleLocator(0.2))
ax.xaxis.set_major_formatter(plt.NullFormatter())

ax.set_ylabel('completeness')
ax.set_xlim(0.5, 4.5)
ax.set_ylim(-0.1, 1.1)
ax.grid(True)

# plot contamination vs Ncolors
ax = fig.add_subplot(224)
ax.plot(Ncolors, contamination[0], 'o-k', label='k=%i' % kvals[0])
ax.plot(Ncolors, contamination[1], '^--k', label='k=%i' % kvals[1])
ax.legend(prop=dict(size=12),
          loc='lower right',
          bbox_to_anchor=(1.0, 0.79))

ax.xaxis.set_major_locator(plt.MultipleLocator(1))
ax.yaxis.set_major_locator(plt.MultipleLocator(0.2))
ax.xaxis.set_major_formatter(plt.FormatStrFormatter('%i'))
ax.set_xlabel('N colors')
ax.set_ylabel('contamination')
ax.set_xlim(0.5, 4.5)
ax.set_ylim(-0.1, 1.1)
ax.grid(True)

plt.show()

Regardless of whether this is the best algorithm or not, we can choose $K$ to minimize the classifcation error rate by using cross-validation. Give it a try:


In [26]:
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import colors

from sklearn.neighbors import KNeighborsClassifier
from astroML.datasets import fetch_rrlyrae_combined
from astroML.utils import completeness_contamination

# New
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import accuracy_score

#----------------------------------------------------------------------
# get data and split into training & testing sets
X, y = fetch_rrlyrae_combined()
X = X[:, [1, 0, 2, 3]]  # rearrange columns for better 1-color results

#----------------------------------------------------------------------
# perform Classification
scores = []
kvals = np.arange(1,20)
for k in kvals:
    clf = # Complete
    CVpredk = cross_val_predict( # Complete
    scores.append(accuracy_score( # Complete

In [ ]:
print("max score is for k={:d}".format(# Complete
# Plot number of neighbors vs score

Below is an example of another way to implement cross valideation.


In [ ]:
from sklearn.cross_validation import cross_val_score

scores2 = []
for k in kvals:
    # Let's do a 2-fold cross-validation of the SVC estimator
    scores2.append(cross_val_score(KNeighborsClassifier(n_neighbors=k), X, y, cv=2, scoring='precision'))

# Plot these too

We can also use the metrics module in sklearn to compute some statistics for us. Try inserting the code below into the appropriate place in our nearest neighbors classifier above.


In [ ]:
from sklearn import metrics 
    
        print k,nc
        print("accuracy:", metrics.accuracy_score(y_test, y_pred))
        print("precision:", metrics.precision_score(y_test, y_pred))
        print("recall:", metrics.recall_score(y_test, y_pred))