Classification Example

This notebook assumes that the current working directory is in the scikit-learn tutorial directory where the notebook is stored. In the folder

 ../data/sdss_colors

there is a script fetch_data.py which will download the colors of over 700,000 stars and quasars from the Sloan Digital Sky Survey. 500,000 of them are training data, spectroscopically identified as stars or quasars. The remaining 200,000 have been classified based on their photometric colors.

If you're using a different directory structure, then the DATA_HOME variable in the following script should be set accordingly.


In [ ]:
import os
DATA_HOME = os.path.abspath('../data/sdss_colors/')

Here we will use a Naive Bayes estimator to classify the objects. First, we will construct our training data and test data arrays:


In [ ]:
import numpy as np

train_data = np.load(os.path.join(DATA_HOME, 'sdssdr6_colors_class_train.npy'))
test_data = np.load(os.path.join(DATA_HOME, 'sdssdr6_colors_class.200000.npy'))

The data is stored as a record array, which is a convenient format for collections of labeled data:


In [ ]:
print train_data.dtype.names

In [ ]:
print train_data['u-g'].shape

Now we must put these into arrays of shape (n_samples, n_features) in order to pass them to routines in scikit-learn. Training samples with zero-redshift are stars, while samples with positive redshift are quasars:


In [ ]:
X_train = np.vstack([train_data['u-g'],
                     train_data['g-r'],
                     train_data['r-i'],
                     train_data['i-z']]).T
y_train = (train_data['redshift'] > 0).astype(int)

X_test = np.vstack([test_data['u-g'],
                    test_data['g-r'],
                    test_data['r-i'],
                    test_data['i-z']]).T
y_test = (test_data['label'] == 0).astype(int)

print "training data:", X_train.shape
print "test data:    ", X_test.shape

Notice that we’ve set this up so that quasars have y = 1, and stars have y = 0. Now we’ll set up a Naive Bayes classifier. This will fit a four-dimensional uncorrelated gaussian to each distribution, and from these gaussians quickly predict the label for a test point:


In [ ]:
from sklearn import naive_bayes
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

Let’s check our accuracy. This is the fraction of labels that are correct:


In [ ]:
accuracy = float(np.sum(y_test == y_pred)) / len(y_test)
print accuracy

We have 61% accuracy. Not very good. But we must be careful here: the accuracy does not always tell the whole story. In our data, there are many more stars than quasars


In [ ]:
print np.sum(y_test == 0)

In [ ]:
print np.sum(y_test == 1)

Stars outnumber Quasars by a factor of 14 to 1. In cases like this, it is much more useful to evaluate the fit based on precision and recall. Because there are many fewer quasars than stars, we’ll call a quasar a positive label and a star a negative label. The precision asks what fraction of positively labeled points are correctly labeled:

$\mathrm{precision = \frac{True\ Positives}{True\ Positives + False\ Positives}}$

The recall asks what fraction of positive samples are correctly identified:

$\mathrm{recall = \frac{True\ Positives}{True\ Positives + False\ Negatives}}$

We can calculate this for our results as follows:


In [ ]:
TP = np.sum((y_pred == 1) & (y_test == 1))  # true positives
FP = np.sum((y_pred == 1) & (y_test == 0))  # false positives
FN = np.sum((y_pred == 0) & (y_test == 1))  # false negatives
print "precision:", TP / float(TP + FP)
print "recall:   ", TP / float(TP + FN)

For convenience, these can be computed using the tools in the metrics sub-package of scikit-learn:


In [ ]:
from sklearn import metrics
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:   ", metrics.recall_score(y_test, y_pred)

Another useful metric is the F1 score, which gives a single score based on the precision and recall for the class:

$\mathrm{F1 = 2\frac{precision * recall}{precision + recall}}$

In a perfect classification, the precision, recall, and F1 score are all equal to 1.


In [ ]:
print "F1 score:", metrics.f1_score(y_test, y_pred)

For convenience, sklearn.metrics provides a function that computes all of these scores, and returns a nicely formatted string. For example:


In [ ]:
print metrics.classification_report(y_test, y_pred, target_names=['Stars', 'QSOs'])

We see that for Gaussian Naive Bayes, our QSO recall is fairly good: we are correctly identifying 95% of all quasars. The precision, on the other hand, is much worse. Of the points we label quasars, only 14% of them are correctly labeled. This low precision leads to an F1-score of only 0.25. This is not an optimal classification of our data. Apparently Naive Bayes is a bit too naive for this problem.

Later, in Exercise #1, we will apply a more sophisticated learning method to this task, which will potentially improve on these results.