Classification Example

You'll need to modify the DATA_HOME variable to the location of the datasets.

In this tutorial we'll use the colors of over 700,000 stars and quasars from the Sloan Digital Sky Survey. 500,000 of them are training data, spectroscopically identified as stars or quasars. The remaining 200,000 have been classified based on their photometric colors.


In [ ]:
import os
DATA_HOME = os.path.abspath('C:/temp/AstroML/data/sdss_colors/')

Here we will use a Naive Bayes estimator to classify the objects. First, we will construct our training data and test data arrays:


In [ ]:
import numpy as np

train_data = np.load(os.path.join(DATA_HOME, 'sdssdr6_colors_class_train.npy'))
test_data = np.load(os.path.join(DATA_HOME, 'sdssdr6_colors_class.200000.npy'))

The data is stored as a record array, which is a convenient format for collections of labeled data:


In [ ]:
print(train_data.dtype.names)

In [ ]:
print(train_data['u-g'].shape)

Now we must put these into arrays of shape (n_samples, n_features) in order to pass them to routines in scikit-learn. Training samples with zero-redshift are stars, while samples with positive redshift are quasars:


In [ ]:
X_train = np.vstack([train_data['u-g'],
                     train_data['g-r'],
                     train_data['r-i'],
                     train_data['i-z']]).T
y_train = (train_data['redshift'] > 0).astype(int)

X_test = np.vstack([test_data['u-g'],
                    test_data['g-r'],
                    test_data['r-i'],
                    test_data['i-z']]).T
y_test = (test_data['label'] == 0).astype(int)

print("training data: ") 
print(X_train.shape)
print("test data: ") 
print(X_test.shape)

Notice that we’ve set this up so that quasars have y = 1, and stars have y = 0. Now we’ll set up a Naive Bayes classifier. This will fit a four-dimensional uncorrelated gaussian to each distribution, and from these gaussians quickly predict the label for a test point:


In [ ]:
from sklearn import naive_bayes
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

Let’s check our accuracy. This is the fraction of labels that are correct:


In [ ]:
accuracy = float(np.sum(y_test == y_pred)) / len(y_test)
print(accuracy)

We have 61% accuracy. Not very good. But we must be careful here: the accuracy does not always tell the whole story. In our data, there are many more stars than quasars


In [ ]:
print(np.sum(y_test == 0))

In [ ]:
print(np.sum(y_test == 1))

Stars outnumber Quasars by a factor of 14 to 1. In cases like this, it is much more useful to evaluate the fit based on precision and recall. Because there are many fewer quasars than stars, we’ll call a quasar a positive label and a star a negative label. The precision asks what fraction of positively labeled points are correctly labeled:

$\mathrm{precision = \frac{True\ Positives}{True\ Positives + False\ Positives}}$

The recall asks what fraction of positive samples are correctly identified:

$\mathrm{recall = \frac{True\ Positives}{True\ Positives + False\ Negatives}}$

We can calculate this for our results as follows:


In [ ]:
TP = np.sum((y_pred == 1) & (y_test == 1))  # true positives
FP = np.sum((y_pred == 1) & (y_test == 0))  # false positives
FN = np.sum((y_pred == 0) & (y_test == 1))  # false negatives
print("precision:") 
print(TP / float(TP + FP))
print("recall:   ") 
print(TP / float(TP + FN))

For convenience, these can be computed using the tools in the metrics sub-package of scikit-learn:


In [ ]:
from sklearn import metrics
print("precision:") 
print(metrics.precision_score(y_test, y_pred))
print("recall:   ") 
print(metrics.recall_score(y_test, y_pred))

Precision and Recall tell different stories about the performance of the classifier. Ideally one would try to create a classifier with a high precision and high recall but this is not always possible, and sometimes raising the precision will decrease the recall or viceversa (why?).

Think about situations when you'll want a high precision classifier even if the recall is poor, and viceversa.

Another useful metric is the F1 score, which gives a single score based on the precision and recall for the class:

$\mathrm{F1 = 2\frac{precision * recall}{precision + recall}}$

In a perfect classification, the precision, recall, and F1 score are all equal to 1.


In [ ]:
print("F1 score:") 
print(metrics.f1_score(y_test, y_pred))

For convenience, sklearn.metrics provides a function that computes all of these scores, and returns a nicely formatted string. For example:


In [ ]:
print(metrics.classification_report(y_test, y_pred, target_names=['Stars', 'QSOs']))

We see that for Gaussian Naive Bayes, our QSO recall is fairly good: we are correctly identifying 95% of all quasars. The precision, on the other hand, is much worse. Of the points we label quasars, only 14% of them are correctly labeled. This low precision leads to an F1-score of only 0.25. This is not an optimal classification of our data. Apparently Naive Bayes is a bit too naive for this problem, so lets check with some other classifiers. Remember that the documentation is in http://scikit-learn.org/stable/supervised_learning.html

Replace the classifier with a DecisionTreeClassifier and compare the results with the Naive Bayes. Which metric would you choose for doing the comparison?


In [ ]:

Now lets try with a Random Forest, that is a ensemble of a number of Decision Trees. Replace the classifier with a RandomForestClassifier and compare the results. Which parameters can be adjusted in this classifier? Experiment with the most important parameters and try to create a better classifier


In [ ]: