Reminder!

After pulling down the tutorial notebook, immediately make a copy. Then do not modify the original. Do your work in the copy. This will prevent the possibility of git conflicts should the version-controlled file change at any point in the future. (The same exhortation applies to homeworks.)

Week 9 Tutorial

Introduction to Machine Learning

In this 3-part notebook you will start to explore the scikit-learn ML python package, and see how it supports a range of machine learning models with a uniform terminology and API, and emphasize model evaluation by cross-validation.

Credit: some of the material in this tutorial is based on Andy Mueller's scikit-learn tutorial from, the 2015 edition of "Astro Hack Week". The SDSS examples are based on Josh Bloom's SDSS example notebook, from the 2014 edition of "Astro Hack Week".

Further reading: Ivezic Chapter 8 (regression) and Chapter 9 (classification).

Requirements

You will need to pip install scikit-learn and check that you have v0.18 or higher as a result.

1. Simple Example: The Digits Dataset

  • Let's take a look at one of the SciKit-Learn example datasets, digits

In [ ]:
% matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [ ]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

In [ ]:
digits.images.shape

In [ ]:
print(digits.images[0])

In [ ]:
plt.matshow(digits.images[23], cmap=plt.cm.Greys)

In [ ]:
digits.data.shape

In [ ]:
digits.target.shape

In [ ]:
digits.target[23]
  • In SciKit-Learn, data contains the design matrix $X$, and is a numpy array of shape $(N, P)$
  • target contains the response variables $y$, and is a numpy array of shape $(N)$

In [ ]:
print(digits.DESCR)

Splitting the data


In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(digits.data, digits.target, test_size=0.25)

In [ ]:
X_train.shape, y_train.shape

In [ ]:
X_test.shape, y_test.shape

Other Example Datasets

SciKit-Learn provides 5 "toy" datasets for tutorial purposes, all load-able in the same way:

Name Description
boston Boston house-prices, with 13 associated measurements (R)
iris Fisher's iris classifications (based on 4 characteristics) (C)
diabetes Diabetes (x vs y) (R)
digits Hand-written digits, 8x8 images with classifications (C)
linnerud Linnerud: 3 exercise and 3 physiological data (R)
  • "R" and "C" indicate that the problem to be solved is either a regression or a classification, respectively.

Looking for Structure

  • A model's ability to make predictions depends on there being structure in the data

  • If structure is present the data are informative, and vice versa.

  • Feature design takes thought; thinking is aided by data visualization


In [ ]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

In [ ]:
# Visualizing the Boston house price data:

import corner

X = boston.data
y = boston.target

plot = np.concatenate((X, np.atleast_2d(y).T), axis=1)
labels = np.append(boston.feature_names,'MEDV')

corner.corner(plot, labels=labels);

2. Fitting a Straight Line with SciKit-Learn

For a first application, let's see the straight line fitting problem solved via the machine learning approach. In the process, we'll look at how model (prediction) accuracy is quantified, and then generalized via cross-validation.

Further Reading

Ivezic Sections 8.1 and 8.2 (linear regression), and Section 8.11 for cross-validation

Linear Regression

Straight line fitting is a linear regression problem - and an example of predictive learning.

A predictive model can be said to have been "fitted" to the data when an assumed loss function has been minimized. A popular choice of minimized loss function is the following, corresponding to the method of "ordinary least squares":

$$ \text{min}_{w, b} \sum_i || w^\mathsf{T}x_i + b - y_i||^2 $$

While this loss function is derivable from statistical principles, the machine only needs it to be encoded.

If we fit a straight line to a subset of the data (the training set), the accuracy of the linear model's predictions in the remainder of the data (the test set) can be checked.

Let's fit some test data with a straight line using the SciKit-Learn library, and see how accurately we can make our predictions.


In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn import datasets, linear_model

In [ ]:
# Code source: Jaques Grobler
# License: BSD 3 clause

# Load the boston dataset, and focus on just one attribute: 
# LSTAT (attribute 12)
boston = datasets.load_boston()

# Package into design matrix X and target vector y:
X = np.atleast_2d(boston.data[:,12]).T
y = np.atleast_2d(boston.target).T

# Make a training/test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.25)

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

In [ ]:
plt.scatter(X_test, y_test, color='black')
plt.xlabel('LSTAT')
plt.ylabel('MEDV');

The Linear Model

scikit-learn machine learning models have a common API:

  • a fit method that optimizes the model's internal parameters given training data features and target values

  • a predict method that returns model-predicted target values given test data features

  • various scores for quantifying performance


In [ ]:
# Create linear regression model object:
model = linear_model.LinearRegression()

# Train the model using the training set:
model.fit(X_train, y_train)

# The coefficients:
print("Coefficients:", model.coef_)

Scoring

  • The "mean squared error" between the model predictions and the truth is a useful metric: minimizing MSE corresponds to minimizing the "empirical risk," defined as the mean value loss function averaged over the available data samples, where the loss function is quadratic:

$\;\;\;\;\;{\rm MSE} = \mathcal{E} \left[ (\hat{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y} + \bar{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y})^2 \right] + (\bar{y} - y^{\rm true})^2$

$\;\;\;\;\;\;\;\;\;\;\;\;\; = {\rm var}(\hat{y}) + {\rm bias}^2(\hat{y})$

  • In general, different models reach different balances between the variance and bias of their predictions

  • A particular choice of loss function leads to a corresponding minimized risk


In [ ]:
# The mean square prediction error:
print("Training data: MSE = %.2f"
      % np.mean((model.predict(X_train) - y_train) ** 2))
print("Test data: MSE = %.2f"
      % np.mean((model.predict(X_test) - y_test) ** 2))

# The "explained variance" R2 score: 1 is perfect prediction:
print('Training data: R^2 score = %.2f' % model.score(X_train, y_train))
print('Test data: R^2 score = %.2f' % model.score(X_test, y_test))

R2 scores

  • The "explained variance" $R^2$ "score" is defined as $(1 - u/v)$, where $u$ is the regression sum of squares $\sum (y_{\rm true} - y_{\rm pred})^2$ and $v$ is the residual sum of squares $\sum (y_{\rm true} - \overline{y_{\rm true}})^2)$.
  • The best possible $R^2$ score is 1.0. A model that has mean squared error equal to the variance in the data gets a score of zero. Models that do systematically worse than this have negative $R^2$ scores.
  • In general we expect the training score to be higher than the test score.

In [ ]:
# Plot outputs:
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, model.predict(X_test), color='blue', linewidth=3)
plt.xlabel('LSTAT')
plt.ylabel('MEDV');

Questions:

  • How is this procedure different from previous occasions we have fitted a straight line?

  • Is it "Frequentist" or "Bayesian"? Why?

  • Does it follow the likelihood principle?

Optimizing Model Prediction Accuracy

  • In supervised machine learning the usual goal is to make the most accurate predictions we can - which means neither over-fitting nor under-fitting the data

  • Above, we made one training/test split, and computed the (mean squared) prediction error. The model that minimizes the generalized prediction error can be found (approximately) with cross validation.

  • In cross validation we consider multiple training/test splits, and look at the mean score across all of these "folds."


In [ ]:
from sklearn.model_selection import cross_val_score

model = linear_model.LinearRegression()

cross_val_score(model, X, y, cv=5, scoring='r2')

Cross Validation Fold Design

  • How we design the folds matters: we want each subset of the data to be a fair sample of the whole.
  • In this problem, we want to select the LSTAT values randomly (rather than sequentially), and so we make a ShuffleSplit

In [ ]:
from sklearn.model_selection import ShuffleSplit

shuffle_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)
cross_val_score(model, X, y, cv=shuffle_split)

Generalized Scoring

With our 10 fold shuffle splits, we can calculate generalized accuracy scores - that could be used in a cross-validation model comparison.


In [ ]:
MSE = cross_val_score(model, X, y, cv=shuffle_split, scoring='neg_mean_squared_error')
GE, errGE = -np.mean(MSE), np.std(MSE)/np.sqrt(len(MSE))
print("Generalization error:", GE, "+/-", errGE)

In [ ]:
R2 = cross_val_score(model, X, y, cv=shuffle_split, scoring='r2')
meanR2, errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Generalized R2 score:", meanR2, "+/-", errR2)

Model Expansion

  • Let's expand our linear model to include some higher order terms (quadratic, cubic etc). This can be done by adding additional feature columns to the design matrix $X$. We are still just predicting $y$, but now we'll be asking for more coefficients.

In [ ]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=4)
XX = poly.fit_transform(X)
XX

In [ ]:
polymodel = linear_model.LinearRegression()

poly_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

R2 = cross_val_score(polymodel, XX, y, cv=poly_split, scoring='r2')

poly_meanR2, poly_errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Polynomial: generalized R^2 score:", np.round(poly_meanR2, 2), "+/-", np.round(poly_errR2, 2))
print("Straight line: generalized R^2 score:", np.round(meanR2, 2), "+/-", np.round(errR2, 2))

Model Checking

As usual, it's a good idea to check the model's performance by visualizing its predictions in data space

We can make one model prediction for each training set in a set of folds, and plot all of them.


In [ ]:
from sklearn.model_selection import cross_val_predict

# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:

y_straightline = cross_val_predict(model, X, y, cv=10)
y_polynomial = cross_val_predict(polymodel, XX, y, cv=10)

In [ ]:
# Plot outputs:
plt.scatter(X, y,  color='black', alpha=0.1)
plt.plot(X, y_straightline, color='blue', linewidth=2, alpha=0.4)
plt.plot(X, y_polynomial, color='green', linewidth=2, alpha=0.4)
plt.xlabel('LSTAT')
plt.ylabel('MEDV');
  • In this example, the polynomial degree is a control parameter that needs to be set: we can search this parameter space for the value that gives the highest average cross validation score (or lowest generalization error).

Multiple Linear Regression

  • The Boston dataset has 13 attributes, more than one of which might contain information about house prices in the city. Let's train a linear model on all these attributes, and see if we can improve our score.

In [ ]:
# Define a linear model:
supermodel = linear_model.LinearRegression()

# Use all the data, and set up a 10-fold cross validation run:
super_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

# Carry out the cross-validation of the model, training, testing and reporting:
R2 = cross_val_score(supermodel, boston.data, boston.target, cv=super_split, scoring='r2')
                           
# Compute our model prediction accuracy score, for comparison with other models:
hpmeanR2, hperrR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Hyperplane: generalized R^2 score:", np.round(hpmeanR2, 2), "+/-", np.round(hperrR2, 2))
print("Straight line: generalized R^2 score:", np.round(meanR2, 2), "+/-", np.round(errR2, 2))

What just happened?

  • We just went from a simple hypothesis (median house price MEDV depends on LSTAT) to a very much more complex one (house price could depend on all of our 13 measured attributes) in one step. The data analysis is automated, in the sense that we simply fed our machine new inputs and it processed them.
  • Using all our data, we are now better at predicting house prices - but we have gained no new understanding of how the Boston housing market works.

3. Machine Learning with the SDSS

Now let's look at a couple of astronomical applications of machine learning, an example regression and classification analysis of the SDSS catalog dataset.

In the process, we'll understand some more "score" metrics and diagnostic visualizations, and carry out model selection by cross validation.

Photometric Redshift Estimation and Quasar Classification

Let's investigate the SDSS photometric object catalog, and look for machine learning solutions to the following two problems:

  1. Estimating the redshifts of quasars from their photometry (regression)

  2. Selecting quasars from a background of stars and galaxies (classification)

Data Aquisition

From the SDSS Sky Server we've downloaded two types of photometry (aperature and petrosian), corrected for extinction, for a number of sources with redshifts. Here's the SQL for an example query, that gets us 10000 example quasars:

SELECT *,dered_u - mag_u AS diff_u, dered_g - mag_g AS diff_g, dered_r - mag_r AS diff_g, dered_i - mag_i AS diff_i, dered_z - mag_z AS diff_z from
(SELECT top 10000
objid, ra, dec, dered_u,dered_g,dered_r,dered_i,dered_z,psfmag_u-extinction_u AS mag_u,
psfmag_g-extinction_g AS mag_g, psfmag_r-extinction_r AS mag_r, psfmag_i-extinction_i AS mag_i,psfmag_z-extinction_z AS mag_z,z AS spec_z,dered_u - dered_g AS u_g_color, 
dered_g - dered_r AS g_r_color,dered_r - dered_i AS r_i_color,dered_i - dered_z AS i_z_color,class
FROM SpecPhoto 
WHERE (class = 'QSO') ) as sp
 
</font>

We can do the same for 'STAR's and 'GALAXY's as well. These data are all provided in the data folder.


In [ ]:
import pandas as pd
pd.set_option('display.max_columns', None)
%pylab inline
import seaborn as sns
sns.set()
import copy
from __future__ import print_function

3.1 Photometric Redshift Estimation

  • This is a regression problem, to be able to predict the redshift response variable given a number of photometric measurement "features"
  • The SDSS spectroscopic quasar sample provides a training set for the prediction of photometric redshifts
  • Let's read in our SDSS quasars, and munge them into machine learning inputs

In [ ]:
qsos = pd.read_csv("../../data/qso10000.csv",index_col=0,usecols=["objid","dered_r","spec_z","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z"])

# Clean out extreme colors and bad magnitudes:
qsos = qsos[(qsos["dered_r"] > -9999) & (qsos["g_r_color"] > -10) & (qsos["g_r_color"] < 10)]


# Response variables: redshift
qso_redshifts = qsos["spec_z"]

# Features or attributes: photometric measurements
qso_features = copy.copy(qsos)
del qso_features["spec_z"]
qso_features.head()

Visual Exploration

  • Over what redshift range do we have quasar spectra?

  • What structure is there in the photometric feature space that might help us predict redshift?

Let's plot all the features, colored by the target redshift, to look for structure.


In [ ]:
bins =  hist(qso_redshifts.values,bins=100) ; xlabel("redshift") ; ylabel("N")

In [ ]:
import matplotlib as mpl
import matplotlib.cm as cm

# Truncate the color at z=2.5 just to keep some contrast.
norm = mpl.colors.Normalize(vmin=min(qso_redshifts.values), vmax=2.5)
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)

# Plot everything against everything else:
rez = pd.scatter_matrix(qso_features[0:1000], alpha=0.2, figsize=[15,15], c=m.to_rgba(qso_redshifts.values))

Feature Data and Target Values

Now we have our machine learning inputs (photometric features: colors and magnitudes) and outputs (target redshift values):


In [ ]:
X = qso_features.values  # Data: 9-d feature space
y = qso_redshifts.values # Target: redshifts

In [ ]:
print("Design matrix shape =", X.shape)
print("Response variable vector shape =", y.shape)

Linear Regression

Let's follow the same procedure as in the previous chunk (from the SciKit-Learn Linear Regression tutorial):


In [ ]:
# Split the data into a training set and a test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Instantiate a "linear model"
from sklearn import linear_model
linear = linear_model.LinearRegression()

# Fit the model, using all the attributes:
linear.fit(X_train, y_train)

# Do the prediction on the test data:
y_lr_pred = linear.predict(X_test)

# How well did we do? Compute MSE
from sklearn.metrics import mean_squared_error
mse_linear = np.sqrt(mean_squared_error(y_test,y_lr_pred))
r2_linear = linear.score(X_test, y_test)
print("Linear regression: MSE = ",mse_linear)
print("R2 score =",r2_linear)

In [ ]:
plot(y_test, y_lr_pred, 'o', alpha=0.2)
title("Linear Regression - MSE = %.2f" % mse_linear, fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric Redshift", fontsize=18)
plot([0,7],[0,7], color='red')
ylim(0,5)

Just how bad is this? Here's the MSE from guessing the average redshift of the training set for all new objects:


In [ ]:
print("Naive MSE", ((1./len(y_train))*(y_train - y_train.mean())**2).sum())
print("Linear regression: MSE = ",mse_linear)

k-Nearest Neighbor (KNN) Regression

Now let's try a non-parametric model:

"Regression based on k-nearest neighbors. The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set."

Question:

What underlying model is implied by the KNN algorithm? How many hidden parameters does it (effectively) have? What choices do we get to make?


In [ ]:
from sklearn import neighbors
from sklearn import preprocessing

X_scaled = preprocessing.scale(X) # Many methods work better on re-scaled ("whitened") X.

KNN = neighbors.KNeighborsRegressor(5)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

KNN.fit(X_train, y_train)

In [ ]:
y_knn_pred = KNN.predict(X_test)
mse_knn = mean_squared_error(y_test,y_knn_pred)
r2_knn = KNN.score(X_test, y_test)
print("MSE (KNN) =", mse_knn)
print("R2 score (KNN) =",r2_knn)
print("cf.")
print("MSE (linear regression) = ",mse_linear)
print("R2 score (linear regression) =",r2_linear)

In [ ]:
plot(y_test, y_knn_pred,'o',alpha=0.2)
title("k-NN Residuals - MSE = %.2f" % mse_knn, fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric Redshift", fontsize=18)
plot([0,7], [0,7], color='red')
ylim(0,5);

Tuning the KNN Model

  • Let's vary the control parameters of the KNN model, to see how good we can make our predictions.

  • We can see our options in the model repr:

KNeighborsRegressor(algorithm='auto',
    leaf_size=30, metric='minkowski',      
    metric_params=None, n_neighbors=5, 
    p=2, weights='uniform')
  • Let's first make a "validation curve" to investigate one parameter: the number of nearest neighbors averaged over.

In [ ]:
# We'll vary the number of neighbors used:
param_name = "n_neighbors"
param_range = np.array([1,2,4,8,16,32,64])

# Compute our cv scores for the above range of the no. of neighbors:
from sklearn.model_selection import validation_curve
training_scores, validation_scores = validation_curve(KNN, X_scaled, y,
                                                      param_name=param_name,
                                                      param_range=param_range, 
                                                      cv=3, scoring='r2')

In [ ]:
def plot_validation_curve(param_name,parameter_values, training_scores, validation_scores):
    training_scores_mean = np.mean(training_scores, axis=1)
    training_scores_std = np.std(training_scores, axis=1)
    validation_scores_mean = np.mean(validation_scores, axis=1)
    validation_scores_std = np.std(validation_scores, axis=1)

    plt.fill_between(parameter_values, training_scores_mean - training_scores_std,
                     training_scores_mean + training_scores_std, alpha=0.1, color="r")
    plt.fill_between(parameter_values, validation_scores_mean - validation_scores_std,
                     validation_scores_mean + validation_scores_std, alpha=0.1, color="g")
    plt.plot(parameter_values, training_scores_mean, 'o-', color="r",
             label="Training R2 score")
    plt.plot(parameter_values, validation_scores_mean, 'o-', color="g",
             label="Cross-validation R2 score")
    plt.ylim(validation_scores_mean.min() - .1, training_scores_mean.max() + .1)
    plt.xlabel(param_name, fontsize=18)
    plt.legend(loc="best", fontsize=18)

In [ ]:
plot_validation_curve(param_name, param_range, training_scores, validation_scores)

Question:

Can you explain the shapes of these two curves? Talk to your neighbor for a few minutes, and be prepared to suggest reasons for a) the rise and fall of the cross validation score and b) the monotonic decrease in training score.

Model tuning with GridSearchCV

  • Now, let's see if we can do better by varying some other KNN options as well - in a grid search.

In [ ]:
param_grid = {'n_neighbors': np.array([1,2,4,8,16,32,64]),
                  'weights': ['uniform','distance'],
                       'p' : np.array([1,2])}

np.set_printoptions(suppress=True)
print(param_grid)

In [ ]:
from sklearn.model_selection import GridSearchCV
KNN_tuned = GridSearchCV(KNN, param_grid, verbose=3)

A GridSearchCV object behaves just like a model, except it carries out a cross-validation while fitting:

This fit will take a minute or so, as it loops over all the search positions in the hyper-parameter grid.


In [ ]:
KNN_tuned.fit(X_train, y_train)

In [ ]:
y_knn_tuned_pred = KNN_tuned.predict(X_test)

mse_knn_tuned = mean_squared_error(y_test,y_knn_tuned_pred)
r2_knn_tuned = KNN_tuned.score(X_test, y_test)

print("MSE (tuned KNN) =", mse_knn_tuned)
print("R2 score (tuned KNN) =", r2_knn_tuned)
print("cf.")
print("MSE (KNN) = ", mse_knn)
print("R2 score (KNN) =", r2_knn)

Which are the best KNN control parameters we found?


In [ ]:
KNN_tuned.best_params_

This value of n_neighbors is consistent with the peak in cross-validation score in the validation curve plot.

Generalization Error

Notice that all the above tuning happened while training on a single split (X_train and y_train).

It's possible that that one particular fold prefers a slightly different set of model parameters than a different one - so to assess our generalization error, we need a further level of cross-validation.

We can do this by passing a GridSearchCV model to the cross validation score calculator. This will take a few minutes, as the grid search is carried out for each CV fold...


In [ ]:
from sklearn.model_selection import cross_val_score

R2 = cross_val_score(KNN_tuned, X_scaled, y, cv=3, scoring='r2')

In [ ]:
meanR2, errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print('Mean score:', meanR2, '+/-', errR2)

Notes

  • Optimizing over control parameters (or hyper parameters) with grid search cross validation is a form of model selection.
  • When presented with new data samples (photometry), and asked to predict the target response variables (photometric redshifts), we'll need a trained machine that has not been over-fitted to the training data.
  • Minimizing and estimating the generalization error is a way to reduce the risk of getting this prediction wrong.

Let's finish off our photo-z machine learning tool, by:

  • Choosing the best combination of hyper-parameters (from our cross-validation analysis)

  • Training it on the whole of the training dataset

  • Using it to predict the redshifts of the objects in the test set


In [ ]:
KNNz = KNN_tuned.best_estimator_

KNNz.fit(X_train, y_train)

In [ ]:
j = 571
one_new_quasar = X_test[j,:].reshape(1, -1)
zphoto = KNNz.predict(one_new_quasar)
zspec = y_test[j]
print("True redshift cf. KNN photo-z:", zspec, ' cf.', zphoto[0])

In [ ]:
zspec = y_test
zphoto = KNNz.predict(X_test)

plot(zspec, zphoto,'o',alpha=0.1)
title("KNNz performance", fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric redshift", fontsize=18)
plot([0,7], [0,7], color='red')
ylim(0,5);

3.2 Quasar Classification with Random Forest

  • Let's switch gears and do a 3-class classification problem: star, galaxy, or QSO.
  • A very good general-purpose classification (and regression!) algorithm is "Random Forest." See this yhat blog post for a nice high level introduction.

Random Forests

  • Decision trees encode sequences of "cuts" in the data, leading to the separation of the samples into groups to which labels are assigned.

Forests of Random Decision Trees

  • Random decision trees are random in that they:

    • Are set up with random cut thresholds, and random feature-ordering

    • Are given random bootstrap sub-samples of the data

  • Most trees in a random forest are terrible classifiers: but we only need a few with high weight to dominate the average

Quasar Classification with Random Forest

  • Let's read in equal numbers (1000) of all three types of data, clean them up, and set $y$ equal to the classification label.

In [ ]:
all_sources = pd.read_csv("../../data/qso10000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"])[:1000]

all_sources = all_sources.append(pd.read_csv("../../data/star1000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"]))

all_sources = all_sources.append(pd.read_csv("../../data/galaxy1000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
                                                "g_r_color","r_i_color","i_z_color","diff_u",\
                                                "diff_g1","diff_i","diff_z","class"]))

all_sources = all_sources[(all_sources["dered_r"] > -9999) & (all_sources["g_r_color"] > -10) & (all_sources["g_r_color"] < 10)]

all_labels = all_sources["class"]

all_features = copy.copy(all_sources)
del all_features["class"]

X = copy.copy(all_features.values)
y = copy.copy(all_labels.values)

Feature Data and Target Labels

In classification problems the target values are categorical "labels":


In [ ]:
print("Feature vector shape =", X.shape)
print("Class label vector shape =", y.shape)

In [ ]:
print(y, X)

Visual Exploration

What structure can we see in the data? Let's plot all the features as before.


In [ ]:
yy = y.copy()
yy[yy=="QSO"] = 0.0    # Red
yy[yy=="STAR"] = 0.5   # Green
yy[yy=="GALAXY"] = 1.0 # Blue
yy = yy.astype(float)

norm = mpl.colors.Normalize(vmin=min(yy), vmax=max(yy))
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)
rez = pd.plotting.scatter_matrix(all_features, alpha=0.2, figsize=[15,15], c=m.to_rgba(yy))

Random Forest Classification


In [ ]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True)
rf.fit(X, y)

Random Forest feature importances

  • Each decision tree in the Random Forest focuses on a different combination of features

  • One output of the fitted model is an indication of which features are most important


In [ ]:
sorted(zip(all_sources.columns.values,rf.feature_importances_),key=lambda q: q[1],reverse=True)

Classification Accuracy

  • The accuracy of a classifier is the fraction of predictions made that are correct.

  • Ensemble classifiers like Random Forest provide an oob_score, the mean "Out of Bag" prediction accuracy

  • Each decision tree in the ensemble is only working on a subset of the data, so it can track its accuracy with the data not in its own bag.


In [ ]:
# Report the "out of bag" accuracy score:
rf.oob_score_

This score looks good, but this is the accuracy on the training set.

Question:

How should we estimate the generalized accuracy?

Classifier Tuning with GridSearchCV

As before, this will take a few minutes, as the model selection is carried out on each fold within the training set...


In [ ]:
# Parameter values to try:
parameters = {'n_estimators':(50,100,200), "max_features": ["auto",3],
              'criterion':["gini","entropy"], "min_samples_leaf": [1,2]}

# Training/test split:
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.25)

In [ ]:
# Do a grid search to find the highest 3-fold cross-validation score:
rf_tuned = GridSearchCV(rf, parameters, cv=3, verbose=1)
RFselector = rf_tuned.fit(X_train, y_train)

# Print the best score and estimator:
print('Best OOB score:', RFselector.best_score_)
print(RFselector.best_estimator_)

Question:

Would you be satisfied with a 95% successful classification fraction? Can you suggest some more useful scores to optimize? (Hint: imagine using a classifier to select a sample of follow-up targets.)

Confusion Matrices

One way of visualizing classification accuracy across a range of labels is via a confusion matrix:


In [ ]:
# Make predictions given the test data:
y_pred = RFselector.predict(X_test)

In [ ]:
# Compute confusion matrix:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

plt.matshow(cm)
plt.title('Confusion matrix', fontsize=18)
plt.colorbar()
plt.ylabel('True label', fontsize=18)
plt.xlabel('Predicted label', fontsize=18)

ROC Curves

Each Random Forest output label comes with a classification probability, computed from the results of the whole forest.

To select a sample of classified objects, one can choose a selection threshold in this class probability, and only keep objects with higher probability than this threshold.

The availability of a class probability leads to an important diagnostic: the "Receiver Operating Characteristic" or "ROC" curve.

The ROC curve shows a classifier's true positive rate (TPR) plotted against the false positive rate (FPR), as the selection threshold is varied.

$\;\;\;\;\;{\rm TPR} = \frac{TP}{TP + FN}$ the fraction of actual 1's that are classified as 1's.

$\;\;\;\;\;{\rm FPR} = \frac{FP}{FP + TN}$ the fraction of actual 0's that are classified as 1's.

TPR = "Recall", "Completeness". FPR = "Fall-out". In astronomy, we are also (very) interested in "Precision," $\frac{TP}{FP + TP}$, which is related to "Purity"

Typically, classifiers have control parameters that affect both the TPR and FPR (often improving one at the expense of the other), so the ROC curve is a good tool for investigating these parameters.

Likewise, ROC curves provide a very good way to compare different classifiers, via the "area under the curve" (and above the random classifier TPR=FPR line).

Let's use the SciKit-Learn utilities to plot an ROC curve for our RFselector, and quantify its performance via the AUC.


In [ ]:
# Classify the test data, and store the classification probabilities:
BestRFselector = RFselector.best_estimator_
y_prob = BestRFselector.fit(X_train, y_train).predict_proba(X_test)

In [ ]:
from sklearn.metrics import roc_curve, auc

# Compute ROC curve and area under curve (AUC) for each class:
labels = BestRFselector.classes_ # ['GALAXY', 'QSO', 'STAR'] - the order is decided by the machine!
fpr = dict()
tpr = dict()
roc_auc = dict()
for i,label in enumerate(labels):
    fpr[label], tpr[label], _ = roc_curve(y_test, y_prob[:, i], pos_label=label)
    roc_auc[label] = auc(fpr[label], tpr[label])

In [ ]:
lw = 2
colors = {'QSO':'red', 'STAR':'green', 'GALAXY':'blue'}
for label in labels:
    plot(fpr[label], tpr[label], color=colors[label],
         lw=lw, label='%s (AUC = %0.3f)' % (label, roc_auc[label]))
plot([0, 1], [0, 1], color='gray', lw=lw, linestyle='--')
xlim([0.0, 1.0])
ylim([0.0, 1.05])
xlabel('False Positive Rate', fontsize=18)
ylabel('True Positive Rate', fontsize=18)
title('RFselector ROC Curve', fontsize=18)
legend(loc="lower right", fontsize=14);

Endnotes

The scikit-learn package makes it easy to experiment with various machine learning algorithms, and make non-parametric models that (via cross-validation) have high generalized prediction accuracy.

Other frameworks exist: Google's "TensorFlow" is worth investigating, both for general ML, and also for its neural network support. PGMs, as well as the ML basics shown here, will reappear.

These models may not provide parameter uncertainties or even much new understanding of the dataset

  • This may not be your priority in a given application.
  • If it is, you'll have more work to do to enable you to draw scientific conclusions...