After pulling down the tutorial notebook, immediately make a copy. Then do not modify the original. Do your work in the copy. This will prevent the possibility of git conflicts should the version-controlled file change at any point in the future. (The same exhortation applies to homeworks.)
In this 3-part notebook you will start to explore the scikit-learn
ML python package, and see how it supports a range of machine learning models with a uniform terminology and API, and emphasize model evaluation by cross-validation.
Credit: some of the material in this tutorial is based on Andy Mueller's
scikit-learn
tutorial from, the 2015 edition of "Astro Hack Week". The SDSS examples are based on Josh Bloom's SDSS example notebook, from the 2014 edition of "Astro Hack Week".Further reading: Ivezic Chapter 8 (regression) and Chapter 9 (classification).
You will need to pip install scikit-learn
and check that you have v0.18 or higher as a result.
In [ ]:
% matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
In [ ]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
In [ ]:
digits.images.shape
In [ ]:
print(digits.images[0])
In [ ]:
plt.matshow(digits.images[23], cmap=plt.cm.Greys)
In [ ]:
digits.data.shape
In [ ]:
digits.target.shape
In [ ]:
digits.target[23]
SciKit-Learn
, data
contains the design matrix $X$, and is a numpy
array of shape $(N, P)$target
contains the response variables $y$, and is a numpy
array of shape $(N)$
In [ ]:
print(digits.DESCR)
In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split(digits.data, digits.target, test_size=0.25)
In [ ]:
X_train.shape, y_train.shape
In [ ]:
X_test.shape, y_test.shape
SciKit-Learn
provides 5 "toy" datasets for tutorial purposes, all load
-able in the same way:
Name | Description |
---|---|
boston |
Boston house-prices, with 13 associated measurements (R) |
iris |
Fisher's iris classifications (based on 4 characteristics) (C) |
diabetes |
Diabetes (x vs y) (R) |
digits |
Hand-written digits, 8x8 images with classifications (C) |
linnerud |
Linnerud: 3 exercise and 3 physiological data (R) |
In [ ]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)
In [ ]:
# Visualizing the Boston house price data:
import corner
X = boston.data
y = boston.target
plot = np.concatenate((X, np.atleast_2d(y).T), axis=1)
labels = np.append(boston.feature_names,'MEDV')
corner.corner(plot, labels=labels);
SciKit-Learn
For a first application, let's see the straight line fitting problem solved via the machine learning approach. In the process, we'll look at how model (prediction) accuracy is quantified, and then generalized via cross-validation.
Ivezic Sections 8.1 and 8.2 (linear regression), and Section 8.11 for cross-validation
Straight line fitting is a linear regression problem - and an example of predictive learning.
A predictive model can be said to have been "fitted" to the data when an assumed loss function has been minimized. A popular choice of minimized loss function is the following, corresponding to the method of "ordinary least squares":
$$ \text{min}_{w, b} \sum_i || w^\mathsf{T}x_i + b - y_i||^2 $$While this loss function is derivable from statistical principles, the machine only needs it to be encoded.
If we fit a straight line to a subset of the data (the training set), the accuracy of the linear model's predictions in the remainder of the data (the test set) can be checked.
Let's fit some test data with a straight line using the SciKit-Learn
library, and see how accurately we can make our predictions.
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn import datasets, linear_model
In [ ]:
# Code source: Jaques Grobler
# License: BSD 3 clause
# Load the boston dataset, and focus on just one attribute:
# LSTAT (attribute 12)
boston = datasets.load_boston()
# Package into design matrix X and target vector y:
X = np.atleast_2d(boston.data[:,12]).T
y = np.atleast_2d(boston.target).T
# Make a training/test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.25)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)
In [ ]:
plt.scatter(X_test, y_test, color='black')
plt.xlabel('LSTAT')
plt.ylabel('MEDV');
Linear
Modelscikit-learn
machine learning models have a common API:
a fit
method that optimizes the model's internal parameters given training data features and target values
a predict
method that returns model-predicted target values given test data features
various score
s for quantifying performance
In [ ]:
# Create linear regression model object:
model = linear_model.LinearRegression()
# Train the model using the training set:
model.fit(X_train, y_train)
# The coefficients:
print("Coefficients:", model.coef_)
$\;\;\;\;\;{\rm MSE} = \mathcal{E} \left[ (\hat{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y} + \bar{y} - y^{\rm true})^2 \right] = \mathcal{E} \left[ (\hat{y} - \bar{y})^2 \right] + (\bar{y} - y^{\rm true})^2$
$\;\;\;\;\;\;\;\;\;\;\;\;\; = {\rm var}(\hat{y}) + {\rm bias}^2(\hat{y})$
In general, different models reach different balances between the variance and bias of their predictions
A particular choice of loss function leads to a corresponding minimized risk
In [ ]:
# The mean square prediction error:
print("Training data: MSE = %.2f"
% np.mean((model.predict(X_train) - y_train) ** 2))
print("Test data: MSE = %.2f"
% np.mean((model.predict(X_test) - y_test) ** 2))
# The "explained variance" R2 score: 1 is perfect prediction:
print('Training data: R^2 score = %.2f' % model.score(X_train, y_train))
print('Test data: R^2 score = %.2f' % model.score(X_test, y_test))
In [ ]:
# Plot outputs:
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, model.predict(X_test), color='blue', linewidth=3)
plt.xlabel('LSTAT')
plt.ylabel('MEDV');
In supervised machine learning the usual goal is to make the most accurate predictions we can - which means neither over-fitting nor under-fitting the data
Above, we made one training/test split, and computed the (mean squared) prediction error. The model that minimizes the generalized prediction error can be found (approximately) with cross validation.
In cross validation we consider multiple training/test splits, and look at the mean score across all of these "folds."
In [ ]:
from sklearn.model_selection import cross_val_score
model = linear_model.LinearRegression()
cross_val_score(model, X, y, cv=5, scoring='r2')
In [ ]:
from sklearn.model_selection import ShuffleSplit
shuffle_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)
cross_val_score(model, X, y, cv=shuffle_split)
In [ ]:
MSE = cross_val_score(model, X, y, cv=shuffle_split, scoring='neg_mean_squared_error')
GE, errGE = -np.mean(MSE), np.std(MSE)/np.sqrt(len(MSE))
print("Generalization error:", GE, "+/-", errGE)
In [ ]:
R2 = cross_val_score(model, X, y, cv=shuffle_split, scoring='r2')
meanR2, errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Generalized R2 score:", meanR2, "+/-", errR2)
In [ ]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=4)
XX = poly.fit_transform(X)
XX
In [ ]:
polymodel = linear_model.LinearRegression()
poly_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)
R2 = cross_val_score(polymodel, XX, y, cv=poly_split, scoring='r2')
poly_meanR2, poly_errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Polynomial: generalized R^2 score:", np.round(poly_meanR2, 2), "+/-", np.round(poly_errR2, 2))
print("Straight line: generalized R^2 score:", np.round(meanR2, 2), "+/-", np.round(errR2, 2))
In [ ]:
from sklearn.model_selection import cross_val_predict
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
y_straightline = cross_val_predict(model, X, y, cv=10)
y_polynomial = cross_val_predict(polymodel, XX, y, cv=10)
In [ ]:
# Plot outputs:
plt.scatter(X, y, color='black', alpha=0.1)
plt.plot(X, y_straightline, color='blue', linewidth=2, alpha=0.4)
plt.plot(X, y_polynomial, color='green', linewidth=2, alpha=0.4)
plt.xlabel('LSTAT')
plt.ylabel('MEDV');
In [ ]:
# Define a linear model:
supermodel = linear_model.LinearRegression()
# Use all the data, and set up a 10-fold cross validation run:
super_split = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)
# Carry out the cross-validation of the model, training, testing and reporting:
R2 = cross_val_score(supermodel, boston.data, boston.target, cv=super_split, scoring='r2')
# Compute our model prediction accuracy score, for comparison with other models:
hpmeanR2, hperrR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print("Hyperplane: generalized R^2 score:", np.round(hpmeanR2, 2), "+/-", np.round(hperrR2, 2))
print("Straight line: generalized R^2 score:", np.round(meanR2, 2), "+/-", np.round(errR2, 2))
MEDV
depends on LSTAT
) to a very much more complex one (house price could depend on all of our 13 measured attributes) in one step. The data analysis is automated, in the sense that we simply fed our machine new inputs and it processed them.Now let's look at a couple of astronomical applications of machine learning, an example regression and classification analysis of the SDSS catalog dataset.
In the process, we'll understand some more "score" metrics and diagnostic visualizations, and carry out model selection by cross validation.
Let's investigate the SDSS photometric object catalog, and look for machine learning solutions to the following two problems:
Estimating the redshifts of quasars from their photometry (regression)
Selecting quasars from a background of stars and galaxies (classification)
From the SDSS Sky Server we've downloaded two types of photometry (aperature and petrosian), corrected for extinction, for a number of sources with redshifts. Here's the SQL for an example query, that gets us 10000 example quasars:
SELECT *,dered_u - mag_u AS diff_u, dered_g - mag_g AS diff_g, dered_r - mag_r AS diff_g, dered_i - mag_i AS diff_i, dered_z - mag_z AS diff_z from (SELECT top 10000 objid, ra, dec, dered_u,dered_g,dered_r,dered_i,dered_z,psfmag_u-extinction_u AS mag_u, psfmag_g-extinction_g AS mag_g, psfmag_r-extinction_r AS mag_r, psfmag_i-extinction_i AS mag_i,psfmag_z-extinction_z AS mag_z,z AS spec_z,dered_u - dered_g AS u_g_color, dered_g - dered_r AS g_r_color,dered_r - dered_i AS r_i_color,dered_i - dered_z AS i_z_color,class FROM SpecPhoto WHERE (class = 'QSO') ) as sp</font>
We can do the same for 'STAR's and 'GALAXY's as well. These data are all provided in the data
folder.
In [ ]:
import pandas as pd
pd.set_option('display.max_columns', None)
%pylab inline
import seaborn as sns
sns.set()
import copy
from __future__ import print_function
In [ ]:
qsos = pd.read_csv("../../data/qso10000.csv",index_col=0,usecols=["objid","dered_r","spec_z","u_g_color",\
"g_r_color","r_i_color","i_z_color","diff_u",\
"diff_g1","diff_i","diff_z"])
# Clean out extreme colors and bad magnitudes:
qsos = qsos[(qsos["dered_r"] > -9999) & (qsos["g_r_color"] > -10) & (qsos["g_r_color"] < 10)]
# Response variables: redshift
qso_redshifts = qsos["spec_z"]
# Features or attributes: photometric measurements
qso_features = copy.copy(qsos)
del qso_features["spec_z"]
qso_features.head()
In [ ]:
bins = hist(qso_redshifts.values,bins=100) ; xlabel("redshift") ; ylabel("N")
In [ ]:
import matplotlib as mpl
import matplotlib.cm as cm
# Truncate the color at z=2.5 just to keep some contrast.
norm = mpl.colors.Normalize(vmin=min(qso_redshifts.values), vmax=2.5)
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)
# Plot everything against everything else:
rez = pd.scatter_matrix(qso_features[0:1000], alpha=0.2, figsize=[15,15], c=m.to_rgba(qso_redshifts.values))
In [ ]:
X = qso_features.values # Data: 9-d feature space
y = qso_redshifts.values # Target: redshifts
In [ ]:
print("Design matrix shape =", X.shape)
print("Response variable vector shape =", y.shape)
Let's follow the same procedure as in the previous chunk (from the SciKit-Learn
Linear Regression tutorial):
In [ ]:
# Split the data into a training set and a test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Instantiate a "linear model"
from sklearn import linear_model
linear = linear_model.LinearRegression()
# Fit the model, using all the attributes:
linear.fit(X_train, y_train)
# Do the prediction on the test data:
y_lr_pred = linear.predict(X_test)
# How well did we do? Compute MSE
from sklearn.metrics import mean_squared_error
mse_linear = np.sqrt(mean_squared_error(y_test,y_lr_pred))
r2_linear = linear.score(X_test, y_test)
print("Linear regression: MSE = ",mse_linear)
print("R2 score =",r2_linear)
In [ ]:
plot(y_test, y_lr_pred, 'o', alpha=0.2)
title("Linear Regression - MSE = %.2f" % mse_linear, fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric Redshift", fontsize=18)
plot([0,7],[0,7], color='red')
ylim(0,5)
Just how bad is this? Here's the MSE from guessing the average redshift of the training set for all new objects:
In [ ]:
print("Naive MSE", ((1./len(y_train))*(y_train - y_train.mean())**2).sum())
print("Linear regression: MSE = ",mse_linear)
Now let's try a non-parametric model:
"Regression based on k-nearest neighbors. The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set."
What underlying model is implied by the KNN algorithm? How many hidden parameters does it (effectively) have? What choices do we get to make?
In [ ]:
from sklearn import neighbors
from sklearn import preprocessing
X_scaled = preprocessing.scale(X) # Many methods work better on re-scaled ("whitened") X.
KNN = neighbors.KNeighborsRegressor(5)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
KNN.fit(X_train, y_train)
In [ ]:
y_knn_pred = KNN.predict(X_test)
mse_knn = mean_squared_error(y_test,y_knn_pred)
r2_knn = KNN.score(X_test, y_test)
print("MSE (KNN) =", mse_knn)
print("R2 score (KNN) =",r2_knn)
print("cf.")
print("MSE (linear regression) = ",mse_linear)
print("R2 score (linear regression) =",r2_linear)
In [ ]:
plot(y_test, y_knn_pred,'o',alpha=0.2)
title("k-NN Residuals - MSE = %.2f" % mse_knn, fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric Redshift", fontsize=18)
plot([0,7], [0,7], color='red')
ylim(0,5);
Let's vary the control parameters of the KNN model, to see how good we can make our predictions.
We can see our options in the model repr
:
KNeighborsRegressor(algorithm='auto',
leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=5,
p=2, weights='uniform')
In [ ]:
# We'll vary the number of neighbors used:
param_name = "n_neighbors"
param_range = np.array([1,2,4,8,16,32,64])
# Compute our cv scores for the above range of the no. of neighbors:
from sklearn.model_selection import validation_curve
training_scores, validation_scores = validation_curve(KNN, X_scaled, y,
param_name=param_name,
param_range=param_range,
cv=3, scoring='r2')
In [ ]:
def plot_validation_curve(param_name,parameter_values, training_scores, validation_scores):
training_scores_mean = np.mean(training_scores, axis=1)
training_scores_std = np.std(training_scores, axis=1)
validation_scores_mean = np.mean(validation_scores, axis=1)
validation_scores_std = np.std(validation_scores, axis=1)
plt.fill_between(parameter_values, training_scores_mean - training_scores_std,
training_scores_mean + training_scores_std, alpha=0.1, color="r")
plt.fill_between(parameter_values, validation_scores_mean - validation_scores_std,
validation_scores_mean + validation_scores_std, alpha=0.1, color="g")
plt.plot(parameter_values, training_scores_mean, 'o-', color="r",
label="Training R2 score")
plt.plot(parameter_values, validation_scores_mean, 'o-', color="g",
label="Cross-validation R2 score")
plt.ylim(validation_scores_mean.min() - .1, training_scores_mean.max() + .1)
plt.xlabel(param_name, fontsize=18)
plt.legend(loc="best", fontsize=18)
In [ ]:
plot_validation_curve(param_name, param_range, training_scores, validation_scores)
In [ ]:
param_grid = {'n_neighbors': np.array([1,2,4,8,16,32,64]),
'weights': ['uniform','distance'],
'p' : np.array([1,2])}
np.set_printoptions(suppress=True)
print(param_grid)
In [ ]:
from sklearn.model_selection import GridSearchCV
KNN_tuned = GridSearchCV(KNN, param_grid, verbose=3)
A GridSearchCV
object behaves just like a model, except it carries out a cross-validation while fitting:
This fit will take a minute or so, as it loops over all the search positions in the hyper-parameter grid.
In [ ]:
KNN_tuned.fit(X_train, y_train)
In [ ]:
y_knn_tuned_pred = KNN_tuned.predict(X_test)
mse_knn_tuned = mean_squared_error(y_test,y_knn_tuned_pred)
r2_knn_tuned = KNN_tuned.score(X_test, y_test)
print("MSE (tuned KNN) =", mse_knn_tuned)
print("R2 score (tuned KNN) =", r2_knn_tuned)
print("cf.")
print("MSE (KNN) = ", mse_knn)
print("R2 score (KNN) =", r2_knn)
Which are the best KNN control parameters we found?
In [ ]:
KNN_tuned.best_params_
This value of n_neighbors
is consistent with the peak in cross-validation score in the validation curve plot.
Notice that all the above tuning happened while training on a single split (X_train
and y_train
).
It's possible that that one particular fold prefers a slightly different set of model parameters than a different one - so to assess our generalization error, we need a further level of cross-validation.
We can do this by passing a GridSearchCV
model to the cross validation score calculator. This will take a few minutes, as the grid search is carried out for each CV fold...
In [ ]:
from sklearn.model_selection import cross_val_score
R2 = cross_val_score(KNN_tuned, X_scaled, y, cv=3, scoring='r2')
In [ ]:
meanR2, errR2 = np.mean(R2), np.std(R2)/np.sqrt(len(R2))
print('Mean score:', meanR2, '+/-', errR2)
Let's finish off our photo-z machine learning tool, by:
Choosing the best combination of hyper-parameters (from our cross-validation analysis)
Training it on the whole of the training dataset
Using it to predict the redshifts of the objects in the test set
In [ ]:
KNNz = KNN_tuned.best_estimator_
KNNz.fit(X_train, y_train)
In [ ]:
j = 571
one_new_quasar = X_test[j,:].reshape(1, -1)
zphoto = KNNz.predict(one_new_quasar)
zspec = y_test[j]
print("True redshift cf. KNN photo-z:", zspec, ' cf.', zphoto[0])
In [ ]:
zspec = y_test
zphoto = KNNz.predict(X_test)
plot(zspec, zphoto,'o',alpha=0.1)
title("KNNz performance", fontsize=18)
xlabel("Spectroscopic Redshift", fontsize=18)
ylabel("Photometric redshift", fontsize=18)
plot([0,7], [0,7], color='red')
ylim(0,5);
Scikit-Learn
docs: "A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset, and uses averaging to improve the predictive accuracy and control over-fitting."Random decision trees are random in that they:
Are set up with random cut thresholds, and random feature-ordering
Are given random bootstrap sub-samples of the data
In [ ]:
all_sources = pd.read_csv("../../data/qso10000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
"g_r_color","r_i_color","i_z_color","diff_u",\
"diff_g1","diff_i","diff_z","class"])[:1000]
all_sources = all_sources.append(pd.read_csv("../../data/star1000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
"g_r_color","r_i_color","i_z_color","diff_u",\
"diff_g1","diff_i","diff_z","class"]))
all_sources = all_sources.append(pd.read_csv("../../data/galaxy1000.csv",index_col=0,usecols=["objid","dered_r","u_g_color",\
"g_r_color","r_i_color","i_z_color","diff_u",\
"diff_g1","diff_i","diff_z","class"]))
all_sources = all_sources[(all_sources["dered_r"] > -9999) & (all_sources["g_r_color"] > -10) & (all_sources["g_r_color"] < 10)]
all_labels = all_sources["class"]
all_features = copy.copy(all_sources)
del all_features["class"]
X = copy.copy(all_features.values)
y = copy.copy(all_labels.values)
In [ ]:
print("Feature vector shape =", X.shape)
print("Class label vector shape =", y.shape)
In [ ]:
print(y, X)
In [ ]:
yy = y.copy()
yy[yy=="QSO"] = 0.0 # Red
yy[yy=="STAR"] = 0.5 # Green
yy[yy=="GALAXY"] = 1.0 # Blue
yy = yy.astype(float)
norm = mpl.colors.Normalize(vmin=min(yy), vmax=max(yy))
cmap = cm.jet_r
m = cm.ScalarMappable(norm=norm, cmap=cmap)
rez = pd.plotting.scatter_matrix(all_features, alpha=0.2, figsize=[15,15], c=m.to_rgba(yy))
In [ ]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, oob_score=True)
rf.fit(X, y)
In [ ]:
sorted(zip(all_sources.columns.values,rf.feature_importances_),key=lambda q: q[1],reverse=True)
The accuracy of a classifier is the fraction of predictions made that are correct.
Ensemble classifiers like Random Forest provide an oob_score
, the mean "Out of Bag" prediction accuracy
Each decision tree in the ensemble is only working on a subset of the data, so it can track its accuracy with the data not in its own bag.
In [ ]:
# Report the "out of bag" accuracy score:
rf.oob_score_
In [ ]:
# Parameter values to try:
parameters = {'n_estimators':(50,100,200), "max_features": ["auto",3],
'criterion':["gini","entropy"], "min_samples_leaf": [1,2]}
# Training/test split:
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.25)
In [ ]:
# Do a grid search to find the highest 3-fold cross-validation score:
rf_tuned = GridSearchCV(rf, parameters, cv=3, verbose=1)
RFselector = rf_tuned.fit(X_train, y_train)
# Print the best score and estimator:
print('Best OOB score:', RFselector.best_score_)
print(RFselector.best_estimator_)
In [ ]:
# Make predictions given the test data:
y_pred = RFselector.predict(X_test)
In [ ]:
# Compute confusion matrix:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
plt.matshow(cm)
plt.title('Confusion matrix', fontsize=18)
plt.colorbar()
plt.ylabel('True label', fontsize=18)
plt.xlabel('Predicted label', fontsize=18)
Each Random Forest output label comes with a classification probability, computed from the results of the whole forest.
To select a sample of classified objects, one can choose a selection threshold in this class probability, and only keep objects with higher probability than this threshold.
The availability of a class probability leads to an important diagnostic: the "Receiver Operating Characteristic" or "ROC" curve.
The ROC curve shows a classifier's true positive rate (TPR) plotted against the false positive rate (FPR), as the selection threshold is varied.
$\;\;\;\;\;{\rm TPR} = \frac{TP}{TP + FN}$ the fraction of actual 1's that are classified as 1's.
$\;\;\;\;\;{\rm FPR} = \frac{FP}{FP + TN}$ the fraction of actual 0's that are classified as 1's.
TPR = "Recall", "Completeness". FPR = "Fall-out". In astronomy, we are also (very) interested in "Precision," $\frac{TP}{FP + TP}$, which is related to "Purity"
Typically, classifiers have control parameters that affect both the TPR and FPR (often improving one at the expense of the other), so the ROC curve is a good tool for investigating these parameters.
Likewise, ROC curves provide a very good way to compare different classifiers, via the "area under the curve" (and above the random classifier TPR=FPR line).
Let's use the SciKit-Learn
utilities to plot an ROC curve for our RFselector, and quantify its performance via the AUC.
In [ ]:
# Classify the test data, and store the classification probabilities:
BestRFselector = RFselector.best_estimator_
y_prob = BestRFselector.fit(X_train, y_train).predict_proba(X_test)
In [ ]:
from sklearn.metrics import roc_curve, auc
# Compute ROC curve and area under curve (AUC) for each class:
labels = BestRFselector.classes_ # ['GALAXY', 'QSO', 'STAR'] - the order is decided by the machine!
fpr = dict()
tpr = dict()
roc_auc = dict()
for i,label in enumerate(labels):
fpr[label], tpr[label], _ = roc_curve(y_test, y_prob[:, i], pos_label=label)
roc_auc[label] = auc(fpr[label], tpr[label])
In [ ]:
lw = 2
colors = {'QSO':'red', 'STAR':'green', 'GALAXY':'blue'}
for label in labels:
plot(fpr[label], tpr[label], color=colors[label],
lw=lw, label='%s (AUC = %0.3f)' % (label, roc_auc[label]))
plot([0, 1], [0, 1], color='gray', lw=lw, linestyle='--')
xlim([0.0, 1.0])
ylim([0.0, 1.05])
xlabel('False Positive Rate', fontsize=18)
ylabel('True Positive Rate', fontsize=18)
title('RFselector ROC Curve', fontsize=18)
legend(loc="lower right", fontsize=14);
The scikit-learn
package makes it easy to experiment with various machine learning algorithms, and make non-parametric models that (via cross-validation) have high generalized prediction accuracy.
Other frameworks exist: Google's "TensorFlow" is worth investigating, both for general ML, and also for its neural network support. PGMs, as well as the ML basics shown here, will reappear.
These models may not provide parameter uncertainties or even much new understanding of the dataset