In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
In previous weeks we have covered preprocessing our data, dimensionality reduction, clustering, regression and classification. This week we will be pulling these processes together into a complete project.
Most projects can be thought of as a series of discrete steps:
If we are fortunate our data may already be in a usable format but more often extensive work is needed to generate something usable.
The number of features we have compared with our sample size will determine whether feature selection is needed. We may choose in the first instance not to use feature selection. If we observe that our performance on the validation dataset is substantially worse than on the training dataset it is likely our model is overfitting and would benefit from limiting the number of features.
Even if the performance is comparable we may still consider using dimensionality reduction or feature selection.
Which algorithm to use will depend on the type of task and the size of the dataset. As with the preceding steps it can be difficult to predict the optimal approach and different options should be tried.
An additional step that can frequently boost performance is combining multiple different models. It is important to consider that although there are advantages combining different models can make the result more difficult to interpret.
The models may be generated by using a different algorithm and/or different features.
Finally we need to be able to utilize the model we have generated. This typically takes the form of receiving a new sample and then performing all the steps used in training to make a prediction.
If we are generating a model only to understand the structure of the data we already have then the new samples may be the test dataset we set aside at the beginning.
At each of the major steps we need to take there are a variety of options. It is often not clear which approach will give us the best performance and so we should try several.
Being able to rapidly try different options helps us get to the best solution faster. It is tempting to make a change to our code, execute it, look at the performance, and then decide between sticking with the change or going back to the original version. It is very easy to:
Using version control software is very useful for avoiding these issues.
I introduced the resources below in week 2:
The most popular version control software at the moment is Git
There are a number of good tutorials available:
We have previously looked at approaches for choosing the optimal parameters for an algorithm. We also have choices earlier in the workflow that we should systematically explore - what features should we use, how should they be normalized, etc.
There are multiple ways we can change our code to explore different approaches:
We will try several approaches in the digits dataset using comments and functions before moving to the more advanced options.
In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.metrics import accuracy_score
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
In [3]:
i = y_digits.shape[0]
split = np.random.random(i)
train = split < 0.7
test = split >= 0.7
X_digits_train = X_digits[train]
X_digits_test = X_digits[test]
y_digits_train = y_digits[train]
y_digits_test = y_digits[test]
In [4]:
clf = linear_model.LogisticRegression()
clf.fit(X_digits_train, y_digits_train)
predictions = clf.predict(X_digits_test)
print(accuracy_score(y_digits_test, predictions))
In [5]:
pca = decomposition.PCA()
pca.fit(X_digits_train)
#pca.transform(X_digits_train)[:,:2].shape
z = 2
clf = linear_model.LogisticRegression()
clf.fit(pca.transform(X_digits_train)[:,:z], y_digits_train)
predictions = clf.predict(pca.transform(X_digits_test)[:,:z])
print(accuracy_score(y_digits_test, predictions))
In [ ]:
In [6]:
# http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#example-plot-digits-pipe-py
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
In [7]:
###############################################################################
# Plot the PCA spectrum
pca.fit(X_digits)
fig, ax = plt.subplots(1,1)
ax.plot(pca.explained_variance_, linewidth=2)
ax.set_xlabel('n_components')
ax.set_ylabel('explained_variance_')
Out[7]:
In [8]:
###############################################################################
# Prediction
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 10)
Cs
Out[8]:
In [9]:
#Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator = GridSearchCV(pipe,
dict(pca__n_components=n_components,
logistic__C=Cs))
estimator.fit(X_digits, y_digits)
print('# components:', estimator.best_estimator_.named_steps['pca'].n_components)
print('C:', estimator.best_estimator_.named_steps['logistic'].C)
print(estimator)
In [10]:
# http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py
# Author: Andreas Mueller <amueller@ais.uni-bonn.de>
#
# License: BSD 3 clause
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
iris = load_iris()
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
# numpy.arange, numpy.linspace, numpy.logspace, and range are all useful in creating options to evaluate
param_grid = dict(features__pca__n_components=[1,2,3],
features__univ_select__k=[1,2,3],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y)
print('PCA components:', grid_search.best_estimator_.named_steps['features'].get_params()['pca'].n_components)
print('Original features used:', grid_search.best_estimator_.named_steps['features'].get_params()['univ_select'].k)
print('C:', grid_search.best_estimator_.named_steps['svm'].C)
print(grid_search.best_estimator_)
In [11]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
categories=['comp.graphics', 'sci.med'], shuffle=True, random_state=0)
print(twenty_train.target_names)
In [12]:
# Looking at an example
print(twenty_train.data[0])
In [13]:
# The first step is converting the text into numerical data we can work with
# We will use a bag-of-words approach - counting the occurance of words
from sklearn.feature_extraction.text import (CountVectorizer,
TfidfTransformer)
# Using a pipeline
pipe = Pipeline([('counts', CountVectorizer()),
('tfidf', TfidfTransformer())])
output = pipe.fit(twenty_train.data).transform(twenty_train.data)
output
Out[13]:
In [14]:
# Adding a classifier
pipe = Pipeline([('counts', CountVectorizer()),
('tfidf', TfidfTransformer()),
('classifier', linear_model.LogisticRegression())])
# We can compare different parameters at each stage
param_grid = dict(counts__ngram_range=[(1,1), (1,2)],
counts__stop_words=[None, 'english'],
classifier__C=[0.1, 1, 10, 30, 100])
X = twenty_train.data[:]
y = twenty_train.target[:]
grid_search = GridSearchCV(pipe, param_grid=param_grid)
grid_search.fit(X, y)
Out[14]:
In [15]:
# Getting the best parameters
print('# of words:', grid_search.best_estimator_.named_steps['counts'].ngram_range)
print('Stop words:', grid_search.best_estimator_.named_steps['counts'].stop_words)
print('C:', grid_search.best_estimator_.named_steps['classifier'].C)
In this example we will look at an EEG dataset with the goal of detecting the onset of epileptic seizures.
This machine learning task and associated dataset was put together by UPenn and Mayo Clinic and hosted on Kaggle as a competition with financial backing from NINDS and the American Epilepsy Society.
There are two parts to the data:
There are a number of challenges in approaching this task.
Thankfully with this being a Kaggle competition the data has already been collected for us. The data has been divided into multiple files each with 1 second of data. There are training examples for each of the subjects and samples have already been chosen for the test set.
We will still want to perform validation but we can use Kaggle to measure our final performance on the test set.
The data available are potentials observed by the different electrodes over the one second window used for each sample. These values are unlikely to be predictive in this format so we must use this data to create the features our model will use.
This is the first real decision we need to make. What might our features be?
Applying normalization using standard approaches may be necessary depending on the algorithm we plan to use.
The dataset is reasonably large and we are likely to have a limited number of features. Our first solution might not need to select features prior to building the model.
Depending on how creative we get and how large the number of features we have grows later solutions may benefit from feature selection, either to prevent overfitting or to speed up evaluation.
Evaluating different algorithms as we build a solution is probably going to be the best approach. While we are developing features using an ensemble or boosting algorithm will be the simplest approach.
There are multiple different levels at which we might develop machine learning models for this task. We could develop models for individual channels or all the channels combined. We could develop a single model for all subjects or separate models for each subject.
All of these different models could then be combined together.
For this task we know we will have a test dataset and our performance there will be evaluated.
NOTE: The dataset for this example is 43 GB uncompressed so I have not included it in the course material. If you want to run through the example it can be downloaded from the link above.
A github repository with all the code for this example is here.
NOTE #2: This example uses object-oriented programming and other relatively advanced approaches.
I do not expect you to follow this example fully at this stage in the course.
To my knowledge, this is the most complex publicly available example of pipelines in scikit-learn. I include this here as an example of how powerful this approach can be. During the competition, this model ranked in the top 15%.
In [16]:
import numpy as np
from sklearn.base import TransformerMixin
class ModelTransformer(TransformerMixin):
"""Wrap a classifier model so that it can be used in a pipeline"""
def __init__(self, model):
self.model = model
def fit(self, *args, **kwargs):
self.model.fit(*args, **kwargs)
return self
def transform(self, X, **transform_params):
return self.model.predict_proba(X)
def predict_proba(self, X, **transform_params):
return self.transform(X, **transform_params)
class VarTransformer(TransformerMixin):
"""Compute the variance"""
def transform(self, X, **transform_params):
var = X.var(axis=1)
return var.reshape((var.shape[0],1))
def fit(self, X, y=None, **fit_params):
return self
class MedianTransformer(TransformerMixin):
"""Compute the median"""
def transform(self, X, **transform_params):
median = np.median(X, axis=1)
return median.reshape((median.shape[0],1))
def fit(self, X, y=None, **fit_params):
return self
class ChannelExtractor(TransformerMixin):
"""Extract a single channel for downstream processing"""
def __init__(self, channel):
self.channel = channel
def transform(self, X, **transformer_params):
return X[:,:,self.channel]
def fit(self, X, y=None, **fit_params):
return self
class FFTTransformer(TransformerMixin):
"""Convert to the frequency domain and then sum over bins"""
def transform(self, X, **transformer_params):
fft = np.fft.rfft(X, axis=1)
fft = np.abs(fft)
fft = np.cumsum(fft, axis=1)
bin_size = 10
max_freq = 60
return np.column_stack([fft[:,i] - fft[:,i-bin_size]
for i in range(bin_size, max_freq, bin_size)])
def fit(self, X, y=None, **fit_params):
return self
In [17]:
"""This cell is not expected to run correctly. We don't have all the packages needed.
If you want to run this example download the repository and the source data."""
import numpy as np
import os
import pickle
from sklearn.cross_validation import cross_val_score, StratifiedShuffleSplit
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier
import get_traces
import transformers as trans
def build_pipeline(X):
"""Helper function to build the pipeline of feature transformations.
We do the same thing to each channel so rather than manually copying changes
for all channels this is automatically generated"""
channels = X.shape[2]
pipeline = Pipeline([
('features', FeatureUnion([
('select_%d_pipeline' % i,
Pipeline([('select_%d' % i, trans.ChannelExtractor(i)),
('channel features', FeatureUnion([
('var', trans.VarTransformer()),
('median', trans.MedianTransformer()),
('fft', trans.FFTTransformer()),
])),
])
) for i in range(channels)])),
('classifier', trans.ModelTransformer(RandomForestClassifier(
n_estimators=500,
max_depth=None,
min_samples_split=1,
random_state=0))),
])
return pipeline
def get_transformed_data(patient, func=get_traces.get_training_traces):
"""Load in all the data"""
X = []
channels = get_traces.get_num_traces(patient)
# Reading in 43 Gb of data . . .
for i in range(channels):
x, y = func(patient, i)
X.append(x)
return (np.dstack(X), y)
all_labels = []
all_predictions = np.array([])
folders = [i for i in os.listdir(get_traces.directory) if i[0] != '.']
folders.sort()
for folder in folders:
print('Starting %s' % folder)
print('getting data')
X, y = get_transformed_data(folder)
print(X.shape)
print('stratifiedshufflesplit')
cv = StratifiedShuffleSplit(y,
n_iter=5,
test_size=0.2,
random_state=0,)
print('cross_val_score')
pipeline = build_pipeline(X)
# Putting this in a list is unnecessary for just one pipeline - use to compare multiple pipelines
scores = [
cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc')
]
print('displaying results')
for score, label in zip(scores, ['pipeline',]):
print("AUC: {:.2%} (+/- {:.2%}), {:}".format(score.mean(),
score.std(), label))
clf = pipeline
print('Fitting full model')
clf.fit(X, y)
print('Getting test data')
testing_data, files = get_transformed_data(folder,
get_traces.get_testing_traces)
print('Generating predictions')
predictions = clf.predict_proba(testing_data)
print(predictions.shape, len(files))
with open('%s_randomforest_predictions.pkl' % folder, 'wb') as f:
pickle.dump((files, predictions[:,1]), f)
The final step is using our trained model. This might be in a research article or as part of an ongoing service. In either case we want to be able to save a version of our model. We can use the saved model to address reviewers comments, to move the model to a production server, etc.
Scikit learn models can be saved by using the pickle storage package and the joblib variant which is more efficient in handling numpy arrays.
In [18]:
from sklearn import datasets
diabetes = datasets.load_diabetes()
# Description at http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
# Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements
# were obtained for each of n = 442 diabetes patients,
# as well as the response of interest, a quantitative measure of disease progression one year after baseline.
X = diabetes.data # independent variables
y = diabetes.target # dependent val
print(X.shape)
print(y.shape)
import pandas as pd
data = pd.DataFrame(X, columns=['age', 'sex', 'bmi', 'map',
'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu'])
data.info()
In [19]:
from sklearn import linear_model
bmi = X[:, 2].reshape(-1, 1)
outcome = y
reg = linear_model.LinearRegression()
reg.fit(bmi, outcome)
predicted_outcome = reg.predict(bmi)
plt.plot(predicted_outcome, outcome, 'k.')
plt.xlabel("Predicted outcome")
plt.ylabel("Clinical outcome")
Out[19]:
In [20]:
print('Directly trained model predictions:', predicted_outcome[:10])
from sklearn.externals import joblib
joblib.dump(reg, 'diabetes_prediction_model.pkl')
reg2 = joblib.load('diabetes_prediction_model.pkl')
predicted_outcome2 = reg2.predict(bmi)
print('Saved model predictions:', predicted_outcome[:10])
In [ ]: