In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
In previous weeks we have covered preprocessing our data, dimensionality reduction, and last week looked at supervised learning. This week we will be pulling these processes together into a complete project.
Most projects can be thought of as a series of discrete steps:
If we are fortunate our data may already be in a usable format but more often extensive work is needed to generate something usable.
The number of features we have compared with our sample size will determine whether feature selection is needed. We may choose in the first instance not to use feature selection. If we observe that our performance on the validation dataset is substantially worse than on the training dataset it is likely our model is overfitting and would benefit from limiting the number of features.
Even if the performance is comparable we may still consider using dimensionality reduction or feature selection.
Which algorithm to use will depend on the type of task and the size of the dataset. As with the preceding steps it can be difficult to predict the optimal approach and different options should be tried.
Usually try ensemble models first, and train and evaluate more than one model!
An additional step that can frequently boost performance is combining multiple different models. It is important to consider that combining different models can make the result more difficult to interpret.
The models may be generated by simply using a different algorithm or may additionally include changes to the features used.
Finally we need to be able to utilize the model we have generated. This typically takes the form of receiving a new sample and then performing all the steps used in training to make a prediction.
If we are generating a model only to understand the structure of the data we already have the new samples may be only the test dataset we set aside at the beginning.
At each of the major steps we need to take there are a variety of options. It is often not clear which approach will give us the best performance and so we should try several.
Being able to rapidly try different options helps us get to the best solution faster. It is tempting to make a change to our code, execute it, look at the performance, and then decide between sticking with the change or going back to the original version. It is very easy to:
Using version control software is very useful for avoiding these issues.
We have previously looked at approaches for choosing the optimal parameters for an algorithm. We also have choices earlier in the workflow that we should systematically explore - what features should we use, how should they be normalized.
Scikit learn includes functionality for easily exploring the impact of different parameters not only in the machine learning algorithm we choose but at every stage of our solution.
In [4]:
# http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#example-plot-digits-pipe-py
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
###############################################################################
# Plot the PCA spectrum
pca.fit(X_digits)
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
###############################################################################
# Prediction
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)
#Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator = GridSearchCV(pipe,
dict(pca__n_components=n_components,
logistic__C=Cs))
estimator.fit(X_digits, y_digits)
plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen')
plt.legend(prop=dict(size=12))
plt.show()
print(estimator)
In [6]:
# http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py
# Author: Andreas Mueller <amueller@ais.uni-bonn.de>
#
# License: BSD 3 clause
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
iris = load_iris()
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA()
# Maybe some original features where good, too?
selection = SelectKBest()
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
param_grid = dict(features__pca__n_components=[1, 2, 3],
features__univ_select__k=[1, 2],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
In this example we will look at an EEG dataset with the goal of detecting the onset of epileptic seizures.
This machine learning task and associated dataset was put together by UPenn and Mayo Clinic and hosted on Kaggle as a competition with financial backing from NINDS and the American Epilepsy Society.
There are two parts to the data:
There are a number of challenges in approaching this task.
Thankfully with this being a Kaggle competition the data has already been collected for us. The data has been divided into multiple files each with 1 second of data. There are training examples for each of the subjects and samples have already been chosen for the test set.
We will still want to perform validation but we can use Kaggle to measure our final performance on the test set.
The data available are potentials observed by the different electrodes over the one second window used for each sample. These values are unlikely to be predictive in this format so we must use this data to create the features our model will use.
This is the first real decision we need to make. What might our features be?
Applying normalization using standard approaches may be necessary depending on the algorithm we plan to use.
The dataset is reasonably large and we are likely to have a limited number of features. Our first solution might not need to select features prior to building the model.
Depending on how creative we get and how large the number of features we have grows later solutions may benefit from feature selection, either to prevent overfitting or to speed up evaluation.
Evaluating different algorithms as we build a solution is probably going to be the best approach. While we are developing features using an ensemble or boosting algorithm will be the simplest approach.
There are multiple different levels at which we might develop machine learning models for this task. We could develop models for individual channels or all the channels combined. We could develop a single model for all subjects or separate models for each subject.
All of these different models could then be combined together.
For this task we know we will have a test dataset and our performance there will be evaluated.
NOTE: The dataset for this example is 43 GB uncompressed so I have not included it in the course material. If you want to run through the example it can be downloaded from the link above.
A github repository with all the code for this example is here.
In [ ]:
import numpy as np
from sklearn.base import TransformerMixin
class ModelTransformer(TransformerMixin):
"""Wrap a classifier model so that it can be used in a pipeline"""
def __init__(self, model):
self.model = model
def fit(self, *args, **kwargs):
self.model.fit(*args, **kwargs)
return self
def transform(self, X, **transform_params):
return self.model.predict_proba(X)
def predict_proba(self, X, **transform_params):
return self.transform(X, **transform_params)
class VarTransformer(TransformerMixin):
"""Compute the variance"""
def transform(self, X, **transform_params):
var = X.var(axis=1)
return var.reshape((var.shape[0],1))
def fit(self, X, y=None, **fit_params):
return self
class MedianTransformer(TransformerMixin):
"""Compute the median"""
def transform(self, X, **transform_params):
median = np.median(X, axis=1)
return median.reshape((median.shape[0],1))
def fit(self, X, y=None, **fit_params):
return self
class ChannelExtractor(TransformerMixin):
"""Extract a single channel for downstream processing"""
def __init__(self, channel):
self.channel = channel
def transform(self, X, **transformer_params):
return X[:,:,self.channel]
def fit(self, X, y=None, **fit_params):
return self
class FFTTransformer(TransformerMixin):
"""Convert to the frequency domain and then sum over bins"""
def transform(self, X, **transformer_params):
fft = np.fft.rfft(X, axis=1)
fft = np.abs(fft)
fft = np.cumsum(fft, axis=1)
bin_size = 10
max_freq = 60
return np.column_stack([fft[:,i] - fft[:,i-bin_size]
for i in range(bin_size, max_freq, bin_size)])
def fit(self, X, y=None, **fit_params):
return self
In [ ]:
import numpy as np
import os
import pickle
from sklearn.cross_validation import cross_val_score, StratifiedShuffleSplit
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier
import get_traces
import transformers as trans
def build_pipeline(X):
"""Helper function to build the pipeline of feature transformations.
We do the same thing to each channel so rather than manually copying changes
for all channels this is automatically generated"""
channels = X.shape[2]
pipeline = Pipeline([
('features', FeatureUnion([
('select_%d_pipeline' % i,
Pipeline([('select_%d' % i, trans.ChannelExtractor(i)),
('channel features', FeatureUnion([
('var', trans.VarTransformer()),
('median', trans.MedianTransformer()),
('fft', trans.FFTTransformer()),
])),
])
) for i in range(channels)])),
('classifier', trans.ModelTransformer(RandomForestClassifier(
n_estimators=500,
max_depth=None,
min_samples_split=1,
random_state=0))),
])
return pipeline
def get_transformed_data(patient, func=get_traces.get_training_traces):
"""Load in all the data"""
X = []
channels = get_traces.get_num_traces(patient)
# Reading in 43 Gb of data . . .
for i in range(channels):
x, y = func(patient, i)
X.append(x)
return (np.dstack(X), y)
all_labels = []
all_predictions = np.array([])
folders = [i for i in os.listdir(get_traces.directory) if i[0] != '.']
folders.sort()
for folder in folders:
print('Starting %s' % folder)
print('getting data')
X, y = get_transformed_data(folder)
print(X.shape)
print('stratifiedshufflesplit')
cv = StratifiedShuffleSplit(y,
n_iter=5,
test_size=0.2,
random_state=0,)
print('cross_val_score')
pipeline = build_pipeline(X)
# Putting this in a list is unnecessary for just one pipeline - use to compare multiple pipelines
scores = [
cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc')
]
print('displaying results')
for score, label in zip(scores, ['pipeline',]):
print("AUC: {:.2%} (+/- {:.2%}), {:}".format(score.mean(),
score.std(), label))
clf = pipeline
print('Fitting full model')
clf.fit(X, y)
print('Getting test data')
testing_data, files = get_transformed_data(folder,
get_traces.get_testing_traces)
print('Generating predictions')
predictions = clf.predict_proba(testing_data)
print(predictions.shape, len(files))
with open('%s_randomforest_predictions.pkl' % folder, 'wb') as f:
pickle.dump((files, predictions[:,1]), f)
In [19]:
from sklearn import grid_search
from sklearn import datasets
from sklearn import neighbors
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn import grid_search
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
np.random.seed(0)
split = np.random.random(y.shape) > 0.3
X_train = X[split]
y_train = y[split]
X_test = X[np.logical_not(split)]
y_test = y[np.logical_not(split)]
print(X_train.shape, y_train.shape)
knn = neighbors.KNeighborsRegressor()
pipe = Pipeline(steps=[('knn', knn)])
parameters = [1,2,3,4,5,6,7,8,9,10]
grid = grid_search.GridSearchCV(pipe, dict(knn__n_neighbors=parameters))
grid.fit(X_train, y_train)
plt.plot(y_test, grid.predict(X_test), 'k.')
plt.show()
print(metrics.mean_squared_error(y_test, grid.predict(X_test)))
grid.get_params()
Out[19]:
In [22]:
best=grid.best_estimator_.named_steps['knn'].n_neighbors
print('optimal number of clusters:', best)