Week 8 - The Machine Learning Workflow


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

In previous weeks we have covered preprocessing our data, dimensionality reduction, clustering, regression and classification. This week we will be pulling these processes together into a complete project.

Most projects can be thought of as a series of discrete steps:

  • Data acquisition / loading
  • Feature creation
  • Feature normalization
  • Feature selection
  • Machine learning model
  • Combining multiple models
  • Reporting / Utilization

Data acquisition

If we are fortunate our data may already be in a usable format but more often extensive work is needed to generate something usable.

  • What type of data do we have?
  • Do we need to combine data from multiple sources?
  • Is our data structured in such a way it can be used directly?
  • Does our data need to be cleaned?
  • Does our data have issues with confounding?

Feature creation

  • Can our data be used directly?
  • What features have been used previously for similar tasks?

Feature normalization

  • Z-score normalization?
  • Min-max mormalization?

Feature selection

The number of features we have compared with our sample size will determine whether feature selection is needed. We may choose in the first instance not to use feature selection. If we observe that our performance on the validation dataset is substantially worse than on the training dataset it is likely our model is overfitting and would benefit from limiting the number of features.

Even if the performance is comparable we may still consider using dimensionality reduction or feature selection.

Machine learning model

Which algorithm to use will depend on the type of task and the size of the dataset. As with the preceding steps it can be difficult to predict the optimal approach and different options should be tried.

Combining multiple models

An additional step that can frequently boost performance is combining multiple different models. It is important to consider that although there are advantages combining different models can make the result more difficult to interpret.

The models may be generated by using a different algorithm and/or different features.

Reporting / Utilization

Finally we need to be able to utilize the model we have generated. This typically takes the form of receiving a new sample and then performing all the steps used in training to make a prediction.

If we are generating a model only to understand the structure of the data we already have then the new samples may be the test dataset we set aside at the beginning.

Rapid experimentation

At each of the major steps we need to take there are a variety of options. It is often not clear which approach will give us the best performance and so we should try several.

Being able to rapidly try different options helps us get to the best solution faster. It is tempting to make a change to our code, execute it, look at the performance, and then decide between sticking with the change or going back to the original version. It is very easy to:

  • Loose track of what code generated what solution
  • Overwrite a working solution and be unable to repeat it

Using version control software is very useful for avoiding these issues.

I introduced the resources below in week 2:

The most popular version control software at the moment is Git

There are a number of good tutorials available:

Optimizing the entire workflow

We have previously looked at approaches for choosing the optimal parameters for an algorithm. We also have choices earlier in the workflow that we should systematically explore - what features should we use, how should they be normalized, etc.

There are multiple ways we can change our code to explore different approaches:

  • Commenting out sections
  • Wrapping functions around steps
  • Scikit-learn pipelines
  • Experimentation packages such as sacred

We will try several approaches in the digits dataset using comments and functions before moving to the more advanced options.


In [2]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.metrics import accuracy_score

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

In [3]:
i = y_digits.shape[0]
split = np.random.random(i)
train = split < 0.7
test = split >= 0.7
X_digits_train = X_digits[train]
X_digits_test = X_digits[test]
y_digits_train = y_digits[train]
y_digits_test = y_digits[test]

In [4]:
clf = linear_model.LogisticRegression()
clf.fit(X_digits_train, y_digits_train)
predictions = clf.predict(X_digits_test)
print(accuracy_score(y_digits_test, predictions))


0.958181818182

In [5]:
pca = decomposition.PCA()
pca.fit(X_digits_train)
#pca.transform(X_digits_train)[:,:2].shape

z = 2
clf = linear_model.LogisticRegression()
clf.fit(pca.transform(X_digits_train)[:,:z], y_digits_train)
predictions = clf.predict(pca.transform(X_digits_test)[:,:z])
print(accuracy_score(y_digits_test, predictions))


0.554545454545

In [ ]:

Scikit learn includes functionality for structuring our code and easily exploring the impact of different parameters not only in the machine learning algorithm we choose but at every stage of our solution.

Pipeline


In [6]:
# http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#example-plot-digits-pipe-py

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

In [7]:
###############################################################################
# Plot the PCA spectrum
pca.fit(X_digits)


fig, ax = plt.subplots(1,1)
ax.plot(pca.explained_variance_, linewidth=2)
ax.set_xlabel('n_components')
ax.set_ylabel('explained_variance_')


Out[7]:
<matplotlib.text.Text at 0x186d893d320>

In [8]:
###############################################################################
# Prediction

n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 10)
Cs


Out[8]:
array([  1.00000000e-04,   7.74263683e-04,   5.99484250e-03,
         4.64158883e-02,   3.59381366e-01,   2.78255940e+00,
         2.15443469e+01,   1.66810054e+02,   1.29154967e+03,
         1.00000000e+04])

In [9]:
#Parameters of pipelines can be set using ‘__’ separated parameter names:

estimator = GridSearchCV(pipe,
                         dict(pca__n_components=n_components,
                              logistic__C=Cs))
estimator.fit(X_digits, y_digits)

print('# components:', estimator.best_estimator_.named_steps['pca'].n_components)
print('C:', estimator.best_estimator_.named_steps['logistic'].C)

print(estimator)


# components: 40
C: 2.78255940221
GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logistic', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'logistic__C': array([  1.00000e-04,   7.74264e-04,   5.99484e-03,   4.64159e-02,
         3.59381e-01,   2.78256e+00,   2.15443e+01,   1.66810e+02,
         1.29155e+03,   1.00000e+04]), 'pca__n_components': [20, 40, 64]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

FeatureUnion


In [10]:
# http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py

# Author: Andreas Mueller <amueller@ais.uni-bonn.de>
#
# License: BSD 3 clause

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

iris = load_iris()

X, y = iris.data, iris.target

# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)

# Maybe some original features where good, too?
selection = SelectKBest(k=1)

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

svm = SVC(kernel="linear")

# Do grid search over k, n_components and C:

pipeline = Pipeline([("features", combined_features), ("svm", svm)])

# numpy.arange, numpy.linspace, numpy.logspace, and range are all useful in creating options to evaluate
param_grid = dict(features__pca__n_components=[1,2,3],
                  features__univ_select__k=[1,2,3],
                  svm__C=[0.1, 1, 10])

grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y)
print('PCA components:', grid_search.best_estimator_.named_steps['features'].get_params()['pca'].n_components)
print('Original features used:', grid_search.best_estimator_.named_steps['features'].get_params()['univ_select'].k)
print('C:', grid_search.best_estimator_.named_steps['svm'].C)
print(grid_search.best_estimator_)


PCA components: 2
Original features used: 3
C: 1
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('univ_select', SelectKBest(k=3, score_func=<function f_classif at 0x00000186D83F4840>))],
       transfo...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

Text classification

Working through an example.


In [11]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
    categories=['comp.graphics', 'sci.med'], shuffle=True, random_state=0)

print(twenty_train.target_names)


['comp.graphics', 'sci.med']

In [12]:
# Looking at an example
print(twenty_train.data[0])


From: harti@mikro.ee.tu-berlin.de (Stefan Hartmann (Behse))
Subject: Genoa graphics board Drivers FTP site!
Article-I.D.: mailgzrz.1qpf1r$9ti
Organization: TUBerlin/ZRZ
Lines: 29
NNTP-Posting-Host: mikro.ee.tu-berlin.de

Hi,

well I have opened up a FTP site for getting the latest software drivers
for Genoa graphics cards.

Here is how to access it:

ftp 192.109.42.11
login:ftp
password:ftp
cd pub/genoa
ls -l
binary
prompt
hash

(now if you wanna have the latest drivers for the 7900 board)

cd 7000series
mget *

quit

This is the sequence to get the drivers.

If you have any further question, please email me.

Best regards, Stefan Hartmann
email to: harti@mikro.ee.tu-berlin.de


In [13]:
# The first step is converting the text into numerical data we can work with
# We will use a bag-of-words approach - counting the occurance of words
from sklearn.feature_extraction.text import (CountVectorizer, 
                                             TfidfTransformer)

# Using a pipeline 
pipe = Pipeline([('counts', CountVectorizer()), 
                 ('tfidf', TfidfTransformer())])

output = pipe.fit(twenty_train.data).transform(twenty_train.data)
output


Out[13]:
<1178x24614 sparse matrix of type '<class 'numpy.float64'>'
	with 170800 stored elements in Compressed Sparse Row format>

In [14]:
# Adding a classifier

pipe = Pipeline([('counts', CountVectorizer()), 
                 ('tfidf', TfidfTransformer()),
                 ('classifier', linear_model.LogisticRegression())])

# We can compare different parameters at each stage
param_grid = dict(counts__ngram_range=[(1,1), (1,2)],
                  counts__stop_words=[None, 'english'],
                  classifier__C=[0.1, 1, 10, 30, 100])

X = twenty_train.data[:]
y = twenty_train.target[:]

grid_search = GridSearchCV(pipe, param_grid=param_grid)
grid_search.fit(X, y)


Out[14]:
GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('counts', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        str...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'classifier__C': [0.1, 1, 10, 30, 100], 'counts__stop_words': [None, 'english'], 'counts__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [15]:
# Getting the best parameters

print('# of words:', grid_search.best_estimator_.named_steps['counts'].ngram_range)
print('Stop words:', grid_search.best_estimator_.named_steps['counts'].stop_words)
print('C:', grid_search.best_estimator_.named_steps['classifier'].C)


# of words: (1, 1)
Stop words: english
C: 10

Advanced Pipeline: Seizure Detection

In this example we will look at an EEG dataset with the goal of detecting the onset of epileptic seizures.

This machine learning task and associated dataset was put together by UPenn and Mayo Clinic and hosted on Kaggle as a competition with financial backing from NINDS and the American Epilepsy Society.

There are two parts to the data:

  • 16 electrode EEG at 400 Hz from dogs with naturally occurring epilepsy
  • Patient EEG recordings at 500 Hz and 5000 Hz with a variable number of electrodes

There are a number of challenges in approaching this task.

  • Patient / animal specific differences
  • Different number of electrodes
  • Different sampling rates
  • Different electrode positions
  • Lack of existing features
  • . . . and likely several others

Data acquisition

Thankfully with this being a Kaggle competition the data has already been collected for us. The data has been divided into multiple files each with 1 second of data. There are training examples for each of the subjects and samples have already been chosen for the test set.

We will still want to perform validation but we can use Kaggle to measure our final performance on the test set.

Feature creation

The data available are potentials observed by the different electrodes over the one second window used for each sample. These values are unlikely to be predictive in this format so we must use this data to create the features our model will use.

This is the first real decision we need to make. What might our features be?

Feature normalization

Applying normalization using standard approaches may be necessary depending on the algorithm we plan to use.

Feature selection

The dataset is reasonably large and we are likely to have a limited number of features. Our first solution might not need to select features prior to building the model.

Depending on how creative we get and how large the number of features we have grows later solutions may benefit from feature selection, either to prevent overfitting or to speed up evaluation.

Machine learning model

Evaluating different algorithms as we build a solution is probably going to be the best approach. While we are developing features using an ensemble or boosting algorithm will be the simplest approach.

Combining multiple models

There are multiple different levels at which we might develop machine learning models for this task. We could develop models for individual channels or all the channels combined. We could develop a single model for all subjects or separate models for each subject.

All of these different models could then be combined together.

Reporting / Utilization

For this task we know we will have a test dataset and our performance there will be evaluated.

NOTE: The dataset for this example is 43 GB uncompressed so I have not included it in the course material. If you want to run through the example it can be downloaded from the link above.

A github repository with all the code for this example is here.

NOTE #2: This example uses object-oriented programming and other relatively advanced approaches.

I do not expect you to follow this example fully at this stage in the course.

To my knowledge, this is the most complex publicly available example of pipelines in scikit-learn. I include this here as an example of how powerful this approach can be. During the competition, this model ranked in the top 15%.


In [16]:
import numpy as np
from sklearn.base import TransformerMixin


class ModelTransformer(TransformerMixin):
    """Wrap a classifier model so that it can be used in a pipeline"""
    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return self.model.predict_proba(X)

    def predict_proba(self, X, **transform_params):
        return self.transform(X, **transform_params)


class VarTransformer(TransformerMixin):
    """Compute the variance"""
    def transform(self, X, **transform_params):
        var = X.var(axis=1)
        return var.reshape((var.shape[0],1))

    def fit(self, X, y=None, **fit_params):
        return self


class MedianTransformer(TransformerMixin):
    """Compute the median"""
    def transform(self, X, **transform_params):
        median = np.median(X, axis=1)
        return median.reshape((median.shape[0],1))

    def fit(self, X, y=None, **fit_params):
        return self

class ChannelExtractor(TransformerMixin):
    """Extract a single channel for downstream processing"""
    def __init__(self, channel):
        self.channel = channel

    def transform(self, X, **transformer_params):
        return X[:,:,self.channel]

    def fit(self, X, y=None, **fit_params):
        return self


class FFTTransformer(TransformerMixin):
    """Convert to the frequency domain and then sum over bins"""
    def transform(self, X, **transformer_params):
        fft = np.fft.rfft(X, axis=1)
        fft = np.abs(fft)
        fft = np.cumsum(fft, axis=1)
        bin_size = 10
        max_freq = 60
        return np.column_stack([fft[:,i] - fft[:,i-bin_size] 
            for i in range(bin_size, max_freq, bin_size)])

    def fit(self, X, y=None, **fit_params):
        return self

In [17]:
"""This cell is not expected to run correctly. We don't have all the packages needed.

If you want to run this example download the repository and the source data."""

import numpy as np
import os
import pickle
from sklearn.cross_validation import cross_val_score, StratifiedShuffleSplit
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier

import get_traces
import transformers as trans


def build_pipeline(X):
    """Helper function to build the pipeline of feature transformations.
    We do the same thing to each channel so rather than manually copying changes
    for all channels this is automatically generated"""
    channels = X.shape[2]
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('select_%d_pipeline' % i, 
                Pipeline([('select_%d' % i, trans.ChannelExtractor(i)),
                ('channel features', FeatureUnion([
                    ('var', trans.VarTransformer()),
                    ('median', trans.MedianTransformer()),
                    ('fft', trans.FFTTransformer()),
                    ])),
                ])
            ) for i in range(channels)])),
        ('classifier', trans.ModelTransformer(RandomForestClassifier(
            n_estimators=500,
            max_depth=None,
            min_samples_split=1, 
            random_state=0))),
            ])
    return pipeline


def get_transformed_data(patient, func=get_traces.get_training_traces):
    """Load in all the data"""
    X = []
    channels = get_traces.get_num_traces(patient)
    # Reading in 43 Gb of data . . .
    for i in range(channels):
        x, y = func(patient, i)
        X.append(x)
    return (np.dstack(X), y)




all_labels = []
all_predictions = np.array([])
folders = [i for i in os.listdir(get_traces.directory) if i[0] != '.']
folders.sort()
for folder in folders:
    print('Starting %s' % folder)

    print('getting data')
    X, y = get_transformed_data(folder)
    print(X.shape)
    print('stratifiedshufflesplit')
    cv = StratifiedShuffleSplit(y,
        n_iter=5,
        test_size=0.2,
        random_state=0,)
    print('cross_val_score')
    
    pipeline = build_pipeline(X)
    
    # Putting this in a list is unnecessary for just one pipeline - use to compare multiple pipelines
    scores = [
        cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc')
        ]
    print('displaying results')
    for score, label in zip(scores, ['pipeline',]):
        print("AUC:  {:.2%} (+/- {:.2%}), {:}".format(score.mean(), 
            score.std(), label))
    
    clf = pipeline
    print('Fitting full model')
    clf.fit(X, y)
    print('Getting test data')
    testing_data, files = get_transformed_data(folder, 
            get_traces.get_testing_traces)
    print('Generating predictions')
    predictions = clf.predict_proba(testing_data)
    print(predictions.shape, len(files))
    with open('%s_randomforest_predictions.pkl' % folder, 'wb') as f:
        pickle.dump((files, predictions[:,1]), f)


C:\Users\stree\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-17-b60e3bf3131e> in <module>()
     10 from sklearn.ensemble import RandomForestClassifier
     11 
---> 12 import get_traces
     13 import transformers as trans
     14 

ImportError: No module named 'get_traces'

Saving a model

The final step is using our trained model. This might be in a research article or as part of an ongoing service. In either case we want to be able to save a version of our model. We can use the saved model to address reviewers comments, to move the model to a production server, etc.

Scikit learn models can be saved by using the pickle storage package and the joblib variant which is more efficient in handling numpy arrays.


In [18]:
from sklearn import datasets

diabetes = datasets.load_diabetes()

# Description at http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
# Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements
# were obtained for each of n = 442 diabetes patients,
# as well as the response of interest, a quantitative measure of disease progression one year after baseline.

X = diabetes.data     # independent variables
y = diabetes.target   # dependent val

print(X.shape)
print(y.shape)

import pandas as pd
data = pd.DataFrame(X, columns=['age', 'sex', 'bmi', 'map', 
                                'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu'])
data.info()


(442, 10)
(442,)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
age    442 non-null float64
sex    442 non-null float64
bmi    442 non-null float64
map    442 non-null float64
tc     442 non-null float64
ldl    442 non-null float64
hdl    442 non-null float64
tch    442 non-null float64
ltg    442 non-null float64
glu    442 non-null float64
dtypes: float64(10)
memory usage: 34.6 KB

In [19]:
from sklearn import linear_model

bmi = X[:, 2].reshape(-1, 1)
outcome = y

reg = linear_model.LinearRegression()
reg.fit(bmi, outcome)
predicted_outcome = reg.predict(bmi)

plt.plot(predicted_outcome, outcome, 'k.')
plt.xlabel("Predicted outcome")
plt.ylabel("Clinical outcome")


Out[19]:
<matplotlib.text.Text at 0x186df5f39e8>

In [20]:
print('Directly trained model predictions:', predicted_outcome[:10])

from sklearn.externals import joblib

joblib.dump(reg, 'diabetes_prediction_model.pkl')

reg2 = joblib.load('diabetes_prediction_model.pkl') 
predicted_outcome2 = reg2.predict(bmi)
print('Saved model predictions:', predicted_outcome[:10])


Directly trained model predictions: [ 210.71003806  103.26219543  194.33703347  141.12476855  117.58857445
  113.4953233   107.35544658  150.33458363  210.71003806  189.22046954]
Saved model predictions: [ 210.71003806  103.26219543  194.33703347  141.12476855  117.58857445
  113.4953233   107.35544658  150.33458363  210.71003806  189.22046954]

Exercises

  1. The diabetes example above currently only uses BMI. Expand it to include the other variables.
  2. Convert the model over to a pipeline format. This pipeline will only have one element at this stage - you will build it up in the next exercises. Do you get the same predictions?
  3. Create a new pipeline applying PCA to the dataset before the classifier. What is the optimal number of dimensions? Try 10-20 options spanning the entire range of 1-64
  4. Build a pipeline using a different algorithm. For example, try k-nearest neighbor or random forest regressor. What are the optimal parameters for you chosen model?
  5. Looking at both models is the size of the errors (residuals) randomly distributed across the range of the outcome variable or is there a pattern? For example, perhaps the models underestimate at high outcome values? These types of pattern are strong indicators that a model can be improved further by creating new features or combining different models.

In [ ]: