Machine learning with scikit-learn

scikit-learn is the most used python library for machine learning (but you might want to look into TensorFlow and Theano if you like deep-learning). It includes both supervised and unsupervised learning, clustering, feature reduction (PCA) and several pre-processing and performance metrics.

Despite the great number of algorithms implemented in the library, the structure is similar for all of them, making it easier to use them and plug them together. Advanced usage include pipelines and advanced cross-validations.

Key components of scikit-learn are the following:

Estimators: an object that can estimate some parameters based on a dataset (e.g an Imputer, which imputes missing values)
Transformers: a subset of estimators that can also act on the original dataset to alter its values, using the parameters learned in the estimation step (e.g. MinMaxScaler, which standardises the given dataset)
Predictors: object capable of making predictions (e.g. LinearRegression, which as you might have guessed makes a linear regression)

All of this components (actually classes) have consistent characteristics:

a fit method to train the model or learn some parameters
a transform (or fit_transform) method to transform the data
a predict method to carry out the predictions
a score method to asses the goodness of the predictions
learned parameters and hyperparameters are exposed as public attributes, by convention followed by an underscore (e.g. Ridge's alpha_ parameter)

This consistency allows great flexibility: different machine-learning methods and algorithms can be swapped easily, and pipelines can be easily built, since each class can feed into each other's result.

Which algorithm should I choose?

Example: predictor of breast cancer malignity

Datasets in scikit.learn follow the scheme (n_samples, n_features), using numpy.ndarrays. If you want greater flexibility, it is advisable to use pandas dataframes.

This dataset is available inside scikit-learn as a test dataset. Playing with it will provide a good introduction to the main concepts of this package.



In [ ]:

    
# Plotting imports
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('white')



In [ ]:

    
import numpy as np
import pandas as pd
from sklearn import datasets



In [ ]:

    
d = datasets.load_breast_cancer()



In [ ]:

    
print(d['DESCR'])



In [ ]:

    
d.keys()



In [ ]:

    
d['data']



In [ ]:

    
d['target']



In [ ]:

    
d['target_names']



In [ ]:

    
df = pd.DataFrame(d['data'],
                  columns=d['feature_names'])



In [ ]:

    
target = pd.Series(d['target'],
                   name='target')



In [ ]:

    
df.info()

All the samples seem to have all the features, so we will not need to either impute them or drop them.

A quick look at the features

Despite not being strictly implemented in scikit-learn, having a broad overview of our input dataset is a mandatory step in any machine learning project. We can use either panda's plotting capabilities or matplotlib/seaborn.



In [ ]:

    
df.hist(bins=50, figsize=(20,15));

Our features are not on the same scale, despite all following similar distributions. This could be a problem for some algorithms. We will therefore need to perform feature scaling, of which there are two flavours:

Normalization: scaling a feture between a min and a max value (i.e. through the MinMaxScaler transformer)
Standardization: bring the feature of interest to a desired mean and std-dev
- much less affected by outliers

Another routine operation is to verify the relationship between features, so that highly correlated ones are either combined or selected. Again, we can have a quick sense using either seaborn or pandas.



In [ ]:

    
sns.pairplot(data=df[df.columns[:5]], size=2);



In [ ]:

    
from pandas.plotting import scatter_matrix

scatter_matrix(df[df.columns[:5]], figsize=(10, 10));

We can leverage seaborn's clustermap method to visualize the correlation matrix of all features.



In [ ]:

    
mclust = sns.clustermap(df.corr(),
                        figsize=(10, 10),
                        cmap='RdBu')
mclust.ax_heatmap.set_yticklabels(mclust.ax_heatmap.get_yticklabels(),
                                  rotation=0);

We can conclude that it might be a good idea to either drop some features or create new ones (features engineering) to improve our predictor. For instance we could highlight those features that have at least a correlation above 0.7 and randomly pick a representative.



In [ ]:

    
feat_corr = df.corr()
feat_corr.stack().head(5)



In [ ]:

    
feat_corr = df.corr()
# ignore the diagonal, obviously
np.fill_diagonal(feat_corr.values, np.nan)
feat_corr = feat_corr.stack()



In [ ]:

    
feat_corr[feat_corr > 0.7].head(5)



In [ ]:

    
# can you think a smarter way to perform this operation?
high_corr = feat_corr[feat_corr > 0.7]
discarded = set()
saved = set()
for feat1 in {x[0] for x in high_corr.index}:
    if feat1 in discarded:
        continue
    saved.add(feat1)
    for feat2 in high_corr.loc[feat1].index:
        discarded.add(feat2)



In [ ]:

    
saved



In [ ]:

    
discarded



In [ ]:

    
sns.pairplot(data=df.loc[:, sorted(set(df.columns) - discarded)], size=2);



In [ ]:

    
df = df.loc[:, sorted(set(df.columns) - discarded)]

Creating a training and test set

To avoid the risk of confirmation bias, and most importanlty overfitting, it is also a good practice to split early our dataset into two:

training set: used to train our predictor
test set: used to verify the appropriatness of our trained predictor

You should avoid at all costs the temptation to use the test set until the very end of the project, and only to choose which model gives the best trade-offs.



In [ ]:

    
from sklearn import model_selection



In [ ]:

    
df_train, df_test = model_selection.train_test_split(df,
                                                     test_size=0.2)
df_train.head(5)



In [ ]:

    
target_train, target_test = target.iloc[df_train.index], target.iloc[df_test.index]
target_train.head(5)



In [ ]:

    
target_train[target_train == 0].shape[0] / target_train[target_train == 1].shape[0]



In [ ]:

    
target_test[target_test == 0].shape[0] / target_test[target_test == 1].shape[0]

We might have a problem here: the proportion of benign and malign tumors is uneven between the train and test sets! This might impose a bias in our predictor, which will reflect in a poor performance in our final predictions.



In [ ]:

    
cv = model_selection.StratifiedShuffleSplit(n_splits=1,
                                            test_size=0.2,
                                            random_state=42)

The StratifiedShuffleSplit class is a more advanced way to generate a test set, such that the proportion of each target class are roughly the same with the training set.

We are defining a random_state so that the random number generator seed is the same each time we run this notebook. This ensures that we will always get the same train/test sets.



In [ ]:

    
cv.split(df, target)



In [ ]:

    
train_idx, test_idx = next(cv.split(df, target))
train_idx



In [ ]:

    
df_train, target_train = df.iloc[train_idx], target[train_idx]
df_test, target_test = df.iloc[test_idx], target[test_idx]
df_train.head(5)



In [ ]:

    
target_train[target_train == 0].shape[0] / target_train[target_train == 1].shape[0]



In [ ]:

    
target_test[target_test == 0].shape[0] / target_test[target_test == 1].shape[0]

We are now gonna focus exclusively on the train set. If we end up applying any transformation to the data, we'll make sure to apply the same transformation to the test set, using the parameters learned on the train set.

Data cleaning

After we have reduced the number of features we are going to standardize the remaining ones. Importantly, we are going to use the same transformer later for the test set, in oder to make use of exactly the same transormation.



In [ ]:

    
from sklearn import preprocessing



In [ ]:

    
std_scaler = preprocessing.StandardScaler()
std_scaler.fit(df_train)

As you can see, whenever we apply a fit method, the returned value is the same instance of the used transformer. To make things easier, we can use the fit_transform method, which fits the scaler using the provided data and then applies the transormation to the same input data.



In [ ]:

    
std_scaler.fit_transform(df_train)



In [ ]:

    
# let's bring it back to a dataframe
df_scaled = pd.DataFrame(std_scaler.fit_transform(df_train),
                         columns=df_train.columns)



In [ ]:

    
df_scaled.hist(bins=50, figsize=(20,15));

Choosing the appropriate machine learning algorithm

Based on the cheatsheet above, we are facing a classification problem, meaning that we want a predictor to be able to input our features and output a classification label. This particular dataset is easy in this sense, because we are going to work with a binary classifier. The above cheat-sheet seems to indicate that the best classifier given the limited amount of data points might be a support vector machine with a linear kernel (LinearSVC).

The above scheme highlights the strength of the different classifiers available, which also is a powerful indication that it's always a good idea to visualize your data. Different structures in the data will dictate different optimal strategies and therefore different algorithms.



In [ ]:

    
from sklearn import svm



In [ ]:

    
svm.LinearSVC()



In [ ]:

    
clf = svm.LinearSVC()
clf = clf.fit(df_scaled, target_train)

This time we are passing both the features and the labels to the fit method. The predict method will apply the learned decision boundaries to the input features to produce the predicted labels.



In [ ]:

    
clf.predict(df_scaled)



In [ ]:

    
predict_train = pd.Series(clf.predict(df_scaled),
                          index=target_train.index,
                          name='prediction')

Model evaluation

How are we going to decice whether our classifier is good enough to be used in the clinic? Scikit-learn offers several metrics that we can use, each better suited for each predictive task. Let's first try a rough comparison of the true and predicted labels.



In [ ]:

    
combined = target_train.to_frame().join(predict_train.to_frame())
combined.head(10)



In [ ]:

    
combined[combined['target'] == combined['prediction']].shape[0] / combined.shape[0]

Let's first some more evaluation metrics, such as a ROC and a precision-recall curve



In [ ]:

    
from sklearn import metrics



In [ ]:

    
plt.figure(figsize=(6, 3))

plt.subplot(121)

fpr, tpr, thresholds = metrics.roc_curve(combined['target'],
                                         combined['prediction'])
plt.plot(fpr, tpr,
         'r-')
plt.plot([0, 1],
         [0, 1],
         '--',
         color='grey')

plt.xlabel('FPR')
plt.ylabel('TPR')

plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)

plt.subplot(122)

prec, rec, thresholds = metrics.precision_recall_curve(combined['target'],
                                                        combined['prediction'])
plt.plot(rec, prec,
         'r-')

plt.xlabel('precision')
plt.ylabel('recall')

plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)

plt.tight_layout()

metrics.roc_auc_score(combined['target'],
                      combined['prediction'])

98% accuracy and 0.98 ROC AUC! Well done...

...but what if we are actually overfitting the input data? This might be especially true when many features are used, leading to the curse of dimensionality. And we are also using the dafault parameters of our classifier, which might also play in inducing overfitting. We might therefore need to use cross validation on the train set, further splitting it into a train and a validation set. Luckily scikit-learn offers an utility function to to this easily.



In [ ]:

    
model_selection.cross_val_score(clf, df_scaled, target_train,
                                cv=10,
                                scoring=metrics.make_scorer(metrics.roc_auc_score))

So it seems that we are not actually overfitting our data, at least judging from this simple analysis.

But what about picking the best hyperparameters? Wec ould keep it simple and test a range of values for the C and loss parameters, using another useful class of scikit learn: GridSearchCV, which will explore all combinations of the parameters, using a scoring function to asses their impact on the predictions.



In [ ]:

    
param_grid = {'C': np.linspace(0.01, 10),
              'loss': ['hinge', 'squared_hinge']}

clf = svm.LinearSVC()
grid_search = model_selection.GridSearchCV(clf,
                                           param_grid=param_grid,
                                           cv=10,
                                           scoring=metrics.make_scorer(metrics.roc_auc_score))
grid_search.fit(df_scaled, target_train)



In [ ]:

    
grid_search.cv_results_.keys()



In [ ]:

    
plt.figure(figsize=(7, 3))

for mean_score, params in zip(grid_search.cv_results_["mean_test_score"],
                              grid_search.cv_results_["params"]):
    if params['loss'] == 'hinge':
        plt.plot(params['C'],
                 mean_score,
                 'bo')
    else:
        plt.plot(params['C'],
                 mean_score,
                 'ro')
plt.xlabel('C')
plt.ylabel('ROC AUC');

It seems that the combination of loss = hinge and C = 2 produces the best results. There are other hyperparameters that we could tune and better metrics we could use, but we'll stop here and test our trained model on the test dataset.

Pipelines

We'll do it in one go using another very useful feature of scikit-learn: pipelines. Pipelines allow the user to chain several transformers (every piece of the pipeline but the last have to implement the fit_transform method) and a final estimator. This way we can keep the process both modular and concise. We'll also be able to pass to the final estimator (or any extra transformer we might have) the learned parameters.



In [ ]:

    
from sklearn import pipeline



In [ ]:

    
breast_pipeline = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
                                     ('classifier', svm.LinearSVC(C=2., loss='hinge'))])



In [ ]:

    
breast_pipeline = breast_pipeline.fit(df_train, target_train)
breast_pipeline.predict(df_test)



In [ ]:

    
predict_test = pd.Series(breast_pipeline.predict(df_test),
                         index=target_test.index,
                         name='prediction')
combined = target_test.to_frame().join(predict_test.to_frame())
combined.head(10)



In [ ]:

    
plt.figure(figsize=(6, 3))

plt.subplot(121)

fpr, tpr, thresholds = metrics.roc_curve(combined['target'],
                                         combined['prediction'])
plt.plot(fpr, tpr,
         'r-')
plt.plot([0, 1],
         [0, 1],
         '--',
         color='grey')

plt.xlabel('FPR')
plt.ylabel('TPR')

plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)

plt.subplot(122)

prec, rec, thresholds = metrics.precision_recall_curve(combined['target'],
                                                        combined['prediction'])
plt.plot(rec, prec,
         'r-')

plt.xlabel('precision')
plt.ylabel('recall')

plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)

plt.tight_layout()

metrics.roc_auc_score(combined['target'],
                      combined['prediction'])

Our model doesn't work just as good as for the training set, which is a good sign that we are have not produced an overfitted model. Our predictions are still very good!

Another dataset example: iris

In the following dataset we are still trying to classify our data, but this time we have three classes. For this reason we might want to try some clusterization methods.



In [ ]:

    
iris = datasets.load_iris()



In [ ]:

    
# rows are observations, columns are features
iris.data



In [ ]:

    
# true label for each observation
iris.target



In [ ]:

    
# label names
iris.target_names



In [ ]:

    
# the simplest preprocessing is to standardize the data
std_scaler = preprocessing.StandardScaler()
iris.data = std_scaler.fit_transform(iris.data)



In [ ]:

    
from sklearn.cluster import KMeans



In [ ]:

    
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(iris.data)



In [ ]:

    
kmeans.labels_



In [ ]:

    
# once the model has been fitted, we can add a new observation and can try to predict to which cluster they belong to
kmeans.predict([[5.8,  2.7,  4.0,  1.25],])

We are going to plot the dataset (using only two features for semplicity), together with the true labels (glyphs) and the clusters (colors).



In [ ]:

    
plt.figure(figsize=(15, 7))

plt.subplot(121)
for label, glyph in zip(set(iris.target), ('o', 'D', '^')):
    for cluster, color in zip(set(kmeans.labels_), ('b', 'r', 'g')):
        plt.plot(iris.data[(iris.target == label) & (kmeans.labels_ == cluster)][:, 0],
                 iris.data[(iris.target == label) & (kmeans.labels_ == cluster)][:, 1],
                 marker=glyph,
                 linestyle='',
                 color=color,
                 label='{0} - {1}'.format(iris.target_names[label],
                                          cluster))
plt.xlabel('feature 0')
plt.ylabel('feature 1')

plt.subplot(122)
for label, glyph in zip(set(iris.target), ('o', 'D', '^')):
    for cluster, color in zip(set(kmeans.labels_), ('b', 'r', 'g')):
        plt.plot(iris.data[(iris.target == label) & (kmeans.labels_ == cluster)][:, 2],
                 iris.data[(iris.target == label) & (kmeans.labels_ == cluster)][:, 3],
                 marker=glyph,
                 linestyle='',
                 color=color,
                 label='{0} - {1}'.format(iris.target_names[label],
                                          cluster))
plt.xlabel('feature 2')
plt.ylabel('feature 3')
plt.legend(loc=(1, 0));

The clustering is separating the three categories with a decent discriminative power; we can use some metrics implemented in scikit-learn to be more precise. We are going to use an homogeinity score to measure how "pure" each cluster is.



In [ ]:

    
metrics.homogeneity_score(iris.target, kmeans.labels_)

Linear model classifier example

In a real world scenario we would need to divide our dataset into a training and test set; for that purpose the sklearn.cross_validation module should be used.



In [ ]:

    
from sklearn.linear_model import RidgeClassifier



In [ ]:

    
ridge = RidgeClassifier(alpha=1.0)
ridge.fit(iris.data, iris.target)



In [ ]:

    
predictions = ridge.predict(iris.data)



In [ ]:

    
metrics.f1_score(iris.target, predictions, average=None)

With this alpha parameter we are probably not overfitting the model, as we are not correctly predicting even the training data!



In [ ]:

    
plt.figure(figsize=(15, 7))

plt.subplot(121)
plt.plot(iris.data[iris.target == predictions][:, 0],
         iris.data[iris.target == predictions][:, 1],
         'k.',
         label='correct predictions')
plt.plot(iris.data[iris.target != predictions][:, 0],
         iris.data[iris.target != predictions][:, 1],
         'ro',
         label='incorrect predictions')
plt.xlabel('feature 0')
plt.ylabel('feature 1')

plt.subplot(122)
plt.plot(iris.data[iris.target == predictions][:, 2],
         iris.data[iris.target == predictions][:, 3],
         'k.',
         label='correct predictions')
plt.plot(iris.data[iris.target != predictions][:, 2],
         iris.data[iris.target != predictions][:, 3],
         'ro',
         label='incorrect predictions')
plt.xlabel('feature 2')
plt.ylabel('feature 3')
plt.legend(loc=(1, 0));