In [1]:

    
%load_ext load_style
%load_style talk.css
from IPython.display import Image
from talktools import website

Machine Learning with scikit-learn

Broadly speaking, Machine Learning is a field, related to Artificial Intelligence (AI), concerned about developping algorithms that 'learn from data', i.e. automatically adjust their performance from exposure to information encoded in data. This learning is achieved via tunable parameters that are automatically adjusted according to performance criteria.

There are two major classes of ML (actually 3 with reinforcement learning):

Supervised learning : Algorithms which learn from a training set of labeled examples to generalize to the set of all possible inputs.

There are two classes of supervised learning algorithms:

classification: when the label is encoded into a discrete, categorical variable (a class, a label). Example of a classification algorithm is the Support Vector Machine
regression: when the label is encoded into a continuous variable. Example of a regression is the simple linear regression.

Unsupervised learning : Algorithms which learn from a training set of unlabeled examples, using the features of the inputs to categorize inputs together according to some statistical criteria.

One can also divide Unsupervised learning algorithms into:

Dimensionality reduction: learning a more compact representation (i.e. reducing the dimensions) of the data. One example is Principal Component Analysis (or Empirical Orthogonal Functions decomposition)
Clustering: separating the data into clusters. One example of widely used algorithm is the k-means clustering method.

Some vocabulary

Instance : sample, observation
Feature : explanatory variable, independent variable, predictor
target : dependent variable, predictand, can be categorical (label, class) or continuous.



In [2]:

    
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import Image, HTML
%matplotlib inline



In [3]:

    
Image(url='http://scikit-learn.org/stable/_static/ml_map.png', width=900)









    Out[3]:

The scikit-learn package is an open-source library that provides a robust set of machine learning algorithms for Python. It is built upon the core Python scientific stack (i.e. NumPy, SciPy, Cython), and has a simple, consistent API, making it useful for a wide range of statistical learning applications.



In [4]:

    
#website('http://scikit-learn.org/stable/', width=1000)

Representing Data in `scikit-learn`

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]

n_samples: The number of samples: each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.
n_features: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases.

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where scipy.sparse matrices can be useful, in that they are much more memory-efficient than numpy arrays.

$$ {\rm feature~matrix:~~~} {\bf X}~=~\left[ \begin{matrix} x_{11} & x_{12} & \cdots & x_{1D}\\ x_{21} & x_{22} & \cdots & x_{2D}\\ x_{31} & x_{32} & \cdots & x_{3D}\\ \vdots & \vdots & \ddots & \vdots\\ \vdots & \vdots & \ddots & \vdots\\ x_{N1} & x_{N2} & \cdots & x_{ND}\\ \end{matrix} \right] $$$$ {\rm label~vector:~~~} {\bf y}~=~ [y_1, y_2, y_3, \cdots y_N] $$

Here there are $N$ samples and $D$ features. Several example datasets are available in the sklearn.datasets module

The `scikit-learn` interface

A great feature about scikit-learn is the consistant API (Application Programmer Interface) which means that once you have learned how one particular algorithm is implemented in scikit-learn, using another algorithm will look very familiar.

All objects within scikit-learn share a uniform common basic API consisting of three complementary interfaces: an estimator interface for building and ﬁtting models, a predictor interface for making predictions and a transformer interface for converting data.

The estimator interface is at the core of the library. It deﬁnes instantiation mechanisms of objects and exposes a fit method for learning a model from training data. All supervised and unsupervised learning algorithms (e.g., for classiﬁcation, regression or clustering) are oﬀered as objects implementing this interface. Machine learning tasks like feature extraction, feature selection or dimensionality reduction are also provided as estimators.

Scikit-learn strives to have a uniform interface across all methods; given a scikit-learn estimator object named model, the following methods are available:

Available in all Estimators
- model.fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).
Available in supervised estimators
- model.predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.
- model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
- model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.
Available in unsupervised estimators
- model.transform() : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.
- model.fit_transform() : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.

The predictor interface extends the notion of an estimator by adding a predict method that takes an array X_test and produces predictions based on the learned parameters of the estimator. In the case of supervised learning estimators, this method typically returns the predicted labels (for classification) or values (*regression) computed by the model. Some unsupervised learning estimators may also implement the predict interface, such as k-means, where the predicted values are the cluster labels.

Since it is common to modify or ﬁlter data before feeding it to a learning algorithm, some estimators in the library implement a transformer interface which deﬁnes a transform method. It takes as input some new data X_test and yields as output a transformed version. Preprocessing, feature selection, feature extraction and dimensionality reduction algorithms are all provided as transformers within the library.

What we're going to do during this session is give an example of supervised learning, and more specifically we're going to see how to solve a classification problem in scikit-learn, with a focus on how one evaluates the performance of a model.

We're going to use a dataset that comes with scikit-learn, which consists in representation of hand-written digits (8 x 8 pixels normalized images) with the associated label (the correct digit)

This example is treated in a more comprehensive manner by Olivier Grisel (see his notebooks here)



In [5]:

    
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.DESCR)









    



 Optical Recognition of Handwritten Digits Data Set

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.



In [6]:

    
X, y = digits.data, digits.target

print("data shape: %r, target shape: %r" % (X.shape, y.shape))
print("labels: %r" % list(np.unique(y)))









    



data shape: (1797, 64), target shape: (1797,)
labels: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



In [7]:

    
def plot_gallery(data, labels, shape, interpolation='nearest'):
    f,ax = plt.subplots(1,5,figsize=(16,5))
    for i in range(data.shape[0]):
        ax[i].imshow(data[i].reshape(shape), interpolation=interpolation, cmap=plt.cm.gray_r)
        ax[i].set_title(labels[i])
        ax[i].set_xticks(()), ax[i].set_yticks(())



In [8]:

    
subsample = np.random.permutation(X.shape[0])[:5]
images = X[subsample]
labels = ['True label: %d' % l for l in y[subsample]]
plot_gallery(images, labels, shape=(8, 8))

example of hand-written digit classification with Support Vector Machines (SVM)

We are importing the svm.SVC (Support Vector Classifier class) from scikit-learn



In [9]:

    
from sklearn.svm import SVC

instanciation



In [10]:

    
svc = SVC()

fitting



In [11]:

    
svc.fit(X, y)









    Out[11]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

scoring



In [12]:

    
svc.score(X,y)









    Out[12]:





1.0

prediction



In [13]:

    
y_hat = svc.predict(X)



In [14]:

    
np.alltrue(y_hat == y)









    Out[14]:





True

Have we got a perfect model ???

Here we are making an important methodological mistake: we are using all the instances available to train the model, and using the same instances to evaluate the model in terms of accuracy. It tell us (almost) nothing about the actual performance in production of the model, just how well it can reproduce the data it's been exposed too ...

A way to work around that is to train the model over a subset of the available instances (the training set), calculate the train score, and test the model (i.e. calculate the test score) over the remaining of the instances (the test set).

Cross-validation consists into repeating this operation several times using successive splits of the original dataset into training and test sets, and calculating a summary statistic of the train and test scores over the iterations (usually average).

Several splits can be used:

Random split: a given percentage of the data is selected at random (with replacement)
K-folds: the dataset is divided into K exhaustive splits, each split is used as the test set, while the K-1 splits are using as the training set
Stratified K-folds: for classification mainly. The folds are constructed so that the class distribution is approximately the same in each fold (e.g. the relative frequency of each class is preserved)
Leave One Out: like K-fold with K = 1. One instance is left out, the model is built on the N-1 remaining instances, this procedure is repeated until all the instances have been used.

cross-validation in scikit-learn



In [15]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, \
                                test_size=0.25, random_state=1)

print("train data shape: %r, train target shape: %r"
      % (X_train.shape, y_train.shape))
print("test data shape: %r, test target shape: %r"
      % (X_test.shape, y_test.shape))









    



train data shape: (1347, 64), train target shape: (1347,)
test data shape: (450, 64), test target shape: (450,)



In [16]:

    
svc = SVC().fit(X_train, y_train)
train_score = svc.score(X_train, y_train) 
train_score









    Out[16]:





1.0



In [17]:

    
test_score = svc.score(X_test, y_test)
test_score









    Out[17]:





0.40888888888888891

Ok that seems more like a 'normal' result ...

if the test data score is not as good as the train score the model is overfitting
if the train score is not close to 100% accuracy the model is underfitting

Ideally we want to neither overfit nor underfit: test_score ~= train_score ~= 1.0.

When setting up a Support Vector Machine classifier, one needs to set up 2 parameters (hyper-parameters) which are NOT tuned at the fitting stage (they are NOT learned). These are C and $\gamma$ (see the relevant section in the wikipedia article). What we did before is to instanciate the SVC class without specifying these parameters, which means that the default are used. Let's try something else.



In [18]:

    
svc_2 = SVC(C=100, gamma=0.001).fit(X_train, y_train)
svc_2









    Out[18]:





SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)



In [19]:

    
svc_2.score(X_train, y_train)









    Out[19]:





1.0



In [20]:

    
svc_2.score(X_test, y_test)









    Out[20]:





0.99111111111111116



In [21]:

    
sum(svc_2.predict(X_test) == y_test) / float(len(y_test))









    Out[21]:





0.99111111111111116

Could be luck (we only used one train / test split here): Now we're going to use cross validation to repeat the train / test split several times to as to get a more accurate estimate of the real test score by averaging the values found of the individual runs

scikit-learn provides a very convenient interface to do that: sklearn.cross_validation



In [22]:

    
from sklearn import cross_validation



In [23]:

    
cross_validation.









    



  File "<ipython-input-23-3af9688d9f43>", line 1
    cross_validation.
                     ^
SyntaxError: invalid syntax



In [24]:

    
cross_validation.ShuffleSplit?



In [26]:

    
cv = cross_validation.ShuffleSplit(len(X), n_iter=3, test_size=0.2,
    random_state=0)

for cv_index, (train, test) in enumerate(cv):
    print("# Cross Validation Iteration #%d" % cv_index)
    print("train indices: {0}...".format(train[:10]))
    print("test indices: {0}...".format(test[:10]))
    
    svc = SVC(C=100, gamma=0.001).fit(X[train], y[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(X[train], y[train]), svc.score(X[test], y[test])))









    



# Cross Validation Iteration #0
train indices: [1109  940  192  260 1148  966 1720  554  308  512]...
test indices: [1081 1707  927  713  262  182  303  895  933 1266]...
train score: 1.000, test score: 0.992

# Cross Validation Iteration #1
train indices: [1642  586  142   15  701  472  380 1405 1551  450]...
test indices: [1014  755 1633  117  181  501  948 1076   45  659]...
train score: 1.000, test score: 0.997

# Cross Validation Iteration #2
train indices: [ 396 1083 1184 1569  560 1502 1722 1162 1316 1685]...
test indices: [ 795  697  655  573  412  743  635  851 1466 1383]...
train score: 1.000, test score: 0.989

There's a wrapper for estimating cross validated scores directly, you just have to pass the cross validation method instanciated before



In [27]:

    
from sklearn.cross_validation import cross_val_score

svc = SVC(C=100, gamma=0.001)

cv = cross_validation.ShuffleSplit(len(X), n_iter=10, test_size=0.2,
    random_state=0)

test_scores = cross_val_score(svc, X, y, cv=cv, n_jobs=4) # n_jobs = 4 if you have a quad-core machine ...
test_scores









    Out[27]:





array([ 0.99166667,  0.99722222,  0.98888889,  0.99722222,  0.98611111,
        0.99722222,  0.98611111,  0.99444444,  0.98611111,  0.99166667])

Cross validation can be used to estimate the best hyperparameters for a model

Let's see what happens when we fix C but vary $\gamma$



In [28]:

    
n_iter = 5 # the number of iterations should be more than that ... 

gammas = np.logspace(-7, -1, 10) # should be more fine grained ... 

cv = cross_validation.ShuffleSplit(len(X), n_iter=n_iter, test_size=0.2)

train_scores = np.zeros((len(gammas), n_iter))
test_scores = np.zeros((len(gammas), n_iter))

for i, gamma in enumerate(gammas):
    for j, (train, test) in enumerate(cv):
        C = 1
        clf = SVC(C=C, gamma=gamma).fit(X[train], y[train])
        train_scores[i, j] = clf.score(X[train], y[train])
        test_scores[i, j] = clf.score(X[test], y[test])



In [31]:

    
f, ax = plt.subplots(figsize=(12,8))
#for i in range(n_iter):
#    ax.semilogx(gammas, train_scores[:, i], alpha=0.2, lw=2, c='b')
#    ax.semilogx(gammas, test_scores[:, i], alpha=0.2, lw=2, c='g')
ax.semilogx(gammas, test_scores.mean(1), lw=4, c='g', label='test score')
ax.semilogx(gammas, train_scores.mean(1), lw=4, c='b', label='train score')

ax.grid()
ax.fill_between(gammas, train_scores.min(1), train_scores.max(1), color = 'b', alpha=0.2)
ax.fill_between(gammas, test_scores.min(1), test_scores.max(1), color = 'g', alpha=0.2)

ax.set_ylabel("score for SVC(C=%4.2f, $\gamma=\gamma$)" % ( C ),fontsize=16)
ax.set_xlabel(r"$\gamma$",fontsize=16)
best_gamma = gammas[np.argmax(test_scores.mean(1))]
best_score = test_scores.mean(1).max()
ax.text(best_gamma, best_score+0.05, "$\gamma$ = %6.4f | score=%6.4f" % (best_gamma, best_score),\
        fontsize=15, bbox=dict(facecolor='w',alpha=0.5))
[x.set_fontsize(16) for x in ax.xaxis.get_ticklabels()]
[x.set_fontsize(16) for x in ax.yaxis.get_ticklabels()]
ax.legend(fontsize=16,  loc=0)
ax.set_ylim(0, 1.1)
ax.grid()

Grid-search

You can search the (hyper) parameter space and find the best hyperparameters using grid search in scikit-learn



In [ ]:

    
from sklearn.grid_search import GridSearchCV



In [ ]:

    
svc_params = {
    'C': np.logspace(-1, 2, 4),
    'gamma': np.logspace(-4, 0, 5),
}



In [ ]:

    
gs_svc = GridSearchCV(SVC(), svc_params, cv=3, n_jobs=4)



In [ ]:

    
gs_svc.fit(X, y)



In [ ]:

    
gs_svc.best_params_, gs_svc.best_score_

Exercise: predicting the quality of a wine given a set of physicochemical measurements

Two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

This dataset is available from the UC Irvine Machine Learning Repo http://archive.ics.uci.edu/ml/datasets/Wine+Quality

You can try several classification approaches for the quality (10 discrete classes for quality) or you can try (using either statsmodels or sklearn) regressions approaches: e.g. predicting the alcohol content given the other (or subset thereof) measurements.



In [51]:

    
wine  = pd.read_csv('./data/winequality-red.csv', sep=';')



In [52]:

    
wine.head()









    Out[52]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
        7.4
       0.70
       0.00
       1.9
       0.076
       11
       34
       0.9978
       3.51
       0.56
       9.4
       5
    
    
      1
        7.8
       0.88
       0.00
       2.6
       0.098
       25
       67
       0.9968
       3.20
       0.68
       9.8
       5
    
    
      2
        7.8
       0.76
       0.04
       2.3
       0.092
       15
       54
       0.9970
       3.26
       0.65
       9.8
       5
    
    
      3
       11.2
       0.28
       0.56
       1.9
       0.075
       17
       60
       0.9980
       3.16
       0.58
       9.8
       6
    
    
      4
        7.4
       0.70
       0.00
       1.9
       0.076
       11
       34
       0.9978
       3.51
       0.56
       9.4
       5

Below an example of classification (using the same SVC classifier)

you need to add the cross-validation step



In [53]:

    
quality = wine.pop('quality')



In [54]:

    
y = quality.values



In [55]:

    
X = wine.values



In [56]:

    
from sklearn.preprocessing import StandardScaler as scaler



In [57]:

    
scaler = scaler()



In [58]:

    
scaler.fit(X)









    Out[58]:





StandardScaler(copy=True, with_mean=True, with_std=True)



In [59]:

    
Xscaled = scaler.transform(X)



In [60]:

    
from sklearn.svm import SVC



In [61]:

    
svc = SVC()



In [62]:

    
svc.fit(Xscaled, y)









    Out[62]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)



In [63]:

    
y_hat = svc.predict(Xscaled)



In [64]:

    
y_hat









    Out[64]:





array([5, 5, 5, ..., 6, 6, 6])



In [65]:

    
y









    Out[65]:





array([5, 5, 5, ..., 6, 5, 6])



In [66]:

    
svc.score(X, y)









    Out[66]:





0.42589118198874298



In [67]:

    
from sklearn.metrics import confusion_matrix



In [68]:

    
confusion_matrix(y, y_hat)









    Out[68]:





array([[  0,   0,   7,   3,   0,   0],
       [  0,   1,  38,  13,   1,   0],
       [  0,   0, 549, 129,   3,   0],
       [  0,   0, 169, 450,  19,   0],
       [  0,   0,  10, 115,  74,   0],
       [  0,   0,   0,  12,   6,   0]])

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11	34	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25	67	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15	54	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17	60	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11	34	0.9978	3.51	0.56	9.4	5