An introduction to machine learning with scikit-learn

We can separate the learning problems in a few large categories:

  • supervised learning
    • classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
    • regression: if the desired output consists of one or more continuous variables, then the task is called regression. Example: the prediction of the length of a salmon as a function of its age and weight.
  • unsupervised learning: in which the training data consists of a set of input vectors x without any corresponding target values.
    • clustering: discover groups of similar examples within the data.
    • density estimation: determine the distribution of data within the input space.

A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_sample, n_features array. Example of digits.data, which gives access to the features that can be used to classify the digits samples:


In [14]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
print(digits.data)


[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

The groundtruth:


In [15]:
digits.target


Out[15]:
array([0, 1, 2, ..., 8, 9, 8])

Each original sample is an image of shape (8,8) and can be accessed using:


In [16]:
digits.images[0]


Out[16]:
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

Learning and predicting


In [17]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-1], digits.target[:-1])


Out[17]:
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

Now we can predict new values


In [18]:
clf.predict(digits.data[-1])


Out[18]:
array([8])

Model persistence

Save a model in the scikit by using Python's built-in persistence model, namely pickle


In [19]:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X,y)


Out[19]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [20]:
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X[0])


Out[20]:
0

In [21]:
y[0]


Out[21]:
0

Model persistence in scikit-learn is worth considering.


In [ ]: