An introduction to machine learning with scikit-learn

Link: http://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting

We can separate the learning problems in a few large categories:

supervised learning
- classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
- regression: if the desired output consists of one or more continuous variables, then the task is called regression. Example: the prediction of the length of a salmon as a function of its age and weight.
unsupervised learning: in which the training data consists of a set of input vectors x without any corresponding target values.
- clustering: discover groups of similar examples within the data.
- density estimation: determine the distribution of data within the input space.

A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_sample, n_features array. Example of digits.data, which gives access to the features that can be used to classify the digits samples:



In [14]:

    
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
print(digits.data)









    



[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

The groundtruth:



In [15]:

    
digits.target









    Out[15]:





array([0, 1, 2, ..., 8, 9, 8])

Each original sample is an image of shape (8,8) and can be accessed using:



In [16]:

    
digits.images[0]









    Out[16]:





array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

Learning and predicting



In [17]:

    
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-1], digits.target[:-1])









    Out[17]:





SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

Now we can predict new values



In [18]:

    
clf.predict(digits.data[-1])









    Out[18]:





array([8])

Model persistence

Save a model in the scikit by using Python's built-in persistence model, namely pickle



In [19]:

    
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X,y)









    Out[19]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)



In [20]:

    
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X[0])









    Out[20]:





0



In [21]:

    
y[0]









    Out[21]:





0

Model persistence in scikit-learn is worth considering.



In [ ]: