scikit-learn - Quick Start

Loading, learning and predicting: the digits dataset

Playing around with digits dataset, which is one of the scikit-learn standard datasets. The dataset provides features that can be used to classify the digits samples.



In [4]:

    
from sklearn import datasets
digits = datasets.load_digits()
print(digits.data)









    



[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]



In [5]:

    
digits.target









    Out[5]:





array([0, 1, 2, ..., 8, 9, 8])

Try support vector classification estimator.



In [7]:

    
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-1], digits.target[:-1])









    Out[7]:





SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

Predict: what is the digit in the last image => 8.



In [12]:

    
clf.predict(digits.data[-1])









    Out[12]:





array([8])

Model persistence: pickle and joblib

Saving a model by using pickle, Python’s built-in persistence model.



In [13]:

    
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(digits.data[-1])









    Out[13]:





array([8])

scikit also provides joblib, which can be used as a replacement of pickle (joblib.dump & joblib.load). It can only pickle to the disk and not to a string, but it is more efficient on big data.



In [14]:

    
from sklearn.externals import joblib
joblib.dump(clf, 'clfdump.pkl')









    Out[14]:





['clfdump.pkl',
 'clfdump.pkl_01.npy',
 'clfdump.pkl_02.npy',
 'clfdump.pkl_03.npy',
 'clfdump.pkl_04.npy',
 'clfdump.pkl_05.npy',
 'clfdump.pkl_06.npy',
 'clfdump.pkl_07.npy',
 'clfdump.pkl_08.npy',
 'clfdump.pkl_09.npy',
 'clfdump.pkl_10.npy',
 'clfdump.pkl_11.npy']