scikit-learn - Quick Start

Loading, learning and predicting: the digits dataset

Playing around with digits dataset, which is one of the scikit-learn standard datasets. The dataset provides features that can be used to classify the digits samples.


In [4]:
from sklearn import datasets
digits = datasets.load_digits()
print(digits.data)


[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

In [5]:
digits.target


Out[5]:
array([0, 1, 2, ..., 8, 9, 8])

Try support vector classification estimator.


In [7]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-1], digits.target[:-1])


Out[7]:
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

Predict: what is the digit in the last image => 8.


In [12]:
clf.predict(digits.data[-1])


Out[12]:
array([8])

Model persistence: pickle and joblib

Saving a model by using pickle, Python’s built-in persistence model.


In [13]:
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(digits.data[-1])


Out[13]:
array([8])

scikit also provides joblib, which can be used as a replacement of pickle (joblib.dump & joblib.load). It can only pickle to the disk and not to a string, but it is more efficient on big data.


In [14]:
from sklearn.externals import joblib
joblib.dump(clf, 'clfdump.pkl')


Out[14]:
['clfdump.pkl',
 'clfdump.pkl_01.npy',
 'clfdump.pkl_02.npy',
 'clfdump.pkl_03.npy',
 'clfdump.pkl_04.npy',
 'clfdump.pkl_05.npy',
 'clfdump.pkl_06.npy',
 'clfdump.pkl_07.npy',
 'clfdump.pkl_08.npy',
 'clfdump.pkl_09.npy',
 'clfdump.pkl_10.npy',
 'clfdump.pkl_11.npy']