Scikit-learn introduction


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Loading data

sklearn.datasets contains well known datasets that you can download and use

http://scikit-learn.org/stable/datasets/

Methods:

  • load_* - load a dataset
  • fetch_* - download and load a dataset
  • make_* - generate a dataset

In [2]:
import sklearn
from sklearn.datasets import load_digits

Classification data


In [3]:
digits = load_digits() # Bunch object
type(digits)


Out[3]:
sklearn.datasets.base.Bunch

In [4]:
digits.keys()


Out[4]:
dict_keys(['DESCR', 'data', 'target', 'images', 'target_names'])

In [5]:
digits.target_names


Out[5]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [6]:
digits.data.shape


Out[6]:
(1797, 64)

In [7]:
digits.data[0]


Out[7]:
array([  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
        15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
         8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
         5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
         1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
         0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.])

In [8]:
plt.imshow(digits.images[0], cmap=plt.cm.binary, interpolation="none")


Out[8]:
<matplotlib.image.AxesImage at 0x10dd6d320>

In [9]:
housing = sklearn.datasets.fetch_california_housing()

In [10]:
housing.keys()


Out[10]:
dict_keys(['DESCR', 'data', 'target', 'feature_names'])

In [11]:
housing.feature_names


Out[11]:
['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

How each feature affect housing prices?


In [12]:
x = housing.data[:, 2]
y = housing.data[:, 3]

In [13]:
vals = housing.target
plt.scatter(x,y, s=vals)


Out[13]:
<matplotlib.collections.PathCollection at 0x10dd9acc0>

Classify digits

Split the dataset into a train and test set


In [14]:
from sklearn.cross_validation import train_test_split
Xtrain_d, Xtest_d, ytrain_d, ytest_d = train_test_split(digits.data, digits.target, test_size=0.1, )

In [15]:
len(Xtrain_d), len(Xtest_d)


Out[15]:
(1617, 180)

Let's use kNN classifiers!


In [16]:
from sklearn.neighbors import KNeighborsClassifier as kNN

In [17]:
n1_model = kNN(n_neighbors=3)

In [18]:
n1_model.fit(Xtrain_d, ytrain_d)


Out[18]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [19]:
# Let's compare the predicated values with the actal ones
X0 = [digits.data[0]]
y0 = digits.target[0]
out0 = n1_model.predict(X0)[0]
print("Equals: ", out0, y0)


Equals:  0 0

In [20]:
n1_model.predict_proba(X0) # The model is pretty sure about its prediction


Out[20]:
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Evaluate the model


In [21]:
n1_model.score(Xtest_d, ytest_d) # Check the accuracy on the training data


Out[21]:
0.98333333333333328

In [22]:
from sklearn.metrics import confusion_matrix, f1_score
ypred_d = n1_model.predict(Xtest_d)
confusion_matrix(ypred_d, ytest_d)


Out[22]:
array([[17,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0, 16,  0,  0,  0,  0,  0,  0,  1,  0],
       [ 0,  0, 15,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0, 19,  0,  0,  0,  0,  1,  0],
       [ 0,  0,  0,  0, 18,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0, 17,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0, 23,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 15,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0, 22,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0, 15]])

In [23]:
f1_score(ytest_d, ypred_d, average="macro")


Out[23]:
0.98479680923057733

...with cross validation


In [25]:
from sklearn.cross_validation import cross_val_score
res = cross_val_score(n1_model, digits.data, digits.target, cv=10)
print(res)
print("Average:", np.average(res))


[ 0.94054054  1.          0.98895028  0.98888889  0.96089385  0.98324022
  0.98324022  0.98314607  0.97740113  0.97159091]
Average: 0.97778921138

Estimating housing prices


In [26]:
from sklearn.linear_model import LinearRegression

In [27]:
lin_model = LinearRegression()

In [28]:
housing.data.shape


Out[28]:
(20640, 8)

In [29]:
sep = 15000
X_h = housing.data[:sep,2:4]
y_h = housing.target[:sep]
lin_model.fit(X_h, y_h)


Out[29]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [30]:
lin_model.predict([housing.data[sep][2:4]]), housing.target[sep]


Out[30]:
(array([ 2.02234329]), 1.407)

In [31]:
lin_model.score(housing.data[sep:, 2:4], housing.target[sep:])


Out[31]:
0.073359798102767604

Exercise