Scikit-learn introduction



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

Loading data

sklearn.datasets contains well known datasets that you can download and use

http://scikit-learn.org/stable/datasets/

Methods:

load_* - load a dataset
fetch_* - download and load a dataset
make_* - generate a dataset



In [2]:

    
import sklearn
from sklearn.datasets import load_digits

Classification data



In [3]:

    
digits = load_digits() # Bunch object
type(digits)









    Out[3]:





sklearn.datasets.base.Bunch



In [4]:

    
digits.keys()









    Out[4]:





dict_keys(['DESCR', 'data', 'target', 'images', 'target_names'])



In [5]:

    
digits.target_names









    Out[5]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [6]:

    
digits.data.shape









    Out[6]:





(1797, 64)



In [7]:

    
digits.data[0]









    Out[7]:





array([  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
        15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
         8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
         5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
         1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
         0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.])



In [8]:

    
plt.imshow(digits.images[0], cmap=plt.cm.binary, interpolation="none")









    Out[8]:





<matplotlib.image.AxesImage at 0x10dd6d320>

Regression data

http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html



In [9]:

    
housing = sklearn.datasets.fetch_california_housing()



In [10]:

    
housing.keys()









    Out[10]:





dict_keys(['DESCR', 'data', 'target', 'feature_names'])



In [11]:

    
housing.feature_names









    Out[11]:





['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

How each feature affect housing prices?



In [12]:

    
x = housing.data[:, 2]
y = housing.data[:, 3]



In [13]:

    
vals = housing.target
plt.scatter(x,y, s=vals)









    Out[13]:





<matplotlib.collections.PathCollection at 0x10dd9acc0>

Classify digits

Split the dataset into a train and test set



In [14]:

    
from sklearn.cross_validation import train_test_split
Xtrain_d, Xtest_d, ytrain_d, ytest_d = train_test_split(digits.data, digits.target, test_size=0.1, )



In [15]:

    
len(Xtrain_d), len(Xtest_d)









    Out[15]:





(1617, 180)

Let's use kNN classifiers!



In [16]:

    
from sklearn.neighbors import KNeighborsClassifier as kNN



In [17]:

    
n1_model = kNN(n_neighbors=3)



In [18]:

    
n1_model.fit(Xtrain_d, ytrain_d)









    Out[18]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')



In [19]:

    
# Let's compare the predicated values with the actal ones
X0 = [digits.data[0]]
y0 = digits.target[0]
out0 = n1_model.predict(X0)[0]
print("Equals: ", out0, y0)









    



Equals:  0 0



In [20]:

    
n1_model.predict_proba(X0) # The model is pretty sure about its prediction









    Out[20]:





array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Evaluate the model



In [21]:

    
n1_model.score(Xtest_d, ytest_d) # Check the accuracy on the training data









    Out[21]:





0.98333333333333328



In [22]:

    
from sklearn.metrics import confusion_matrix, f1_score
ypred_d = n1_model.predict(Xtest_d)
confusion_matrix(ypred_d, ytest_d)









    Out[22]:





array([[17,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0, 16,  0,  0,  0,  0,  0,  0,  1,  0],
       [ 0,  0, 15,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0, 19,  0,  0,  0,  0,  1,  0],
       [ 0,  0,  0,  0, 18,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0, 17,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0, 23,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 15,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0, 22,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0, 15]])



In [23]:

    
f1_score(ytest_d, ypred_d, average="macro")









    Out[23]:





0.98479680923057733

...with cross validation



In [25]:

    
from sklearn.cross_validation import cross_val_score
res = cross_val_score(n1_model, digits.data, digits.target, cv=10)
print(res)
print("Average:", np.average(res))









    



[ 0.94054054  1.          0.98895028  0.98888889  0.96089385  0.98324022
  0.98324022  0.98314607  0.97740113  0.97159091]
Average: 0.97778921138

Estimating housing prices



In [26]:

    
from sklearn.linear_model import LinearRegression



In [27]:

    
lin_model = LinearRegression()



In [28]:

    
housing.data.shape









    Out[28]:





(20640, 8)



In [29]:

    
sep = 15000
X_h = housing.data[:sep,2:4]
y_h = housing.target[:sep]
lin_model.fit(X_h, y_h)









    Out[29]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [30]:

    
lin_model.predict([housing.data[sep][2:4]]), housing.target[sep]









    Out[30]:





(array([ 2.02234329]), 1.407)



In [31]:

    
lin_model.score(housing.data[sep:, 2:4], housing.target[sep:])









    Out[31]:





0.073359798102767604

Exercise

Find the best model for the diabetes dataset: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes