Introduction



In [1]:

    
# importing numpy, pandas & matplotlib
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import random
%matplotlib inline

What is Data?

A dataset consists of multiple data rows.
Each row describes an item with its features.
Think of features as properties of a sample. E.g. an apple has color, size, kind, ...



In [2]:

    
# Load Iris dataset
from sklearn.datasets import load_iris
iris = load_iris()



In [3]:

    
iris.feature_names









    Out[3]:





['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']



In [4]:

    
print(iris.DESCR)









    



Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

We separate dataset into features and labels matrices.
Features could be vectors (data rows has only 1 feature), Matrices (having multiple features) or Tensors (each feature is itself another matrix or tensor)
Labels could be vectors, matrices or tensors too.
Usually features are matrices and we declare it by X and lables are vectors declared by y.
We declare the dataset size as the number of data items in it.



In [5]:

    
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)



In [6]:

    
print('X.shape =', X.shape)
print('y.shape =', y.shape)









    



X.shape = (150, 4)
y.shape = (150,)



In [7]:

    
X.head()









    Out[7]:







  
    
      
      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
    
    
      1
      4.9
      3.0
      1.4
      0.2
    
    
      2
      4.7
      3.2
      1.3
      0.2
    
    
      3
      4.6
      3.1
      1.5
      0.2
    
    
      4
      5.0
      3.6
      1.4
      0.2



In [8]:

    
y.head()









    Out[8]:





0    0
1    0
2    0
3    0
4    0
dtype: int64



In [9]:

    
plt.figure(figsize=(6, 6));
plt.scatter(X.values[:, 0], X.values[:, 1], c=y, cmap=plt.cm.rainbow);



In [10]:

    
# Correlation plots are a good way to understand the dataset
pd.plotting.scatter_matrix(X, figsize=(10, 10));

Supervised Learning

TODO

Regression

TODO

Prepare the data

First we prepare the data and preprocess it.
Standardization, Normalization and Imputation are the common preprocesses.



In [11]:

    
from sklearn.datasets import load_boston
boston = load_boston()



In [12]:

    
boston.feature_names









    Out[12]:





array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'],
      dtype='<U7')



In [13]:

    
print(boston.DESCR)









    



Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)



In [14]:

    
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target)



In [15]:

    
print('X.shape =', X.shape)
print('y.shape =', y.shape)









    



X.shape = (506, 13)
y.shape = (506,)



In [16]:

    
X.head()



In [17]:

    
y.head()









    Out[17]:





0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
dtype: float64



In [18]:

    
plt.scatter(X.RM, y, marker='.');
plt.xlabel('# Rooms');
plt.ylabel('House Price');



In [19]:

    
plt.scatter(X.LSTAT, y, marker='.');
plt.xlabel('LSTAT');
plt.ylabel('House Price');



In [20]:

    
X = X.RM

Train/Test Split

In order to reguralize the model, we split the dataset into training and test sets.
Training and test sets are similar in shape except in size.
Usually training set is the bigger part of the dataset (e.g. 80%) and the rest is test set (e.g. 20%).



In [21]:

    
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values.reshape(X.shape[0], 1),
                                                    y.values.reshape(y.shape[0], 1),
                                                    test_size=.3)



In [22]:

    
print('X_train.shape =', X_train.shape)
print('X_test.shape =', X_test.shape)
print('y_train.shape =', y_train.shape)
print('y_test.shape =', y_test.shape)









    



X_train.shape = (354, 1)
X_test.shape = (152, 1)
y_train.shape = (354, 1)
y_test.shape = (152, 1)

Train the model on the training set



In [23]:

    
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)

Evaluate on the test set



In [24]:

    
y_pred = model.predict(X_test)



In [25]:

    
from sklearn.metrics import mean_squared_error
print('Mean squared error =', mean_squared_error(y_test, y_pred))









    



Mean squared error = 46.4310404601



In [26]:

    
print('Linear Regression score = %.2f%%' % (model.score(X_test, y_test) * 100))









    



Linear Regression score = 48.50%



In [27]:

    
plt.scatter(X_test, y_test, marker='.');
plt.plot(X_test, y_pred, color='red');

The model is ready and can be used to predict on real data

Classification

Prepare the data

Just like regression.



In [28]:

    
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')



In [29]:

    
X = pd.DataFrame(mnist.data)
y = pd.Series(mnist.target)



In [30]:

    
f, axes = plt.subplots(3, 4, figsize=(8, 8));
for i in range(3):
    for j in range(4):
        axes[i, j].axis('off')
        if i == 2 and j >= 2:
            continue
        num = i * 4 + j
        axes[i, j].set_title(num)
        axes[i, j].matshow(X.values[y == num][0].reshape(28, 28))

Train/Test Split



In [31]:

    
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000, shuffle=False)



In [32]:

    
print('X_train.shape =', X_train.shape)
print('y_train.shape =', y_train.shape)
print('X_test.shape =', X_test.shape)
print('y_test.shape =', y_test.shape)









    



X_train.shape = (60000, 784)
y_train.shape = (60000,)
X_test.shape = (10000, 784)
y_test.shape = (10000,)

Train the model on the training set



In [33]:

    
from sklearn.linear_model import SGDClassifier
model = SGDClassifier().fit(X_train, y_train)









    



/usr/local/lib/python3.5/dist-packages/sklearn/linear_model/stochastic_gradient.py:84: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)

Evaluate on the test set



In [34]:

    
print('SGD Classifier score = %.2f%%' % (model.score(X_test, y_test) * 100))









    



SGD Classifier score = 89.13%



In [35]:

    
f, axes = plt.subplots(3, 4, figsize=(8, 8));
for i in range(3):
    for j in range(4):
        axes[i, j].axis('off')
        if i == 2 and j >= 2:
            continue
        axes[i, j].set_title(i * 4 + j)
        axes[i, j].matshow(model.coef_[i * 4 + j].reshape(28, 28))

Truely Classified Digits



In [36]:

    
y_pred = model.predict(X_test)
f, axes = plt.subplots(1, 5, figsize=(12, 4))
samples = random.sample(list(y_test[y_test == y_pred].iteritems()), 5)
for i_zero, (i, p) in enumerate(samples):
    axes[i_zero].axis('off')
    axes[i_zero].matshow(X_test.loc[i].values.reshape(28, 28))
    axes[i_zero].set_title('pred = %d' % y_pred[i - 60000], color='green')

Misclassified Digits



In [37]:

    
y_pred = model.predict(X_test)
f, axes = plt.subplots(1, 5, figsize=(12, 4))
samples = random.sample(list(y_test[y_test != y_pred].iteritems()), 5)
for i_zero, (i, p) in enumerate(samples):
    axes[i_zero].axis('off')
    axes[i_zero].matshow(X_test.loc[i].values.reshape(28, 28))
    axes[i_zero].set_title('pred = %d' % y_pred[i - 60000], color='red')

Unsupervised Learning

Clustering

Prepare the data

Like supervised learning, we first gather and preprocess the data.
The difference is that there is no lables vector in clusterin. Only features (X).
Also train/test split doesn't make sense since there is no true lable to compute the score.
There are (fortunately!) some evaluation metrics but they work on the whole dataset.
Some other evaluation metrics work when the true labels are provided.



In [38]:

    
from sklearn.datasets import make_moons
X, y = make_moons(1000, noise=.05)



In [39]:

    
plt.scatter(X[:, 0], X[:, 1], s=10);

Train the model on whole dataset



In [40]:

    
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=.2).fit(X)



In [41]:

    
plt.scatter(X[:, 0], X[:, 1], s=10, c=model.labels_, cmap=plt.cm.rainbow);



In [42]:

    
from sklearn.metrics import adjusted_rand_score
print('Adjusted rand index =', adjusted_rand_score(y, model.labels_))









    



Adjusted rand index = 1.0

PCA

TODO



In [43]:

    
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces()



In [44]:

    
plt.subplots(2, 3)
for i in range(6):
    plt.subplot(2, 3, i + 1)
    plt.axis('off')
    plt.imshow(faces.images[10 * i].reshape(64, 64), cmap=plt.cm.gray)



In [45]:

    
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler



In [46]:

    
faces.data = StandardScaler().fit_transform(faces.data)
pca = PCA().fit(faces.data)



In [47]:

    
plt.subplots(2, 3)
for i in range(6):
    plt.subplot(2, 3, i + 1)
    plt.axis('off')
    plt.imshow(pca.components_[i].reshape(64, 64), cmap=plt.cm.gray_r)

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2