Fit and predict your data: Introduction to Scikit-learn


In [90]:
from IPython.display import Image
Image(filename='images/phd053104s.png')


Out[90]:

What is machine learning ?

A learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), is it said to have several attributes or features.

We can separate learning problems in a few large categories:

  • Supervised learning, in which the data comes with additional attributes that we want to predict.
    • classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
    • regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
  • Unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

What we have to install?

$pip install numpy scipy pandas sklearn matplotlib


In [41]:
from IPython.display import Image
Image(filename='images/instalacion.png')


Out[41]:

Before starting with sklearn .. lets take a look at numpy, pandas and matplotlib

numpy


In [1]:
import numpy as np

In [2]:
a = np.array(range(64), dtype=np.int32)
a


Out[2]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], dtype=int32)

In [3]:
a = a.reshape((8, 8))
a


Out[3]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31],
       [32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47],
       [48, 49, 50, 51, 52, 53, 54, 55],
       [56, 57, 58, 59, 60, 61, 62, 63]], dtype=int32)

In [4]:
b = np.random.rand(5, 5)
b.shape


Out[4]:
(5, 5)

In [5]:
a.mean(), a.std(), a.cumsum()


Out[5]:
(31.5,
 18.472953201911167,
 array([   0,    1,    3,    6,   10,   15,   21,   28,   36,   45,   55,
          66,   78,   91,  105,  120,  136,  153,  171,  190,  210,  231,
         253,  276,  300,  325,  351,  378,  406,  435,  465,  496,  528,
         561,  595,  630,  666,  703,  741,  780,  820,  861,  903,  946,
         990, 1035, 1081, 1128, 1176, 1225, 1275, 1326, 1378, 1431, 1485,
        1540, 1596, 1653, 1711, 1770, 1830, 1891, 1953, 2016]))

In [6]:
a.cumsum() * 2


Out[6]:
array([   0,    2,    6,   12,   20,   30,   42,   56,   72,   90,  110,
        132,  156,  182,  210,  240,  272,  306,  342,  380,  420,  462,
        506,  552,  600,  650,  702,  756,  812,  870,  930,  992, 1056,
       1122, 1190, 1260, 1332, 1406, 1482, 1560, 1640, 1722, 1806, 1892,
       1980, 2070, 2162, 2256, 2352, 2450, 2550, 2652, 2756, 2862, 2970,
       3080, 3192, 3306, 3422, 3540, 3660, 3782, 3906, 4032])

In [8]:
c = _
c


Out[8]:
array([   0,    2,    6,   12,   20,   30,   42,   56,   72,   90,  110,
        132,  156,  182,  210,  240,  272,  306,  342,  380,  420,  462,
        506,  552,  600,  650,  702,  756,  812,  870,  930,  992, 1056,
       1122, 1190, 1260, 1332, 1406, 1482, 1560, 1640, 1722, 1806, 1892,
       1980, 2070, 2162, 2256, 2352, 2450, 2550, 2652, 2756, 2862, 2970,
       3080, 3192, 3306, 3422, 3540, 3660, 3782, 3906, 4032])

In [9]:
c[:5]


Out[9]:
array([ 0,  2,  6, 12, 20])

In [11]:
c[-1]


Out[11]:
4032

In [56]:
from IPython.display import Image
Image(filename='images/numpy.jpg', width=300)


Out[56]:

Matplotlib


In [12]:
from IPython.display import IFrame
IFrame('http://matplotlib.org/', width=900, height=350)


Out[12]:

In [58]:
from IPython.display import Image
Image(filename='images/convincing.png')


Out[58]:

In [2]:
%matplotlib inline

import matplotlib.pyplot as plt

In [15]:
print plt.style.available
plt.style.use(plt.style.available[0])


[u'seaborn-darkgrid', u'seaborn-notebook', u'classic', u'seaborn-ticks', u'grayscale', u'bmh', u'seaborn-talk', u'dark_background', u'ggplot', u'fivethirtyeight', u'seaborn-colorblind', u'seaborn-deep', u'seaborn-whitegrid', u'seaborn-bright', u'seaborn-poster', u'seaborn-muted', u'seaborn-paper', u'seaborn-white', u'seaborn-pastel', u'seaborn-dark', u'seaborn-dark-palette']

In [16]:
x = np.linspace(0, 2, 10)
x


Out[16]:
array([ 0.        ,  0.22222222,  0.44444444,  0.66666667,  0.88888889,
        1.11111111,  1.33333333,  1.55555556,  1.77777778,  2.        ])

In [18]:
plt.plot(x, x, 'o--', label='linear')
plt.plot(x, x ** 2, 'x-', label='quadratic')
plt.legend(loc='best')
plt.title('Linear vs Quadratic progression')
plt.xlabel('Input')
plt.ylabel('Output')


Out[18]:
<matplotlib.text.Text at 0x7f320fdbaf50>

In [19]:
plt.scatter(np.random.rand(100,), np.random.rand(100,))


Out[19]:
<matplotlib.collections.PathCollection at 0x7f320fc863d0>

Pandas


In [20]:
from IPython.display import IFrame
IFrame('http://pandas.pydata.org/', width=900, height=350)


Out[20]:

In [4]:
import pandas as pd

In [23]:
pd.read_csv?

In [5]:
bandas = pd.read_csv('data/rec 1a validacion Vanesa B 1 2 3 4 5 7  S_.txt',index_col=None, header=4, sep=';')
bandas


Out[5]:
banda 1 banda 2 banda 3 banda 4 banda 5 banda 7 salida
0 0.839 0.918 0.925 0.643 0.808 0.902 0
1 0.937 0.918 0.902 0.663 0.843 0.902 0
2 0.906 0.835 0.902 0.631 1.000 1.000 0
3 0.969 0.835 0.925 0.631 1.000 1.000 0
4 0.969 0.835 0.925 0.631 1.000 1.000 0
5 0.937 0.835 0.925 0.631 1.000 1.000 0
6 0.839 0.835 0.925 0.631 1.000 1.000 0
7 0.937 0.792 0.953 0.608 1.000 1.000 0
8 0.937 0.961 0.925 0.608 1.000 1.000 0
9 1.000 0.875 0.953 0.608 1.000 1.000 0
10 0.969 0.835 0.925 0.608 1.000 1.000 0
11 1.000 0.875 0.976 0.608 1.000 0.969 0
12 1.000 0.875 0.925 0.620 1.000 1.000 0
13 0.937 0.875 0.925 0.620 1.000 0.984 0
14 1.000 0.918 0.925 0.631 1.000 1.000 0
15 0.937 0.918 0.953 0.631 1.000 1.000 0
16 0.969 0.875 0.953 0.631 1.000 1.000 0
17 1.000 0.918 0.925 0.631 1.000 1.000 0
18 0.937 0.875 0.953 0.643 0.984 1.000 0
19 0.937 0.875 0.902 0.631 0.945 1.000 0
20 0.906 0.835 0.902 0.663 0.933 0.937 0
21 0.937 1.000 0.953 0.706 0.851 0.855 0
22 1.000 0.961 0.976 0.663 0.761 0.820 0
23 0.839 0.875 0.902 0.643 0.741 0.804 0
24 0.906 0.792 0.827 0.643 0.741 0.788 0
25 0.710 0.792 0.749 0.631 0.761 0.773 0
26 0.678 0.749 0.749 0.608 0.741 0.773 0
27 0.678 0.749 0.749 0.608 0.741 0.773 0
28 0.741 0.835 0.851 0.576 0.718 0.788 0
29 0.741 0.835 0.851 0.576 0.718 0.788 0
... ... ... ... ... ... ... ...
999967 0.741 0.624 0.675 0.553 0.741 0.690 0
999968 0.647 0.667 0.651 0.588 0.741 0.706 0
999969 0.678 0.624 0.702 0.565 0.733 0.690 0
999970 0.647 0.584 0.651 0.565 0.718 0.675 0
999971 0.549 0.541 0.675 0.553 0.733 0.690 0
999972 0.580 0.584 0.600 0.565 0.718 0.706 0
999973 0.580 0.541 0.576 0.553 0.725 0.655 0
999974 0.482 0.624 0.600 0.553 0.710 0.675 0
999975 0.612 0.667 0.651 0.565 0.800 0.737 0
999976 0.647 0.667 0.651 0.553 0.808 0.706 0
999977 0.518 0.624 0.600 0.553 0.741 0.675 0
999978 0.710 0.710 0.749 0.588 0.776 0.737 0
999979 0.741 0.710 0.725 0.576 0.741 0.722 0
999980 0.741 0.710 0.702 0.533 0.733 0.690 0
999981 0.710 0.667 0.675 0.545 0.710 0.706 0
999982 0.647 0.541 0.624 0.522 0.733 0.690 0
999983 0.482 0.416 0.424 0.498 0.667 0.557 0
999984 0.388 0.459 0.498 0.498 0.643 0.639 0
999985 0.612 0.624 0.624 0.510 0.710 0.690 0
999986 0.678 0.792 0.749 0.533 0.725 0.788 0
999987 0.776 0.710 0.749 0.553 0.761 0.737 0
999988 0.647 0.624 0.624 0.522 0.694 0.655 0
999989 0.647 0.624 0.624 0.510 0.710 0.675 0
999990 0.678 0.624 0.675 0.565 0.733 0.706 0
999991 0.678 0.624 0.624 0.533 0.725 0.675 0
999992 0.776 0.624 0.675 0.545 0.718 0.690 0
999993 0.580 0.584 0.651 0.553 0.675 0.675 0
999994 0.580 0.584 0.651 0.510 0.694 0.690 0
999995 0.580 0.584 0.624 0.510 0.694 0.655 0
999996 0.549 0.584 0.576 0.490 0.694 0.655 0

999997 rows × 7 columns


In [6]:
bandas.describe()


Out[6]:
banda 1 banda 2 banda 3 banda 4 banda 5 banda 7 salida
count 999997.000000 999997.000000 999997.000000 999997.000000 999997.000000 999997.000000 999997.000000
mean 0.736513 0.778513 0.751244 0.466247 0.580058 0.548622 0.297029
std 0.156879 0.125312 0.127666 0.248788 0.380072 0.346731 0.456950
min 0.000000 0.000000 0.024000 0.000000 0.000000 0.000000 0.000000
25% 0.647000 0.710000 0.675000 0.129000 0.024000 0.063000 0.000000
50% 0.776000 0.792000 0.749000 0.576000 0.761000 0.690000 0.000000
75% 0.839000 0.875000 0.827000 0.643000 0.875000 0.820000 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

In [7]:
bandas.corr()


Out[7]:
banda 1 banda 2 banda 3 banda 4 banda 5 banda 7 salida
banda 1 1.000000 0.900330 0.761178 -0.384812 -0.318211 -0.196691 0.379487
banda 2 0.900330 1.000000 0.820724 -0.307433 -0.269831 -0.162319 0.349697
banda 3 0.761178 0.820724 1.000000 0.104466 0.194268 0.311338 -0.107332
banda 4 -0.384812 -0.307433 0.104466 1.000000 0.942624 0.899421 -0.942380
banda 5 -0.318211 -0.269831 0.194268 0.942624 1.000000 0.975522 -0.958861
banda 7 -0.196691 -0.162319 0.311338 0.899421 0.975522 1.000000 -0.942074
salida 0.379487 0.349697 -0.107332 -0.942380 -0.958861 -0.942074 1.000000

Let's start with scikit-learn


In [27]:
from IPython.display import Image
Image(url='http://1.bp.blogspot.com/-ME24ePzpzIM/UQLWTwurfXI/AAAAAAAAANw/W3EETIroA80/s1600/drop_shadows_background.png',
      width=1000, height=1000)


Out[27]:

In [28]:
from IPython.display import IFrame
IFrame('http://scikit-learn.org/stable/index.html', width=900, height=350)


Out[28]:

In [8]:
bandas_cut = pd.read_csv('data/rec 1a validacion Vanesa B 1 2 3 4 5 7  S_.txt',index_col=None, header=4, sep=';', nrows=10000)
bandas_cut


Out[8]:
banda 1 banda 2 banda 3 banda 4 banda 5 banda 7 salida
0 0.839 0.918 0.925 0.643 0.808 0.902 0
1 0.937 0.918 0.902 0.663 0.843 0.902 0
2 0.906 0.835 0.902 0.631 1.000 1.000 0
3 0.969 0.835 0.925 0.631 1.000 1.000 0
4 0.969 0.835 0.925 0.631 1.000 1.000 0
5 0.937 0.835 0.925 0.631 1.000 1.000 0
6 0.839 0.835 0.925 0.631 1.000 1.000 0
7 0.937 0.792 0.953 0.608 1.000 1.000 0
8 0.937 0.961 0.925 0.608 1.000 1.000 0
9 1.000 0.875 0.953 0.608 1.000 1.000 0
10 0.969 0.835 0.925 0.608 1.000 1.000 0
11 1.000 0.875 0.976 0.608 1.000 0.969 0
12 1.000 0.875 0.925 0.620 1.000 1.000 0
13 0.937 0.875 0.925 0.620 1.000 0.984 0
14 1.000 0.918 0.925 0.631 1.000 1.000 0
15 0.937 0.918 0.953 0.631 1.000 1.000 0
16 0.969 0.875 0.953 0.631 1.000 1.000 0
17 1.000 0.918 0.925 0.631 1.000 1.000 0
18 0.937 0.875 0.953 0.643 0.984 1.000 0
19 0.937 0.875 0.902 0.631 0.945 1.000 0
20 0.906 0.835 0.902 0.663 0.933 0.937 0
21 0.937 1.000 0.953 0.706 0.851 0.855 0
22 1.000 0.961 0.976 0.663 0.761 0.820 0
23 0.839 0.875 0.902 0.643 0.741 0.804 0
24 0.906 0.792 0.827 0.643 0.741 0.788 0
25 0.710 0.792 0.749 0.631 0.761 0.773 0
26 0.678 0.749 0.749 0.608 0.741 0.773 0
27 0.678 0.749 0.749 0.608 0.741 0.773 0
28 0.741 0.835 0.851 0.576 0.718 0.788 0
29 0.741 0.835 0.851 0.576 0.718 0.788 0
... ... ... ... ... ... ... ...
9970 0.612 0.498 0.525 0.424 0.557 0.490 0
9971 0.549 0.498 0.525 0.400 0.541 0.510 0
9972 0.518 0.498 0.498 0.412 0.533 0.541 0
9973 0.518 0.459 0.475 0.424 0.549 0.510 0
9974 0.482 0.624 0.600 0.467 0.518 0.510 0
9975 1.000 1.000 1.000 0.533 0.659 0.757 0
9976 1.000 1.000 1.000 0.478 0.616 0.757 0
9977 1.000 1.000 1.000 0.478 0.584 0.690 0
9978 1.000 1.000 1.000 0.467 0.600 0.706 0
9979 1.000 0.918 0.875 0.467 0.576 0.592 0
9980 0.647 0.624 0.624 0.435 0.569 0.573 0
9981 0.678 0.624 0.600 0.435 0.533 0.541 0
9982 0.776 0.835 0.800 0.467 0.584 0.655 0
9983 1.000 1.000 1.000 0.498 0.600 0.675 0
9984 1.000 1.000 1.000 0.467 0.533 0.573 0
9985 1.000 1.000 1.000 0.467 0.533 0.573 0
9986 1.000 1.000 0.976 0.467 0.533 0.573 0
9987 1.000 1.000 1.000 0.455 0.510 0.592 0
9988 1.000 1.000 1.000 0.478 0.518 0.541 0
9989 1.000 1.000 1.000 0.467 0.525 0.541 0
9990 1.000 1.000 0.953 0.447 0.482 0.490 0
9991 1.000 1.000 1.000 0.467 0.482 0.541 0
9992 1.000 1.000 1.000 0.455 0.490 0.541 0
9993 1.000 1.000 1.000 0.467 0.498 0.592 0
9994 1.000 1.000 1.000 0.522 0.549 0.639 0
9995 1.000 1.000 1.000 0.510 0.592 0.675 0
9996 1.000 1.000 1.000 0.447 0.475 0.510 0
9997 1.000 1.000 1.000 0.490 0.510 0.541 0
9998 1.000 1.000 1.000 0.490 0.518 0.557 0
9999 1.000 1.000 1.000 0.455 0.510 0.541 0

10000 rows × 7 columns


In [9]:
bandas_cut.columns


Out[9]:
Index([u'banda 1', u'banda 2', u'banda 3', u'banda 4', u'banda 5', u'banda 7',
       u'salida'],
      dtype='object')

Obtain the data as numpy arrays


In [10]:
X = bandas_cut[bandas_cut.columns[:-1]].values
y = bandas_cut[bandas_cut.columns[-1]].values

Dimensionality reduction

To project the data from a high-dimensional space down to two dimensions for the purpose of visualization.


In [13]:
import numpy as np

In [11]:
from sklearn.decomposition import PCA
Xp = PCA(n_components=2).fit_transform(X)
Xp


Out[11]:
array([[ 0.40756617,  0.09962935],
       [ 0.43585663,  0.14630343],
       [ 0.58785802,  0.07958301],
       ..., 
       [-0.06815429,  0.30465406],
       [-0.05282888,  0.30513639],
       [-0.08216879,  0.30858593]])

In [14]:
% matplotlib inline
import matplotlib.pyplot as plt

# get the product class 
product_class = np.unique(y)

colors = plt.get_cmap("hsv")

plt.figure(figsize=(10, 4))
for i, p in enumerate(product_class):
    mask = (y == p)
    plt.scatter(Xp[mask, 0], Xp[mask, 1], 
                c=colors(1. * i / 11), label=p, alpha=0.2)
    
plt.legend(loc="best")
plt.xlabel('PC 1')
plt.ylabel('PC 2')


Out[14]:
<matplotlib.text.Text at 0x7f9c2b3485d0>

Clustering

Mean Shift

Mean shift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region.


In [16]:
from sklearn.cluster import MeanShift, estimate_bandwidth

bandwidth = estimate_bandwidth(Xp, n_samples=1000)

In [17]:
# Create Model
ms = MeanShift(bandwidth=bandwidth)

In [18]:
# Train the model without y data!
ms.fit(Xp)

labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)


number of estimated clusters : 2

In [19]:
import matplotlib.pyplot as plt
from itertools import cycle

plt.figure()
plt.clf()

colors = cycle('rc')
for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(Xp[my_members, 0], Xp[my_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()


Some classification examples!

Before doing any classification, we first need to split our data, so we can save unseen elements to evaluate generalization.


In [20]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print """X_train shape : {}, y_train shape : {}
X_test shape : {}, y_test shape : {}""".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape)


X_train shape : (7000, 6), y_train shape : (7000,)
X_test shape : (3000, 6), y_test shape : (3000,)

Define a matrix plot so we can visualize the performance of the classifiers


In [21]:
def plot_matrix(clf, X_test, y_test):
    plt.clf()
    plt.imshow(confusion_matrix(clf.predict(X_test), y_test),
               interpolation='nearest', cmap=plt.cm.Blues)
    plt.colorbar()
    plt.xlabel("true label")
    plt.ylabel("predicted label")
    plt.show()

Let's define a Support Vector Machine in sklearn!

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

Define the model

In [48]:
from sklearn.svm import SVC

sv = SVC(kernel='linear', cache_size=1000, probability=True)
Train the model using known labels

In [49]:
sv.fit(X_train, y_train)


Out[49]:
SVC(C=1.0, cache_size=1000, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=True, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
Apply the model

In [50]:
y_pred = sv.predict(X_test)

In [26]:
y_pred, X_test


Out[26]:
(array([0, 0, 0, ..., 1, 0, 0]),
 array([[ 1.   ,  1.   ,  1.   ,  0.478,  0.525,  0.573],
        [ 0.482,  0.584,  0.549,  0.533,  0.659,  0.573],
        [ 0.612,  0.71 ,  0.675,  0.651,  0.886,  0.675],
        ..., 
        [ 0.871,  0.792,  0.6  ,  0.051,  0.   ,  0.031],
        [ 0.776,  0.835,  0.851,  0.643,  1.   ,  0.886],
        [ 0.937,  1.   ,  1.   ,  0.686,  1.   ,  0.984]]))
Evaluate the model

In [29]:
confusion_matrix(sv.predict(X_test), y_test)


Out[29]:
array([[2179,    3],
       [   0,  818]])

In [27]:
print classification_report( y_pred, y_test)
print sv.score(X_test, y_test)
plot_matrix(sv, X_test, y_test)


             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2182
          1       1.00      1.00      1.00       818

avg / total       1.00      1.00      1.00      3000

0.999
  • Precision is the ability of the classifier not to label as positive a sample that is negative.
  • Recall is the ability of the classifier to find all the positive samples.
  • The F1 score can be interpreted as a weighted average of the precision and recall.

In [45]:
confusion_matrix(y_pred, y_test)


Out[45]:
array([[2179,    3],
       [   0,  818]])

Extremely Randomized Trees

It essentially consists of randomizing -partial or totally - both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter.


In [47]:
from sklearn.ensemble import ExtraTreesClassifier

clf = ExtraTreesClassifier(n_estimators=200,
                           max_features=0.2, 
                           n_jobs=2,
                           max_depth=None,
                           min_samples_split=1,
                           random_state=1).fit(X_train, y_train)
print classification_report(clf.predict(X_test), y_test)
print "Score over Testing Data {}".format(clf.score(X_test, y_test))
print "Score over Training Data {}".format(clf.score(X_train, y_train))
plot_matrix(clf, X_test, y_test)


             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2180
          1       1.00      1.00      1.00       820

avg / total       1.00      1.00      1.00      3000

Score over Testing Data 0.999666666667
Score over Training Data 1.0

In [31]:
importances = clf.feature_importances_

text = map(lambda i: bandas.columns[:-1][i], range(6))
plt.figure(figsize=(20, 6))
print importances[::-1].shape
plt.bar(range(6),height=importances,  width=1.)
plt.xticks(np.arange(0.5, 6, 1.), text, rotation=90)
plt.xlim((0, 6))
plt.show()

indices = np.argsort(importances)[::-1]
for i in range(3):
    print importances[indices[i]], bandas_cut.columns[:-1][indices[i]]


(6,)
0.371724125925 banda 5
0.289452793666 banda 7
0.23476407926 banda 4

A simple regression problem


In [32]:
from sklearn.linear_model import LinearRegression
Create some artificial data

In [33]:
X_reg = np.random.random(size=(200, 1))
y_reg = 3 * X_reg[:, 0] + 2 + np.random.normal(size=200)
Declare the model and train it!

In [35]:
model = LinearRegression()
model.fit(X_reg, y_reg)
print("Model coefficient: %.5f, and intercept: %.5f"% (model.coef_, model.intercept_))


Model coefficient: 2.98216, and intercept: 1.97019

In [36]:
# Plot the data and the model prediction
X_test_reg = np.linspace(0, 1, 100)[:, np.newaxis]
y_test_reg = model.predict(X_test_reg)

plt.plot(X_reg[:, 0], y_reg, 'o')
plt.plot(X_test_reg[:, 0], y_test_reg)
plt.title('Linear regression with a single input variable');


Grid Search: Searching for estimator parameters


In [37]:
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier

parameter_grid = {
    'n_estimators': [100, 200],
    'max_features': [0.2, 0.5],
    #'max_depth': [5., None]
}

grid_search = GridSearchCV(ExtraTreesClassifier(n_jobs=4), parameter_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] max_features=0.2, n_estimators=100 ..............................
[CV] ..... max_features=0.2, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.2, n_estimators=100 ..............................
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.3s
[CV] ..... max_features=0.2, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.2, n_estimators=100 ..............................
[CV] ..... max_features=0.2, n_estimators=100, score=0.999286 -   0.3s
[CV] max_features=0.2, n_estimators=100 ..............................
[CV] ..... max_features=0.2, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.2, n_estimators=100 ..............................
[CV] ..... max_features=0.2, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.2, n_estimators=200 ..............................
[CV] ..... max_features=0.2, n_estimators=200, score=1.000000 -   0.4s
[CV] max_features=0.2, n_estimators=200 ..............................
[CV] ..... max_features=0.2, n_estimators=200, score=1.000000 -   0.3s
[CV] max_features=0.2, n_estimators=200 ..............................
[CV] ..... max_features=0.2, n_estimators=200, score=0.999286 -   0.3s
[CV] max_features=0.2, n_estimators=200 ..............................
[CV] ..... max_features=0.2, n_estimators=200, score=1.000000 -   0.3s
[CV] max_features=0.2, n_estimators=200 ..............................
[CV] ..... max_features=0.2, n_estimators=200, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=100 ..............................
[CV] ..... max_features=0.5, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=100 ..............................
[CV] ..... max_features=0.5, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=100 ..............................
[CV] ..... max_features=0.5, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=100 ..............................
[CV] ..... max_features=0.5, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=100 ..............................
[CV] ..... max_features=0.5, n_estimators=100, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=200 ..............................
[CV] ..... max_features=0.5, n_estimators=200, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=200 ..............................
[CV] ..... max_features=0.5, n_estimators=200, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=200 ..............................
[CV] ..... max_features=0.5, n_estimators=200, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=200 ..............................
[CV] ..... max_features=0.5, n_estimators=200, score=1.000000 -   0.3s
[CV] max_features=0.5, n_estimators=200 ..............................
[CV] ..... max_features=0.5, n_estimators=200, score=1.000000 -   0.3s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    5.7s finished
Out[37]:
GridSearchCV(cv=5, error_score='raise',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=4,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': [100, 200], 'max_features': [0.2, 0.5]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=3)

In [39]:
grid_search.best_params_, grid_search.best_estimator_


Out[39]:
({'max_features': 0.5, 'n_estimators': 100},
 ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.5, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,
            oob_score=False, random_state=None, verbose=0, warm_start=False))

Model performance evaluation

Sklearn give a sort of baseline for classifiers

In [40]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='most_frequent',random_state=0).fit(X, y)
#print clf.score(X_test, y_test)
#plot_matrix(clf, X_test, y_test)
clf.score(X, y)


Out[40]:
0.72589999999999999

ROC curve


In [43]:
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

def plot_roc_curve(target_test, target_predicted_proba):
    fpr, tpr, thresholds = roc_curve(target_test, target_predicted_proba[:, 1])
    
    roc_auc = auc(fpr, tpr)
    # Plot ROC curve
    plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')  # random predictions curve
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('False Positive Rate or (1 - Specifity)')
    plt.ylabel('True Positive Rate or (Sensitivity)')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")

In [51]:
plot_roc_curve(y_test, sv.predict_proba(X_test))


Cross Validation

Cross Validation is a procedure to repeat the train / test split several times to as to get a more accurate estimate of the real test score by averaging the values found of the individual runs.


In [52]:
from sklearn.cross_validation import cross_val_score, ShuffleSplit

cv = ShuffleSplit(X.shape[0], n_iter=10, test_size=0.1, random_state=0)

test_scores = cross_val_score(sv, X, y, cv=cv, n_jobs=2)
print "scores: {}  mean: {}  std: {}".format(str(test_scores), np.mean(test_scores), np.std(test_scores))


scores: [ 1.     1.     1.     1.     1.     0.999  1.     0.999  0.999  1.   ]  mean: 0.9997  std: 0.000458257569496

In [55]:



Out[55]:
1000

A small view over lasagne


In [57]:
import theano
from lasagne.updates import nesterov_momentum
from nolearn.lasagne import BatchIterator
from nolearn.lasagne import NeuralNet
from lasagne.layers import InputLayer, Conv2DLayer, DropoutLayer,\
    MaxPool2DLayer, DenseLayer
from lasagne.nonlinearities import softmax
from sklearn.preprocessing import MinMaxScaler, label_binarize

In [82]:
import lasagne
lasagne.__version__


Out[82]:
'0.2.dev1'

In [58]:
X = bandas_cut[bandas_cut.columns[:-1]].values
y = bandas_cut[bandas_cut.columns[-1]].values

In [59]:
X_net = X.astype(np.float32)
y_net = y.astype(np.int32)

Using the preprocessing module from sklearn


In [60]:
X_scaler = MinMaxScaler()
X_net = X_scaler.fit_transform(X_net)

In [61]:
X_train, X_test, y_train, y_test = train_test_split(X_net, y_net,
                                                    test_size=0.3,
                                                    random_state=42)

print "X_train.shape -> {}, X_test.shape -> {} ".format(X_train.shape,
                                                        X_test.shape)
print "y_train.shape -> {}, y_test.shape -> {} ".format(y_train.shape,
                                                        y_test.shape)
print X.min(), X.max()
print y.min(), y.max()


X_train.shape -> (7000, 6), X_test.shape -> (3000, 6) 
y_train.shape -> (7000,), y_test.shape -> (3000,) 
0.0 1.0
0 1

Setup the layers of our neural net


In [62]:
layers_0 = [
                (InputLayer, {'shape': (None, 6)}),
                (DenseLayer, {'num_units': 100}),
                (DropoutLayer, {}),
                (DenseLayer, {'num_units': 100}),
                (DenseLayer, {'num_units': 2, 'nonlinearity': softmax}),
           ]

Build the network


In [63]:
def create_network(npochs=50, batch_size=200):
    return NeuralNet(
        layers=layers_0,
        update=nesterov_momentum,
        update_learning_rate=theano.shared(np.float32(0.01)),
        update_momentum=theano.shared(np.float32(0.9)),

        regression=False,
        batch_iterator_train=BatchIterator(batch_size=batch_size),
        max_epochs=npochs,
        verbose=1)
net0 = create_network()

In [64]:
net0.fit(X_train, y_train)


/usr/lib/python2.7/site-packages/nolearn/lasagne/base.py:416: UserWarning: The Param class is deprecated. Replace Param(default=N) by theano.In(value=N)
  for input_layer in input_layers]
/usr/lib/python2.7/site-packages/nolearn/lasagne/base.py:417: UserWarning: The Param class is deprecated. Replace Param(default=N) by theano.In(value=N)
  inputs = X_inputs + [theano.Param(y_batch, name="y")]
# Neural Network with 11002 learnable parameters

## Layer information

  #  name        size
---  --------  ------
  0  input0         6
  1  dense1       100
  2  dropout2     100
  3  dense3       100
  4  dense4         2

  epoch    train loss    valid loss    train/val    valid acc  dur
-------  ------------  ------------  -----------  -----------  -----
      1       0.41956       0.23279      1.80231      0.99645  0.05s
      2       0.16697       0.06027      2.77030      0.99929  0.05s
      3       0.06781       0.02431      2.78948      0.99929  0.06s
      4       0.03751       0.01439      2.60698      0.99929  0.06s
      5       0.02537       0.01005      2.52442      0.99929  0.06s
      6       0.02011       0.00780      2.57981      0.99929  0.05s
      7       0.01634       0.00655      2.49598      0.99929  0.07s
      8       0.01405       0.00590      2.38167      0.99929  0.06s
      9       0.01386       0.00512      2.70485      0.99929  0.06s
     10       0.01009       0.00466      2.16410      0.99929  0.06s
     11       0.00886       0.00419      2.11425      0.99929  0.06s
     12       0.00923       0.00408      2.26532      0.99929  0.06s
     13       0.00997       0.00403      2.47311      0.99929  0.06s
     14       0.00718       0.00386      1.85831      0.99929  0.06s
     15       0.00751       0.00356      2.10848      0.99929  0.05s
     16       0.00693       0.00361      1.92185      0.99929  0.06s
     17       0.00624       0.00349      1.78572      0.99929  0.05s
     18       0.00673       0.00361      1.86321      0.99929  0.06s
     19       0.00723       0.00343      2.10603      0.99929  0.05s
     20       0.00690       0.00335      2.06139      0.99929  0.05s
     21       0.00636       0.00323      1.97199      0.99929  0.05s
     22       0.00633       0.00332      1.90884      0.99929  0.05s
     23       0.00533       0.00323      1.65261      0.99929  0.05s
     24       0.00556       0.00308      1.80490      0.99929  0.05s
     25       0.00594       0.00320      1.85832      0.99929  0.05s
     26       0.00467       0.00305      1.53003      0.99929  0.05s
     27       0.00534       0.00293      1.82419      0.99929  0.05s
     28       0.00521       0.00291      1.79342      0.99929  0.05s
     29       0.00537       0.00311      1.72578      0.99929  0.05s
     30       0.00495       0.00297      1.66524      0.99929  0.05s
     31       0.00542       0.00299      1.81119      0.99929  0.06s
     32       0.00412       0.00293      1.40736      0.99929  0.06s
     33       0.00566       0.00268      2.11225      0.99929  0.06s
     34       0.00446       0.00279      1.59789      0.99929  0.06s
     35       0.00506       0.00279      1.81019      0.99929  0.05s
     36       0.00486       0.00270      1.80049      0.99929  0.06s
     37       0.00359       0.00263      1.36422      0.99929  0.07s
     38       0.00484       0.00263      1.83788      0.99929  0.06s
     39       0.00409       0.00264      1.54830      0.99929  0.06s
     40       0.00496       0.00263      1.88389      0.99929  0.06s
     41       0.00314       0.00252      1.24664      0.99929  0.07s
     42       0.00303       0.00255      1.19015      0.99929  0.07s
     43       0.00389       0.00247      1.57088      0.99929  0.06s
     44       0.00324       0.00247      1.30840      0.99929  0.06s
     45       0.00400       0.00247      1.62296      0.99929  0.05s
     46       0.00409       0.00236      1.73093      0.99929  0.06s
     47       0.00457       0.00239      1.90722      0.99929  0.05s
     48       0.00372       0.00256      1.45192      0.99929  0.05s
     49       0.00332       0.00240      1.38350      0.99929  0.05s
     50       0.00357       0.00240      1.48744      0.99929  0.05s
Out[64]:
NeuralNet(X_tensor_type=None,
     batch_iterator_test=<nolearn.lasagne.base.BatchIterator object at 0x7f9c35da4850>,
     batch_iterator_train=<nolearn.lasagne.base.BatchIterator object at 0x7f9c241f96d0>,
     custom_score=None,
     layers=[(<class 'lasagne.layers.input.InputLayer'>, {'shape': (None, 6)}), (<class 'lasagne.layers.dense.DenseLayer'>, {'num_units': 100}), (<class 'lasagne.layers.noise.DropoutLayer'>, {}), (<class 'lasagne.layers.dense.DenseLayer'>, {'num_units': 100}), (<class 'lasagne.layers.dense.DenseLayer'>, {'num_units': 2, 'nonlinearity': <function softmax at 0x7f9c3760e578>})],
     loss=None, max_epochs=50, more_params={},
     objective=<function objective at 0x7f9c35da8848>,
     objective_loss_function=<function categorical_crossentropy at 0x7f9c36f647d0>,
     on_epoch_finished=[<nolearn.lasagne.handlers.PrintLog instance at 0x7f9c244a7290>],
     on_training_finished=[],
     on_training_started=[<nolearn.lasagne.handlers.PrintLayerInfo instance at 0x7f9c244a7f80>],
     regression=False,
     train_split=<nolearn.lasagne.base.TrainSplit object at 0x7f9c35da4890>,
     update=<function nesterov_momentum at 0x7f9c36f6e140>,
     update_learning_rate=<TensorType(float32, scalar)>,
     update_momentum=<TensorType(float32, scalar)>,
     use_label_encoder=False, verbose=1,
     y_tensor_type=TensorType(int32, vector))

How to save and load your network ?


In [66]:
import cPickle as pickle

with open('data/aguatierra_simpleNN.pickle', 'wb') as f:
        pickle.dump(net0, f, -1)

In [67]:
import cPickle as pickle

net0 = None

fnames_nets = ['data/aguatierra_simpleNN.pickle']
nets = [net0]
for n, fnames in enumerate(fnames_nets):
    with open(fnames, 'rb') as f:
        nets[n] = pickle.load(f)

In [68]:
from nolearn.lasagne import PrintLayerInfo
layer_info = PrintLayerInfo()

nets[0].verbose = 3
nets[0].initialize()
layer_info(nets[0])


# Neural Network with 11002 learnable parameters

## Layer information

  #  name        size
---  --------  ------
  0  input0         6
  1  dense1       100
  2  dropout2     100
  3  dense3       100
  4  dense4         2

Loss plot


In [69]:
%matplotlib inline

plt.clf()
plt.figure(figsize=(15,5))
for net, net_name, color in zip(nets, ['Arch0'], ['m']):
    train_loss = np.array([i["train_loss"] for i in net.train_history_])
    valid_loss = np.array([i["valid_loss"] for i in net.train_history_])
    plt.plot(train_loss, '--{}'.format(color), linewidth=2, label="{} train".format(net_name))
    plt.plot(valid_loss, '-{}'.format(color), linewidth=2, label="{} valid".format(net_name))

plt.grid()
plt.legend()
plt.xlabel("epoch")
plt.ylabel("loss")
plt.yscale("log")
plt.show()


<matplotlib.figure.Figure at 0x7f9bf8211450>

In [ ]:


In [70]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn import cross_validation
for net in nets:
    y_pred = net.predict(X_net)
    print classification_report(y_net, y_pred)
    print "[Test dataset] Score: %.5f" % net.score(X_test, y_test)


             precision    recall  f1-score   support

          0       1.00      1.00      1.00      7259
          1       1.00      1.00      1.00      2741

avg / total       1.00      1.00      1.00     10000

[Test dataset] Score: 0.99867