A Tour of SciKit-Learn

When we talk about Data Science and the Data Science Pipeline, we are typically talking about the management of data flows for a specific purpose - the modeling of some hypothesis. The models that we construct can then be used in Data Products as an engine to create more data and actionable results. Machine learning is the art of training some model by using existing data along with a statistical method to create a parametric representation of a model that fits the data. That’s kind of a mouthful, but what that essentially means is that a machine learning algorithm uses statistical processes to learn from examples, then applies what it has learned to future inputs to predict an outcome.

Machine learning can classically be summarized with two methodologies: supervised and unsupervised learning. In supervised learning, the “correct answers” are annotated ahead of time and the algorithm tries to fit a decision space based on those answers. In unsupervised learning, algorithms try to group like examples together, inferring similarities via distance metrics. Machine learning allows us to handle new data in a meaningful way, predicting where new data will fit into our models.

Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.

The purpose of this notebook is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology. For more on Scikit-Learn see: Six Reasons why I recommend Scikit-Learn (O’Reilly Radar).


In [2]:
%matplotlib inline

# Things we'll need later
import time
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
from sklearn.metrics import classification_report
from sklearn import cross_validation as cv

# Load the example datasets
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from sklearn.datasets import load_diabetes
from sklearn.datasets import load_digits
from sklearn.datasets import load_linnerud

# Boston house prices dataset (reals, regression)
boston = load_boston()
print "Boston: %i samples %i features" % boston.data.shape

# Iris flower dataset (reals, multi-label classification)
iris   = load_iris()
print "Iris: %i samples %i features" % iris.data.shape

# Diabetes dataset (reals, regression)
diabetes = load_diabetes()
print "Diabetes: %i samples %i features" % diabetes.data.shape

# Hand-written digit dataset (multi-label classification)
digits = load_digits()
print "Digits: %i samples %i features" % digits.data.shape

# Linnerud psychological and exercise dataset (multivariate regression)
linnerud = load_linnerud()
print "Linnerud: %i samples %i features" % linnerud.data.shape


Boston: 506 samples 13 features
Iris: 150 samples 4 features
Diabetes: 442 samples 10 features
Digits: 1797 samples 64 features
Linnerud: 20 samples 3 features

The datasets that come with Scikit Learn demonstrate the properties of classification and regression algorithms, as well as how the data should fit. They are also small and are easy to train models that work. As such they are ideal for pedagogical purposes. The datasets module also contains functions for loading data from the mldata.org repository as well as for generating random data.


In [3]:
import pandas as pd
from pandas.tools.plotting import scatter_matrix, radviz

df = pd.DataFrame(iris.data)
df.columns = iris.feature_names

cmap = {0:'r', 1:'b', 2:'g'}
colors = [cmap[t] for t in iris.target]
# fig = scatter_matrix(df, alpha=0.4, figsize=(16, 10), diagonal='kde', color=colors)
fig = radviz(df, iris.target)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-cc426d37b0d3> in <module>()
      8 colors = [cmap[t] for t in iris.target]
      9 # fig = scatter_matrix(df, alpha=0.4, figsize=(16, 10), diagonal='kde', color=colors)
---> 10 fig = radviz(df, iris.target)

/usr/local/lib/python2.7/site-packages/pandas/tools/plotting.pyc in radviz(frame, class_column, ax, color, colormap, **kwds)
    381 
    382     n = len(frame)
--> 383     classes = frame[class_column].drop_duplicates()
    384     class_col = frame[class_column]
    385     df = frame.drop(class_column, axis=1).apply(normalize)

/usr/local/lib/python2.7/site-packages/pandas/util/decorators.pyc in wrapper(*args, **kwargs)
     86                 else:
     87                     kwargs[new_arg_name] = new_arg_value
---> 88             return func(*args, **kwargs)
     89         return wrapper
     90     return _deprecate_kwarg

/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in drop_duplicates(self, subset, take_last, inplace)
   2826         deduplicated : DataFrame
   2827         """
-> 2828         duplicated = self.duplicated(subset, take_last=take_last)
   2829 
   2830         if inplace:

/usr/local/lib/python2.7/site-packages/pandas/util/decorators.pyc in wrapper(*args, **kwargs)
     86                 else:
     87                     kwargs[new_arg_name] = new_arg_value
---> 88             return func(*args, **kwargs)
     89         return wrapper
     90     return _deprecate_kwarg

/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in duplicated(self, subset, take_last)
   2871 
   2872         vals = (self[col].values for col in subset)
-> 2873         labels, shape = map(list, zip( * map(f, vals)))
   2874 
   2875         ids = get_group_index(labels, shape, sort=False, xnull=False)

/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in f(vals)
   2860 
   2861         def f(vals):
-> 2862             labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
   2863             return labels.astype('i8',copy=False), len(shape)
   2864 

/usr/local/lib/python2.7/site-packages/pandas/core/algorithms.pyc in factorize(values, sort, order, na_sentinel, size_hint)
    133     table = hash_klass(size_hint or len(vals))
    134     uniques = vec_klass()
--> 135     labels = table.get_labels(vals, uniques, 0, na_sentinel)
    136 
    137     labels = com._ensure_platform_int(labels)

pandas/hashtable.pyx in pandas.hashtable.Float64HashTable.get_labels (pandas/hashtable.c:10626)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [4]:
df = pd.DataFrame(diabetes.data)
fig = scatter_matrix(df, alpha=0.2, figsize=(16, 10), diagonal='kde')



In [1]:
import random
plt.figure(1, figsize=(3, 3))
plt.imshow(randomdigits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-a71caf66dbc4> in <module>()
      1 import random
----> 2 plt.figure(1, figsize=(3, 3))
      3 plt.imshow(randomdigits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
      4 plt.show()

NameError: name 'plt' is not defined

Regressions

Regressions are a type of supervised learning algorithm, where, given continuous input data, the object is to fit a function that is able to predict the continuous value of input features.

Linear Regression

Linear regression fits a linear model (a line in two dimensions) to the data.


In [12]:
from sklearn.linear_model import LinearRegression

# Fit regression to diabetes dataset
model = LinearRegression()
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)


Mean Squared Error: 2859.690
Coefficient of Determination: 0.518

Perceptron

A primitive neural network that learns weights for input vectors and transfers the weights through a network to make a prediction.


In [13]:
from sklearn.linear_model import Perceptron

model = Perceptron()
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)


Mean Squared Error: 8815.502
Coefficient of Determination: -0.487

k-Nearest Neighbor Regression

Makes predictions by locating similar cases and returning the average majority.


In [14]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor()
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)


Mean Squared Error: 2342.556
Coefficient of Determination: 0.605

Classification and Regression Trees (CART)

Makes splits of the best separation of the data for the predictions being made.


In [15]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)


Mean Squared Error: 0.000
Coefficient of Determination: 1.000

Random Forest

Random forest is an ensemble method that creates a number of decision trees using the CART algorithm, each on a different subset of the data. The general approach to creating the ensemble is bootstrap aggregation of the decision trees (bagging).


In [ ]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)

AdaBoost

Adaptive Boosting (AdaBoost) is an ensemble method that sums the predictions made by multiple decision trees. Additional models are added and trained on instances that were incorrectly predicted (boosting)


In [19]:
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor()
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)


Mean Squared Error: 2288.682
Coefficient of Determination: 0.614

Support Vector Machines

Uses the SVM algorithm (transforming the problem space into higher dimensions in order to use kernel methods) to make predictions for a linear function.


In [18]:
from sklearn.svm import SVR

model = SVR()
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)


Mean Squared Error: 6024.465
Coefficient of Determination: -0.016

Regularization

Regularization methods decrease the over-fitting of a model by penalizing complexity. These are usually demonstrated on regression algorithms, which is why they are included in this section.

Ridge Regression

Also known as Tikhonov regularization penalizes a least squares regression model on the square of the absolute magnitiude of the coefficients (the L2 norm)


In [16]:
from sklearn.linear_model import Ridge

model = Ridge(alpha=0.1)
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)


Mean Squared Error: 2890.445
Coefficient of Determination: 0.513

LASSO

Least Absolute Shrinkage and Selection Operator (LASSO) penalizes the least squares regression on the absolute magnitude of the coefficients (the L1 norm)


In [17]:
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.1)
model.fit(diabetes.data, diabetes.target)

expected  = diabetes.target
predicted = model.predict(diabetes.data)

# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)


Mean Squared Error: 2912.522
Coefficient of Determination: 0.509

Classification

Classification is a supervised machine learning problem where, given labeled input data (with two or more labels), the task is to fit a function that can predict the discrete class of input data.

Logistic Regression

Fits a logistic model to data and makes predictions about the probability of a categorical event (between 0 and 1). Logistic regressions make predictions between 0 and 1, so in order to classify multiple classes a one-vs-all scheme is used (one model per class, winner-takes-all).


In [20]:
from sklearn.linear_model import LogisticRegression

splits     = cv.train_test_split(iris.data, iris.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits

model      = LogisticRegression()
model.fit(X_train, y_train)

expected   = y_test
predicted  = model.predict(X_test)

print classification_report(expected, predicted)


             precision    recall  f1-score   support

          0       1.00      1.00      1.00        12
          1       0.75      0.75      0.75         4
          2       0.93      0.93      0.93        14

avg / total       0.93      0.93      0.93        30

LDA

Linear Discriminate Analysis (LDA) fits a conditional probability density function (Gaussian) to the attributes of the classes. The discrimination function is linear.


In [21]:
from sklearn.lda import LDA

splits     = cv.train_test_split(digits.data, digits.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits

model      = LDA()
model.fit(X_train, y_train)

expected   = y_test
predicted  = model.predict(X_test)

print classification_report(expected, predicted)


             precision    recall  f1-score   support

          0       1.00      0.97      0.99        37
          1       0.94      0.96      0.95        47
          2       0.97      1.00      0.98        29
          3       0.97      0.95      0.96        37
          4       0.98      0.96      0.97        49
          5       0.96      0.93      0.95        28
          6       0.97      0.97      0.97        30
          7       0.93      0.93      0.93        29
          8       0.89      0.91      0.90        43
          9       0.84      0.87      0.86        31

avg / total       0.95      0.94      0.94       360

/usr/local/lib/python2.7/site-packages/sklearn/lda.py:161: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")

Naive Bayes

Uses Bayes Theorem (with a naive assumption) to model the conditional relationship of each attribute to the class.


In [23]:
from sklearn.naive_bayes import GaussianNB

splits     = cv.train_test_split(iris.data, iris.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits

model      = GaussianNB()
model.fit(X_train, y_train)

expected   = y_test
predicted  = model.predict(X_test)

print classification_report(expected, predicted)


             precision    recall  f1-score   support

          0       1.00      1.00      1.00         7
          1       0.93      1.00      0.96        13
          2       1.00      0.90      0.95        10

avg / total       0.97      0.97      0.97        30

k-Nearest Neighbor

Makes predictions by locating similar instances via a similarity function or distance and averaging the majority of the most similar.


In [29]:
from sklearn.neighbors import KNeighborsClassifier

# splits     = cv.train_test_split(digits.data, digits.target, test_size=0.2)
# X_train, X_test, y_train, y_test = splits

# model      = KNeighborsClassifier(20)
# model.fit(X_train, y_train)

# expected   = y_test
# predicted  = model.predict(X_test)

# print classification_report(expected, predicted)

help(KNeighborsClassifier)


Help on class KNeighborsClassifier in module sklearn.neighbors.classification:

class KNeighborsClassifier(sklearn.neighbors.base.NeighborsBase, sklearn.neighbors.base.KNeighborsMixin, sklearn.neighbors.base.SupervisedIntegerMixin, sklearn.base.ClassifierMixin)
 |  Classifier implementing the k-nearest neighbors vote.
 |  
 |  Parameters
 |  ----------
 |  n_neighbors : int, optional (default = 5)
 |      Number of neighbors to use by default for :meth:`k_neighbors` queries.
 |  
 |  weights : str or callable
 |      weight function used in prediction.  Possible values:
 |  
 |      - 'uniform' : uniform weights.  All points in each neighborhood
 |        are weighted equally.
 |      - 'distance' : weight points by the inverse of their distance.
 |        in this case, closer neighbors of a query point will have a
 |        greater influence than neighbors which are further away.
 |      - [callable] : a user-defined function which accepts an
 |        array of distances, and returns an array of the same shape
 |        containing the weights.
 |  
 |      Uniform weights are used by default.
 |  
 |  algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
 |      Algorithm used to compute the nearest neighbors:
 |  
 |      - 'ball_tree' will use :class:`BallTree`
 |      - 'kd_tree' will use :class:`KDTree`
 |      - 'brute' will use a brute-force search.
 |      - 'auto' will attempt to decide the most appropriate algorithm
 |        based on the values passed to :meth:`fit` method.
 |  
 |      Note: fitting on sparse input will override the setting of
 |      this parameter, using brute force.
 |  
 |  leaf_size : int, optional (default = 30)
 |      Leaf size passed to BallTree or KDTree.  This can affect the
 |      speed of the construction and query, as well as the memory
 |      required to store the tree.  The optimal value depends on the
 |      nature of the problem.
 |  
 |  metric : string or DistanceMetric object (default = 'minkowski')
 |      the distance metric to use for the tree.  The default metric is
 |      minkowski, and with p=2 is equivalent to the standard Euclidean
 |      metric. See the documentation of the DistanceMetric class for a
 |      list of available metrics.
 |  
 |  p : integer, optional (default = 2)
 |      Power parameter for the Minkowski metric. When p = 1, this is
 |      equivalent to using manhattan_distance (l1), and euclidean_distance
 |      (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
 |  
 |  metric_params: dict, optional (default = None)
 |      additional keyword arguments for the metric function.
 |  
 |  Examples
 |  --------
 |  >>> X = [[0], [1], [2], [3]]
 |  >>> y = [0, 0, 1, 1]
 |  >>> from sklearn.neighbors import KNeighborsClassifier
 |  >>> neigh = KNeighborsClassifier(n_neighbors=3)
 |  >>> neigh.fit(X, y) # doctest: +ELLIPSIS
 |  KNeighborsClassifier(...)
 |  >>> print(neigh.predict([[1.1]]))
 |  [0]
 |  >>> print(neigh.predict_proba([[0.9]]))
 |  [[ 0.66666667  0.33333333]]
 |  
 |  See also
 |  --------
 |  RadiusNeighborsClassifier
 |  KNeighborsRegressor
 |  RadiusNeighborsRegressor
 |  NearestNeighbors
 |  
 |  Notes
 |  -----
 |  See :ref:`Nearest Neighbors <neighbors>` in the online documentation
 |  for a discussion of the choice of ``algorithm`` and ``leaf_size``.
 |  
 |  .. warning::
 |  
 |     Regarding the Nearest Neighbors algorithms, if it is found that two
 |     neighbors, neighbor `k+1` and `k`, have identical distances but
 |     but different labels, the results will depend on the ordering of the
 |     training data.
 |  
 |  http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
 |  
 |  Method resolution order:
 |      KNeighborsClassifier
 |      sklearn.neighbors.base.NeighborsBase
 |      abc.NewBase
 |      sklearn.base.BaseEstimator
 |      sklearn.neighbors.base.KNeighborsMixin
 |      sklearn.neighbors.base.SupervisedIntegerMixin
 |      sklearn.base.ClassifierMixin
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, **kwargs)
 |  
 |  predict(self, X)
 |      Predict the class labels for the provided data
 |      
 |      Parameters
 |      ----------
 |      X : array of shape [n_samples, n_features]
 |          A 2-D array representing the test points.
 |      
 |      Returns
 |      -------
 |      y : array of shape [n_samples] or [n_samples, n_outputs]
 |          Class labels for each data sample.
 |  
 |  predict_proba(self, X)
 |      Return probability estimates for the test data X.
 |      
 |      Parameters
 |      ----------
 |      X : array, shape = (n_samples, n_features)
 |          A 2-D array representing the test points.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          of such arrays if n_outputs > 1.
 |          The class probabilities of the input samples. Classes are ordered
 |          by lexicographic order.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset([])
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __repr__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep: boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The former have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.neighbors.base.KNeighborsMixin:
 |  
 |  kneighbors(self, X, n_neighbors=None, return_distance=True)
 |      Finds the K-neighbors of a point.
 |      
 |      Returns distance
 |      
 |      Parameters
 |      ----------
 |      X : array-like, last dimension same as that of fit data
 |          The new point.
 |      
 |      n_neighbors : int
 |          Number of neighbors to get (default is the value
 |          passed to the constructor).
 |      
 |      return_distance : boolean, optional. Defaults to True.
 |          If False, distances will not be returned
 |      
 |      Returns
 |      -------
 |      dist : array
 |          Array representing the lengths to point, only present if
 |          return_distance=True
 |      
 |      ind : array
 |          Indices of the nearest points in the population matrix.
 |      
 |      Examples
 |      --------
 |      In the following example, we construct a NeighborsClassifier
 |      class from an array representing our data set and ask who's
 |      the closest point to [1,1,1]
 |      
 |      >>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
 |      >>> from sklearn.neighbors import NearestNeighbors
 |      >>> neigh = NearestNeighbors(n_neighbors=1)
 |      >>> neigh.fit(samples) # doctest: +ELLIPSIS
 |      NearestNeighbors(algorithm='auto', leaf_size=30, ...)
 |      >>> print(neigh.kneighbors([1., 1., 1.])) # doctest: +ELLIPSIS
 |      (array([[ 0.5]]), array([[2]]...))
 |      
 |      As you can see, it returns [[0.5]], and [[2]], which means that the
 |      element is at distance 0.5 and is the third element of samples
 |      (indexes start at 0). You can also query for multiple points:
 |      
 |      >>> X = [[0., 1., 0.], [1., 0., 1.]]
 |      >>> neigh.kneighbors(X, return_distance=False) # doctest: +ELLIPSIS
 |      array([[1],
 |             [2]]...)
 |  
 |  kneighbors_graph(self, X, n_neighbors=None, mode='connectivity')
 |      Computes the (weighted) graph of k-Neighbors for points in X
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = [n_samples, n_features]
 |          Sample data
 |      
 |      n_neighbors : int
 |          Number of neighbors for each sample.
 |          (default is value passed to the constructor).
 |      
 |      mode : {'connectivity', 'distance'}, optional
 |          Type of returned matrix: 'connectivity' will return the
 |          connectivity matrix with ones and zeros, in 'distance' the
 |          edges are Euclidean distance between points.
 |      
 |      Returns
 |      -------
 |      A : sparse matrix in CSR format, shape = [n_samples, n_samples_fit]
 |          n_samples_fit is the number of samples in the fitted data
 |          A[i, j] is assigned the weight of edge that connects i to j.
 |      
 |      Examples
 |      --------
 |      >>> X = [[0], [3], [1]]
 |      >>> from sklearn.neighbors import NearestNeighbors
 |      >>> neigh = NearestNeighbors(n_neighbors=2)
 |      >>> neigh.fit(X) # doctest: +ELLIPSIS
 |      NearestNeighbors(algorithm='auto', leaf_size=30, ...)
 |      >>> A = neigh.kneighbors_graph(X)
 |      >>> A.toarray()
 |      array([[ 1.,  0.,  1.],
 |             [ 0.,  1.,  1.],
 |             [ 1.,  0.,  1.]])
 |      
 |      See also
 |      --------
 |      NearestNeighbors.radius_neighbors_graph
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.neighbors.base.SupervisedIntegerMixin:
 |  
 |  fit(self, X, y)
 |      Fit the model using X as training data and y as target values
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix, BallTree, KDTree}
 |          Training data. If array or matrix, shape = [n_samples, n_features]
 |      
 |      y : {array-like, sparse matrix}
 |          Target values of shape = [n_samples] or [n_samples, n_outputs]
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples,)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

Decision Trees

Decision trees use the CART algorithm to make predictions by making splits that best fit the data.


In [31]:
# from sklearn.tree import DecisionTreeClassifier

# splits     = cv.train_test_split(iris.data, iris.target, test_size=0.2)
# X_train, X_test, y_train, y_test = splits

# model      = DecisionTreeClassifier()
# model.fit(X_train, y_train)

# expected   = y_test
# predicted  = model.predict(X_test)

# print classification_report(expected, predicted)

help(DecisionTreeClassifier)


Help on class DecisionTreeClassifier in module sklearn.tree.tree:

class DecisionTreeClassifier(BaseDecisionTree, sklearn.base.ClassifierMixin)
 |  A decision tree classifier.
 |  
 |  Parameters
 |  ----------
 |  criterion : string, optional (default="gini")
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "entropy" for the information gain.
 |  
 |  splitter : string, optional (default="best")
 |      The strategy used to choose the split at each node. Supported
 |      strategies are "best" to choose the best split and "random" to choose
 |      the best random split.
 |  
 |  max_features : int, float, string or None, optional (default=None)
 |      The number of features to consider when looking for the best split:
 |        - If int, then consider `max_features` features at each split.
 |        - If float, then `max_features` is a percentage and
 |          `int(max_features * n_features)` features are considered at each
 |          split.
 |        - If "auto", then `max_features=sqrt(n_features)`.
 |        - If "sqrt", then `max_features=sqrt(n_features)`.
 |        - If "log2", then `max_features=log2(n_features)`.
 |        - If None, then `max_features=n_features`.
 |  
 |      Note: the search for a split does not stop until at least one
 |      valid partition of the node samples is found, even if it requires to
 |      effectively inspect more than ``max_features`` features.
 |  
 |  max_depth : int or None, optional (default=None)
 |      The maximum depth of the tree. If None, then nodes are expanded until
 |      all leaves are pure or until all leaves contain less than
 |      min_samples_split samples.
 |      Ignored if ``max_samples_leaf`` is not None.
 |  
 |  min_samples_split : int, optional (default=2)
 |      The minimum number of samples required to split an internal node.
 |  
 |  min_samples_leaf : int, optional (default=1)
 |      The minimum number of samples required to be at a leaf node.
 |  
 |  max_leaf_nodes : int or None, optional (default=None)
 |      Grow a tree with ``max_leaf_nodes`` in best-first fashion.
 |      Best nodes are defined as relative reduction in impurity.
 |      If None then unlimited number of leaf nodes.
 |      If not None then ``max_depth`` will be ignored.
 |  
 |  random_state : int, RandomState instance or None, optional (default=None)
 |      If int, random_state is the seed used by the random number generator;
 |      If RandomState instance, random_state is the random number generator;
 |      If None, the random number generator is the RandomState instance used
 |      by `np.random`.
 |  
 |  Attributes
 |  ----------
 |  `tree_` : Tree object
 |      The underlying Tree object.
 |  
 |  `max_features_` : int,
 |      The infered value of max_features.
 |  
 |  `classes_` : array of shape = [n_classes] or a list of such arrays
 |      The classes labels (single output problem),
 |      or a list of arrays of class labels (multi-output problem).
 |  
 |  `n_classes_` : int or list
 |      The number of classes (for single output problems),
 |      or a list containing the number of classes for each
 |      output (for multi-output problems).
 |  
 |  `feature_importances_` : array of shape = [n_features]
 |      The feature importances. The higher, the more important the
 |      feature. The importance of a feature is computed as the (normalized)
 |      total reduction of the criterion brought by that feature.  It is also
 |      known as the Gini importance [4]_.
 |  
 |  See also
 |  --------
 |  DecisionTreeRegressor
 |  
 |  References
 |  ----------
 |  
 |  .. [1] http://en.wikipedia.org/wiki/Decision_tree_learning
 |  
 |  .. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification
 |         and Regression Trees", Wadsworth, Belmont, CA, 1984.
 |  
 |  .. [3] T. Hastie, R. Tibshirani and J. Friedman. "Elements of Statistical
 |         Learning", Springer, 2009.
 |  
 |  .. [4] L. Breiman, and A. Cutler, "Random Forests",
 |         http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
 |  
 |  Examples
 |  --------
 |  >>> from sklearn.datasets import load_iris
 |  >>> from sklearn.cross_validation import cross_val_score
 |  >>> from sklearn.tree import DecisionTreeClassifier
 |  >>> clf = DecisionTreeClassifier(random_state=0)
 |  >>> iris = load_iris()
 |  >>> cross_val_score(clf, iris.data, iris.target, cv=10)
 |  ...                             # doctest: +SKIP
 |  ...
 |  array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
 |          0.93...,  0.93...,  1.     ,  0.93...,  1.      ])
 |  
 |  Method resolution order:
 |      DecisionTreeClassifier
 |      BaseDecisionTree
 |      abc.NewBase
 |      sklearn.base.BaseEstimator
 |      sklearn.feature_selection.from_model._LearntSelectorMixin
 |      sklearn.base.TransformerMixin
 |      sklearn.base.ClassifierMixin
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features=None, random_state=None, min_density=None, compute_importances=None, max_leaf_nodes=None)
 |  
 |  predict_log_proba(self, X)
 |      Predict class log-probabilities of the input samples X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape = [n_samples, n_features]
 |          The input samples.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          such arrays if n_outputs > 1.
 |          The class log-probabilities of the input samples. The order of the
 |          classes corresponds to that in the attribute `classes_`.
 |  
 |  predict_proba(self, X)
 |      Predict class probabilities of the input samples X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape = [n_samples, n_features]
 |          The input samples.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          such arrays if n_outputs > 1.
 |          The class probabilities of the input samples. The order of the
 |          classes corresponds to that in the attribute `classes_`.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset([])
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from BaseDecisionTree:
 |  
 |  fit(self, X, y, sample_mask=None, X_argsorted=None, check_input=True, sample_weight=None)
 |      Build a decision tree from the training set (X, y).
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = [n_samples, n_features]
 |          The training input samples. Use ``dtype=np.float32`` for maximum
 |          efficiency.
 |      
 |      y : array-like, shape = [n_samples] or [n_samples, n_outputs]
 |          The target values (class labels in classification, real numbers in
 |          regression). In the regression case, use ``dtype=np.float64`` and
 |          ``order='C'`` for maximum efficiency.
 |      
 |      sample_weight : array-like, shape = [n_samples] or None
 |          Sample weights. If None, then samples are equally weighted. Splits
 |          that would create child nodes with net zero or negative weight are
 |          ignored while searching for a split in each node. In the case of
 |          classification, splits are also ignored if they would result in any
 |          single class carrying a negative weight in either child node.
 |      
 |      check_input : boolean, (default=True)
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you do.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Returns self.
 |  
 |  predict(self, X)
 |      Predict class or regression value for X.
 |      
 |      For a classification model, the predicted class for each sample in X is
 |      returned. For a regression model, the predicted value based on X is
 |      returned.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape = [n_samples, n_features]
 |          The input samples.
 |      
 |      Returns
 |      -------
 |      y : array of shape = [n_samples] or [n_samples, n_outputs]
 |          The predicted classes, or the predict values.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from BaseDecisionTree:
 |  
 |  feature_importances_
 |      Return the feature importances.
 |      
 |      The importance of a feature is computed as the (normalized) total
 |      reduction of the criterion brought by that feature.
 |      It is also known as the Gini importance.
 |      
 |      Returns
 |      -------
 |      feature_importances_ : array, shape = [n_features]
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __repr__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep: boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The former have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.feature_selection.from_model._LearntSelectorMixin:
 |  
 |  transform(self, X, threshold=None)
 |      Reduce X to its most important features.
 |      
 |      Parameters
 |      ----------
 |      X : array or scipy sparse matrix of shape [n_samples, n_features]
 |          The input samples.
 |      
 |      threshold : string, float or None, optional (default=None)
 |          The threshold value to use for feature selection. Features whose
 |          importance is greater or equal are kept while the others are
 |          discarded. If "median" (resp. "mean"), then the threshold value is
 |          the median (resp. the mean) of the feature importances. A scaling
 |          factor (e.g., "1.25*mean") may also be used. If None and if
 |          available, the object attribute ``threshold`` is used. Otherwise,
 |          "mean" is used by default.
 |      
 |      Returns
 |      -------
 |      X_r : array of shape [n_samples, n_selected_features]
 |          The input samples with only the selected features.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.TransformerMixin:
 |  
 |  fit_transform(self, X, y=None, **fit_params)
 |      Fit to data, then transform it.
 |      
 |      Fits transformer to X and y with optional parameters fit_params
 |      and returns a transformed version of X.
 |      
 |      Parameters
 |      ----------
 |      X : numpy array of shape [n_samples, n_features]
 |          Training set.
 |      
 |      y : numpy array of shape [n_samples]
 |          Target values.
 |      
 |      Returns
 |      -------
 |      X_new : numpy array of shape [n_samples, n_features_new]
 |          Transformed array.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples,)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

SVMs

Support Vector Machines (SVM) uses points in transformed problem space that separates the classes into groups.


In [ ]:
from sklearn.svm import SVC

kernels = ['linear', 'poly', 'rbf']

splits     = cv.train_test_split(digits.data, digits.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits

for kernel in kernels:
    if kernel != 'poly':
        model      = SVC(kernel=kernel)
    else:
        model      = SVC(kernel=kernel, degree=3)
        
    model.fit(X_train, y_train)
    expected   = y_test
    predicted  = model.predict(X_test)

    print classification_report(expected, predicted)

Random Forest

Random Forest is an ensemble of decision trees on different subsets of the dataset. The ensemble is created by bootstrap aggregation (bagging).


In [ ]:
from sklearn.ensemble import RandomForestClassifier

splits     = cv.train_test_split(digits.data, digits.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits

model      = RandomForestClassifier()
model.fit(X_train, y_train)

expected   = y_test
predicted  = model.predict(X_test)

print classification_report(expected, predicted)

Clustering

Clustering algorithms attempt to find patterns in unlabeled data. They are usually grouped into two main categories: centroidal (find the centers of clusters) and hierarchical (find clusters of clusters).

In order to explore clustering, we'll have to generate some fake datasets to use.


In [6]:
from sklearn.datasets import make_circles
from sklearn.datasets import make_moons
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

N = 1000 # Number of samples in each cluster

# Some colors for later
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)

circles = make_circles(n_samples=N, factor=.5, noise=.05)
moons   = make_moons(n_samples=N, noise=.08)
blobs   = make_blobs(n_samples=N, random_state=9)
noise   = np.random.rand(N, 2), None

# Let's see what the data looks like!
fig, axe = plt.subplots(figsize=(18, 4))
for idx, dataset in enumerate((circles, moons, blobs, noise)):
    X, y = dataset
    X = StandardScaler().fit_transform(X)
    
    plt.subplot(1,4,idx+1)
    plt.scatter(X[:,0], X[:,1], marker='.')

    plt.xticks(())
    plt.yticks(())
    plt.ylabel('$x_1$')
    plt.xlabel('$x_0$')

plt.show()



In [ ]:

K-Means Clustering

Partition N samples into k clusters, where each sample belongs to a cluster to which it has the closest mean of the neighbors. This problem is NP-hard, but there are good estimations.


In [8]:
from sklearn.cluster import MiniBatchKMeans

fig, axe = plt.subplots(figsize=(18, 4))
for idx, dataset in enumerate((circles, moons, blobs, noise)):
    X, y = dataset
    X = StandardScaler().fit_transform(X)
    
    # Fit the model with our algorithm
    model = MiniBatchKMeans(n_clusters=5)
    model.fit(X)
    
    # Make Predictions
    predictions = model.predict(X)
    
    # Find centers
    centers = model.cluster_centers_
    center_colors = colors[:len(centers)]
    plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
    
    plt.subplot(1,4,idx+1)
    plt.scatter(X[:, 0], X[:, 1], color=colors[predictions].tolist(), s=10)

    plt.xticks(())
    plt.yticks(())
    plt.ylabel('$x_1$')
    plt.xlabel('$x_0$')

plt.show()


Affinity Propagation

Clustering based on the concept of "message passing" between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm. Like k-medoids, AP finds "exemplars", members of the input set that are representative of clusters


In [14]:
from sklearn.cluster import AffinityPropagation


fig, axe = plt.subplots(figsize=(18, 4))
for idx, dataset in enumerate((circles, moons, blobs, noise)):
    X, y = dataset
    X = StandardScaler().fit_transform(X)
    
    # Fit the model with our algorithm
    model = AffinityPropagation(damping=.92, preference=-200)
    model.fit(X)
    
    # Make Predictions
    predictions = model.predict(X)
    
    # Find centers
    centers = model.cluster_centers_
    center_colors = colors[:len(centers)]
    plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
    
    plt.subplot(1,4,idx+1)
    plt.scatter(X[:, 0], X[:, 1], color=colors[predictions].tolist(), s=10)

    plt.xticks(())
    plt.yticks(())
    plt.ylabel('$x_1$')
    plt.xlabel('$x_0$')

plt.show()