Primero librerias


In [6]:
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
import seaborn as sns
%matplotlib inline

Vamos a crear datos de jugete

Crea varios "blobs"

recuerda la funcion de scikit-learn datasets.make_blobs() Tambien prueba

centers = [[1, 1], [-1, -1], [1, -1]]
 X,Y = datasets.make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)

In [71]:
centers = [[1, 1], [-1, -1], [1, -1]]
X,Y = datasets.make_blobs(n_samples=1000, centers=centers, cluster_std=0.6)
plt.scatter(X[:,0],X[:,1],c=Y)
plt.jet()


Ahora vamos a crear un modelo de arbol

podemos usar DecisionTreeClassifier como clasificador


In [21]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

Que parametros y funciones tiene el classificador?

Hint: usa help(cosa)!


In [22]:
help(clf)


Help on DecisionTreeClassifier in module sklearn.tree.tree object:

class DecisionTreeClassifier(BaseDecisionTree, sklearn.base.ClassifierMixin)
 |  A decision tree classifier.
 |  
 |  Parameters
 |  ----------
 |  criterion : string, optional (default="gini")
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "entropy" for the information gain.
 |  
 |  splitter : string, optional (default="best")
 |      The strategy used to choose the split at each node. Supported
 |      strategies are "best" to choose the best split and "random" to choose
 |      the best random split.
 |  
 |  max_features : int, float, string or None, optional (default=None)
 |      The number of features to consider when looking for the best split:
 |        - If int, then consider `max_features` features at each split.
 |        - If float, then `max_features` is a percentage and
 |          `int(max_features * n_features)` features are considered at each
 |          split.
 |        - If "auto", then `max_features=sqrt(n_features)`.
 |        - If "sqrt", then `max_features=sqrt(n_features)`.
 |        - If "log2", then `max_features=log2(n_features)`.
 |        - If None, then `max_features=n_features`.
 |  
 |      Note: the search for a split does not stop until at least one
 |      valid partition of the node samples is found, even if it requires to
 |      effectively inspect more than ``max_features`` features.
 |  
 |  max_depth : int or None, optional (default=None)
 |      The maximum depth of the tree. If None, then nodes are expanded until
 |      all leaves are pure or until all leaves contain less than
 |      min_samples_split samples.
 |      Ignored if ``max_leaf_nodes`` is not None.
 |  
 |  min_samples_split : int, optional (default=2)
 |      The minimum number of samples required to split an internal node.
 |  
 |  min_samples_leaf : int, optional (default=1)
 |      The minimum number of samples required to be at a leaf node.
 |  
 |  min_weight_fraction_leaf : float, optional (default=0.)
 |      The minimum weighted fraction of the input samples required to be at a
 |      leaf node.
 |  
 |  max_leaf_nodes : int or None, optional (default=None)
 |      Grow a tree with ``max_leaf_nodes`` in best-first fashion.
 |      Best nodes are defined as relative reduction in impurity.
 |      If None then unlimited number of leaf nodes.
 |      If not None then ``max_depth`` will be ignored.
 |  
 |  class_weight : dict, list of dicts, "auto" or None, optional (default=None)
 |      Weights associated with classes in the form ``{class_label: weight}``.
 |      If not given, all classes are supposed to have weight one. For
 |      multi-output problems, a list of dicts can be provided in the same
 |      order as the columns of y.
 |  
 |      The "auto" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data.
 |  
 |      For multi-output, the weights of each column of y will be multiplied.
 |  
 |      Note that these weights will be multiplied with sample_weight (passed
 |      through the fit method) if sample_weight is specified.
 |  
 |  random_state : int, RandomState instance or None, optional (default=None)
 |      If int, random_state is the seed used by the random number generator;
 |      If RandomState instance, random_state is the random number generator;
 |      If None, the random number generator is the RandomState instance used
 |      by `np.random`.
 |  
 |  Attributes
 |  ----------
 |  tree_ : Tree object
 |      The underlying Tree object.
 |  
 |  max_features_ : int,
 |      The inferred value of max_features.
 |  
 |  classes_ : array of shape = [n_classes] or a list of such arrays
 |      The classes labels (single output problem),
 |      or a list of arrays of class labels (multi-output problem).
 |  
 |  n_classes_ : int or list
 |      The number of classes (for single output problems),
 |      or a list containing the number of classes for each
 |      output (for multi-output problems).
 |  
 |  feature_importances_ : array of shape = [n_features]
 |      The feature importances. The higher, the more important the
 |      feature. The importance of a feature is computed as the (normalized)
 |      total reduction of the criterion brought by that feature.  It is also
 |      known as the Gini importance [4]_.
 |  
 |  See also
 |  --------
 |  DecisionTreeRegressor
 |  
 |  References
 |  ----------
 |  
 |  .. [1] http://en.wikipedia.org/wiki/Decision_tree_learning
 |  
 |  .. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification
 |         and Regression Trees", Wadsworth, Belmont, CA, 1984.
 |  
 |  .. [3] T. Hastie, R. Tibshirani and J. Friedman. "Elements of Statistical
 |         Learning", Springer, 2009.
 |  
 |  .. [4] L. Breiman, and A. Cutler, "Random Forests",
 |         http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
 |  
 |  Examples
 |  --------
 |  >>> from sklearn.datasets import load_iris
 |  >>> from sklearn.cross_validation import cross_val_score
 |  >>> from sklearn.tree import DecisionTreeClassifier
 |  >>> clf = DecisionTreeClassifier(random_state=0)
 |  >>> iris = load_iris()
 |  >>> cross_val_score(clf, iris.data, iris.target, cv=10)
 |  ...                             # doctest: +SKIP
 |  ...
 |  array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
 |          0.93...,  0.93...,  1.     ,  0.93...,  1.      ])
 |  
 |  Method resolution order:
 |      DecisionTreeClassifier
 |      BaseDecisionTree
 |      abc.NewBase
 |      sklearn.base.BaseEstimator
 |      sklearn.feature_selection.from_model._LearntSelectorMixin
 |      sklearn.base.TransformerMixin
 |      sklearn.base.ClassifierMixin
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, class_weight=None)
 |  
 |  predict_log_proba(self, X)
 |      Predict class log-probabilities of the input samples X.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix of shape = [n_samples, n_features]
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          such arrays if n_outputs > 1.
 |          The class log-probabilities of the input samples. The order of the
 |          classes corresponds to that in the attribute `classes_`.
 |  
 |  predict_proba(self, X, check_input=True)
 |      Predict class probabilities of the input samples X.
 |      
 |      The predicted class probability is the fraction of samples of the same
 |      class in a leaf.
 |      
 |      check_input : boolean, (default=True)
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you do.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix of shape = [n_samples, n_features]
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      Returns
 |      -------
 |      p : array of shape = [n_samples, n_classes], or a list of n_outputs
 |          such arrays if n_outputs > 1.
 |          The class probabilities of the input samples. The order of the
 |          classes corresponds to that in the attribute `classes_`.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset([])
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from BaseDecisionTree:
 |  
 |  fit(self, X, y, sample_weight=None, check_input=True)
 |      Build a decision tree from the training set (X, y).
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix, shape = [n_samples, n_features]
 |          The training input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csc_matrix``.
 |      
 |      y : array-like, shape = [n_samples] or [n_samples, n_outputs]
 |          The target values (class labels in classification, real numbers in
 |          regression). In the regression case, use ``dtype=np.float64`` and
 |          ``order='C'`` for maximum efficiency.
 |      
 |      sample_weight : array-like, shape = [n_samples] or None
 |          Sample weights. If None, then samples are equally weighted. Splits
 |          that would create child nodes with net zero or negative weight are
 |          ignored while searching for a split in each node. In the case of
 |          classification, splits are also ignored if they would result in any
 |          single class carrying a negative weight in either child node.
 |      
 |      check_input : boolean, (default=True)
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you do.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Returns self.
 |  
 |  predict(self, X, check_input=True)
 |      Predict class or regression value for X.
 |      
 |      For a classification model, the predicted class for each sample in X is
 |      returned. For a regression model, the predicted value based on X is
 |      returned.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix of shape = [n_samples, n_features]
 |          The input samples. Internally, it will be converted to
 |          ``dtype=np.float32`` and if a sparse matrix is provided
 |          to a sparse ``csr_matrix``.
 |      
 |      check_input : boolean, (default=True)
 |          Allow to bypass several input checking.
 |          Don't use this parameter unless you know what you do.
 |      
 |      Returns
 |      -------
 |      y : array of shape = [n_samples] or [n_samples, n_outputs]
 |          The predicted classes, or the predict values.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from BaseDecisionTree:
 |  
 |  feature_importances_
 |      Return the feature importances.
 |      
 |      The importance of a feature is computed as the (normalized) total
 |      reduction of the criterion brought by that feature.
 |      It is also known as the Gini importance.
 |      
 |      Returns
 |      -------
 |      feature_importances_ : array, shape = [n_features]
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __repr__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep: boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The former have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.feature_selection.from_model._LearntSelectorMixin:
 |  
 |  transform(self, X, threshold=None)
 |      Reduce X to its most important features.
 |      
 |      Uses ``coef_`` or ``feature_importances_`` to determine the most
 |      important features.  For models with a ``coef_`` for each class, the
 |      absolute sum over the classes is used.
 |      
 |      Parameters
 |      ----------
 |      X : array or scipy sparse matrix of shape [n_samples, n_features]
 |          The input samples.
 |      
 |      threshold : string, float or None, optional (default=None)
 |          The threshold value to use for feature selection. Features whose
 |          importance is greater or equal are kept while the others are
 |          discarded. If "median" (resp. "mean"), then the threshold value is
 |          the median (resp. the mean) of the feature importances. A scaling
 |          factor (e.g., "1.25*mean") may also be used. If None and if
 |          available, the object attribute ``threshold`` is used. Otherwise,
 |          "mean" is used by default.
 |      
 |      Returns
 |      -------
 |      X_r : array of shape [n_samples, n_selected_features]
 |          The input samples with only the selected features.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.TransformerMixin:
 |  
 |  fit_transform(self, X, y=None, **fit_params)
 |      Fit to data, then transform it.
 |      
 |      Fits transformer to X and y with optional parameters fit_params
 |      and returns a transformed version of X.
 |      
 |      Parameters
 |      ----------
 |      X : numpy array of shape [n_samples, n_features]
 |          Training set.
 |      
 |      y : numpy array of shape [n_samples]
 |          Target values.
 |      
 |      Returns
 |      -------
 |      X_new : numpy array of shape [n_samples, n_features_new]
 |          Transformed array.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

vamos a ajustar nuestro modelo con fit y sacar su puntaje con score


In [26]:
clf.fit(X,Y)
clf.score(X,Y)


Out[26]:
1.0

In [28]:
clf = DecisionTreeClassifier()
clf.fit(X[:1000],Y[:1000])
clf.score(X,Y)*100


Out[28]:
90.450000000000003

Por que no queremos 100%?

Este problema se llama "Overfitting"

Pasos para un tipico algoritmo ML:

  • Crear un modelo
  • Particionar tus datos en diferentes pedazos (10% entrenar y 90% prueba)
  • Entrenar tu modelo sobre cada pedazo de los datos
  • Escogete el mejor modelo o el promedio de los modelos
  • Predice!

Primero vamos a particionar los datos usando


In [33]:
from sklearn.cross_validation import train_test_split
X_train,X_test, Y_train, Y_test= train_test_split(X,Y,test_size=0.90)

cuales son los tamanios de estos nuevos datos?


In [36]:
plt.scatter(X_test[:,0],X_test[:,1],c=Y_test)


Out[36]:
<matplotlib.collections.PathCollection at 0x7f7644432810>

y ahora entrenamos nuestro modelo y checamos el error


In [49]:
clf = DecisionTreeClassifier()
clf.fit(X_train,Y_train)
clf.score(X_test,Y_test)*100


Out[49]:
89.900000000000006

Como se ve nuestro modelo?

Que fue mas importante para hacer una decision?

Como podemos mejorar y controlar como dividimos nuestros datos?


In [39]:
clf.feature_importances_


Out[39]:
array([ 0.46621362,  0.53378638])

Validación cruzada y

K-fold

Y lo mejor es que podemos hacer todo de usa sola patada con sci-kit!

Hay que usar cross_val_score


In [50]:
from sklearn.cross_validation import cross_val_score

resultados = cross_val_score(clf,X,Y, cv=10)

In [51]:
np.mean(resultados)


Out[51]:
1.0

In [ ]:

Y como podemos mejorar un arbol de decision?

RandomForestClassifier(n_estimators=n_estimators) Al rescate!


In [66]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
print(clf)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

a probarlo!


In [61]:
resultados = cross_val_score(clf,X,Y, cv=10)

mejoro?


In [62]:
resultados.mean()


Out[62]:
0.9247007642372912

Pero ahora tenemos un parametro nuevo, cuantos arboles queremos usar?

,, ...

Que tal si probamos con un for loop!? Y checamos el error conforme al numero de arboles?

Actividad!

Hay que :

  • Definir nuestro rango de arboles a probar en un arreglo
  • hacer un for loop sobre este arreglo
    • Para cada elemento, entrena un bosque y saca el score
    • Guarda el score en una lista
    • graficalo!

In [74]:
ks=[2,3,5,8,10,12,15,18,20,25,30,35,40,45,50]
scores=[]
for i in ks:
    clf = RandomForestClassifier(n_estimators=i)
    resultados = cross_val_score(clf,X,Y, cv=10)
    scores.append( np.mean(resultados) )

plt.plot(ks,scores)


Out[74]:
[<matplotlib.lines.Line2D at 0x7f763fa67ad0>]

El conjunto de datos Iris

Un modelo multi-dimensional


In [ ]:
g = sns.PairGrid(iris, hue="species")
g = g.map(plt.scatter)
g = g.add_legend()

Actividad:

Objetivo: Entrena un arbol para predecir la especie de la planta

  • Checa las graficas, que variables podrian ser mas importante?
  • Agarra los datos, que dimensiones son?
  • Rompelos en pedacitos y entrena tus modelos
  • Que scores te da? Que resulto ser importante?

In [ ]:
iris = datasets.load_iris()
X = iris.data
Y = iris.target

In [ ]: