In [1]:
from sklearn.datasets import make_moons, make_circles, make_classification

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            (X, y)]

figsize(14, 5)

In [2]:
def plot_classification(name, clf, X, y, cmap):
    score = clf.score(X, y)

    h = 0.2
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = meshgrid(arange(x_min, x_max, h), arange(y_min, y_max, h))    
    
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    if hasattr(clf, "decision_function"):
        Z = clf.decision_function(c_[xx.ravel(), yy.ravel()])
    else:
        Z = clf.predict_proba(c_[xx.ravel(), yy.ravel()])[:, 1]    

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    contourf(xx, yy, Z, cmap=cmap, alpha=.8)

    scatter(X[:, 0], X[:, 1], c=y, cmap=cm.Greys)
    xlim(xx.min(), xx.max())
    ylim(yy.min(), yy.max())
    title(name + " - Score %.2f" % score)

In [3]:
def plot_multi_class(name, clf, X, y, cmap=cm.PRGn):
    score = clf.score(X, y)

    h = 0.2
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = meshgrid(arange(x_min, x_max, h), arange(y_min, y_max, h))    
    
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    Z = clf.predict(c_[xx.ravel(), yy.ravel()])
    
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    contourf(xx, yy, Z, cmap=cmap, alpha=.8)

    scatter(X[:, 0], X[:, 1], c=y, cmap=cm.Greys)
    xlim(xx.min(), xx.max())
    ylim(yy.min(), yy.max())
    title(name + " - Score %.2f" % score)

Classificação

Problema

Considere os seguintes dados


In [4]:
figsize(14, 5)
for i, (X, y) in enumerate(datasets):
    subplot(1,3,i+1)
    scatter(X[:, 0], X[:, 1], c=y, cmap=cm.Greys)


Gostaríamos de criar um classificador capaz de apropriadamente separar duas classes e corretamente classificar novas entradas.

Os $K$ vizinhos mais próximos


In [5]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [6]:
X, y = datasets[0]
knn.fit(X, y) 
y_e = knn.predict(X)

In [7]:
figsize(8,8)
plot_classification('K Vizinhos', knn, X, y, cm.PRGn)



In [8]:
figsize(15, 5)
for dataset_idx, (X, y) in enumerate(datasets):
    subplot(1, 3, dataset_idx+1)
    knn.fit(X, y) 
    plot_classification('K Vizinhos', knn, X, y, cm.PRGn)


Máquinas de suporte vetorial (SVM)


In [9]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')
X, y = datasets[0]
svc.fit(X, y)


Out[9]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [10]:
figsize(8,8)
plot_classification('SVC linear', svc, X, y, cm.PRGn)



In [11]:
figsize(15, 5)
for dataset_idx, (X, y) in enumerate(datasets):
    subplot(1, 3, dataset_idx+1)
    svc.fit(X, y)
    plot_classification('SVC linear', svc, X, y, cm.PRGn)



In [12]:
svc = SVC(kernel='poly', degree=3)

for dataset_idx, (X, y) in enumerate(datasets):
    subplot(1, 3, dataset_idx+1)
    svc.fit(X, y) 
    plot_classification('SVC Polynomial', svc, X, y, cm.PRGn)



In [13]:
svc = SVC(kernel='rbf')

for dataset_idx, (X, y) in enumerate(datasets):
    subplot(1, 3, dataset_idx+1)
    svc.fit(X, y) 
    plot_classification('SVC RBF', svc, X, y, cm.PRGn)


Exercício - Iris


In [14]:
from sklearn.datasets import load_iris
iris = load_iris()

In [15]:
print iris.DESCR


Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:
    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================
    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...


In [16]:
X = iris.data[:,2:]
y = iris.target

In [17]:
figsize(8,8)
scatter(X[:,0], X[:,1], c=y)


Out[17]:
<matplotlib.collections.PathCollection at 0x7fb0109ea910>

Crie um classificador capaz de separar as 3 classes de plantas.

Solução


In [20]:
svc = SVC(kernel='rbf')
svc.fit(X, y)


Out[20]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [21]:
figsize(8,8)
plot_multi_class('SVC - RBF', svc, X, y)