Classification with Scikit-learn

First we use pandas to read in the csv file and separate the Y (target class, final column in CSV) from the X, the predicting values.


In [ ]:
import pandas as pd

data = pd.read_csv("../data/iris.data")

# convert to NumPy arrays because they are the easiest to handle in sklearn
variables = data.drop(["class"], axis=1).as_matrix()
classes = data[["class"]].as_matrix().reshape(-1)

In [ ]:
# import cross-validation scorer and KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

train_X, test_X, train_Y, test_Y = train_test_split(variables, classes)

# initialize classifier object
classifier = KNeighborsClassifier()

# fit the object using training data and sample labels
classifier.fit(train_X, train_Y)

# evaluate the results for held-out test sample
classifier.score(test_X, test_Y)
# value is the mean accuracy

In [ ]:
# if we wanted to predict values for unseen data, we would use the predict()-method

classifier.predict(test_X) # note no known Y-values passed

Exercise

  • Import the classifier object ``sklearn.svm.SVC```
  • initialize it
  • fit it with the training data (no need to split a second time)
  • evaluate the quality of the created classifier using score()

In [ ]:

Pipelining and cross-validation

It's common to want to preprocess data somehow or in general have several steps. This can be easily done with the Pipeline class.

There are typically parameters involved and you might want to select the best possible parameter.


In [ ]:
from sklearn.decomposition import PCA # pca is a subspace method that projects the data into a lower-dimensional space

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier


pca = PCA(n_components=2)
knn = KNeighborsClassifier(n_neighbors=3)

from sklearn.pipeline import Pipeline

pipeline = Pipeline([("pca", pca), ("kneighbors", knn)])

parameters_grid = dict(
    pca__n_components=[1,2,3,4],
    kneighbors__n_neighbors=[1,2,3,4,5,6]
    )
grid_search = GridSearchCV(pipeline, parameters_grid)
grid_search.fit(train_X, train_Y)
grid_search.best_estimator_

In [ ]:
# you can now test agains the held out part
grid_search.best_estimator_.score(test_X, test_Y)

Exercise

There is another dataset, "breast-cancer-wisconsin.data". For a description see [here] (https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/) .

It contains samples with patient ID (that you should remove), measurements and as last the doctors judgment of the biopsy: malignant or benign.

Read in the file and create a classifier.

You can alternately just split the input and use some classifier or do a grid cross-validation over a larger space of potential parameters.


In [ ]: