Scikit Learn

http://scikit-learn.org/

A mature library for machine learning, built using tools like numpy and scipy.

  • Classification
  • Regression
  • Clustering
  • Dimensionality reduction
  • Model Selection

An excellent class from PyCon in 2013, is available here:

https://github.com/jakevdp/sklearn_pycon2013

A fantastic slide to help identify what approach to use:

http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


Classification of flowers.


In [3]:
from IPython.display import Image
Image("http://blog.kaggle.com/wp-content/uploads/2015/04/iris_petal_sepal.png")


Out[3]:

In [4]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_iris

iris = load_iris()

Form the feature matrix, which is the collection of measurements or observations that are associated with a particular state or label.


In [5]:
X = pd.DataFrame(iris.data, columns=iris.feature_names)
X.head()


Out[5]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

Form the set of labels, the classification of the state based on the measurements.


In [6]:
Y = pd.DataFrame(np.vstack([iris.target, np.choose(iris.target, 
                                                   iris.target_names)]).T,
                 columns=["label id", "label name"])
y = Y["label id"]

Y.head()


Out[6]:
label id label name
0 0 setosa
1 0 setosa
2 0 setosa
3 0 setosa
4 0 setosa

Q. How many measurements and how many classes of flower?


In [7]:
print("There are {} measurements and {} classes of flower.".format(X.size, Y["label name"].unique().size))


There are 600 measurements and 3 classes of flower.

Best practice suggests that we should always partition our data into train, test and validation sets. It is very common to do this using randomized partitioning.

http://scikit-learn.org/stable/modules/cross_validation.html


In [8]:
from sklearn import cross_validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)

Using a simple linear classifier, lets train the model and evaluate the performance on the test set.


In [9]:
from sklearn.svm import LinearSVC

clf = LinearSVC()

clf.fit(X_train, y_train)


Out[9]:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0)

We can look at how the classifier performed on the test set using the classifiers "score" method. This reports the mean accuracy of the classifier.


In [10]:
clf.score(X_test, y_test)


Out[10]:
0.91666666666666663

Measuring accuracy alone ignores other important features of a classifier:

http://scikit-learn.org/stable/modules/model_evaluation.html


In [11]:
from sklearn.metrics import classification_report

print(classification_report(y_test, clf.predict(X_test), target_names=iris.target_names))


             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        16
 versicolor       0.91      0.87      0.89        23
  virginica       0.86      0.90      0.88        21

avg / total       0.92      0.92      0.92        60


Continue on to the scikit-learn gallery and look at some of the examples.



In [ ]: