Scikit Learn

A mature library for machine learning, built using tools like numpy and scipy.

Classification
Regression
Clustering
Dimensionality reduction
Model Selection

An excellent class from PyCon in 2013, is available here:

https://github.com/jakevdp/sklearn_pycon2013

A fantastic slide to help identify what approach to use:

http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Classification of flowers.



In [3]:

    
from IPython.display import Image
Image("http://blog.kaggle.com/wp-content/uploads/2015/04/iris_petal_sepal.png")









    Out[3]:



In [4]:

    
import pandas as pd
import numpy as np

from sklearn.datasets import load_iris

iris = load_iris()

Form the feature matrix, which is the collection of measurements or observations that are associated with a particular state or label.



In [5]:

    
X = pd.DataFrame(iris.data, columns=iris.feature_names)
X.head()









    Out[5]:






  
    
      
      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
    
  
  
    
      0
       5.1
       3.5
       1.4
       0.2
    
    
      1
       4.9
       3.0
       1.4
       0.2
    
    
      2
       4.7
       3.2
       1.3
       0.2
    
    
      3
       4.6
       3.1
       1.5
       0.2
    
    
      4
       5.0
       3.6
       1.4
       0.2

Form the set of labels, the classification of the state based on the measurements.



In [6]:

    
Y = pd.DataFrame(np.vstack([iris.target, np.choose(iris.target, 
                                                   iris.target_names)]).T,
                 columns=["label id", "label name"])
y = Y["label id"]

Y.head()

Q. How many measurements and how many classes of flower?



In [7]:

    
print("There are {} measurements and {} classes of flower.".format(X.size, Y["label name"].unique().size))









    



There are 600 measurements and 3 classes of flower.

Best practice suggests that we should always partition our data into train, test and validation sets. It is very common to do this using randomized partitioning.

http://scikit-learn.org/stable/modules/cross_validation.html



In [8]:

    
from sklearn import cross_validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)

Using a simple linear classifier, lets train the model and evaluate the performance on the test set.



In [9]:

    
from sklearn.svm import LinearSVC

clf = LinearSVC()

clf.fit(X_train, y_train)









    Out[9]:





LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0)

We can look at how the classifier performed on the test set using the classifiers "score" method. This reports the mean accuracy of the classifier.



In [10]:

    
clf.score(X_test, y_test)









    Out[10]:





0.91666666666666663

Measuring accuracy alone ignores other important features of a classifier:

http://scikit-learn.org/stable/modules/model_evaluation.html



In [11]:

    
from sklearn.metrics import classification_report

print(classification_report(y_test, clf.predict(X_test), target_names=iris.target_names))









    



             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        16
 versicolor       0.91      0.87      0.89        23
  virginica       0.86      0.90      0.88        21

avg / total       0.92      0.92      0.92        60

Continue on to the scikit-learn gallery and look at some of the examples.



In [ ]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2