A mature library for machine learning, built using tools like numpy and scipy.
An excellent class from PyCon in 2013, is available here:
https://github.com/jakevdp/sklearn_pycon2013
A fantastic slide to help identify what approach to use:
http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
In [3]:
from IPython.display import Image
Image("http://blog.kaggle.com/wp-content/uploads/2015/04/iris_petal_sepal.png")
Out[3]:
In [4]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
Form the feature matrix, which is the collection of measurements or observations that are associated with a particular state or label.
In [5]:
X = pd.DataFrame(iris.data, columns=iris.feature_names)
X.head()
Out[5]:
Form the set of labels, the classification of the state based on the measurements.
In [6]:
Y = pd.DataFrame(np.vstack([iris.target, np.choose(iris.target,
iris.target_names)]).T,
columns=["label id", "label name"])
y = Y["label id"]
Y.head()
Out[6]:
Q. How many measurements and how many classes of flower?
In [7]:
print("There are {} measurements and {} classes of flower.".format(X.size, Y["label name"].unique().size))
Best practice suggests that we should always partition our data into train, test and validation sets. It is very common to do this using randomized partitioning.
http://scikit-learn.org/stable/modules/cross_validation.html
In [8]:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
Using a simple linear classifier, lets train the model and evaluate the performance on the test set.
In [9]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train, y_train)
Out[9]:
We can look at how the classifier performed on the test set using the classifiers "score" method. This reports the mean accuracy of the classifier.
In [10]:
clf.score(X_test, y_test)
Out[10]:
Measuring accuracy alone ignores other important features of a classifier:
http://scikit-learn.org/stable/modules/model_evaluation.html
In [11]:
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(X_test), target_names=iris.target_names))
Continue on to the scikit-learn gallery and look at some of the examples.
In [ ]: