scikit-learn
expects dataSections:
Flow:
It's said in different ways, but I like the way Jake VanderPlas defines ML:
Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.
He goes on to say:
Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.
(more here)
ML is much more than writing a program. ML experts write clever and robust algorithms which can generalize to answer different, but specific questions. There are still types of questions that a certain algorithm can not or should not be used to answer. I say answer instead of solve, because even with an answer one should evaluate whether it is a good answer or bad answer. Also, just an in statistics, one needs to be careful about assumptions and limitations of an algorithm and the subsequent model that is built from it. Here's my hand-drawn diagram of the machine learning process.
Below, we are going to show a simple case of classification. In the figure we show a collection of 2D data, colored by their class labels (imagine one class is labeled "red" and the other "blue").
The
fig_code
module is credited to Jake VanderPlas and was cloned from his github repo here - also on our repo is his license file since he asked us to include that if we use his source code. :)
In [ ]:
# Plot settings for notebook
# so that plots show up in notebook
%matplotlib inline
# seaborn here is used for aesthetics.
# here, setting seaborn plot defaults (this can be safely commented out)
import seaborn; seaborn.set()
In [ ]:
# Import an example plot from the figures directory
from fig_code import plot_sgd_separator
plot_sgd_separator()
Above is the vector which best separates the two classes, "red" and "blue" using a classification algorithm called Stochastic Gradient Decent (don't worry about the detail yet). The confidence intervals are shown as dashed lines. - FACT CHECK CI LINE COMMENT PLEASE
This demonstrates a very important aspect of ML and that is the algorithm is generalizable, i.e., if we add some new data, a new point, the algorithm can predict whether is should be in the "red" or "blue" category.
ML TIP: ML can only answer 5 questions:
- How much/how many?
- Which category?
- Which group?
- Is it weird?
- Which action?
As far as algorithms for learning a model (i.e. running some training data through an algorithm), it's nice to think of them in two different ways (with the help of the machine learning wikipedia article). The first way of thinking about ML, is by the type of information or input given to a system. So, given that criteria there are three classical categories:
Another way of categorizing ML approaches, is to think of the desired output:
--> This second approach (by desired output) is how sklearn
categorizes it's ML algorithms.
Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.
All supervised estimators in sklearn
implement a fit(X, y)
method to fit the model and a predict(X)
method that, given unlabeled observations X, returns the predicted labels y.
Common algorithms you will use to train a model and then use trying to predict the labels of unknown observations are: classification and regression. There are many types of classification and regression (for examples check out the sklearn
algorithm cheatsheet below).
In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data.
Unsupervised models have a fit()
, transform()
and/or fit_transform()
in sklearn
.
Some examples are pattern matching (e.g. regex), group-by and data mining in general (discovery vs. prediction).
Looking back at previous years, by what percent did housing prices increase over each decade?
Looking back at previous years, and given the relationship between housing prices and mean income in my area, given my income how much will a house be in two years in my area?
A vacuum like roomba has to make a decision to vacuum the living room again or return to its base.
Is this image a cat or dog?
Are orange tabby cats more common than other breeds in Austin, Texas?
Using my SQL database on housing prices, group my housing prices by whether or not the house is under 10 miles from a school.
What is the weather going to be like tomorrow?
What is the purpose of life?
sklearn
)This module is not meant to be a comprehensive introduction to ML, but rather an introduction to the current de facto tool for ML in python. As a gentle intro, it is helpful to think of the sklearn
approach having layers of abstraction. This famous quote certainly applies:
Easy reading is damn hard writing, and vice versa.
--Nathaniel Hawthorne
In sklearn
, you'll find you have a common programming choice: to do things very explicitly, e.g. pre-process data one step at a time, perhaps do a transformation like PCA, split data into traning and test sets, define a classifier or learner with desired parameterss, train the classifier, use the classifier to predict on a test set and then analyze how good it did.
A different approach and something sklearn
offers is to combine some or all of the steps above into a pipeline so to speak. For instance, one could define a pipeline which does all of these steps at one time and perhaps even pits mutlple learners against one another or does some parameter tuning with a grid search (examples will be shown towards the end). This is what is meant here by layers of abstraction.
So, in this particular module, for the most part, we will try to be explicit regarding our process and give some useful tips on options for a more automated or pipelined approach. Just note, once you've mastered the explicit approaches you might want to explore sklearn
's GridSearchCV
and Pipeline
classes.
Here is sklearn
's algorithm diagram - (note, this is not an exhaustive list of model options offered in sklearn
, but serves as a good algorithm guide).
In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()
# Leave one value out from training set - that will be test later on
X_train, y_train = iris.data[:-1,:], iris.target[:-1]
In [ ]:
from sklearn.linear_model import LogisticRegression
# our model - a multiclass regression
logistic = LogisticRegression()
# train on iris training set
logistic.fit(X_train, y_train)
X_test = iris.data[-1,:].reshape(1, -1)
y_predict = logistic.predict(X_test)
print('Predicted class %s, real class %s' % (
y_predict, iris.target[-1]))
print('Probabilities of membership in each class: %s' %
logistic.predict_proba(X_test))
QUESTION:
Term | Definition |
---|---|
Training set | set of data used to learn a model |
Test set | set of data used to test a model |
Feature | a variable (continuous, discrete, categorical, etc.) aka column |
Target | Label (associated with dependent variable, what we predict) |
Learner | Model or algorithm |
Fit, Train | learn a model with an ML algorithm using a training set |
Predict | w/ supervised learning, give a label to an unknown datum(data), w/ unsupervised decide if new data is weird, in which group, or what to do next with the new data |
Accuracy | percentage of correct predictions ((TP + TN) / total) |
Precision | percentage of correct positive predictions (TP / (FP + TP)) |
Recall | percentage of positive cases caught (TP / (FN + TP)) |
PRO TIP: Are you a statitician? Want to talk like a machine learning expert? Here you go (from the friendly people at SAS (here)):
A Statistician Would Say | A Machine Learnest Would Say |
---|---|
dependent variable | target |
variable | feature |
transformation | feature creation |
BREAK
NOTE:
sklearn
needs data/features (aka columns) in numpy ndarrays and the optional labels also as numpy ndarrays.TIP: Commonly, machine learning algorithms will require your data to be standardized and preprocessed. In
sklearn
the data must also take on a certain structure as well.</b>
In [ ]:
print(type(iris.data))
print(type(iris.target))
In [ ]:
import seaborn as sb
import pandas as pd
import numpy as np
#sb.set_context("notebook", font_scale=2.5)
%matplotlib inline
Features in the Iris dataset:
0 sepal length in cm
1 sepal width in cm
2 petal length in cm
3 petal width in cm
Target classes to predict:
0 Iris Setosa
1 Iris Versicolour
2 Iris Virginica
Shape and representation
In [ ]:
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
# How many data points (rows) x how many features (columns)
print(iris.data.shape)
print(iris.target.shape)
# What python object represents
print(type(iris.data))
print(type(iris.target))
Sneak a peek at data (a reminder of your pandas
dataframe methods)
In [ ]:
# convert to pandas df (adding real column names)
iris.df = pd.DataFrame(iris.data,
columns = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])
# first few rows
iris.df.head()
Describe the dataset with some summary statitsics
In [ ]:
# summary stats
iris.df.describe()
iris
dataset. It has no missing values. It's already in numpy arrays and has the correct shape for sklearn
. However we could try standardization and/or normalization. (later, in the transforms section, we will show one hot encoding, a preprocessing step)What you might have to do before using a learner in `sklearn`:
pandas
)Features should end up in a numpy.ndarray (hence numeric) and labels in a list.
Data options:
If you use your own data or "real-world" data you will likely have to do some data wrangling and need to leverage pandas
for some data manipulation.
FYI: you'll commonly see the data or feature set (ML word for data without it's labels) represented as a capital X and the targets or labels (if we have them) represented as a lowercase y. This is because the data is a 2D array or list of lists and the targets are a 1D array or simple list.
In [ ]:
# Standardization aka scaling
from sklearn import preprocessing, datasets
# make sure we have iris loaded
iris = datasets.load_iris()
X, y = iris.data, iris.target
# scale it to a gaussian distribution
X_scaled = preprocessing.scale(X)
# how does it look now
pd.DataFrame(X_scaled).head()
In [ ]:
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_scaled).describe()
# also could:
#print(X_scaled.mean(axis = 0))
#print(X_scaled.std(axis = 0))
PRO TIP: To save our standardization and reapply later (say to the test set or some new data), create a transformer object like so:
scaler = preprocessing.StandardScaler().fit(X_train) # apply to a new dataset (e.g. test set): scaler.transform(X_test)
In [ ]:
# Standardization aka scaling
from sklearn import preprocessing, datasets
# make sure we have iris loaded
iris = datasets.load_iris()
X, y = iris.data, iris.target
# scale it to a gaussian distribution
X_norm = preprocessing.normalize(X, norm='l1')
# how does it look now
pd.DataFrame(X_norm).tail()
In [ ]:
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_norm).describe()
# cumulative sum of normalized and original data:
#print(pd.DataFrame(X_norm.cumsum().reshape(X.shape)).tail())
#print(pd.DataFrame(X).cumsum().tail())
# unit norm (convert to unit vectors) - all row sums should be 1 now
X_norm.sum(axis = 1)
PRO TIP: To save our normalization (like standardization above) and reapply later (say to the test set or some new data), create a transformer object like so:
normalizer = preprocessing.Normalizer().fit(X_train) # apply to a new dataset (e.g. test set): normalizer.transform(X_test)
BREAK
In [ ]:
# PCA for dimensionality reduction
from sklearn import decomposition
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
# perform principal component analysis
pca = decomposition.PCA(n_components = 3)
pca.fit(X)
X_t = pca.transform(X)
(X_t[:, 0])
# import numpy and matplotlib for plotting (and set some stuff)
import numpy as np
np.set_printoptions(suppress=True)
import matplotlib.pyplot as plt
%matplotlib inline
# let's separate out data based on first two principle components
x1, x2 = X_t[:, 0], X_t[:, 1]
# please don't worry about details of the plotting below
# (will introduce in different module)
# (note: you can get the iris names below from iris.target_names, also in docs)
s1 = ['r' if v == 0 else 'b' if v == 1 else 'g' for v in y]
s2 = ['Setosa' if v == 0 else 'Versicolor' if v == 1 else 'Virginica' for v in y]
classes = s2
colors = s1
for (i, cla) in enumerate(set(classes)):
xc = [p for (j, p) in enumerate(x1) if classes[j] == cla]
yc = [p for (j, p) in enumerate(x2) if classes[j] == cla]
cols = [c for (j, c) in enumerate(colors) if classes[j] == cla]
plt.scatter(xc, yc, c = cols, label = cla)
plt.legend(loc = 4)
In [ ]:
# SelectKBest for selecting top-scoring features
from sklearn import datasets
from sklearn.feature_selection import SelectKBest, chi2
iris = datasets.load_iris()
X, y = iris.data, iris.target
print(X.shape)
# Do feature selection
# input is scoring function (here chi2) to get univariate p-values
# and number of top-scoring features (k) - here we get the top 2
X_t = SelectKBest(chi2, k = 2).fit_transform(X, y)
print(X_t.shape)
Note on scoring function selection in `SelectKBest` tranformations:
tenorflow
In [ ]:
# OneHotEncoder for dummying variables
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import pandas as pd
data = pd.DataFrame({'index': range(1, 7),
'state': ['WA', 'NY', 'CO', 'NY', 'CA', 'WA']})
print(data)
# We encode both our categorical variable and it's labels
enc = OneHotEncoder()
label_enc = LabelEncoder() # remember the labels here
# Encode labels (can use for discrete numerical values as well)
data_label_encoded = label_enc.fit_transform(data['state'])
data['state'] = data_label_encoded
# Encode and "dummy" variables
data_feature_one_hot_encoded = enc.fit_transform(data[['state']])
# Put into dataframe to look nicer and decode state dummy variables to original state values
# TRY: compare the original input data (look at row numbers) to one hot encoding results
# --> do they match??
pd.DataFrame(data_feature_one_hot_encoded.toarray(), columns = label_enc.inverse_transform(range(4)))
In [ ]:
# Encoded labels as dummy variables
print(data_label_encoded)
# Decoded
print(label_enc.inverse_transform(data_label_encoded))
EXERCISE: Use one hot encoding to "recode" the iris data's extra suprise column (we are going to add a categorical variable here to play with...)
In [ ]:
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
a = pd.DataFrame(X,
columns = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])
col5 = pd.DataFrame(np.random.randint(1, 4, size = len(y)))
X_plus = pd.concat([a, col5], axis = 1)
X_plus.head(20)
# ...now one-hot-encode...
BREAK
Reminder: All supervised estimators in scikit-learn implement a
fit(X, y)
method to fit the model and apredict(X)
method that, given unlabeled observations X, returns the predicted labels y. (direct quote fromsklearn
docs)
"Often the hardest part of solving a machine learning problem can be finding the right estimator for the job."
"Different estimators are better suited for different types of data and different problems."
An estimator for recognizing a new iris from its measurements
Or, in machine learning parlance, we fit an estimator on known samples of the iris measurements to predict the class to which an unseen iris belongs.
Let's give it a try! (We are actually going to hold out a small percentage of the iris
dataset and check our predictions against the labels)
In [ ]:
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn import tree
# Let's load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# split data into training and test sets using the handy train_test_split func
# in this split, we are "holding out" only one value and label (placed into X_test and y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1)
In [ ]:
# Let's try a decision tree classification method
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
In [ ]:
# Let's predict on our "held out" sample
y_pred = clf.predict(X_test)
In [ ]:
# What was the label associated with this test sample? ("held out" sample's original label)
# fill in the blank below
# how did our prediction do?
print("Prediction: %d, Original label: %d" % (y_pred[0], ___)) # <-- fill in blank
In [ ]:
In [ ]:
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydot
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
In [ ]:
from sklearn.tree import export_graphviz
import graphviz
export_graphviz(clf, out_file="mytree.dot",
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
with open("mytree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
In [45]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
clf = RandomForestClassifier(n_jobs=2)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Factor(iris.target, iris.target_names)
train, test = df[df['is_train']==True], df[df['is_train']==False]
clf = RandomForestClassifier(n_jobs=2)
clf.fit(X_train, y_train)
preds = iris.target_names[clf.predict(X_test, y_test)]
df.head()
#pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])
We can be explicit and use the train_test_split
method in scikit-learn ( train_test_split ) as in (and as shown above for iris
data):
# Create some data by hand and place 70% into a training set and the rest into a test set
# Here we are using labeled features (X - feature data, y - labels) in our made-up data
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
OR
Be more concise and
import numpy as np
from sklearn import cross_validation, linear_model
X, y = np.arange(10).reshape((5, 2)), range(5)
clf = linear_model.LinearRegression()
score = cross_validation.cross_val_score(clf, X, y)
There is also a cross_val_predict
method to create estimates rather than scores ( cross_val_predict )
BREAK
Reminder: In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the training set given to the learner is unlabeled, there is no error or reward signal to evaluate a potential solution. Basically, we are just finding a way to represent the data and get as much information from it that we can.
HEY! Remember PCA from above? PCA is actually considered unsupervised learning. We just put it up there because it's a good way to visualize data at the beginning of the ML process.
We are going to continue to use the iris
dataset (however we won't be needed the targets or labels)
In [ ]:
from sklearn import cluster, datasets
# data
iris = datasets.load_iris()
X, y = iris.data, iris.target
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X)
# how do our original labels fit into the clusters we found?
print(k_means.labels_[::10])
print(y[::10])
EXERCISE IDEA: Iterate over different number of clusters, n_clusters param, in Kmeans
BREAK
Here, we will process some data, classify it with SVM (see [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more info), and view the quality of the classification with a confusion matrix.
In [38]:
import numpy as np
# import model algorithm and data
from sklearn import svm, datasets
# import splitter
from sklearn.cross_validation import train_test_split
# import metrics
from sklearn.metrics import confusion_matrix
# feature data (X) and labels (y)
iris = datasets.load_iris()
X, y = iris.data, iris.target
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, random_state = 42)
In [39]:
# perform the classification step and run a prediction on test set from above
clf = svm.SVC(kernel = 'linear', C = 0.01)
y_pred = clf.fit(X_train, y_train).predict(X_test)
In [40]:
# Define a plotting function confusion matrices
# (from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, target_names, title = 'The Confusion Matrix', cmap = plt.cm.YlOrRd):
plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
plt.tight_layout()
# Add feature labels to x and y axes
tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.colorbar()
Numbers in confusion matrix:
In [41]:
%matplotlib inline
cm = confusion_matrix(y_test, y_pred)
# see the actual counts
print(cm)
# visually inpsect how the classifier did matching predictions to true labels
plot_confusion_matrix(cm, iris.target_names)
In [42]:
from sklearn.metrics import classification_report
# Using the test and prediction sets from above
print(classification_report(y_test, y_pred, target_names = iris.target_names))
In [43]:
# Another example with some toy data
y_test = ['cat', 'dog', 'mouse', 'mouse', 'cat', 'cat']
y_pred = ['mouse', 'dog', 'cat', 'mouse', 'cat', 'mouse']
# How did our predictor do?
print(classification_report(y_test, y_pred, target_names = y_test))
GridSearchCV
paramter tuning
In [ ]:
import numpy as np
from sklearn import cross_validation
# Let's run a prediction on some test data given a trained model
# First, create some data
X = np.sort(np.random.rand(20))
func = lambda x: np.cos(1.5 * np.pi * x)
y = np.array([func(x) for x in X])
In [ ]:
# A plotting function
import matplotlib.pyplot as plt
%matplotlib inline
def plot_fit(X_train, y_train, X_test, y_pred):
plt.plot(X_test, y_pred, label = "Model")
plt.plot(X_test, func(X_test), label = "Function")
plt.scatter(X_train, y_train, label = "Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
BREAK
Pipelining (as an aside to this section)
Pipeline(steps=[...])
- where steps can be a list of processes through which to put data or a dictionary which includes the parameters for each step as values- For example, here we do a transformation (SelectKBest) and a classification (SVC) all at once in a pipeline we set up
# a feature selection instance
selection = SelectKBest(chi2, k = 2)
# classification instance
clf = svm.SVC(kernel = 'linear')
# make a pipeline
pipeline = Pipeline([("feature selection", selection), ("classification", clf)])
# train the model
pipeline.fit(X, y)
See a full example here
Note: If you wish to perform multiple transformations in your pipeline try FeatureUnion
In [ ]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(degree = 1, include_bias = False)
lm = LinearRegression()
In [ ]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([("polynomial_features", poly),
("linear_regression", lm)])
pipeline.fit(X[:, np.newaxis], y)
X_test = np.linspace(0, 1, 100)
y_pred = pipeline.predict(X_test[:, np.newaxis])
plot_fit(X, y, X_test, y_pred)
In [ ]:
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(include_bias = False)
lm = LinearRegression()
pipeline = Pipeline([("polynomial_features", poly),
("linear_regression", lm)])
param_grid = dict(polynomial_features__degree = list(range(1, 30, 2)),
linear_regression__normalize = [False, True])
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X[:, np.newaxis], y)
print(grid_search.best_params_)
BREAK
In [ ]: