scikit-learn
expects dataSOME overarching points:
data preprocessing very big so should mention, but really focus on learners; motivation by seeing some results immediately (Giri) Speaking - up in the air right now (Kim wouldn't mind handing it off to Giri and/or Jen) Introduce the math (Jen)
Load data -> excercise -> math behind it -> model eval (for example) (Giri) External speakers (bring up Weds @ planning meeting) - perhaps Giri could reach out and find someone interested in doing a ML intro (fairly brief, though) (Micheleen)
It's said in different ways, but I like the way Jake VanderPlas defines ML:
Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.
He goes on to say:
Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.
(more here)
ML is much more than writing a program. ML experts write clever and robust algorithms which can generalize to answer different, but specific questions. There are still types of questions that a certain algorithm can not or should not be used to answer. I say answer instead of solve, because even with an answer one should evaluate whether it is a good answer or bad answer. Also, just an in statistics, one needs to be careful about assumptions and limitations of an algorithm and the subsequent model that is built from it. Here's my hand-drawn diagram of the machine learning process.
As far as algorithms for learning a model (i.e. running some training data through an algorithm), I like to think of a couple ways to categorize machine learning approaches (with the help of the machine learning wikipedia article). The first way of thinking about ML, is by the type of information or input given to a system. So, given that criteria there are three classical categories:
Another way of categorizing ML approaches, is to think of the desired output:
--> This second approach (by desired output) is how sklearn
categorizes it's ML algorithms.
Term | Definition |
---|---|
Training set | set of data used to learn a model |
Test set | set of data used to test a model |
Feature | a variable (continuous, discrete, categorical, etc.) aka column |
Target | Label (associated with dependent variable, what we predict) |
Learner | Model or algorithm |
Fit, Train | learn a model with an ML algorithm using a training set |
Predict | w/ supervised learning, give a label to an unknown datum(data), w/ unsupervised decide if new data is weird, in which group, or what to do next with the new data |
Accuracy | percentage of correct predictions ((TP + TN) / total) |
Precision | percentage of correct positive predictions (TP / (FP + TP)) |
Recall | percentage of positive cases caught (TP / (FN + TP)) |
ML TIP: ML can only answer 5 questions:
- How much/how many?
- Which category?
- Which group?
- Is it weird?
- Which action?
Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.
All supervised estimators in sklearn
implement a fit(X, y)
method to fit the model and a predict(X)
method that, given unlabeled observations X, returns the predicted labels y.
Common algorithms you will use to train a model and then use trying to predict the labels of unknown observations are: classification and regression. There are many types of classification and regression (for examples check out the sklearn
algorithm cheatsheet below).
In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data.
Unsupervised models have a fit()
, transform()
and/or fit_transform()
in sklearn
.
Some examples are pattern matching (e.g. regex), group-by and data mining in general (discovery vs. prediction).
Looking back at previous years, by what percent did housing prices increase over each decade?
Looking back at previous years, and given the relationship between housing prices and mean income in my area, given my income how much will a house be in two years in my area?
A vacuum like roomba has to make a decision to vacuum the living room again or return to its base.
Is this image a cat or dog?
Are orange tabby cats more common than other breeds in Austin, Texas?
Using my SQL database on housing prices, group my housing prices by whether or not the house is under 10 miles from a school.
What is the weather going to be like tomorrow?
What is the purpose of life?
PRO TIP: Are you a statitician? Want to talk like a machine learning expert? Here you go (from the friendly people at SAS (here)):
A Statistician Would Say | A Machine Learnest Would Say |
---|---|
dependent variable | target |
variable | feature |
transformation | feature creation |
sklearn
)This module is not meant to be a comprehensive introduction to ML, but rather an introduction to the current de facto tool for ML in python. As a gentle intro, it is helpful to think of the sklearn
approach having layers of abstraction. This famous quote certainly applies:
Easy reading is damn hard writing, and vice versa.
--Nathaniel Hawthorne
In sklearn
, you'll find you have a common programming choice: to do things very explicitly, e.g. pre-process data one step at a time, perhaps do a transformation like PCA, split data into traning and test sets, define a classifier or learner with desired parameterss, train the classifier, use the classifier to predict on a test set and then analyze how good it did.
A different approach and something sklearn
offers is to combine some or all of the steps above into a pipeline so to speak. For instance, one could define a pipeline which does all of these steps at one time and perhaps even pits mutlple learners against one another or does some parameter tuning with a grid search (examples will be shown towards the end). This is what is meant here by layers of abstraction.
So, in this particular module, for the most part, we will try to be explicit regarding our process and give some useful tips on options for a more automated or pipelined approach. Just note, once you've mastered the explicit approaches you might want to explore sklearn
's GridSearchCV
and Pipeline
classes.
Here is sklearn
's algorithm diagram - (note, this is not an exhaustive list of model options offered in sklearn
, but serves as a good algorithm guide).
sklearn
needs data and features (aka columns with optional labels) in numpy arrays (aka ndarrays)Commonly, machine learning algorithms will require your data to be standardized and preprocessed. In `sklearn` the data must also take on a certain structure as well.
What you might have to do before using a learner in `sklearn`:
pandas
)Features should end up in a numpy.ndarray (hence numeric) and labels in a list.
Data options:
If you use your own data or "real-world" data you will likely have to do some data wrangling and need to leverage pandas
for some data manipulation.
Add some talking points about cross validation, why you want to do it, etc. (kendall) Add some description on what diabetes data set is. Show a few rows of the data, and explain what the columns are. Helps to visualize the data.
What is train_test_split? What do you mean by manual? (Kim) SKLearn provides abstractions using
When splitting into train, test, validation, do you get the three sets? (Kim) Answer: No, it's handled by SKLearn. You have more customized control (Micheleen)
What is the score? (Kim) Each is a test run. Values are 0 to 1. Higher is better. (Micheleen)
Maybe need a higher general-level ML overview, flow chart of steps? Preprocessing, data splitting, cross validation, etc in a flowchart. AI: Search for a flowchart online (CK)
In [1]:
import seaborn as sb
import pandas as pd
import numpy as np
#sb.set_context("notebook", font_scale=2.5)
%matplotlib inline
Features in the Iris dataset:
0 sepal length in cm
1 sepal width in cm
2 petal length in cm
3 petal width in cm
Target classes to predict:
0 Iris Setosa
1 Iris Versicolour
2 Iris Virginica
Shape and representation
In [ ]:
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
# How many data points (rows) x how many features (columns)
print(iris.data.shape)
print(iris.target.shape)
# What python object represents
print(type(iris.data))
print(type(iris.target))
Sneak a peek at data (a reminder of your pandas
dataframe methods)
In [123]:
# convert to pandas df (adding real column names)
iris.df = pd.DataFrame(iris.data,
columns = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])
# first few rows
iris.df.head()
Out[123]:
Describe the dataset with some summary statitsics
In [ ]:
# summary stats
iris.df.describe()
iris
dataset. It has no missing values. It's already in numpy arrays and has the correct shape for sklearn
. However we could try standardization and/or normalization. (later, in the transforms section, we will show one hot encoding, a preprocessing step)FYI: you'll commonly see the data or feature set (ML word for data without it's labels) represented as a capital X and the targets or labels (if we have them) represented as a lowercase y. This is because the data is a 2D array or list of lists and the targets are a 1D array or simple list.
In [ ]:
# Standardization aka scaling
from sklearn import preprocessing, datasets
# make sure we have iris loaded
iris = datasets.load_iris()
X, y = iris.data, iris.target
# scale it to a gaussian distribution
X_scaled = preprocessing.scale(X)
# how does it look now
pd.DataFrame(X_scaled).head()
In [ ]:
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_scaled).describe()
# also could:
#print(X_scaled.mean(axis = 0))
#print(X_scaled.std(axis = 0))
PRO TIP: To save our standardization and reapply later (say to the test set or some new data), create a transformer object like so:
scaler = preprocessing.StandardScaler().fit(X_train) # apply to a new dataset (e.g. test set): scaler.transform(X_test)
In [ ]:
# Standardization aka scaling
from sklearn import preprocessing, datasets
# make sure we have iris loaded
iris = datasets.load_iris()
X, y = iris.data, iris.target
# scale it to a gaussian distribution
X_norm = preprocessing.normalize(X, norm='l1')
# how does it look now
pd.DataFrame(X_norm).tail()
In [ ]:
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_norm).describe()
# cumulative sum of normalized and original data:
#print(pd.DataFrame(X_norm.cumsum().reshape(X.shape)).tail())
#print(pd.DataFrame(X).cumsum().tail())
# unit norm (convert to unit vectors) - all row sums should be 1 now
X_norm.sum(axis = 1)
PRO TIP: To save our normalization (like standardization above) and reapply later (say to the test set or some new data), create a transformer object like so:
normalizer = preprocessing.Normalizer().fit(X_train) # apply to a new dataset (e.g. test set): normalizer.transform(X_test)
In [124]:
# PCA for dimensionality reduction
from sklearn import decomposition
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
# perform principal component analysis
pca = decomposition.PCA(n_components = 3)
pca.fit(X)
X_t = pca.transform(X)
(X_t[:, 0])
# import numpy and matplotlib for plotting (and set some stuff)
import numpy as np
np.set_printoptions(suppress=True)
import matplotlib.pyplot as plt
%matplotlib inline
# let's separate out data based on first two principle components
x1, x2 = X_t[:, 0], X_t[:, 1]
# please don't worry about details of the plotting below
# (will introduce in different module)
# (note: you can get the iris names below from iris.target_names, also in docs)
s1 = ['r' if v == 0 else 'b' if v == 1 else 'g' for v in y]
s2 = ['Setosa' if v == 0 else 'Versicolor' if v == 1 else 'Virginica' for v in y]
classes = s2
colors = s1
for (i, cla) in enumerate(set(classes)):
xc = [p for (j, p) in enumerate(x1) if classes[j] == cla]
yc = [p for (j, p) in enumerate(x2) if classes[j] == cla]
cols = [c for (j, c) in enumerate(colors) if classes[j] == cla]
plt.scatter(xc, yc, c = cols, label = cla)
plt.legend(loc = 4)
Out[124]:
EXERCISE IDEA: Normalize data then rerun PCA and plot. What changes?
SOLUTION
In [ ]:
# Solution to exercise idea
...code here...
In [ ]:
# SelectKBest for selecting top-scoring features
from sklearn import datasets
from sklearn.feature_selection import SelectKBest, chi2
iris = datasets.load_iris()
X, y = iris.data, iris.target
print(X.shape)
# Do feature selection
# input is scoring function (here chi2) to get univariate p-values
# and number of top-scoring features (k) - here we get the top 2
X_t = SelectKBest(chi2, k = 2).fit_transform(X, y)
print(X_t.shape)
Note on scoring function selection in `SelectKBest` tranformations:
tenorflow
In [ ]:
# OneHotEncoder for dummying variables
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import pandas as pd
data = pd.DataFrame({'index': range(1, 7),
'state': ['WA', 'NY', 'CO', 'NY', 'CA', 'WA']})
print(data)
# We encode both our categorical variable and it's labels
enc = OneHotEncoder()
label_enc = LabelEncoder() # remember the labels here
# Encode labels (can use for discrete numerical values as well)
data_label_encoded = label_enc.fit_transform(data['state'])
data['state'] = data_label_encoded
# Encode and "dummy" variables
data_feature_one_hot_encoded = enc.fit_transform(data[['state']])
# Put into dataframe to look nicer and decode state dummy variables to original state values
# TRY: compare the original input data (look at row numbers) to one hot encoding results
# --> do they match??
pd.DataFrame(data_feature_one_hot_encoded.toarray(), columns = label_enc.inverse_transform(range(4)))
In [ ]:
# Encoded labels as dummy variables
print(data_label_encoded)
# Decoded
print(label_enc.inverse_transform(data_label_encoded))
EXERCISE IDEA: Use one hot encoding to "recode" the iris data's extra suprise column (we are going to add a categorical variable here to play with...)
In [ ]:
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
a = pd.DataFrame(X,
columns = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])
col5 = pd.DataFrame(np.random.randint(1, 4, size = len(y)))
X_plus = pd.concat([a, col5], axis = 1)
X_plus.head(20)
# ...now one-hot-encode...
Reminder: All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y. (direct quote from
sklearn
docs)
"Often the hardest part of solving a machine learning problem can be finding the right estimator for the job."
"Different estimators are better suited for different types of data and different problems."
An estimator for recognizing a new iris from its measurements
Or, in machine learning parlance, we fit an estimator on known samples of the iris measurements to predict the class to which an unseen iris belongs.
Let's give it a try! (We are actually going to hold out a small percentage of the iris
dataset and check our predictions against the labels)
In [ ]:
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn import svm
# Let's load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# split data into training and test sets using the handy train_test_split func
# in this split, we are "holding out" only one value and label (placed into X_test and y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1)
# Define an estimator instance (here, support vector classification)
# this just means giving our instance a name and setting the parameters
clf = svm.SVC(gamma = 0.001, C = 100.)
# We can now fit and predict with this object instance
In [ ]:
# Let's fit the data to the SVC instance object
clf.fit(X_train, y_train)
In [ ]:
# Let's predict on our "held out" sample
y_pred = clf.predict(X_test)
In [ ]:
# What was the label associated with this test sample? ("held out" sample's original label)
# fill in the blank below
# how did our prediction do?
print("Prediction: %d, Original label: %d" % (res[0], ___))
We can be explicit and use the train_test_split
method in scikit-learn ( train_test_split ) as in (and as shown above for iris
data):
# Create some data by hand and place 70% into a training set and the rest into a test set
# Here we are using labeled features (X - feature data, y - labels) in our made-up data
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
OR
Be more concise and
import numpy as np
from sklearn import cross_validation, linear_model
X, y = np.arange(10).reshape((5, 2)), range(5)
clf = linear_model.LinearRegression()
score = cross_validation.cross_val_score(clf, X, y)
There is also a cross_val_predict
method to create estimates rather than scores ( cross_val_predict )
Reminder: In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the training set given to the learner is unlabeled, there is no error or reward signal to evaluate a potential solution. Basically, we are just finding a way to represent the data and get as much information from it that we can.
HEY! Remember PCA from above? PCA is actually considered unsupervised learning. We just put it up there because it's a good way to visualize data at the beginning of the ML process.
We are going to continue to use the iris
dataset (however we won't be needed the targets or labels)
In [2]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
# Generate train data
X = 0.3 * np.random.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
# fit the model
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size
# plot the line, the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.title("Novelty Detection")
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.Blues_r)
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='red')
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='orange')
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([a.collections[0], b1, b2, c],
["learned frontier", "training observations",
"new regular observations", "new abnormal observations"],
loc="upper left",
prop=matplotlib.font_manager.FontProperties(size=11))
plt.xlabel(
"error train: %d/200 ; errors novel regular: %d/40 ; "
"errors novel abnormal: %d/40"
% (n_error_train, n_error_test, n_error_outliers))
Out[2]:
In [2]:
# TODO: explain Kmeans clustering of iris code below
from sklearn import cluster, datasets
# data
iris = datasets.load_iris()
X, y = iris.data, iris.target
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X)
# how do our original labels fit into the clusters we found?
print(k_means.labels_[::10])
print(y[::10])
EXERCISE IDEA: Iterate over different number of clusters, n_clusters param, in Kmeans
Here, we will process some data, classify it with SVM (see [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more info), and view the quality of the classification with a confusion matrix.
What are precision and recall? (Kendall) Why is it called a confusion matrix? Describe what iris data set is, what the labels are, irises are flowers. (Kim) Might want to show the numbers in addition to the heat map Also might want to show a Yes No (True Positives, False Positives) example. (David)
In [ ]:
import numpy as np
# import model algorithm and data
from sklearn import svm, datasets
# import splitter
from sklearn.cross_validation import train_test_split
# import metrics
from sklearn.metrics import confusion_matrix
# feature data (X) and labels (y)
iris = datasets.load_iris()
X, y = iris.data, iris.target
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, random_state = 42)
In [ ]:
# perform the classification step and run a prediction on test set from above
clf = svm.SVC(kernel = 'linear', C = 0.01)
y_pred = clf.fit(X_train, y_train).predict(X_test)
In [ ]:
# Define a plotting function confusion matrices
# (from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, target_names, title = 'The Confusion Matrix', cmap = plt.cm.YlOrRd):
plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
plt.tight_layout()
# Add feature labels to x and y axes
tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.colorbar()
Numbers in confusion matrix:
In [ ]:
%matplotlib inline
cm = confusion_matrix(y_test, y_pred)
# see the actual counts
print(cm)
# visually inpsect how the classifier did matching predictions to true labels
plot_confusion_matrix(cm, iris.target_names)
In [ ]:
from sklearn.metrics import classification_report
# Using the test and prediction sets from above
print(classification_report(y_test, y_pred, target_names = iris.target_names))
In [ ]:
# Another example with some toy data
y_test = ['cat', 'dog', 'mouse', 'mouse', 'cat', 'cat']
y_pred = ['mouse', 'dog', 'cat', 'mouse', 'cat', 'mouse']
# How did our predictor do?
print(classification_report(y_test, y_pred, target_names = y_test))
EXERCISE IDEA: Normaize or standardize data and reclassify and show confusion matrix.
GridSearchCV
paramter tuning
In [ ]:
import numpy as np
from sklearn import cross_validation
# Let's run a prediction on some test data given a trained model
# First, create some data
X = np.sort(np.random.rand(20))
func = lambda x: np.cos(1.5 * np.pi * x)
y = np.array([func(x) for x in X])
In [ ]:
# A plotting function
import matplotlib.pyplot as plt
%matplotlib inline
def plot_fit(X_train, y_train, X_test, y_pred):
plt.plot(X_test, y_pred, label = "Model")
plt.plot(X_test, func(X_test), label = "Function")
plt.scatter(X_train, y_train, label = "Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
Pipelining (as an aside to this section)
Pipeline(steps=[...])
- where steps can be a list of processes through which to put data or a dictionary which includes the parameters for each step as values- For example, here we do a transformation (SelectKBest) and a classification (SVC) all at once in a pipeline we set up
# a feature selection instance
selection = SelectKBest(chi2, k = 2)
# classification instance
clf = svm.SVC(kernel = 'linear')
# make a pipeline
pipeline = Pipeline([("feature selection", selection), ("classification", clf)])
# train the model
pipeline.fit(X, y)
See a full example here
Note: If you wish to perform multiple transformations in your pipeline try FeatureUnion
In [ ]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(degree = 1, include_bias = False)
lm = LinearRegression()
In [ ]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([("polynomial_features", poly),
("linear_regression", lm)])
pipeline.fit(X[:, np.newaxis], y)
X_test = np.linspace(0, 1, 100)
y_pred = pipeline.predict(X_test[:, np.newaxis])
plot_fit(X, y, X_test, y_pred)
In [ ]:
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(include_bias = False)
lm = LinearRegression()
pipeline = Pipeline([("polynomial_features", poly),
("linear_regression", lm)])
param_grid = dict(polynomial_features__degree = list(range(1, 30, 2)),
linear_regression__normalize = [False, True])
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X[:, np.newaxis], y)
print(grid_search.best_params_)
In [ ]: