Intro to scikit-learn, machine learning in Python

Part of our team's ongoing Data Science Workshops

2014-04-04, Josh Montague (@jrmontag)

Overview

This is a simple introduction to the sklearn API, with two examples using built-in datasets:

  1. K-Nearest Neighbors (supervised)
  2. Linear regression (unsupervised)

This session doesn't go into the implementations of the estimator algorithms, nor how to choose them appropriately. However, the official docs are incredible, and have more information than you will probably every have time to read. See also this fantastic flowchart on when to use which algorithm.

Import some libraries for data munging and inspection:

These aren't the sklearn library, but they're important for doing science in Python. If this doesn't work, you'll need to
$pip install numpy
and
$pip install matplotlib

For this tutorial, I'm using numpy version 1.12.1 and matplotlib version 2.0.2 with Python version 3.6.1

Later we'll import code from sklearn to do the machine learning.


In [ ]:
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

Get & inspect the data

First, we'll get some sample data from the built-in collection. Fisher's famous iris dataset is a great place to start. The datasets are of type bunch, which is dictionary-like.


In [ ]:
# import the 'iris' dataset from sklearn
from sklearn import datasets
iris = datasets.load_iris()

Use some of the keys in the bunch to poke around the data and get a sense for what we have to work with. Generally, the sklearn built-in data has data and target keys whose values (arrays) we'll use for our machine learning.


In [ ]:
print("data dimensions (array): {} \n ".format(iris.data.shape))
print("bunch keys: {} \n".format(iris.keys()))
print("feature names: {} \n".format(iris.feature_names))

# the "description" is a giant text blob that will tell you more 
#    about the contents of the bunch
print(iris.DESCR)

Have a look at the actual data we'll be using. Note that the data array has four features for each sample (consistent with the "feature names" above), and the labels can only take on the values 0-2.

I'm still getting familiar with slicing, indexing, etc numpy arrays, so I find it helpful to have the docs open somewhere.


In [ ]:
# preview 'idx' rows/cols of the data
idx = 6

print("example iris features: \n {} \n".format(iris.data[:idx]))
print("example iris labels: {} \n".format(iris.target[:idx]))
print("all possible labels: {} \n".format(np.unique(iris.target)))

It's always a good idea to throw out some scatter plots (if the data is appropriate) to see the space our data covers. Since we have four features, we can just grab some pairs of the data and make simple scatter plots.


In [ ]:
plt.figure(figsize = (16,4))

plt.subplot(131)
plt.scatter(iris.data[:, 0:1], iris.data[:, 1:2])
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.axis("tight")

plt.subplot(132)
plt.scatter(iris.data[:, 1:2], iris.data[:, 2:3])
plt.xlabel("sepal width (cm)")
plt.ylabel("petal length (cm)")
plt.axis("tight")

plt.subplot(133)
plt.scatter(iris.data[:, 0:1], iris.data[:, 2:3])
plt.xlabel("sepal length (cm)")
plt.ylabel("petal length (cm)")
plt.axis("tight")

And, what the heck, this is a perfectly good opportunity to try out a 3D plot, too, right? We'll also add in the target from the dataset - that is, the actual labels that we're getting to - as the coloring. Since we only have three dimensions to plot, we have to leave something out.


In [ ]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize = (10,8))
ax = plt.axes(projection='3d')
ax.view_init(15, 60)   # (elev, azim) : adjust these to change viewing angle!


x = iris.data[:, 0:1]
y = iris.data[:, 1:2]
z = iris.data[:, 2:3]

# add the last dimension for use in e.g. the color!
label1 = iris.data[:, 3:4]
# can also color the data according to the actual labels 
label2 = iris.target

ax.scatter(x, y, z, c=label2)

plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
ax.set_zlabel("petal length (cm)")
#plt.axis("tight")

Since this dataset has a collection of "ground truth" labels (label2 in the previous graph), this is an example of supervised learning. We tell the algorithm the right answer a whole bunch of times, and look for it to figure out the best way to predict labels of future data samples.

Learning & predicting

In sklearn, model objects (implementing a particular algorithm) are called "estimators". Estimators implement the methods fit(X,y) and predict(X).

First things first, we want to get all the sample feature (data) and label (target) arrays. Then, we'll randomly split up the data sets into training and test sets by shuffling the order and slicing off the last 'subset' samples to hold for testing.


In [ ]:
iris_X = iris.data
iris_y = iris.target

r = random.randint(0,100)
np.random.seed(r)
idx = np.random.permutation(len(iris_X))

subset = 25

iris_X_train = iris_X[idx[:-subset]]  # all but the last 'subset' rows
iris_y_train = iris_y[idx[:-subset]]
iris_X_test  = iris_X[idx[-subset:]]  # the last 'subset' rows
iris_y_test  = iris_y[idx[-subset:]]

To see that we're choosing the training and test samples, we can again plot them to see how they're distributed. If you re-run the random.seed code, it should choose a new random collection of the data.


In [ ]:
plt.figure(figsize= (6,6))

plt.scatter(iris_X_train[:, 0:1]
        , iris_X_train[:, 1:2]
        , c="blue"
        , s=30
        , label="train data"
        )

plt.scatter(iris_X_test[:, 0:1]
        , iris_X_test[:, 1:2]
        , c="red"
        , s=30        
        , label="test data"        
        )
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.legend()
#_ = plt.axis("tight")

Now, we use the sklearn package and create a nearest-neighbor classification estimator (with all the default values). After instantiating the object, we use its fit() method and pass it the training data - both features and labels. Note that the __repr__() here tells you about the default values if you want to adjust them. Of course you can always just look at the documentation, too!


In [ ]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

# fit the model using the training data and labels
knn.fit(iris_X_train, iris_y_train)

At this point, the trained knn estimator object has, internally, the "best" mapping from input to output. It can now be used to predict new data via the predict() method. In this case, the prediction is which class the new samples' features best fit - a simple 1D array of labels.


In [ ]:
# predict the labels for the test data, using the trained model
iris_y_predict = knn.predict(iris_X_test)

# show the results (labels)
iris_y_predict

Since this data came labeled, we can actually look at the actual correct answers. As this list grows in size, it's trickier to note the differences or similarities. But, still, it looks like it did a pretty good job.


In [ ]:
iris_y_test

Thankfully, sklearn also has many built-in ways to gauge the "accuracy" of our trained estimator. The simplest is just "what fraction of our classifications did we get right?" Clearly easier than inspecting by eye. Note: occasionally, this estimator actually reaches 100%. If you increase the "subset" that's cut out for testing, you're likely to see a decrease in the accuracy_score, which makes it easier to visualize.


In [ ]:
from sklearn.metrics import accuracy_score

accuracy_score(iris_y_test, iris_y_predict)

So we know the successful prediction percentage, but we can also do a visual inspection of how the labels differ. Even for this small dataset, it can be tricky to spot the differences between the true values and predicted values; an accuracy_score in the 90% range means that only one or two samples will be incorrectly predicted.


In [ ]:
plt.figure(figsize = (12,6))

plt.subplot(221)
plt.scatter(iris_X_test[:, 0:1]     # real data
        , iris_X_test[:, 1:2]
        , c=iris_y_test         # real labels
        , s=100
        , alpha=0.6
        )
plt.ylabel("sepal width (cm)")
plt.title("real labels")

plt.subplot(223)
plt.scatter(iris_X_test[:, 0:1]
        , iris_X_test[:, 2:3]
        , c=iris_y_test
        , s=100
        , alpha=0.6
        )
plt.xlabel("sepal length (cm)")
plt.ylabel("petal length (cm)")


plt.subplot(222)
plt.scatter(iris_X_test[:, 0:1]     # real data
        , iris_X_test[:, 1:2]
        , c=iris_y_predict      # predicted labels
        , s=100
        , alpha=0.6
        )
plt.ylabel("sepal width (cm)")
plt.title("predicted labels")

plt.subplot(224)
plt.scatter(iris_X_test[:, 0:1]
        , iris_X_test[:, 2:3]
        , c=iris_y_predict
        , s=100
        , alpha=0.6
        )
plt.xlabel("sepal length (cm)")
plt.ylabel("petal length (cm)")

Alternatively, if you have more numpy and matplotlib skillz than I currently have, you can also visualize the resulting model of a similar estimator like so: (source)


In [ ]:
from IPython.core.display import Image
Image(filename='./iris_knn.png', width=500)

Once more, but with model persistence

Now let's work an example of unsupervised learning.

After some time with exploration like in the last example, we'll get a handle on our data, the features that will be helpful, and the general pipeline of analysis. In order to make the analysis more portable (and also when issues of scale become important), we may want to train our estimator once (as well as we possibly can, without overfitting), and then use that estimator for predictions of new data for quite a while. If fitting the estimator takes an hour (or a day!), we definitely don't want to repeat that process any more than necessary.

Enter pickle.

pickle is a way to serialize and de-serialize Python objects; that is, convert an in-memory structure to a byte stream that you can write to file, move around, and read back into memory as a full-fledged Python object.

First, let's work with a slightly trimmed-down analysis pipeline this time. First, get some data (still built in), and have a look. This data is on house sales in Boston around the 1970s (hence the ridiculously low prices, relative to today!). The features are all kinds of interesting things, and the targets are the sale prices.


In [ ]:
# boston home sales in the 1970s
boston = datasets.load_boston()

boston.keys()

In [ ]:
# get the full info
print(boston.DESCR)

The feature_names are less obvious in this dataset, but if you print out the DESCR above, there's a more detailed explanation of the features.


In [ ]:
print("dimensions: {}, {} \n".format(boston.data.shape, boston.target.shape))
print("features (defs in DESCR): {} \n".format(boston.feature_names))
print("first row: {}\n".format(boston.data[:1]))

While a better model would take incorporate all of these features (with appropriate normalization), let's focus on just one for the moment. Column 5 is the number of rooms in each house. Seems reasonable that this would be a decent predictor of sale price. Let's have a quick look at the data.


In [ ]:
rooms = boston.data[:, 5]

plt.figure(figsize = (6,6))
plt.scatter(rooms, boston.target, alpha=0.5)
plt.xlabel("room count")
plt.ylabel("cost ($1000s)")
plt.axis("tight")

Ok, so we can work with this - there's definitely some correlation between these two variables. Let's imagine that we just knew we wanted to fit this immediately, without all the inspection. Furthermore, we wanted to build a model, fit the estimator, and then keep it around for use in an analysis later on (via pickle). Let's go back a step and start from loading the data:


In [ ]:
# here comes the data
boston = datasets.load_boston()

# a little goofyness to get a "column vector" for the estimator
b_X_tmp = boston.data[:, np.newaxis]
b_X = b_X_tmp[:, :, 5]
b_y = boston.target

# split it out into train/test again
r = random.randint(0,100)
np.random.seed(r)
idx = np.random.permutation(len(b_X))

subset = 50

b_X_train = b_X[idx[:-subset]]  # all but last 'subset' rows
b_y_train = b_y[idx[:-subset]]
b_X_test  = b_X[idx[-subset:]]  # last 'subset' rows
b_y_test  = b_y[idx[-subset:]]

Fit the estimator (and this time let's print out the fitted model parameters)...


In [ ]:
# get our estimator & fit to the training data 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

print(lr.fit(b_X_train, b_y_train))
print("Coefficients: {},{} \n".format(lr.coef_, lr.intercept_))

And now let's imagine we spent a ton of time building this model, so we want to save it to disk:


In [ ]:
import pickle

p = pickle.dumps(lr)
print(p) # looks super useful, right?

Write the pickle to disk, and then navigate to this location in another shell and cat the file. It's pretty much what you see above. Pretty un-helpful to the eye.


In [ ]:
# write this fitted estimator (python object) to a byte stream
pickle.dump(lr,open('./lin-reg.pkl', 'wb'))

In [ ]:
!cat ./lin-reg.pkl

But, now imagine we had another process that had some new data and we wanted to use this pre-existing model to predict some results. We just read in the faile, and deserialize it. You can even check the coefficients to see that it's "the same" model.


In [ ]:
# now, imagine you've previously created this file and stored it off somewhere... 
new_lr = pickle.load(open('./lin-reg.pkl', 'rb'))
    
print(new_lr)  
print("Coefficient (compare to previous output): {}, {} \n".format(new_lr.coef_, new_lr.intercept_))

And now we can use it to predict the target for our housing data (remember, we use the "test" data for measuring the success of our estimator.


In [ ]:
b_y_predict = new_lr.predict(b_X_test)

#b_y_predict  # you can have a look at the result if you want

Now, we can have a look at how we did. Below, we can look at the best-fit line through all of the data (both "test" and "train"). Then, we also compare the predicted fit results (test data) to the actual true results.


In [ ]:
plt.figure(figsize= (12,5))

plt.subplot(121)
plt.scatter(b_X, b_y, c="red")
plt.plot(b_X, new_lr.predict(b_X), '-k')
plt.axis('tight')
plt.xlabel('room count')
plt.ylabel('predicted price ($1000s)')
plt.title("fit to all data")

plt.subplot(122)
plt.scatter(b_y_test, b_y_predict, c="green")
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('true price ($1000s)')
plt.ylabel('predicted price ($1000s)')
plt.title("true- and predicted-value comparison (test data)")

It's not so bad! We left out all sorts of things that might help, like weighting, normalization, ..., but it's not bad for how easy it was.


So, generally, the approach to using the sklearn package is to choose an estimator, split your data, fit on the training data, predict on the testing data, and then celebrate how easy scikit-learn has made this process:

from sklearn import estimator

e = Estimator()
e.fit(train_samples [, train_labels])
e.predict(test_samples [, test_labels])

rejoice()

In [ ]: