Scikit-learn Demo

Josh Speagle (UCSC Python Coffee 12/08/2016)

This notebook goes through some basic examples and uses of Python's scikit-learn package, which is a user-friendly machine learning package with useful tutorials and excellent documentation where I've pulled most of the examples from.

Before we begin, I'll first initialize the environment. I'm lazy and like to use Python like Matlab, so there's that. Additionally, I like my figures to have larger labels and axis markers, so I'll update those defaults as well.


In [1]:
%pylab
%matplotlib inline

# re-defining plotting defaults
from matplotlib.font_manager import FontProperties
from matplotlib import gridspec
rcParams.update({'xtick.major.pad': '7.0'})
rcParams.update({'xtick.major.size': '7.5'})
rcParams.update({'xtick.major.width': '1.5'})
rcParams.update({'xtick.minor.pad': '7.0'})
rcParams.update({'xtick.minor.size': '3.5'})
rcParams.update({'xtick.minor.width': '1.0'})
rcParams.update({'ytick.major.pad': '7.0'})
rcParams.update({'ytick.major.size': '7.5'})
rcParams.update({'ytick.major.width': '1.5'})
rcParams.update({'ytick.minor.pad': '7.0'})
rcParams.update({'ytick.minor.size': '3.5'})
rcParams.update({'ytick.minor.width': '1.0'})
rcParams.update({'xtick.color': 'k'})
rcParams.update({'ytick.color': 'k'})
rcParams.update({'font.size': 30})


Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib

Intro: Machine Learning from a Bayesian Perspective

It can be easy to get swept away in the magic of machine learning. Let's first consider some Bayesian motivation, and then move on to the more traditional definitions.

Let's say we have two objects: a training object $A \in \mathbf{A}$, whose properties we know, and an unknown object $B \in \mathbf{B}$, whose properties we want to predict. The Bayesian way to think about this is

\begin{equation} P(A|B)=\frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{\sum_{A\in\mathbf{A}} P(B|A)P(A)}, \end{equation}

where $P(A|B)$ is the posterior, $P(B|A)$ is the likelihood, $P(A)$ is the prior, and $P(B)$ is the evidence. If we are interested in using $A$ as a go-between for some other property we care about (let's call it $Y$), then we can write that as

\begin{equation} P(Y|B)= \sum_{A\in\mathbf{A}} P(Y|A)P(A|B). \end{equation}

To probe $P(Y|B)$, we thus need to sample from the posterior. This requires us to know the likelihood and prior for all of the objects in our training set.

But what if we have no idea what the likelihood function is or what the priors should be? That's where machine learning comes to the rescue! Machine learning uses the training data to directly probe the posterior in a flexible way. I can't emphasize enough, however, that we don't get this for free - we have to make some educated guesses about the basic properties of the data as well as the appoximate form the posterior will take in order to make this happen. And it can be difficult to actually interpret what's actually going on afterwards.

Overview

In general, what machine learning tries to do is make predictions from a set of training data to a set of unknown data.

These broadly break down into a few main categories:

  • Supervised learning, where we are trying to learn how a set of features/attributes (X) maps to a corresponding set of targets/labels (Y) in our training data. If these targets are discrete, this is referred to as classification. If the targets are continuous, this is referred to as regression.
  • Unsupervised learning, where we are simply trying to find new representations of our features. This can be broadly divided into clustering (finding groups of similar objects), density estimation (turning discrete samples into continuous densities), and manifold learning (creating lower-dimensional representations of higher-dimensional datasets).

We'll quickly run through each of these using some of the data from the sklearn package. Let's import that now.


In [2]:
import sklearn
from sklearn import datasets

Classification: Irises

The iris dataset is contains 3 different types of irises (Setosa, Versicolour, and Virginica). We want to try and classify them based on the lengths and widths of their petals and sepals.


In [3]:
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
print iris['DESCR']


Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

To test things out, we need to split up the data into a training and a testing set. Testing your results on data you just fit will result in overfitting!


In [4]:
indices = random.permutation(len(iris_X)) # random permutation

iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test  = iris_X[indices[-10:]]
iris_y_test  = iris_y[indices[-10:]]

$k$NN

One of the simplest machine learning methods is the $k$ Nearest Neighbors ($k$NN) method, which literally is to just find the $k$ closest objects in feature space (by some distance metric) and then just perform some simple operation (the default is the mode I believe) over their targets. Doing this through sklearn is literally 4 lines of code.


In [5]:
from sklearn.neighbors import KNeighborsClassifier # load kNN
knn=KNeighborsClassifier() # initialize with defaults
knn.fit(iris_X_train,iris_y_train) # train
yp=knn.predict(iris_X_test) # predict
print iris_y_test
print yp


[0 0 2 1 2 0 2 1 1 2]
[0 0 2 1 2 0 2 1 1 2]

There are a bunch of hyperparameters at work here under the hood. We can see just how much tuning you can do by just querying the original object.


In [6]:
knn


Out[6]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

That's pretty solid. Visualizing how this actually works so we can see the actual decision boundaries takes a bit more code though.


In [7]:
### setting up plotting utils for later

from matplotlib.colors import ListedColormap

# create color maps
cmap_light=ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold=ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

X=iris_X_train[:,:2]
y=iris_y_train

# create mesh of input points
h=.02  # step size
x_min,x_max=X[:, 0].min()-1, X[:, 0].max()+1
y_min,y_max=X[:, 1].min()-1, X[:, 1].max()+1
xx,yy=meshgrid(arange(x_min,x_max,h),arange(y_min,y_max,h)) # (X,Y) mesh

In [8]:
clf=knn # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes

figure(figsize=(8,6))

# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)

# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (k=%i, weights='%s')" % (knn.n_neighbors,knn.weights),fontsize=24,y=1.02)


Out[8]:
<matplotlib.text.Text at 0x117f4b7d0>

Logistic Regression

A rival to $k$NN? Fitting a line to things! While linear regression is the method of choice for regression problems, the equivalent for classification is fitting a logistic function:

\begin{equation} y = \textrm{logistic}(X\beta - \alpha) + \epsilon = \frac{1}{1 + \textrm{exp}(- X\beta + \alpha)} + \epsilon. \end{equation}

An example is shown below.


In [9]:
plot(linspace(-1,1,100),1./(1+exp(-50*linspace(-1,1,100))),lw=3,color='navy')
xlabel('Feature')
ylabel('P(class)')


Out[9]:
<matplotlib.text.Text at 0x117ff2d50>

In [10]:
from sklearn import linear_model
logistic = linear_model.LogisticRegression(C=1e5)
logistic.fit(iris_X_train, iris_y_train)
yp=logistic.predict(iris_X_test) # predict
print iris_y_test
print yp


[0 0 2 1 2 0 2 1 1 2]
[0 0 2 1 2 0 2 1 1 2]

In [11]:
logistic


Out[11]:
LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [12]:
clf=logistic # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes

figure(figsize=(8,6))

# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)

# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (C=%i)" % (logistic.C),fontsize=24,y=1.02)


Out[12]:
<matplotlib.text.Text at 0x117404ad0>

SVM

Support Vector Machines (SVMs) are essentially a fancier version of logistic regression, except you don't necessarily have to assume your decision boundaries are simple linear functions in your data. The complexity is instead set by the input kernels, which define the inner product over the feature space.


In [13]:
from sklearn import svm
svc = svm.SVC(kernel='linear')
svc.fit(iris_X_train, iris_y_train)
yp=svc.predict(iris_X_test) # predict
print iris_y_test
print yp


[0 0 2 1 2 0 2 1 1 2]
[0 0 2 1 2 0 2 1 1 2]

In [14]:
svc


Out[14]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
clf=svc # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes

figure(figsize=(8,6))

# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)

# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (Kernel='%s')" % (svc.kernel),fontsize=24,y=1.02)


Out[15]:
<matplotlib.text.Text at 0x11b409c90>

In [16]:
svc=svm.SVC(kernel='poly',degree=3)
clf=svc # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes

figure(figsize=(8,6))

# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)

# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (Kernel='%s', deg=%d)" % (svc.kernel,svc.degree),fontsize=24,y=1.02)


Out[16]:
<matplotlib.text.Text at 0x11b92f8d0>

In [17]:
svc=svm.SVC(kernel='rbf')
clf=svc # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes

figure(figsize=(8,6))

# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)

# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (Kernel='%s')" % (svc.kernel),fontsize=24,y=1.02)


Out[17]:
<matplotlib.text.Text at 0x11c214dd0>

Regression: Boston Housing Prices

We'll now move onto regression, where we get to use a bunch of variables to estimate housing prices in Boston.


In [18]:
boston=datasets.load_boston()
X=boston.data
y=boston.target
print boston['DESCR']


Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


In [19]:
figure(figsize=(14,6))
subplot(1,2,1)
plot(X[:,4],X[:,5],'.')
xlabel(boston.feature_names[4])
ylabel(boston.feature_names[5])
tight_layout()

subplot(1,2,2)
h=hist(y,20,histtype='step',lw=3,color='red')
xlabel('Median Value (thousands)')
ylabel('dN')
tight_layout()


Linear Regression and Cross-Validation

We'll start by using simple linear regression so we can introduce two new concept: cross-validation and pre-processing.

In essence, it's pretty inefficient to be like "yo, let's just check our performance with 20% of the data" when you have all those training objects available! While there are a couple versions of this, the simplest is $k$-fold cross-validation, where you split the sample into $k$ distinct subsamples, and then use $k-1$ to train on and the remaining subsample to validate on. Do this $k$ times and you've essentially generated predictions to the entire training set without cheating (i.e. overfitting).

It's also sometimes hard to determine which features are informative since they can all have different normalizations. We can correct for this be essentially normalizing everything to have $[\mu,\sigma]=[0,1]$. This is an example of pre-processing the data to make things easier later on. Other examples include trying to remove correlations, pruning outliers, adding combinations of features, and more!


In [20]:
from sklearn.model_selection import cross_val_predict # import cross validation module
from sklearn import linear_model # import linear model
from sklearn import preprocessing # import pre-processing functions

X_norm=preprocessing.StandardScaler().fit_transform(X)

lr=linear_model.LinearRegression() 
yp=cross_val_predict(lr,X_norm,y,cv=10)

figure(figsize=(10,6))
plot(y,yp,'r.')
plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=3)
xlabel('Measured')
ylabel('Predicted')
plt.show()


We can gain some insight by looking at how much our coefficients are determined by each variable.


In [21]:
lr.fit(X,y)
idx=argsort(abs(lr.coef_))
print boston.feature_names[idx][::-1]
print lr.coef_[idx][::-1]


['NOX' 'RM' 'CHAS' 'DIS' 'PTRATIO' 'LSTAT' 'RAD' 'CRIM' 'ZN' 'INDUS' 'TAX'
 'B' 'AGE']
[ -1.77957587e+01   3.80475246e+00   2.68856140e+00  -1.47575880e+00
  -9.53463555e-01  -5.25466633e-01   3.05655038e-01  -1.07170557e-01
   4.63952195e-02   2.08602395e-02  -1.23293463e-02   9.39251272e-03
   7.51061703e-04]

Random Forests and Feature Importances

Decision trees are one of the most commonly used machine learning methods due to their speed, flexibility, and ease of use. Similar to $k$NN, decision trees seek to find a set of nearby objects to make predictions. Instead of computing distances in feature space, however, they instead recursively partition the space using a set of logical statements in a way that maximizes the information gain. While individual trees are prone to overfitting, ensemble estimators that aggregate many decision trees into a random forest tend to be more robust.


In [22]:
from sklearn import ensemble # import ensemble methods

rf=ensemble.RandomForestRegressor(n_estimators=100) # initialize 100 trees with default parameters
yp=cross_val_predict(rf,X_norm,y,cv=10)

In [23]:
figure(figsize=(10,6))
plot(y,yp,'r.')
plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=3)
xlabel('Measured')
ylabel('Predicted')
plt.show()


One really slick thing about decision trees is the decisions actually give you a way to rank feature importances.


In [24]:
rf.fit(X_norm,y)
feature_importances=rf.feature_importances_
feature_importances=100.0*(feature_importances/feature_importances.max())

figure(figsize=(12,6))
idx=argsort(feature_importances)[::-1]
h=bar(arange(len(idx)),feature_importances[idx],width=1,lw=3,color='darkviolet')
xticks(arange(len(idx))+0.5,boston.feature_names[idx],rotation='vertical')
ylabel('Importance')
tight_layout()


We can compare the two functions using their mean square error (MSE) to better quantify which one works better (and by how much).


In [25]:
from sklearn.model_selection import cross_val_score
print 'Linear Regression:',median(cross_val_score(lr,X_norm,y,cv=20))
print 'Random Forest:',median(cross_val_score(rf,X_norm,y,cv=20))


Linear Regression: 0.441728953823
Random Forest: 0.599176237186

Isotonic Regression

You don't have to choose between completely non-parametric methods and extremely linear ones: there's tons of cool stuff out there to try to fit your dataset. One example is isotonic regression, which is attempts to fit a strictly-increasing model to your data. This is useful if you have some reason to believe your underlying function should always be increasing, as is the case below.


In [26]:
# create mock data
n=500
x=linspace(0,100,n)
y=randint(-50,50,size=(n,))+50.*log(1+x)

figure(figsize=(10,5))
plot(x,y)


Out[26]:
[<matplotlib.lines.Line2D at 0x11d918910>]

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.isotonic import IsotonicRegression
ir=IsotonicRegression() # initialize isotonic regression object
yp=ir.fit_transform(x,y) # fit and transform results

lr = LinearRegression() # linear regression (for comparison)
lr.fit(x[:,None], y)
yp2=lr.predict(x[:,None])

figure(figsize=(10,5))
plot(x,y,alpha=0.3,color='black')
plot(x,yp2,color='blue',lw=2)
plot(x,yp,color='red',lw=3)


Out[27]:
[<matplotlib.lines.Line2D at 0x11cfba0d0>]

Kernel Ridge Regression and Gaussian Process Regression

Finally, we can look at more methods that take advantage of this "kernel trick" first demonstrated for SVMs to do all sorts of fun things. The first method is Kernel Ridge Regression, which involves a combination of kernels and ridge regression (a more stable form of regression). The second involves Gaussian Processes (GPs), which are magical things that take advantage of cool math stuff related to multivariate Gaussians coupled with some assumptions about the covariances using kernels.


In [28]:
# Generate sample data
X=15*rand(100,1)
y=sin(X).ravel()+3*(0.5-rand(X.shape[0])) # sin w/ white noise

figure(figsize=(10,5))
plot(X,y,'ko')
plot(linspace(0,15,1000),sin(linspace(0,15,1000)),lw=2)

X_plot = linspace(0, 40, 1000)[:, None] # grid to evaluate predictions


As you can imagine, both of these functions now involves a decent amount of hyperparameters that we either need to optimize or marginalize over. The simplest way to do this is by executing a grid search over the hyperparameters, which we do below.


In [29]:
from sklearn.kernel_ridge import KernelRidge # kernel ridge regression
from sklearn.model_selection import GridSearchCV # grid search cross-validation (for hyper-parameters)
from sklearn.gaussian_process import GaussianProcessRegressor # GP
from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared # kernels

In [30]:
# find optimal hyperparameters for KRR using a gridsearch and fit data
param_grid={"alpha": [1e0,1e-1,1e-2,1e-3],
            "kernel": [ExpSineSquared(l, p)
                       for l in logspace(-2,2,10)
                       for p in logspace(0,2,10)]}
kr=GridSearchCV(KernelRidge(),cv=5,param_grid=param_grid)
kr.fit(X,y)
y_kr=kr.predict(X_plot)
kr.best_params_


Out[30]:
{'alpha': 0.1, 'kernel': ExpSineSquared(length_scale=1.67, periodicity=12.9)}

In [31]:
# fit GP with exponential since squared kernel + whitenoise kernel
gp_kernel=ExpSineSquared(1.0,5.0,periodicity_bounds=(1e-2, 1e1))+WhiteKernel(1e-1)
gpr=GaussianProcessRegressor(kernel=gp_kernel,n_restarts_optimizer=50)
gpr.fit(X,y)
y_gpr,y_std=gpr.predict(X_plot,return_std=True)
print gpr.kernel_
print '3*std(unif):',3*sqrt(1/12.)


ExpSineSquared(length_scale=1.51, periodicity=6.29) + WhiteKernel(noise_level=0.703)
3*std(unif): 0.866025403784

In [32]:
# plot results
figure(figsize=(10,5))

plot(X,y,'ko',label='Data') # data
plot(X_plot,sin(X_plot),lw=2,color='black',label='True') # true underlying distribution
plot(X_plot,y_kr,color='red',lw=2,label='KRR')
plot(X_plot,y_gpr,color='blue',lw=2,label='GPR')
fill_between(X_plot[:,0],y_gpr-y_std,y_gpr+y_std,color='lightblue',alpha=0.3)
xlabel('Features')
ylabel('Target')
xlim([0,40])
ylim([-4,4])
legend(loc="best",scatterpoints=1,prop={'size': 8})


Out[32]:
<matplotlib.legend.Legend at 0x11dbd4790>

Unsupervised Learning

$k$-Means Clustering

The "bread and butter" of unsupervised learning is $k$-means clustering. This just involves partitioning the data into $k$ clusters, where we're trying to figure out where to put the cluster centers to partitions the data well. We can demonstrate this on the iris dataset.


In [33]:
from sklearn import cluster

k_means=cluster.KMeans(n_clusters=3)
k_means.fit(iris_X)

print(k_means.labels_[::10])
print(iris_y[::10])


[2 2 2 2 2 1 1 1 1 1 0 0 0 0 0]
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

In [34]:
X=iris_X[:,:2]
y=iris_y
clf=k_means # classification learning function (clf)
clf.fit(X) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes

figure(figsize=(8,6))

# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)

# Plot also the training points
scatter(k_means.cluster_centers_[:,0],k_means.cluster_centers_[:,1],color='black',cmap=cmap_bold,s=150,alpha=0.5)
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (k=%i)" % (k_means.n_clusters),fontsize=24,y=1.02)


Out[34]:
<matplotlib.text.Text at 0x11e7ae510>

PCA

The final thing to cover is the simplest form of dimensionality reduction known as Principal Component Analysis (PCA). This involves finding linear combinations of the features that explain the maximum amount of variance. Often, this is the best way to remove linear-ish correlations among parameters, which can (1) help visualize things, (2) help reduce the effective data size (by limiting yourself to the maximally-informative PCA components), and (3) reduce correlations among components (which helps interpret feature importances).


In [35]:
from sklearn.decomposition import PCA

pca=PCA()
X_r=pca.fit(iris_X).transform(iris_X) # limit to top 2 components

# Percentage of variance explained for each components
print('Explained Variance Ratio: %s' % str(pca.explained_variance_ratio_))

figure(figsize=(10,6))
colors=['#FF0000', '#00FF00', '#0000FF']

for color,i,target_name in zip(colors,[0,1,2],iris.target_names):
    scatter(X_r[y==i,0],X_r[y==i,1],color=color,lw=2,label=target_name)
xlabel('PCA Component 1')
ylabel('PCA Component 2')
legend(loc='best',scatterpoints=1,fontsize=14)
title('PCA of Iris dataset',y=1.02)


Explained Variance Ratio: [ 0.92461621  0.05301557  0.01718514  0.00518309]
Out[35]:
<matplotlib.text.Text at 0x11c911b90>

In [36]:
X=X_r[:,:2]

h=.02  # step size
x_min,x_max=X[:, 0].min()-1, X[:, 0].max()+1
y_min,y_max=X[:, 1].min()-1, X[:, 1].max()+1
xx,yy=meshgrid(arange(x_min,x_max,h),arange(y_min,y_max,h)) # (X,Y) mesh

clf=svc # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes

figure(figsize=(8,6))

# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)

# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (Kernel='%s')" % (svc.kernel),fontsize=24,y=1.02)


Out[36]:
<matplotlib.text.Text at 0x11dcbfcd0>

In [ ]: