This notebook goes through some basic examples and uses of Python's scikit-learn package, which is a user-friendly machine learning package with useful tutorials and excellent documentation where I've pulled most of the examples from.
Before we begin, I'll first initialize the environment. I'm lazy and like to use Python like Matlab, so there's that. Additionally, I like my figures to have larger labels and axis markers, so I'll update those defaults as well.
In [1]:
%pylab
%matplotlib inline
# re-defining plotting defaults
from matplotlib.font_manager import FontProperties
from matplotlib import gridspec
rcParams.update({'xtick.major.pad': '7.0'})
rcParams.update({'xtick.major.size': '7.5'})
rcParams.update({'xtick.major.width': '1.5'})
rcParams.update({'xtick.minor.pad': '7.0'})
rcParams.update({'xtick.minor.size': '3.5'})
rcParams.update({'xtick.minor.width': '1.0'})
rcParams.update({'ytick.major.pad': '7.0'})
rcParams.update({'ytick.major.size': '7.5'})
rcParams.update({'ytick.major.width': '1.5'})
rcParams.update({'ytick.minor.pad': '7.0'})
rcParams.update({'ytick.minor.size': '3.5'})
rcParams.update({'ytick.minor.width': '1.0'})
rcParams.update({'xtick.color': 'k'})
rcParams.update({'ytick.color': 'k'})
rcParams.update({'font.size': 30})
It can be easy to get swept away in the magic of machine learning. Let's first consider some Bayesian motivation, and then move on to the more traditional definitions.
Let's say we have two objects: a training object $A \in \mathbf{A}$, whose properties we know, and an unknown object $B \in \mathbf{B}$, whose properties we want to predict. The Bayesian way to think about this is
\begin{equation} P(A|B)=\frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{\sum_{A\in\mathbf{A}} P(B|A)P(A)}, \end{equation}where $P(A|B)$ is the posterior, $P(B|A)$ is the likelihood, $P(A)$ is the prior, and $P(B)$ is the evidence. If we are interested in using $A$ as a go-between for some other property we care about (let's call it $Y$), then we can write that as
\begin{equation} P(Y|B)= \sum_{A\in\mathbf{A}} P(Y|A)P(A|B). \end{equation}To probe $P(Y|B)$, we thus need to sample from the posterior. This requires us to know the likelihood and prior for all of the objects in our training set.
But what if we have no idea what the likelihood function is or what the priors should be? That's where machine learning comes to the rescue! Machine learning uses the training data to directly probe the posterior in a flexible way. I can't emphasize enough, however, that we don't get this for free - we have to make some educated guesses about the basic properties of the data as well as the appoximate form the posterior will take in order to make this happen. And it can be difficult to actually interpret what's actually going on afterwards.
In general, what machine learning tries to do is make predictions from a set of training data to a set of unknown data.
These broadly break down into a few main categories:
We'll quickly run through each of these using some of the data from the sklearn package. Let's import that now.
In [2]:
import sklearn
from sklearn import datasets
The iris dataset is contains 3 different types of irises (Setosa, Versicolour, and Virginica). We want to try and classify them based on the lengths and widths of their petals and sepals.
In [3]:
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
print iris['DESCR']
To test things out, we need to split up the data into a training and a testing set. Testing your results on data you just fit will result in overfitting!
In [4]:
indices = random.permutation(len(iris_X)) # random permutation
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_y_test = iris_y[indices[-10:]]
One of the simplest machine learning methods is the $k$ Nearest Neighbors ($k$NN) method, which literally is to just find the $k$ closest objects in feature space (by some distance metric) and then just perform some simple operation (the default is the mode I believe) over their targets. Doing this through sklearn is literally 4 lines of code.
In [5]:
from sklearn.neighbors import KNeighborsClassifier # load kNN
knn=KNeighborsClassifier() # initialize with defaults
knn.fit(iris_X_train,iris_y_train) # train
yp=knn.predict(iris_X_test) # predict
print iris_y_test
print yp
There are a bunch of hyperparameters at work here under the hood. We can see just how much tuning you can do by just querying the original object.
In [6]:
knn
Out[6]:
That's pretty solid. Visualizing how this actually works so we can see the actual decision boundaries takes a bit more code though.
In [7]:
### setting up plotting utils for later
from matplotlib.colors import ListedColormap
# create color maps
cmap_light=ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold=ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
X=iris_X_train[:,:2]
y=iris_y_train
# create mesh of input points
h=.02 # step size
x_min,x_max=X[:, 0].min()-1, X[:, 0].max()+1
y_min,y_max=X[:, 1].min()-1, X[:, 1].max()+1
xx,yy=meshgrid(arange(x_min,x_max,h),arange(y_min,y_max,h)) # (X,Y) mesh
In [8]:
clf=knn # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes
figure(figsize=(8,6))
# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)
# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (k=%i, weights='%s')" % (knn.n_neighbors,knn.weights),fontsize=24,y=1.02)
Out[8]:
A rival to $k$NN? Fitting a line to things! While linear regression is the method of choice for regression problems, the equivalent for classification is fitting a logistic function:
\begin{equation} y = \textrm{logistic}(X\beta - \alpha) + \epsilon = \frac{1}{1 + \textrm{exp}(- X\beta + \alpha)} + \epsilon. \end{equation}An example is shown below.
In [9]:
plot(linspace(-1,1,100),1./(1+exp(-50*linspace(-1,1,100))),lw=3,color='navy')
xlabel('Feature')
ylabel('P(class)')
Out[9]:
In [10]:
from sklearn import linear_model
logistic = linear_model.LogisticRegression(C=1e5)
logistic.fit(iris_X_train, iris_y_train)
yp=logistic.predict(iris_X_test) # predict
print iris_y_test
print yp
In [11]:
logistic
Out[11]:
In [12]:
clf=logistic # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes
figure(figsize=(8,6))
# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)
# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (C=%i)" % (logistic.C),fontsize=24,y=1.02)
Out[12]:
Support Vector Machines (SVMs) are essentially a fancier version of logistic regression, except you don't necessarily have to assume your decision boundaries are simple linear functions in your data. The complexity is instead set by the input kernels, which define the inner product over the feature space.
In [13]:
from sklearn import svm
svc = svm.SVC(kernel='linear')
svc.fit(iris_X_train, iris_y_train)
yp=svc.predict(iris_X_test) # predict
print iris_y_test
print yp
In [14]:
svc
Out[14]:
In [15]:
clf=svc # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes
figure(figsize=(8,6))
# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)
# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (Kernel='%s')" % (svc.kernel),fontsize=24,y=1.02)
Out[15]:
In [16]:
svc=svm.SVC(kernel='poly',degree=3)
clf=svc # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes
figure(figsize=(8,6))
# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)
# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (Kernel='%s', deg=%d)" % (svc.kernel,svc.degree),fontsize=24,y=1.02)
Out[16]:
In [17]:
svc=svm.SVC(kernel='rbf')
clf=svc # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes
figure(figsize=(8,6))
# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)
# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (Kernel='%s')" % (svc.kernel),fontsize=24,y=1.02)
Out[17]:
We'll now move onto regression, where we get to use a bunch of variables to estimate housing prices in Boston.
In [18]:
boston=datasets.load_boston()
X=boston.data
y=boston.target
print boston['DESCR']
In [19]:
figure(figsize=(14,6))
subplot(1,2,1)
plot(X[:,4],X[:,5],'.')
xlabel(boston.feature_names[4])
ylabel(boston.feature_names[5])
tight_layout()
subplot(1,2,2)
h=hist(y,20,histtype='step',lw=3,color='red')
xlabel('Median Value (thousands)')
ylabel('dN')
tight_layout()
We'll start by using simple linear regression so we can introduce two new concept: cross-validation and pre-processing.
In essence, it's pretty inefficient to be like "yo, let's just check our performance with 20% of the data" when you have all those training objects available! While there are a couple versions of this, the simplest is $k$-fold cross-validation, where you split the sample into $k$ distinct subsamples, and then use $k-1$ to train on and the remaining subsample to validate on. Do this $k$ times and you've essentially generated predictions to the entire training set without cheating (i.e. overfitting).
It's also sometimes hard to determine which features are informative since they can all have different normalizations. We can correct for this be essentially normalizing everything to have $[\mu,\sigma]=[0,1]$. This is an example of pre-processing the data to make things easier later on. Other examples include trying to remove correlations, pruning outliers, adding combinations of features, and more!
In [20]:
from sklearn.model_selection import cross_val_predict # import cross validation module
from sklearn import linear_model # import linear model
from sklearn import preprocessing # import pre-processing functions
X_norm=preprocessing.StandardScaler().fit_transform(X)
lr=linear_model.LinearRegression()
yp=cross_val_predict(lr,X_norm,y,cv=10)
figure(figsize=(10,6))
plot(y,yp,'r.')
plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=3)
xlabel('Measured')
ylabel('Predicted')
plt.show()
We can gain some insight by looking at how much our coefficients are determined by each variable.
In [21]:
lr.fit(X,y)
idx=argsort(abs(lr.coef_))
print boston.feature_names[idx][::-1]
print lr.coef_[idx][::-1]
Decision trees are one of the most commonly used machine learning methods due to their speed, flexibility, and ease of use. Similar to $k$NN, decision trees seek to find a set of nearby objects to make predictions. Instead of computing distances in feature space, however, they instead recursively partition the space using a set of logical statements in a way that maximizes the information gain. While individual trees are prone to overfitting, ensemble estimators that aggregate many decision trees into a random forest tend to be more robust.
In [22]:
from sklearn import ensemble # import ensemble methods
rf=ensemble.RandomForestRegressor(n_estimators=100) # initialize 100 trees with default parameters
yp=cross_val_predict(rf,X_norm,y,cv=10)
In [23]:
figure(figsize=(10,6))
plot(y,yp,'r.')
plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=3)
xlabel('Measured')
ylabel('Predicted')
plt.show()
One really slick thing about decision trees is the decisions actually give you a way to rank feature importances.
In [24]:
rf.fit(X_norm,y)
feature_importances=rf.feature_importances_
feature_importances=100.0*(feature_importances/feature_importances.max())
figure(figsize=(12,6))
idx=argsort(feature_importances)[::-1]
h=bar(arange(len(idx)),feature_importances[idx],width=1,lw=3,color='darkviolet')
xticks(arange(len(idx))+0.5,boston.feature_names[idx],rotation='vertical')
ylabel('Importance')
tight_layout()
We can compare the two functions using their mean square error (MSE) to better quantify which one works better (and by how much).
In [25]:
from sklearn.model_selection import cross_val_score
print 'Linear Regression:',median(cross_val_score(lr,X_norm,y,cv=20))
print 'Random Forest:',median(cross_val_score(rf,X_norm,y,cv=20))
You don't have to choose between completely non-parametric methods and extremely linear ones: there's tons of cool stuff out there to try to fit your dataset. One example is isotonic regression, which is attempts to fit a strictly-increasing model to your data. This is useful if you have some reason to believe your underlying function should always be increasing, as is the case below.
In [26]:
# create mock data
n=500
x=linspace(0,100,n)
y=randint(-50,50,size=(n,))+50.*log(1+x)
figure(figsize=(10,5))
plot(x,y)
Out[26]:
In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.isotonic import IsotonicRegression
ir=IsotonicRegression() # initialize isotonic regression object
yp=ir.fit_transform(x,y) # fit and transform results
lr = LinearRegression() # linear regression (for comparison)
lr.fit(x[:,None], y)
yp2=lr.predict(x[:,None])
figure(figsize=(10,5))
plot(x,y,alpha=0.3,color='black')
plot(x,yp2,color='blue',lw=2)
plot(x,yp,color='red',lw=3)
Out[27]:
Finally, we can look at more methods that take advantage of this "kernel trick" first demonstrated for SVMs to do all sorts of fun things. The first method is Kernel Ridge Regression, which involves a combination of kernels and ridge regression (a more stable form of regression). The second involves Gaussian Processes (GPs), which are magical things that take advantage of cool math stuff related to multivariate Gaussians coupled with some assumptions about the covariances using kernels.
In [28]:
# Generate sample data
X=15*rand(100,1)
y=sin(X).ravel()+3*(0.5-rand(X.shape[0])) # sin w/ white noise
figure(figsize=(10,5))
plot(X,y,'ko')
plot(linspace(0,15,1000),sin(linspace(0,15,1000)),lw=2)
X_plot = linspace(0, 40, 1000)[:, None] # grid to evaluate predictions
As you can imagine, both of these functions now involves a decent amount of hyperparameters that we either need to optimize or marginalize over. The simplest way to do this is by executing a grid search over the hyperparameters, which we do below.
In [29]:
from sklearn.kernel_ridge import KernelRidge # kernel ridge regression
from sklearn.model_selection import GridSearchCV # grid search cross-validation (for hyper-parameters)
from sklearn.gaussian_process import GaussianProcessRegressor # GP
from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared # kernels
In [30]:
# find optimal hyperparameters for KRR using a gridsearch and fit data
param_grid={"alpha": [1e0,1e-1,1e-2,1e-3],
"kernel": [ExpSineSquared(l, p)
for l in logspace(-2,2,10)
for p in logspace(0,2,10)]}
kr=GridSearchCV(KernelRidge(),cv=5,param_grid=param_grid)
kr.fit(X,y)
y_kr=kr.predict(X_plot)
kr.best_params_
Out[30]:
In [31]:
# fit GP with exponential since squared kernel + whitenoise kernel
gp_kernel=ExpSineSquared(1.0,5.0,periodicity_bounds=(1e-2, 1e1))+WhiteKernel(1e-1)
gpr=GaussianProcessRegressor(kernel=gp_kernel,n_restarts_optimizer=50)
gpr.fit(X,y)
y_gpr,y_std=gpr.predict(X_plot,return_std=True)
print gpr.kernel_
print '3*std(unif):',3*sqrt(1/12.)
In [32]:
# plot results
figure(figsize=(10,5))
plot(X,y,'ko',label='Data') # data
plot(X_plot,sin(X_plot),lw=2,color='black',label='True') # true underlying distribution
plot(X_plot,y_kr,color='red',lw=2,label='KRR')
plot(X_plot,y_gpr,color='blue',lw=2,label='GPR')
fill_between(X_plot[:,0],y_gpr-y_std,y_gpr+y_std,color='lightblue',alpha=0.3)
xlabel('Features')
ylabel('Target')
xlim([0,40])
ylim([-4,4])
legend(loc="best",scatterpoints=1,prop={'size': 8})
Out[32]:
The "bread and butter" of unsupervised learning is $k$-means clustering. This just involves partitioning the data into $k$ clusters, where we're trying to figure out where to put the cluster centers to partitions the data well. We can demonstrate this on the iris dataset.
In [33]:
from sklearn import cluster
k_means=cluster.KMeans(n_clusters=3)
k_means.fit(iris_X)
print(k_means.labels_[::10])
print(iris_y[::10])
In [34]:
X=iris_X[:,:2]
y=iris_y
clf=k_means # classification learning function (clf)
clf.fit(X) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes
figure(figsize=(8,6))
# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)
# Plot also the training points
scatter(k_means.cluster_centers_[:,0],k_means.cluster_centers_[:,1],color='black',cmap=cmap_bold,s=150,alpha=0.5)
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (k=%i)" % (k_means.n_clusters),fontsize=24,y=1.02)
Out[34]:
The final thing to cover is the simplest form of dimensionality reduction known as Principal Component Analysis (PCA). This involves finding linear combinations of the features that explain the maximum amount of variance. Often, this is the best way to remove linear-ish correlations among parameters, which can (1) help visualize things, (2) help reduce the effective data size (by limiting yourself to the maximally-informative PCA components), and (3) reduce correlations among components (which helps interpret feature importances).
In [35]:
from sklearn.decomposition import PCA
pca=PCA()
X_r=pca.fit(iris_X).transform(iris_X) # limit to top 2 components
# Percentage of variance explained for each components
print('Explained Variance Ratio: %s' % str(pca.explained_variance_ratio_))
figure(figsize=(10,6))
colors=['#FF0000', '#00FF00', '#0000FF']
for color,i,target_name in zip(colors,[0,1,2],iris.target_names):
scatter(X_r[y==i,0],X_r[y==i,1],color=color,lw=2,label=target_name)
xlabel('PCA Component 1')
ylabel('PCA Component 2')
legend(loc='best',scatterpoints=1,fontsize=14)
title('PCA of Iris dataset',y=1.02)
Out[35]:
In [36]:
X=X_r[:,:2]
h=.02 # step size
x_min,x_max=X[:, 0].min()-1, X[:, 0].max()+1
y_min,y_max=X[:, 1].min()-1, X[:, 1].max()+1
xx,yy=meshgrid(arange(x_min,x_max,h),arange(y_min,y_max,h)) # (X,Y) mesh
clf=svc # classification learning function (clf)
clf.fit(X,y) # fit data
Z=clf.predict(c_[xx.ravel(), yy.ravel()]) # predicted classes
figure(figsize=(8,6))
# plot mesh
Z=Z.reshape(xx.shape)
pcolormesh(xx,yy,Z,cmap=cmap_light)
# Plot also the training points
scatter(X[:,0],X[:,1],c=y,cmap=cmap_bold)
xlim(xx.min(),xx.max())
xlabel('Feature 1')
ylim(yy.min(),yy.max())
ylabel('Feature 2')
title("Iris Classification (Kernel='%s')" % (svc.kernel),fontsize=24,y=1.02)
Out[36]:
In [ ]: