In [ ]:
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline

Machine Learning, Basis Expansion, Kernels(1) and Kernels(2)

2015-February-13 Scott Hendrickson

Attempt to draw a conceptual line connecting some machine learning techniques we have been using and talking about recently, a short discussion about Kernel choices for SVM and finding non-linear relationships in data...

Logistic Regression and SVM classifiers are linear

These methods formulate an optimization problem that locates separating hyperplanes.

Logistic Regression

[Following DrJ from last week] Cost function:

$ C = - \left( y \: \text{log}(h_{\theta}(x)) + (1-y) \: \text{log}(1-h_{\theta}) \right) $

The hypothesis function, $h$:

$ h_{\theta}(x) = \frac{1}{1 + e^{-z}}$, where $ \theta^T x = z $.

This function takes an arbitrary input vector $x$ and creates a linear combination with the model vector $ \theta $ to create an output on [0,1].


In [ ]:
def sigmoid(z, z0=0, a=1):
    return 1./(1. + np.exp(-a*(z-z0)))

z = np.linspace(-8,8,50)
plt.plot(z,sigmoid(z), color="magenta")
plt.show()

The common cost function formulation for Logistic Regression in terms of z:

$ C = -y \: \text{log} \frac{1}{1+exp(-z)} - (1-y) \: \text{log} \left( 1-\frac{1}{1+exp(-z)} \right) $

Five points to notice about the above forumlation:

  1. Optimizating $ C $ determines our model parameters $ \theta $ from features $ x $ and targets $ y $.

  2. z is a scalar $ \theta^T x = z $

  3. $ \theta^T \perp x $ implies $ z = 0 $ and this means $ h_{\theta}(x) = \frac{1}{1 + e^{0}} = \frac{1}{2} $. This condition defines the direction of the separating hyperplane.

  4. The hyperplane passes through the origin.

  5. $ \theta^T x = z $ is linear in features $x$

Let's see it in action.


In [ ]:
def f(n=100, sigma=1, mu=1):
    # random normal vector
    return sigma * np.random.randn(n) + mu
    
plt.scatter(f(),f())

Separate 2 noisy groups


In [ ]:
# make some 2d data
x = f(200)
y = np.append(f(100, mu=3), f(100, mu=0))
t = np.append(np.ones(100), np.zeros(100))
# plot it
plt.scatter(x,y, s=30, c=t, cmap=plt.cm.Paired)
X = np.array([x,y]).transpose()

# meshgrid for shading and contours
xx, yy = np.meshgrid(np.linspace(-4, 4, 500),
                     np.linspace(-4, 6, 500))

# make a model
from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
m.fit(X,t)
# get the decision boundaries
Z = m.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',
           origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,
                       linetypes='--')

# Prediction on the training set
p = m.predict(X)
#print(p)
print("The hyperplane predicts {:.2%} of the training data.".format(m.score(X,t)))

What about when a plane can't do the job?


In [ ]:
# meshgrid for contours and shading
xx, yy = np.meshgrid(np.linspace(-15, 15, 700),
                     np.linspace(-15, 15, 700))

# make some data
x = np.append(f(100), f(100, sigma=5))
y = np.append(f(100, mu=0), f(100, sigma=5))
t = np.append(np.ones(100), np.zeros(100))
plt.scatter(x,y, s=30, c=t, cmap=plt.cm.Paired)
X = np.array([x,y]).transpose()
# model
m = LogisticRegression()
m.fit(X,t)

Z = m.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',
           origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,
                       linetypes='--')
# Prediction score
p = m.predict(X)
#print(p)
print("The hyperplane predicts {:.2%} of the training data.".format(m.score(X,t)))

Maybe relax (3) and (4) a bit...

We can make this learning method a little more flexible by relaxing (3) and (4). As long as we don't change conditions (1-3), the Logistic Regression strategy still works (why?).

Allow arbitrary translation from the origin by a constant and determined the constant as part of the model:

$ z = \theta^T x $

$ \theta = \left( \begin{array}{ccc} \theta_1 \\ \theta_2 \\ \ldots \\ \theta_n \\ \theta_{\text translate} \end{array} \right)$

$ x = \left( \begin{array}{ccc} x_1 \\ x_2 \\ \ldots \\ x_n \\ 1 \end{array} \right)$

$ z = \theta^T x = \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_{\text translate} $

Or factoring out $\theta_1$,

$ z = \theta^T x = \theta_1 (x_1 + \theta_2^\prime x_2 + \ldots + \theta_{\text translate}^\prime) $

where $ \theta_2^\prime = \frac{\theta_2}{\theta_1} $ etc.

Now, the scaling of the transition region, the hyperplane location and the hyperplane direction are determined by the model.


In [ ]:
# plot it in 1 dim
plt.plot(z,sigmoid(z,2,3), color="cyan", label="compress, shift")
plt.plot(z,sigmoid(z,-2), color="yellow", label="shift")
plt.plot(z,sigmoid(z,0,0.5), color="green", label="expand")
plt.legend(loc=2)

A second perspective...

The strategy above for relaxing (4) can be seen as finding $ z(x) $ by embedding the original feature vector, $x$, in a space with 1 additional dimension. If the originial feature vectors has dimension $n$, then the new features space has dimension, $n+1$.

We might have formulated the problem as a translation $x \rightarrow x - x^0$, and expansion $ x - x^0 \rightarrow \alpha (x - x^0) $ giving the slightly less general formulation,

$ z = \theta^T \alpha (x - x^0) = \theta_1 \alpha_1 (x_1 - x^0_1) + \theta_2 \alpha_2 (x_2 - x^0_2) + \ldots + \theta_n \alpha_n (x_n - x^0_n) $

Because we are going to let the optimizer find the model parameters $\theta$, we usually absorb the $\alpha$s into the $\theta$s. In this limited formulation, $ z(x) $ is only a function of the distance to the hyperplane.

This discussion suggests that one strategy for expanding the capabilties of our classifier is to think up favorable coordinate transformations. When we imagine a useful coordinate trasformation, we calculate $z$,

$ z = \theta^T \mathbf{M} x = \theta^T x^\prime $.

I propose to...Think up embedding/coordinate transformations that make the problem that is essentially a non-linear relationship in the original feature space into a linear relationship in some transformed feature space.

Then our Logistic Regression will work to build models of non-linear relationships. Hold that thought.

Might inspire some creative geometric thinking, but...

The question is "In what coordinate system is this essentially non-linear relationship a linear relationship?" Thinking up the right coordinate transformations from this geometric point of view might be practically impossible!

How about relaxing (5) directly?

The "absorb the $\alpha$s" comment above might push us to different interesting idea. What if we could find a useful representation of $z$ that:

  1. Is non-linear in input features $x$
  2. Lets the optimizer sort out the relative strenghts and types of the non-linearity at learning time

From ESL,

The core idea of [basis expansion] is to augment/replace the vectors of inputs X with additional variables, which are transformations of X, and then use linear models in the new space of derived input features.  - ESL pg 139

This functional approach changes our strategy toward finding $z$ as a functional expansion in $x$.

Basis Expansion

Maybe we can try a polynomial expansion of $z$? They are easy to write down and we are pretty confident you can represent any form of $z$ we need.

$ z(x) = \Sigma^\infty_{m=1}\beta_m x^n $

Done. We can now represent a huge range of non-linear relationships between $z$ and $x$ in a way that allows the Logistic Regression method to work by finding a hyperplane. However, this is going to present some challenges for practical machine learning. We have to determine an inifite number of $\beta$s in training!

We need a finite expansion.

One idea to assume the higher-order terms are less important and truncate the series after a couple of terms. Absorb the constants into the $\theta$s.

Try:

$ z(x) = \theta_1 x^2 + \theta_2 x + \theta_3 $


In [ ]:
# meshgrid for contours and shading
xx, yy = np.meshgrid(np.linspace(-15, 15, 500),
                     np.linspace(-15, 15, 500))

plt.scatter(x,y, s=30, c=t, cmap=plt.cm.Paired)
# quadratic terms, including cross terms between the dimensions x and y
x1 = x*x
y1 = y*y
xy = x*y
ext_X = np.array([x, y, x1, y1, xy]).transpose()
ext_m = LogisticRegression()
ext_m.fit(ext_X,t)

gx, gy = xx.ravel(), yy.ravel()

Z = ext_m.decision_function(np.array([gx, gy, gx*gx, gy*gy, gx*gy]).transpose())
Z = Z.reshape(xx.shape)
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',
           origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,
                       linetypes='--')

p = ext_m.predict(ext_X)
#print(p)
print(ext_m.score(ext_X,t))

Seems promising--we just found a boundary defined by a conical section. Maybe this is enough. If not, can we generalize the idea?

Yes, bring on the basis functions.

Rewrite $z$,

$ z(x) = \Sigma^M_{m=1}\beta_m h_m(x) $

Before we apply to machine learning, let's see how the function expansion might look.

Let's choose a localized basis function with parameters that define the location and extent. One important example is the Raidal Basis Function.


In [ ]:
# raidal basis function
x = np.linspace(-3,9,60)
def b1(x, pars):
    return np.exp(-(x-pars[0])*(x-pars[0])/(pars[1]*pars[1]))

plt.plot(x, b1(x, (0,1)), color="green")

In [ ]:
# target function
def target(x):
    return 0.1*np.power(x,3) - np.power(x,2) + 2*x + 5

# point by point difference
def c(x, pars):
    return np.power(target(x) - pars[2]*b1(x,pars[:2]), 2)

# cost function is some of diffs
def cost(pars):
    return np.sum(c(x,pars))

plt.plot(x,target(x)        , color="red")
#plt.plot(x,  10*b1(x, (1,1)), color="green")

Best fit if we use only one basis function.


In [ ]:
from scipy.optimize import minimize
a = minimize(cost,[1,1,3])
print(a.x)
plt.plot(x,target(x), color="red")
plt.plot(x, a.x[2]*b1(x, a.x), color="green")

Best fit if we use more than one basis function.


In [ ]:
# a fun python functional thing will be useful in a minute
from functools import partial
basetwo = partial(int, base=2)
basetwo.__doc__ = 'Convert base 2 string to an int.'
basetwo('10010')

To see how additional basis functions improve the fit, change the value of n in the line gx=parital(g,n) below.


In [ ]:
def g(x, pars, n=1):
    sum = 0.0
    for i in range(n):
        sum += pars[i*3 + 2]*b1(x, pars[i*3:i*3+2])
    return sum

from functools import partial
plt.plot(x,  target(x), color="red")

gx = partial(g, n=4)

def c(x, pars):
    return np.power(target(x) - gx(x, pars), 2)


a = minimize(cost,[ 1,1,3,
                   -1,1,1,
                   -1,2,1,
                    1,2,1])
print(a.x)

plt.scatter(x, gx(x, a.x), color="green")

Ok, n=4 gives a nice smooth fit. Now we are ready to try a machine learning prolem in 2d.


In [ ]:
def b2(xx, pars):
    # xx is 2d
    return np.array([np.exp(-(np.power(x[0]-pars[0],2) +
                    np.power(x[1]-pars[1],2))
                        /(pars[2]*pars[2])) for x in xx])

In [ ]:
# meshgrid
xx, yy = np.meshgrid(np.linspace(-6, 6, 500),
                     np.linspace(-6, 6, 500))
# some data
np.random.seed(0)
X = np.random.randn(300, 2)
t = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)

plt.scatter(X[:, 0], X[:, 1], s=30, c=t, cmap=plt.cm.Paired)

m = LogisticRegression()
m.fit(X,t)

Z = m.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',
           origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,
                       linetypes='--')

plt.show()

p = m.predict(X)
#print(p)
print("The hyperplane predicts {:.2%} of the training data.".format(m.score(X,t)))
#

XOR requires a non-linear model

Lets try using the RBF.

Two problems, one fundamental, one practical:

  1. How many basis functions do I need?
  2. We have non-muliplicative parameters

The first is much like our problem with the infinite series. Remember, we will always use regularization in practical applications, so have a few too many basis functions might still work fine, while having too few will underperform. A parameter search across the number of basis functions might prove useful.

The second problem is a practical problem with common libraries. There is nothing in theory that won't let us minimize the cost function over the parameters of the functions, but additional care for stability, convergence etc will be required.

Proposes a set of basis functions located in each of the quadrants.

Train on the XOR data using our basis funcitons


In [ ]:
# Make some basis functions
v11 = partial(b2, pars=(1,1,1))
v_11 = partial(b2, pars=(-1,1,1))
v1_1 = partial(b2, pars=(1,-1,1))
v_1_1 = partial(b2, pars=(-1,-1,1))
v22 = partial(b2, pars=(2,2,1))
v_22 = partial(b2, pars=(-2,2,1))
v2_2 = partial(b2, pars=(2,-2,1))
v_2_2 = partial(b2, pars=(-2,-2,1))

In [ ]:
xx, yy = np.meshgrid(np.linspace(-6, 6, 500),
                     np.linspace(-6, 6, 500))
#
plt.scatter(X[:, 0], X[:, 1], s=30, c=t, cmap=plt.cm.Paired)

ext_X = np.array([v11(X), v_11(X), v1_1(X), v_1_1(X),
                  v22(X), v_22(X), v2_2(X), v_2_2(X)]).transpose()
ext_m = LogisticRegression()
ext_m.fit(ext_X,t)

gr = np.c_[xx.ravel(), yy.ravel()]

ext_Z = ext_m.decision_function(np.array([v11(gr)
                                , v_11(gr)
                                , v1_1(gr)
                                , v_1_1(gr)
                                , v22(gr)
                                , v_22(gr)
                                , v2_2(gr)
                                , v_2_2(gr)]).transpose())
ext_Z = ext_Z.reshape(xx.shape)
plt.imshow(ext_Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',
           origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, ext_Z, levels=[0], linewidths=2,
                       linetypes='--')

p = ext_m.predict(ext_X)
#print(p)
print("The hyperplane predicts {:.2%} of the training data.".format(ext_m.score(ext_X,t)))

Given this particular set of basis functions, we can also reach into the estimator above and find out what the relative weights were in fitting the model to this data:


In [ ]:
# nb: must be same order as the X array that went into the model fit (ext_X)
feature_names = ['v11', 'v_11', 'v1_1', 'v_1_1', 'v22', 'v_22', 'v2_2', 'v_2_2']

x = np.arange(len(feature_names))
plt.bar(x, ext_m.coef_.ravel() )
_ = plt.xticks(x + 0.5, feature_names, rotation=30)
_ = plt.ylabel("model coefficients")

Kernel Smoothing

Turn here for a language-driven diversion.

Kernel smoothing is a class of regression techniques that achieve flexibility in estimating the regression function $f(x)$ by fitting a different but simple model separate at each query point $x_0$. This is done by using only the observations close to the target point $x_0$ to fit the simple model, and in such a way that the resulting estimated function $f(x)$ is smooth.


In [ ]:
%run kernel.py

Kernel Methods

Above, we looked a coordinate transformations as a way to find a space where the problem became linear. For the Logistic Regression, we noted a simple way to forumulate it in terms of only distances to the separating hyperplane. This strategy didn't look very promising because guessing the right coordinate transformation seemed impossible.

If we had a machine learning technique that more natrually compatible with formuation of the cost function based only on inner products, we can use different trick to solve the problem.

Remember last week when Josh talked about Support Vector Machines. The cost function,

$ \text{min}_{\theta} C \sum \limits_{i=1}^{m} \lbrack y_i \text{cost}_1(\theta^T f) + (1-y_i) \text{cost}_0(\theta^T f) \rbrack + \frac{1}{2} \sum \limits_{j=1}^{n} \theta_j^2 $

with f defined by,

$ f = \text{exp} \left( - \frac{\Vert x - l^i \Vert ^2}{2 \sigma^2} \right) $

This function depends on the square of the distance between the feature vectors.

$ | x - x^\prime |^2 $

This can be written in terms of the inner product of the feature vectors.

$ | x - x^\prime |^2 = x \cdot x^\prime $

What's the trick?

  1. Data items are embeded into a vector space called the feature space.
  2. Linear relations are sought among the images of the data items in the feature space.
  3. The algorithms (e.g. SVM) are implemented in such a way that the coordinates of teh embedded points are not needed, only the pairwise inner products.
  4. The pariwise inner products can be computed efficiently direction from the origianl data items usein a kernel function.

A kernel function is a function $\kappa$ the for all $x$, $z$ $\in$ $X$ satisfies,

$ \kappa(x,z) = <\phi(x), \phi(z)> $

where $\phi$ is a mapping from X to an inner product feature space F

$\phi:x \rightarrow \phi(x) \in F$

Since the inner product is expressed in the transformed space by the kernel function, we might make some headway by looking at the generic properties of the kernel functions and giving up on actually finding the transformation, $\phi(x)$.

Inner product spaces with the additional properties of being separable and complete are Hilbert spaces. When $\kappa$ satisfies the finitely positive semi-definate property, we refer to it as a Reproducing kernel Hilbert Space (RKHS).

It can be shown that once we have kernels $\kappa_1$ and $\kappa_2$ we can construct new kernels by:

  1. $\kappa(x,z) = \kappa_1(x,z) + \kappa_2(x,z)$
  2. $\kappa(x,z) = \alpha \kappa_1(x,z)$
  3. $\kappa(x,z) = \kappa_1(x,z) \kappa_2(x,z)$
  4. $\kappa(x,z) = f(x) f(z)$
  5. $\kappa(x,z) = \kappa_3(\phi(x),\phi(z))$
  6. $\kappa(x,z) = x^\prime B z$

Now we just put our efforts into kernel construction.

RBF and SVM

It turns out that a useful kernel is the RBF.


In [ ]:
from sklearn import svm

xx, yy = np.meshgrid(np.linspace(-3, 3, 500),
                     np.linspace(-3, 3, 500))
np.random.seed(0)
X = np.random.randn(300, 2)
Y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)

# fit the model
clf = svm.NuSVC()
clf.fit(X, Y)

# plot the decision function for each datapoint on the grid
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',
           origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,
                       linetypes='--')
plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired)
plt.show()
p = clf.predict(X)
#print(p)
print("The hyperplane predicts {:.2%} of the training data.".format(clf.score(X,t)))

Last Note: ESL points out that it is write the solution based no an expansion on RBF basis functions and based on the kernel smoothing propertes sketched above in a way that shows them to be isomorphic. So the kernel(1) and kernel(2) concepts are not completely disconnected as they might at first appear. (See Chapters 5 & 6.)