In [1]:
!date
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')


Sun Feb  8 15:02:57 PST 2015
/homes/abie/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module IPython was already imported from /snfs2/HOME/abie/ipython/IPython/__init__.pyc, but /snfs2/HOME/abie/anaconda/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream

In [2]:
# set random seed, for reproducibility
np.random.seed(12345)

Simulate data that is easy to plot

This might not be the best example, but at least we can look at it.


In [3]:
x_true = np.linspace(0,15,1000)
y_true = np.cos(x_true)

sigma_true = .3
x_train = np.random.choice(x_true, size=100)
y_train = np.random.laplace(np.cos(x_train), sigma_true)

In [4]:
# remember what this looks like?
plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.legend()


Out[4]:
<matplotlib.legend.Legend at 0x2b63908e01d0>

How do you think Linear Regression will perform in this setting?


In [5]:
import sklearn.linear_model

In [6]:
clf = sklearn.linear_model.LinearRegression()

In [7]:
X_train = x_train[:,None]
clf.fit(X_train, y_train)


Out[7]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [8]:
X_true = x_true[:,None]
y_pred = clf.predict(X_true)

In [9]:
plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[9]:
<matplotlib.legend.Legend at 0x2b6393b97410>

Not so good, huh?

At this point in your life, do you have a number of tricks for improving this?


In [10]:
import sklearn.preprocessing

In [11]:
poly = sklearn.preprocessing.PolynomialFeatures(degree=2)

In [12]:
poly.fit_transform(X_train)


Out[12]:
array([[  1.00000000e+00,   7.23723724e+00,   5.23776028e+01],
       [  1.00000000e+00,   7.28228228e+00,   5.30316352e+01],
       [  1.00000000e+00,   4.27927928e+00,   1.83122312e+01],
       [  1.00000000e+00,   1.93693694e+00,   3.75172470e+00],
       [  1.00000000e+00,   6.30630631e+00,   3.97694992e+01],
       [  1.00000000e+00,   6.38138138e+00,   4.07220283e+01],
       [  1.00000000e+00,   5.73573574e+00,   3.28986644e+01],
       [  1.00000000e+00,   5.36036036e+00,   2.87334632e+01],
       [  1.00000000e+00,   8.19819820e+00,   6.72104537e+01],
       [  1.00000000e+00,   8.12312312e+00,   6.59851293e+01],
       [  1.00000000e+00,   1.15465465e+01,   1.33322737e+02],
       [  1.00000000e+00,   1.77177177e+00,   3.13917521e+00],
       [  1.00000000e+00,   5.54054054e+00,   3.06975895e+01],
       [  1.00000000e+00,   4.72972973e+00,   2.23703433e+01],
       [  1.00000000e+00,   9.81981982e+00,   9.64288613e+01],
       [  1.00000000e+00,   1.57657658e+00,   2.48559370e+00],
       [  1.00000000e+00,   1.36636637e+00,   1.86695705e+00],
       [  1.00000000e+00,   1.13963964e+01,   1.29877851e+02],
       [  1.00000000e+00,   5.61561562e+00,   3.15351387e+01],
       [  1.00000000e+00,   3.12312312e+00,   9.75389804e+00],
       [  1.00000000e+00,   1.32432432e+01,   1.75383492e+02],
       [  1.00000000e+00,   1.07057057e+01,   1.14612135e+02],
       [  1.00000000e+00,   4.00900901e+00,   1.60721532e+01],
       [  1.00000000e+00,   1.15615616e+00,   1.33669706e+00],
       [  1.00000000e+00,   9.75975976e+00,   9.52529106e+01],
       [  1.00000000e+00,   1.21621622e+00,   1.47918188e+00],
       [  1.00000000e+00,   1.08408408e+01,   1.17523830e+02],
       [  1.00000000e+00,   2.49249249e+00,   6.21251883e+00],
       [  1.00000000e+00,   3.94894895e+00,   1.55941978e+01],
       [  1.00000000e+00,   6.45645646e-01,   4.16858300e-01],
       [  1.00000000e+00,   3.46846847e+00,   1.20302735e+01],
       [  1.00000000e+00,   3.45345345e-01,   1.19263408e-01],
       [  1.00000000e+00,   4.35435435e-01,   1.89604018e-01],
       [  1.00000000e+00,   1.42942943e+01,   2.04326849e+02],
       [  1.00000000e+00,   6.59159159e+00,   4.34490797e+01],
       [  1.00000000e+00,   4.30930931e+00,   1.85701467e+01],
       [  1.00000000e+00,   1.05255255e+01,   1.10786688e+02],
       [  1.00000000e+00,   5.30030030e+00,   2.80931833e+01],
       [  1.00000000e+00,   1.28978979e+01,   1.66355770e+02],
       [  1.00000000e+00,   1.28378378e+01,   1.64810080e+02],
       [  1.00000000e+00,   5.40540541e-01,   2.92184076e-01],
       [  1.00000000e+00,   8.64864865e+00,   7.47991234e+01],
       [  1.00000000e+00,   5.81081081e+00,   3.37655223e+01],
       [  1.00000000e+00,   1.14714715e+01,   1.31594658e+02],
       [  1.00000000e+00,   1.86186186e+00,   3.46652959e+00],
       [  1.00000000e+00,   9.68468468e+00,   9.37931174e+01],
       [  1.00000000e+00,   2.14714715e+00,   4.61024087e+00],
       [  1.00000000e+00,   9.47447447e+00,   8.97656666e+01],
       [  1.00000000e+00,   1.26576577e+01,   1.60216297e+02],
       [  1.00000000e+00,   1.36636637e+01,   1.86695705e+02],
       [  1.00000000e+00,   5.57057057e+00,   3.10312565e+01],
       [  1.00000000e+00,   5.06006006e+00,   2.56042078e+01],
       [  1.00000000e+00,   1.60660661e+00,   2.58118479e+00],
       [  1.00000000e+00,   5.84084084e+00,   3.41154217e+01],
       [  1.00000000e+00,   5.10510511e-01,   2.60620981e-01],
       [  1.00000000e+00,   4.63963964e+00,   2.15262560e+01],
       [  1.00000000e+00,   5.57057057e+00,   3.10312565e+01],
       [  1.00000000e+00,   1.08108108e+00,   1.16873630e+00],
       [  1.00000000e+00,   7.50750751e-02,   5.63626690e-03],
       [  1.00000000e+00,   3.39339339e+00,   1.15151187e+01],
       [  1.00000000e+00,   1.62162162e+00,   2.62965668e+00],
       [  1.00000000e+00,   3.91891892e+00,   1.53579255e+01],
       [  1.00000000e+00,   8.85885886e-01,   7.84793803e-01],
       [  1.00000000e+00,   1.24624625e+00,   1.55312971e+00],
       [  1.00000000e+00,   7.86786787e+00,   6.19033448e+01],
       [  1.00000000e+00,   5.97597598e+00,   3.57122889e+01],
       [  1.00000000e+00,   1.14564565e+01,   1.31250395e+02],
       [  1.00000000e+00,   8.88888889e+00,   7.90123457e+01],
       [  1.00000000e+00,   2.32732733e+00,   5.41645249e+00],
       [  1.00000000e+00,   4.41441441e+00,   1.94870546e+01],
       [  1.00000000e+00,   5.16516517e+00,   2.66789312e+01],
       [  1.00000000e+00,   1.48948949e+01,   2.21857894e+02],
       [  1.00000000e+00,   1.12162162e+01,   1.25803506e+02],
       [  1.00000000e+00,   7.50750751e-02,   5.63626690e-03],
       [  1.00000000e+00,   8.79879880e+00,   7.74188603e+01],
       [  1.00000000e+00,   1.14114114e+00,   1.30220310e+00],
       [  1.00000000e+00,   1.18618619e+01,   1.40703767e+02],
       [  1.00000000e+00,   7.86786787e+00,   6.19033448e+01],
       [  1.00000000e+00,   1.14114114e+01,   1.30220310e+02],
       [  1.00000000e+00,   8.61861862e+00,   7.42805869e+01],
       [  1.00000000e+00,   4.45945946e+00,   1.98867787e+01],
       [  1.00000000e+00,   2.19219219e+00,   4.80570661e+00],
       [  1.00000000e+00,   1.03603604e+01,   1.07337067e+02],
       [  1.00000000e+00,   8.91891892e+00,   7.95471147e+01],
       [  1.00000000e+00,   2.77777778e+00,   7.71604938e+00],
       [  1.00000000e+00,   1.22222222e+01,   1.49382716e+02],
       [  1.00000000e+00,   8.70870871e-01,   7.58416074e-01],
       [  1.00000000e+00,   4.18918919e+00,   1.75493061e+01],
       [  1.00000000e+00,   6.08108108e+00,   3.69795471e+01],
       [  1.00000000e+00,   1.16366366e+01,   1.35411312e+02],
       [  1.00000000e+00,   1.27477477e+01,   1.62505073e+02],
       [  1.00000000e+00,   9.45945946e-01,   8.94813733e-01],
       [  1.00000000e+00,   1.33783784e+01,   1.78981008e+02],
       [  1.00000000e+00,   7.68768769e+00,   5.91005420e+01],
       [  1.00000000e+00,   1.09609610e+01,   1.20142665e+02],
       [  1.00000000e+00,   1.02702703e+01,   1.05478451e+02],
       [  1.00000000e+00,   6.62162162e+00,   4.38458729e+01],
       [  1.00000000e+00,   5.97597598e+00,   3.57122889e+01],
       [  1.00000000e+00,   6.15615616e+00,   3.78982586e+01],
       [  1.00000000e+00,   1.02552553e+01,   1.05170260e+02]])

In [13]:
clf = sklearn.linear_model.LinearRegression()
clf.fit(poly.transform(X_train), y_train)


Out[13]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [14]:
X_true = x_true[:,None]
y_pred = clf.predict(poly.transform(X_true))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[14]:
<matplotlib.legend.Legend at 0x2b6393c8b250>

Adding squared terms doesn't do much, but how about higher degree polynomial?


In [15]:
poly = sklearn.preprocessing.PolynomialFeatures(degree=10)

clf = sklearn.linear_model.LinearRegression()
clf.fit(poly.fit_transform(X_train), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(poly.transform(X_true))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[15]:
<matplotlib.legend.Legend at 0x2b6393d6f5d0>

Not so good if you have to extrapolate, though...


In [16]:
x_true = np.linspace(-2,17,1000)
y_true = np.cos(x_true)

X_true = x_true[:,None]
y_pred = clf.predict(poly.transform(X_true))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[16]:
<matplotlib.legend.Legend at 0x2b63a0011690>

How about if you know that this is some sort of periodic function?


In [17]:
def my_transform(X, phases=[0,.5], freqs=[1., 2.]):
    """ Return sine waves of a number of different phases and frequencies"""
    n,p = X.shape
    assert p == 1, "designed for univariate feature space only"
    X = X.reshape(n)
    
    X_new = np.empty((n,len(phases)*len(freqs)))
    
    col = 0
    for a in phases:
        for b in freqs:
            X_new[:,col] = np.cos(a + b*X)
            col += 1
            
    assert np.all(np.isfinite(X_new))
    return X_new

If you really know the phase and frequency, you can get it:


In [18]:
phases=[0, np.pi]
freqs=[1, 2]

clf = sklearn.linear_model.LinearRegression()
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[18]:
<matplotlib.legend.Legend at 0x2b63a00e8c90>

But suppose that you do not have it so narrowed down:


In [19]:
phases=[0,1,2]
freqs=np.linspace(0,6,7)

clf = sklearn.linear_model.LinearRegression()
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[19]:
<matplotlib.legend.Legend at 0x2b63a01c7150>

And there is that whole small $n$ large $p$ thing you need to worry about:


In [20]:
phases=np.linspace(0,6,10)
freqs=np.linspace(0,6,10)

clf = sklearn.linear_model.LinearRegression()
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[20]:
<matplotlib.legend.Legend at 0x2b63a029b550>

Which can end up totally breaking if $n$ is too small and $p$ is too large.


In [21]:
phases=np.linspace(0,6,25)
freqs=np.linspace(0,6,25)

clf = sklearn.linear_model.LinearRegression()
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[21]:
<matplotlib.legend.Legend at 0x2b63a0377b10>

Enter penalty functions, e.g. sklearn.linear_model.Ridge:


In [22]:
phases=np.linspace(0,6,25)
freqs=np.linspace(0,6,25)

clf = sklearn.linear_model.Ridge(alpha=1)
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[22]:
<matplotlib.legend.Legend at 0x2b63a0452350>

Or for this sort of application, sklearn.linear_model.Lasso:


In [23]:
phases=np.linspace(0,6,25)
freqs=np.linspace(0,6,25)

clf = sklearn.linear_model.Lasso(alpha=.05)
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[23]:
<matplotlib.legend.Legend at 0x2b63a052fad0>

Amazing, in my humble opinion!


In [24]:
phases=np.linspace(0,6,100)
freqs=np.linspace(0,6,100)

clf = sklearn.linear_model.Lasso(alpha=.05)
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


Out[24]:
<matplotlib.legend.Legend at 0x2b63a0601e50>

In [25]:
# One thing that is so great about this?  You can understand the model:
clf.coef_.shape


Out[25]:
(10000,)

In [26]:
clf.coef_


Out[26]:
array([ 0.,  0.,  0., ..., -0., -0., -0.])

In [27]:
np.where(clf.coef_>.01)


Out[27]:
(array([316]),)

In [28]:
print phases[2:5]
print freqs[15:18]


[ 0.12121212  0.18181818  0.24242424]
[ 0.90909091  0.96969697  1.03030303]

One more penalty function, because I love the name: Elastic Nets


In [29]:
phases=np.linspace(0,6,100)
freqs=np.linspace(0,6,100)

clf = sklearn.linear_model.ElasticNet(alpha=.1, l1_ratio=.25)
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()


/homes/abie/anaconda/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.py:490: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations
  ConvergenceWarning)
Out[29]:
<matplotlib.legend.Legend at 0x2b63a0802690>

In [30]:
np.where(clf.coef_>.01)


Out[30]:
(array([  16,   17,  116,  117,  216,  217,  316,  317,  416,  516,  616,
        9717, 9817, 9917]),)

How should you find the appropriate values of alpha (and l1_ratio in E-N case)? Cross-validation, of course. There are some special tricks for making this fast, and sklearn has implementations you can use if this is relevant to your projects.

Perhaps this all feels a little irrelevant

And it is a bit of a stylized example. In its defense, it looks pretty. Also, I think I will be able to interest the astronomers in it. But let us switch to a classification version of this:


In [31]:
# set random seed for reproducibility
np.random.seed(12345)

x_true = np.linspace(0,15,1000)
y_true_numeric = np.cos(x_true)
y_true = (np.random.uniform(low=-1, high=1, size=1000) <= y_true_numeric)

train = np.random.choice(range(1000), size=100)
x_train = x_true[train]
y_train = y_true[train]

Not as pretty...


In [32]:
plt.plot(x_true, y_true_numeric, '-', label='Propensity')
plt.plot(x_train, np.random.normal(0,.01,size=100)+y_train, 's', label='Jittered Train')
plt.legend()


Out[32]:
<matplotlib.legend.Legend at 0x2b63a08cdd10>

That would look better if the class labels were $\pm1$:


In [33]:
y_true = 2*y_true - 1
y_train = y_true[train]

In [34]:
plt.plot(x_true, y_true_numeric, '-', label='Propensity')
plt.plot(x_train, np.random.normal(0,.01,size=100)+y_train, 's', label='Jittered Train')
plt.legend()


Out[34]:
<matplotlib.legend.Legend at 0x2b63a099f910>

In [35]:
clf = sklearn.linear_model.LinearRegression()
X_train = x_train[:,None]
clf.fit(X_train, y_train)


Out[35]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [36]:
y_pred = (clf.predict(X_train) >= 0)

In [37]:
np.mean(y_pred == y_train)


Out[37]:
0.050000000000000003

There must be something wrong with that, right???


In [38]:
y_pred


Out[38]:
array([False,  True, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True,  True, False,
        True,  True,  True, False, False, False, False,  True, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False, False], dtype=bool)

In [39]:
y_train


Out[39]:
array([-1,  1,  1, -1, -1, -1,  1, -1,  1,  1, -1, -1,  1,  1, -1,  1, -1,
       -1,  1, -1, -1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
       -1, -1,  1, -1,  1, -1, -1, -1, -1, -1, -1, -1,  1,  1, -1, -1, -1,
        1, -1,  1,  1, -1,  1, -1, -1, -1,  1,  1, -1, -1, -1, -1, -1,  1,
        1,  1,  1, -1,  1,  1, -1, -1,  1,  1, -1,  1, -1, -1,  1,  1,  1,
        1,  1, -1, -1,  1,  1, -1,  1, -1, -1, -1,  1, -1, -1, -1])

In [40]:
y_pred = 2*y_pred -1

In [41]:
np.mean(y_pred == y_train)


Out[41]:
0.5

Not very good, and that is in-sample fit!

But we know how to fix it:


In [42]:
phases=np.linspace(0,6,100)
freqs=np.linspace(0,6,100)

clf = sklearn.linear_model.ElasticNet(alpha=.1, l1_ratio=.25)
clf.fit(my_transform(X_train, phases, freqs), y_train)

y_pred = clf.predict(my_transform(X_train, phases, freqs))
y_pred = (y_pred >= 0)
y_pred = 2*y_pred -1

In [43]:
np.mean(y_pred == y_train)


Out[43]:
0.89000000000000001

You should not be impressed by this until you test it out-of-sample... what do you think will happen in that case?

Using linear regression for this is a little silly, though. We should consider a loss function that is more appropriate to the task at hand:


In [44]:
clf = sklearn.linear_model.LogisticRegression(penalty='l1', C=.05)

clf.fit(my_transform(X_train, phases, freqs), y_train)

y_pred = clf.predict(my_transform(X_train, phases, freqs))
# don't need to mess around with the results this time...

np.mean(y_pred == y_train)


Out[44]:
0.81999999999999995

Surprising that it is worse, not better. Worth testing this out of sample, too, huh?

The hinge loss may be less familiar:


In [45]:
import sklearn.svm

In [46]:
clf = sklearn.svm.SVC(kernel='linear')
clf.fit(my_transform(X_train, phases, freqs), y_train)

y_pred = clf.predict(my_transform(X_train, phases, freqs))
# don't need to mess around with the results this time...

np.mean(y_pred == y_train)


Out[46]:
0.90000000000000002

The other cool thing about the SVM is the kernel trick. It is possible to use the kernel trick in a lot of settings, but it is particularly central to SVM. Let us first remove all of the transformed covariate columns we added:


In [47]:
clf = sklearn.svm.SVC(kernel='linear', C=1.)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_train)
np.mean(y_pred == y_train)


Out[47]:
0.53000000000000003

Not good at all. But with the kernel trick:


In [48]:
clf = sklearn.svm.SVC(kernel='poly', C=1., degree=6)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_train)
np.mean(y_pred == y_train)


Out[48]:
0.42999999999999999

In [49]:
clf = sklearn.svm.SVC(kernel='rbf', C=1., gamma=0.1)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_train)
np.mean(y_pred == y_train)


Out[49]:
0.81999999999999995

Homework: use out-of-sample cross-validation to find the best values for C, gamma, and degree. Use test/train/validation split to estimate the accuracy of the best method. Bonus extra hard theory-and-application problem: Figure out what the best accuracy possible is for this toy problem.


In [49]: