In [ ]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')

In [ ]:
# set random seed, for reproducibility
np.random.seed(12345)

Simulate data that is easy to plot

This might not be the best example, but at least we can look at it.


In [ ]:
x_true = np.linspace(0,15,1000)
y_true = np.cos(x_true)

sigma_true = .3
x_train = np.random.choice(x_true, size=100)
y_train = np.random.laplace(np.cos(x_train), sigma_true)

In [ ]:
# remember what this looks like?
plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.legend()

How do you think Linear Regression will perform in this setting?


In [ ]:
import sklearn.linear_model

In [ ]:
clf = sklearn.linear_model.____________________

In [ ]:
X_train = x_train[:,None]
clf.fit(_____________________, _____________________)

In [ ]:
X_true = x_true[:,None]
y_pred = clf.predict(_____________________)

In [ ]:
plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

Not so good, huh?

At this point in your life, do you have a number of tricks for improving this?


In [ ]:
import sklearn.preprocessing

In [ ]:
poly = sklearn.preprocessing.PolynomialFeatures(degree=_____________________)

In [ ]:
poly.fit_transform(X_train)  # What does fit_transform seem to do?  is poly.fit different?  what about poly.transform?

In [ ]:
clf = sklearn.linear_model.LinearRegression()
clf.fit(poly.transform(X_train), y_train)

In [ ]:
X_true = x_true[:,None]
y_pred = clf.predict(poly.transform(X_true))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

Adding squared terms doesn't do much, but how about higher degree polynomial?


In [ ]:
poly = sklearn.preprocessing.PolynomialFeatures(degree=_____________________)

clf = sklearn.linear_model.LinearRegression()
clf.fit(poly._____________________(X_train), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(poly._____________________(X_true))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

Not so good if you have to extrapolate, though...


In [ ]:
x_true = np.linspace(-2,17,1000)
y_true = np.cos(x_true)

X_true = x_true[:,None]
y_pred = clf.predict(poly.transform(X_true))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

How about if you know that this is some sort of periodic function?


In [ ]:
def my_transform(X, phases=[0,.5], freqs=[1., 2.]):
    """ Return sine waves of a number of different phases and frequencies
    
    Parameters
    ----------
    X : _____________________
    phases : _____________________
    freqs : _____________________
    
    Results
    -------
    Returns _____________________
    """
    
    n,p = X.shape
    assert p == 1, "designed for univariate feature space only"
    X = X.reshape(n)
    
    X_new = np.empty((n,len(phases)*len(freqs)))
    
    col = 0
    for a in phases:
        for b in freqs:
            X_new[:,col] = np.cos(a + b*X)
            col += 1
            
    assert np.all(np.isfinite(X_new))
    return X_new

If you really know the phase and frequency, you can get it:


In [ ]:
phases = ________________________
freqs = ________________________

clf = sklearn.linear_model.LinearRegression()
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

But suppose that you do not have it so narrowed down:


In [ ]:
phases=[0,1,2]
freqs=np.linspace(0,6,7)

clf = sklearn.linear_model.LinearRegression()
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

And there is that whole small $n$ large $p$ thing you need to worry about:


In [ ]:
phases=np.linspace(0,6,10)
freqs=np.linspace(0,6,10)

clf = sklearn.linear_model.LinearRegression()
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

Which can end up totally breaking if $n$ is too small and $p$ is too large.


In [ ]:
phases=np.linspace(0,6,25)
freqs=np.linspace(0,6,25)

clf = sklearn.linear_model.LinearRegression()
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

Enter penalty functions, e.g. sklearn.linear_model.Ridge:


In [ ]:
phases=np.linspace(0,6,25)
freqs=np.linspace(0,6,25)

clf = sklearn.linear_model.Ridge(alpha=________________________)
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

Or for this sort of application, sklearn.linear_model.Lasso:


In [ ]:
phases=np.linspace(0,6,25)
freqs=np.linspace(0,6,25)

clf = sklearn.linear_model.Lasso(alpha=________________________)
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

Amazing, in my humble opinion!

One thing that is so great about this? You can understand the model:


In [ ]:
phases=np.linspace(0,6,100)
freqs=np.linspace(0,6,100)

clf = sklearn.linear_model.Lasso(alpha=.05)
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

In [ ]:
clf.______________________

In [ ]:
clf.coef_.________________________

In [ ]:
np.where(clf.coef_>.01)

In [ ]:
print phases[2:5]
print freqs[15:18]

What is the phase and frequency of the sine wave with a coefficient weight greater than 0.01?


In [ ]:
________________________, ________________________

One more penalty function, because I love the name: Elastic Nets


In [ ]:
phases=np.linspace(0,6,100)
freqs=np.linspace(0,6,100)

clf = sklearn.linear_model.ElasticNet(alpha=________________________, l1_ratio=________________________)
clf.fit(my_transform(X_train, phases, freqs), y_train)

X_true = x_true[:,None]
y_pred = clf.predict(my_transform(X_true, phases, freqs))

plt.plot(x_true, y_true, '-', label='Truth')
plt.plot(x_train, y_train, 's', label='Train')
plt.plot(x_true, y_pred, '-', label='Predicted')
plt.legend()

In [ ]:
np.where(clf.coef_>.01)

How should you find the appropriate values of alpha (and l1_ratio in E-N case)? Cross-validation, of course. There are some special tricks for making this fast, and sklearn has implementations you can use if this is relevant to your projects.

Perhaps this all feels a little irrelevant

And it is a bit of a stylized example. In its defense, it looks pretty. Also, I think I will be able to interest the astronomers in it. But let us switch to a classification version of this:


In [ ]:
# set random seed for reproducibility
np.random.seed(12345)

x_true = np.linspace(0,15,1000)
y_true_numeric = np.cos(x_true)
y_true = (np.random.uniform(low=-1, high=1, size=1000) <= y_true_numeric)

train = np.random.choice(range(1000), size=100)
x_train = x_true[train]
y_train = y_true[train]

Not as pretty...


In [ ]:
plt.plot(x_true, y_true_numeric, '-', label='Propensity')
plt.plot(x_train, np.random.normal(0,________________________,size=100)+y_train, 's', label='Jittered Train')
plt.legend()

That would look better if the class labels were $\pm1$:


In [ ]:
y_true = 2*y_true - 1
y_train = y_true[train]

In [ ]:
plt.plot(x_true, y_true_numeric, '-', label='Propensity')
plt.plot(x_train, np.random.normal(0,.01,size=100)+y_train, 's', label='Jittered Train')
plt.legend()

In [ ]:
# fit the linear model
clf = sklearn.linear_model.LinearRegression()
X_train = x_train[:,None]
clf.fit(X_train, y_train)

In [ ]:
# make a categorical prediction with a linear regression
y_pred = (clf.predict(X_train) >= ________________________)

In [ ]:
# accuracy
np.mean(________________________ == ________________________)

There must be something wrong with that, right???


In [ ]:
y_pred = ________________________

In [ ]:
np.mean(y_pred == y_train)

Not very good, and that is in-sample fit!

But we know how to fix it:


In [ ]:
phases=np.linspace(0,6,100)
freqs=np.linspace(0,6,100)

clf = sklearn.linear_model.ElasticNet(alpha=.1, l1_ratio=________________________)
clf.fit(my_transform(X_train, phases, freqs), y_train)

y_pred = clf.predict(my_transform(X_train, phases, freqs))
y_pred = (y_pred >= 0)
y_pred = 2*y_pred -1

In [ ]:
np.mean(y_pred == y_train)

You should not be impressed by this until you test it out-of-sample... what do you think will happen in that case?

Using linear regression for this is a little silly, though. We should consider a loss function that is more appropriate to the task at hand:


In [ ]:
clf = sklearn.linear_model.LogisticRegression(penalty=________________________, C=________________________)

clf.fit(my_transform(X_train, phases, freqs), y_train)

y_pred = clf.predict(my_transform(X_train, phases, freqs))
# don't need to mess around with the results this time...

np.mean(y_pred == y_train)

Surprising that it is worse, not better. Worth testing this out of sample, too, huh?

The hinge loss may be less familiar:


In [ ]:
import sklearn.svm

In [ ]:
clf = sklearn.svm.SVC(kernel='linear')
clf.fit(my_transform(X_train, phases, freqs), y_train)

y_pred = clf.predict(my_transform(X_train, phases, freqs))
# don't need to mess around with the results this time...

np.mean(y_pred == y_train)

The other cool thing about the SVM is the kernel trick. It is possible to use the kernel trick in a lot of settings, but it is particularly central to SVM. Let us first remove all of the transformed covariate columns we added:


In [ ]:
clf = sklearn.svm.SVC(kernel='linear', C=________________________)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_train)
np.mean(y_pred == y_train)

Not good at all. But with the kernel trick:


In [ ]:
clf = sklearn.svm.SVC(kernel='poly', C=_____________________, degree=_____________________)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_train)
np.mean(y_pred == y_train)

In [ ]:
clf = sklearn.svm.SVC(kernel='rbf', C=_____________________, gamma=_____________________)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_train)
np.mean(y_pred == y_train)

Homework: use out-of-sample cross-validation to find the best values for C, gamma, and degree. Use test/train/validation split to estimate the accuracy of the best method. Bonus extra hard theory-and-application problem: Figure out what the best accuracy possible is for this toy problem.