Exercise 2: Predictive accuracy

(Reproducing ISL Figure 2.9 and 2.17)



In [ ]:

    
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')



In [ ]:

    
# set random seed, for reproducibility
np.random.seed(12345)

# funny little var for making notebook executable
__________________ = None

GPR example

adapted from http://www.astroml.org/book_figures/chapter8/fig_gp_example.html



In [ ]:

    
# true x and y points
x_true = np.linspace(0,10,500)
y_true = np.cos(x_true)

# noisy observed x and y points
x_train = np.random.choice(x_true, 25, replace=False)
dy = 0.5
y_train = np.random.normal(np.cos(x_train), dy)



In [ ]:

    
plt.plot(x_true, y_true, label='true')
plt.plot(x_train, y_train, 's', label='observed')
plt.legend()



In [ ]:

    
from sklearn.gaussian_process import GaussianProcess



In [ ]:

    
gp = GaussianProcess(theta0=100, nugget=.5)
gp.fit(x_train[:, None], y_train)
y_pred = gp.predict(x_true[:, None])



In [ ]:

    
# What does x_train[:,None] do?



In [ ]:

    
# What does the GPR prediction look like?
plt.plot(x_true, y_true, label='true')
plt.plot(x_train, y_train, 's', label='observed')
plt.plot(x_true, y_pred, label='predicted')
plt.legend()

We can measure the training error (RMSE):



In [ ]:

    
y_pred = gp.predict(x_train[:, None])
rmse = __________________
print '(in-sample) rmse =', rmse

To measure the "test" error, we need different data. Since this is a simulation, we can have as much as we want:



In [ ]:

    
# noisy observed x and y points
x_test= np.random.uniform(0, 10, size=1000)
y_test = np.random.normal(np.cos(x_test), dy)

# out-of-sample prediction
y_pred = __________________
rmse = __________________
print '(out-of-sample) rmse =', rmse

Refactor this into a function:



In [ ]:

    
def test_train_error(theta0):
    """ calculate the RMSE for the test and train data
    using GPR with the specified theta0 value"""

    gp = GaussianProcess(theta0=theta0, nugget=.5)
    __________________
    __________________
    __________________
    
    return dict(__________________=__________________,)

test_train_error(100)

I didn't tell you about dict last time! This is one of my favorite Python things. It is like an array, but you can use whatever you want as the index. Well, pretty much whatever you want... in CS lingo, it is an associative array. You can also create one with curly brackets:

d1 = {}
d1['abie'] = 'forgetful'

d2 = {'yes':1, 'no':0}
d2['yes']  # what value will this produce?

Search over theta values to see how test and train error vary:



In [ ]:

    
results = []

for theta0 in np.exp(np.linspace(-3, 5, 50)):
    results.append(
        test_train_error(theta0))
    
results = pd.DataFrame(results)



In [ ]:

    
plt.semilogx(results.theta0, results.err_train, label='Train Error')
plt.plot(results.theta0, results.err_test, label='Test Error')
plt.legend()
plt.xlabel('$\theta_0$')
plt.ylabel('RMSE')

So the best value is around $\theta_0 = $ __________________:



In [ ]:

    
gp = GaussianProcess(theta0=__________________, nugget=.5)
gp.fit(x_train[:, None], y_train)
y_pred = gp.predict(x_true[:, None])



In [ ]:

    
plt.plot(x_true, y_true, label='true')
plt.plot(x_train, y_train, 's', label='observed')
plt.plot(x_true, y_pred, label='predicted')
plt.legend()

What does the prediction look like when $\theta_0$ is too small?

(Is it time to refactor this into a function?)



In [ ]:

    
gp = GaussianProcess(theta0=__________________, nugget=.5)
__________________
__________________
__________________

Same game for $k$-NN:

But now the concept is classification, not regression.



In [ ]:

    
df = pd.read_csv('ISL_Fig_2_9_data.csv', index_col=0)



In [ ]:

    
# what is in df?



In [ ]:

    
# fix the color names (data cleaning!)
df['color'] = df['Y'].map({'red': 'orange', 'blue': 'purple'})



In [ ]:

    
# have a look at this data
plt.figure(figsize=(8,8))
for g, dfg in df.groupby('color'):
    plt.plot(dfg['X.1'], dfg['X.2'], 's', color=g, ms=10, alpha=.6)



In [ ]:

    
# load the k-NN library
import sklearn.neighbors



In [ ]:

    
# create the training data

# array of feature vectors
X_train = np.array(df.filter(['X.1', 'X.2']))

# corresponding labels
y_train = np.array(df.color.map({'orange': 0, 'purple': 1}))



In [ ]:

    
# fit a k-NN classifier
n_neighbors=20
weights='uniform'
clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors, weights)
clf.fit(X_train, y_train)



In [ ]:

    
# explore the decision boundary. For that, we will predict each
# point in the mesh [-3, 4]x[-3, 4].
xx, yy = np.meshgrid(np.linspace(-3, 4, 500),
                     np.linspace(-3, 4, 500))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)



In [ ]:

    
plt.figure(figsize=(8,8))
for g, dfg in df.groupby('color'):
    plt.plot(dfg['X.1'], dfg['X.2'], 's', color=g, ms=10, alpha=.6)
plt.contour(xx, yy, Z, levels=[.5], colors='k', zorder=10)

In-sample accuracy (concordance):



In [ ]:

    
y_pred = clf.predict(X_train)

c = __________________
print '(in-sample) concordance =', c

To do out-of-sample, need to know distribution, or approximate it somehow. One way (to which we will return) is the train/test split. In this simulation environment, however, we can just get more:



In [ ]:

    
orange_means = pd.read_csv('ISL_Fig_2_9_orange_mean.csv', index_col=0)
purple_means = pd.read_csv('ISL_Fig_2_9_purple_mean.csv', index_col=0)

To simulate data of either color, first choose a mean uniformly at random, and then draw from a normal centered at that mean, with standard deviation 0.5:



In [ ]:

    
def rmixture(means):
    i = np.random.randint(len(means.index))
    return np.random.normal(loc=means.iloc[i], scale=.5)

rmixture(orange_means), rmixture(purple_means)



In [ ]:

    
n_test = 1000

X_test = [rmixture(orange_means) for i in range(n_test)] + \
           [rmixture(purple_means) for i in range(n_test)]
X_test = np.array(X_test)

y_test = [0 for i in range(n_test)] + [1 for i in range(n_test)]
y_test = np.array(y_test)

Have a look at this newly simulated data:



In [ ]:

    
plt.figure(figsize=(8,8))
plt.plot(X_test[:n_test,0], X_test[:n_test,1], 's', color='orange', ms=10, alpha=.6)
plt.plot(X_test[n_test:,0], X_test[n_test:,1], 's', color='purple', ms=10, alpha=.6)
plt.contour(xx, yy, Z, levels=[.5], colors='k', zorder=10)



In [ ]:

    
y_pred = __________________



In [ ]:

    
c = __________________
print '(out-of-sample) concordance =', c

Now refactor the test-/train-error measurement into a function that takes the $k$ in $k$-NN as a parameter, and sweep over a range of $k$ values to find the best $k$, and reproduce ISL Figure 2.17.



In [ ]:

    
def test_train_error(k):
    """ calculate the concordance for the test and train data
    using k-NN with the specified k value"""
    __________________
    __________________
    __________________
    
    return dict(__________________=__________________,)

test_train_error(1)



In [ ]:

    
results = []

__________________
__________________
__________________

__________________



In [ ]:

    
# plot results, replication of figure 2.17 from ISL