In [3]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Statistical Learning

Different from machine learning, in that not merely interested in how well a model fits: also interested in how to interpret/derive meaning.

  • How to relate my covariates $X = \{ x_1, x_2, x_3, ... x_p \}$ to the response $y$.
  • Our model of the data is $y = f(x) + \epsilon$
    • $f(x)$ is not necessarily linear,
    • Error terms need not be normal.
  • Goal: to develop an estimate of $f$, $\hat{f}$
    • Two reasons to estimate $f$ with $\hat{f}$:
      1. Make predictions (not necessarily informed by mechanisms, relationships among covariates),
        • Want $\hat{y}$ to be close to $y$; $\hat{y} = \hat{f}(x)$
        • Minimize Mean Squared Error:

$E(y-\hat{y})^2 = E[f(x) + \epsilon - \hat{f}(x)]^2$

$E(y-\hat{y})^2 = [f(x) - \hat{f}(x)]^2 + Var(\epsilon)$


In [11]:
import numpy as np
from scipy.stats import uniform

In [14]:
f = lambda x: np.log(x)
x = np.linspace(0.1, 5.1, 100)
y = f(x)
Eps = uniform.rvs(-1., 2., size=(100,))

In [20]:
plt.plot(x, y, label='$f(x)$', lw=3)
plt.scatter(x, y + Eps, label='y')
plt.xlabel('x')
plt.legend(loc='best')
plt.show()


  • Goal: to develop an estimate of $f$, $\hat{f}$
    • Two reasons to estimate $f$ with $\hat{f}$:
      1. $\hat{f}$ -> making inference; want to know how covariates X affects y.

In [58]:
models = ['Subset selection lasso', 'least squares', 'generalized additive model trees', 
          'bagging, boosting', 'support vector machines']
pos = [(0, 1), (0.2, 0.8), (0.4, 0.6), (0.6, 0.1), (0.7, 0.3)]
xlabels = ['Restrictive', 'Flexible']
ylabels = ['Low', 'High']

plt.figure(figsize=(10, 7))
for m, p in zip(models, pos):
    plt.text(p[0]+ 0.02, p[1]-0.05, m, size=16)
    
plt.xticks([0.07, 0.95], xlabels, size=16)
plt.yticks([0, 1], ylabels, size=16)
plt.ylabel('Interpretability', size=20)
plt.xlabel('Flexibility', size=20)

plt.show()


How do we estimate $\hat{f}$?

Parametric vs non-parametric methods

Parametric methods

  • Assume some form for the relationship between X and y. For example:

$y = \beta_0 + \beta_1x + \epsilon$

$y = X\beta + \epsilon$

$logit(y) = X\beta + \epsilon$

  • And fit data by tweaking a few $p << n$ beta terms (much few parameters than the number of observations).

Non-parametric methods

  • Assume no form for $f$,
  • or the form has $p \simeq n$

In [65]:
x = np.linspace(0., 1.2, 5)
plt.scatter(x[0:4], [0.1, 0.6, 0.25, 0.7])
plt.plot(x, [0.1, 0.6, 0.25, 0.7, 1.2])
plt.plot(x, x/1.5)
plt.scatter(1.2, 0., c='red')
plt.show()


Can fit this perfectly with a cubic model. But assuming that this is correct.

What happens when we get a new data point: $(x_0, y_0)$

for non-parametric methods we need some way to penalize "wiggliness"

wiggliness df: cumulative change in the second derivative, $f''$.

Pros & cons:

  • Parametric:
    • Pros:
      • More interpretable
      • Requires fewer data
    • Cons:
      • More rigid
      • More assumptions to make
  • Non-parametric
    • Pros:
      • More flexible
      • Fewer assumptions
    • Cons:
      • Need more data
      • Harder to interpret

Supervised vs. unsupervised algorithms

  • in the supervised algorithm we have response variable, $y$
  • unsupervised case, no response variable
  • the response variable, $y$, supervises our selection of important covariates, $X$

Examples:

  • Regression -- supervised
  • NMDS/PCA -- unsupervised
  • Diabetes risk -- supervised

In [81]:
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.title('Supervised')
plt.scatter([.0, .2, .1, .3], [.2, .1, .3, .4], c='red', label='nondiabetic')
plt.scatter([.6, .8, .9, .7], [.55, .74, .5, .8], c='blue', label='diabetic')
plt.ylabel('Weekly sugar intake')
plt.xlabel('BMI')
plt.legend(loc=2)

plt.subplot(122)
plt.title('Unsupervised')
plt.scatter([.6, .8, .9, .7]+[.0, .2, .1, .3], [.55, .74, .5, .8]+[.2, .1, .3, .4], c='black', label='diabetic')
plt.ylabel('Weekly sugar intake')
plt.xlabel('BMI')

plt.tight_layout()


In the unsupervised case, we don't know the patient groups.

Classification & regression

Regression: response is continuous (either continuous or categorical covariates)

Classification: response is categorical

Regression

Assessing model accuracy


In [93]:
x = np.linspace(0., 1., 50)
y = x + np.random.random(size=50) - 0.5


plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.title('Model A')
plt.scatter(x, y)
plt.plot(x, x)

plt.subplot(122)
plt.title('Model B')
plt.scatter(x, y)
plt.plot(x, [0.42]*50)

plt.tight_layout()
plt.show()


Model A is better because the $Ave(y-\hat{y})^2$ (Mean Squared Error) is smaller.

Consider the model where we have n parameters (e.g. n-degree polynomial). It can go through every data point: no MSE!

If the model is too flexible (and we overfit the data), then we tend to do a bad job at predicting a new data point that was not used in tuning the model.

Test data & training data

Take our data and split into two groups:

  1. Training data: data used to tune the model(s) of interest
  2. Test data: data used to assess the accuracy of each model (typically use MSE)

In general, $MSE_{training} \leq MSE_{test}$

Want to look at the impact of model complexity on both $MSE_{training}$ and $MSE_{test}$.


In [106]:
plt.figure(figsize=(7, 5))

x = np.linspace(1, 10, 99)
plt.plot(x, 1./x**0.5 - 0.1, label='$MSE_training$', lw=3)

plt.plot(np.linspace(1, 10, 7), [0.9, 0.6, 0.5, 0.45, 0.55, 0.7, 0.9], label='$MSE_{test}$', lw=3)
plt.ylabel('$MSE$')
plt.xlabel('flexibility')

plt.legend()
plt.show()


$MSE_{test}$ should bottom out around the "true" function. $MSE_{test}$ should never drop below the "true" amount of error/residuals. Goal is to minimize $MSE_{test}$.

Bias/Variance trade-off

  • It can be shown that for $y = f(x) + \epsilon$,

$E[ y_0 - \hat{f}(x_0)]^2 = Var(\hat{f}(x_0) + [bias(\hat{f}(x_0))]^2 + Var(\epsilon)$

  • $E[y_0 - \hat{f}(x_0)]^2$ -- Expected test set MSE
  • $Var(\hat{f}(x_0)$ -- Measure of how much the $\hat{f}$ function would change if I got new data. If model is well-fit, this should be small.
  • $Bias(\hat{f}) = E[f(x_0) - \hat{f}(x_0)]$ -- How much am I going to be wrong because my $\hat{f}$ is too restrictive. Want a model that is flexible enough that this bias is small.

$y_0$ is training data

Classification

Assessing accuracy

  • $\hat{y}$ will be categorical (as is $y$)
  • Measure will be % of cases mis-classified

Training error rate: $ER = \frac{1}{n}\sum{I(y_i \neq \hat{y}_i)}$

$I(u) =$ 1 if TRUE, 0 if FALSE$


In [108]:
x = np.linspace(0., 1., 20)
y = [1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

plt.scatter(x, y)
plt.ylabel('Cougar occupied')
plt.xlabel('# of dogs')


Out[108]:
<matplotlib.collections.PathCollection at 0x109ce5550>

$\hat{y}=$ 1=occupied if $\hat{p}(x_0) > 0.5$; 0=unoccupied if $\hat{p}(x_0) \leq 0.5$

Making a logistic regression a classifier.


In [ ]: