In [3]:
%pylab inline
Different from machine learning, in that not merely interested in how well a model fits: also interested in how to interpret/derive meaning.
$E(y-\hat{y})^2 = E[f(x) + \epsilon - \hat{f}(x)]^2$
$E(y-\hat{y})^2 = [f(x) - \hat{f}(x)]^2 + Var(\epsilon)$
In [11]:
import numpy as np
from scipy.stats import uniform
In [14]:
f = lambda x: np.log(x)
x = np.linspace(0.1, 5.1, 100)
y = f(x)
Eps = uniform.rvs(-1., 2., size=(100,))
In [20]:
plt.plot(x, y, label='$f(x)$', lw=3)
plt.scatter(x, y + Eps, label='y')
plt.xlabel('x')
plt.legend(loc='best')
plt.show()
In [58]:
models = ['Subset selection lasso', 'least squares', 'generalized additive model trees',
'bagging, boosting', 'support vector machines']
pos = [(0, 1), (0.2, 0.8), (0.4, 0.6), (0.6, 0.1), (0.7, 0.3)]
xlabels = ['Restrictive', 'Flexible']
ylabels = ['Low', 'High']
plt.figure(figsize=(10, 7))
for m, p in zip(models, pos):
plt.text(p[0]+ 0.02, p[1]-0.05, m, size=16)
plt.xticks([0.07, 0.95], xlabels, size=16)
plt.yticks([0, 1], ylabels, size=16)
plt.ylabel('Interpretability', size=20)
plt.xlabel('Flexibility', size=20)
plt.show()
Parametric methods
$y = \beta_0 + \beta_1x + \epsilon$
$y = X\beta + \epsilon$
$logit(y) = X\beta + \epsilon$
Non-parametric methods
In [65]:
x = np.linspace(0., 1.2, 5)
plt.scatter(x[0:4], [0.1, 0.6, 0.25, 0.7])
plt.plot(x, [0.1, 0.6, 0.25, 0.7, 1.2])
plt.plot(x, x/1.5)
plt.scatter(1.2, 0., c='red')
plt.show()
Can fit this perfectly with a cubic model. But assuming that this is correct.
What happens when we get a new data point: $(x_0, y_0)$
for non-parametric methods we need some way to penalize "wiggliness"
wiggliness df: cumulative change in the second derivative, $f''$.
Pros & cons:
Examples:
In [81]:
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.title('Supervised')
plt.scatter([.0, .2, .1, .3], [.2, .1, .3, .4], c='red', label='nondiabetic')
plt.scatter([.6, .8, .9, .7], [.55, .74, .5, .8], c='blue', label='diabetic')
plt.ylabel('Weekly sugar intake')
plt.xlabel('BMI')
plt.legend(loc=2)
plt.subplot(122)
plt.title('Unsupervised')
plt.scatter([.6, .8, .9, .7]+[.0, .2, .1, .3], [.55, .74, .5, .8]+[.2, .1, .3, .4], c='black', label='diabetic')
plt.ylabel('Weekly sugar intake')
plt.xlabel('BMI')
plt.tight_layout()
In [93]:
x = np.linspace(0., 1., 50)
y = x + np.random.random(size=50) - 0.5
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.title('Model A')
plt.scatter(x, y)
plt.plot(x, x)
plt.subplot(122)
plt.title('Model B')
plt.scatter(x, y)
plt.plot(x, [0.42]*50)
plt.tight_layout()
plt.show()
Model A is better because the $Ave(y-\hat{y})^2$ (Mean Squared Error) is smaller.
Consider the model where we have n parameters (e.g. n-degree polynomial). It can go through every data point: no MSE!
If the model is too flexible (and we overfit the data), then we tend to do a bad job at predicting a new data point that was not used in tuning the model.
Take our data and split into two groups:
In general, $MSE_{training} \leq MSE_{test}$
Want to look at the impact of model complexity on both $MSE_{training}$ and $MSE_{test}$.
In [106]:
plt.figure(figsize=(7, 5))
x = np.linspace(1, 10, 99)
plt.plot(x, 1./x**0.5 - 0.1, label='$MSE_training$', lw=3)
plt.plot(np.linspace(1, 10, 7), [0.9, 0.6, 0.5, 0.45, 0.55, 0.7, 0.9], label='$MSE_{test}$', lw=3)
plt.ylabel('$MSE$')
plt.xlabel('flexibility')
plt.legend()
plt.show()
$MSE_{test}$ should bottom out around the "true" function. $MSE_{test}$ should never drop below the "true" amount of error/residuals. Goal is to minimize $MSE_{test}$.
$E[ y_0 - \hat{f}(x_0)]^2 = Var(\hat{f}(x_0) + [bias(\hat{f}(x_0))]^2 + Var(\epsilon)$
$y_0$ is training data
In [108]:
x = np.linspace(0., 1., 20)
y = [1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
plt.scatter(x, y)
plt.ylabel('Cougar occupied')
plt.xlabel('# of dogs')
Out[108]:
$\hat{y}=$ 1=occupied if $\hat{p}(x_0) > 0.5$; 0=unoccupied if $\hat{p}(x_0) \leq 0.5$
Making a logistic regression a classifier.
In [ ]: