A computer program learns from experience, E, with respect to class of tasks, T, and performance measure, P, if its performance at T improves by P with E.
Examples of these categories are,
X
: $n \times p$ design matrix / featuresY
: $n$ label vector
In [38]:
## I will be using Python 3, for install instructions see
## http://anson.ucdavis.edu/~jsharpna/DSBook/unit1/intro.html#installation-and-workflow
## The following packages are numpy (linear algebra), pandas (data munging),
## sklearn (machine learning), matplotlib (graphics), statsmodels (statistical models)
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
In [39]:
## Lines that start with ! run a bash command
!ls -l ../../data/winequality-red.csv
In [40]:
!head ../../data/winequality-red.csv
Wine dataset description
Some Python basics:
In [41]:
datapath = "../../data/"
with open(datapath + 'winequality-red.csv','r') as winefile:
header = winefile.readline()
wine_list = [line.strip().split(';') for line in winefile]
In [42]:
wine_ar = np.array(wine_list,dtype=np.float64)
In [43]:
names = [name.strip('"') for name in header.strip().split(';')]
print(names)
In [44]:
#Subselect the predictor X and response y
y = wine_ar[:,-1]
X = wine_ar[:,:-1]
n,p = X.shape
In [45]:
y.shape, X.shape #just checking
Out[45]:
In [46]:
import statsmodels.api as sm
X = np.hstack((np.ones((n,1)),X)) #add intercept
wine_ols = sm.OLS(y,X) #Initialize the OLS
wine_res = wine_ols.fit()
In [47]:
wine_res.summary()
Out[47]:
Learning machine that takes $p$-dimensional data $x_i = (x_{i,1}, \ldots, x_{i,p})$ and predicts $y_i \in \mathcal Y$.
In [48]:
## The following uses pandas!
datapath = "../../data/"
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
# Load and prepare GDP per capita data
# Download data from http://goo.gl/j1MSKe (=> imf.org)
gdp_per_capita = pd.read_csv(datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
encoding='latin1', na_values="n/a")
gdp_per_capita = gdp_per_capita.rename(columns={"2015": "GDP per capita"})
gdp_per_capita = gdp_per_capita.set_index("Country")
full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita, left_index=True, right_index=True)
full_country_stats = full_country_stats.sort_values(by="GDP per capita")
In [49]:
full_country_stats.head()
Out[49]:
In [50]:
_ = full_country_stats.plot("GDP per capita",'Life satisfaction',kind='scatter')
plt.title('Life Satisfaction Index')
Out[50]:
In [51]:
keepvars = full_country_stats.dtypes[full_country_stats.dtypes == float].index.values
keepvars = keepvars[:-1]
country = full_country_stats[keepvars]
In [52]:
Y = np.array(country['Life satisfaction'])
del country['Life satisfaction']
X_vars = country.columns.values
X = np.array(country)
In [53]:
def loss(yhat,y):
"""sqr error loss"""
return (yhat - y)**2
def fit(X,Y):
"""fit the OLS from training w/ intercept"""
lin1 = LinearRegression(fit_intercept=True) # OLS from sklearn
lin1.fit(X,Y) # fit OLS
return np.append(lin1.intercept_,lin1.coef_) # return betahat
def predict(x, betahat):
"""predict for point x"""
return betahat[0] + x @ betahat[1:]
Given a loss $\ell(\theta; X,Y)$, for parameters $\theta$, the risk is $$ R(\theta) = \mathbb E \ell(\theta; X,Y). $$
And given training data $\{x_i,y_i\}_{i=1}^{n_0}$ (drawn iid to $X,Y$), then the empirical risk is $$ R_n(\theta) = \frac 1n \sum_{i=1}^n \ell(\theta; x_i, y_i). $$ Notice that $\mathbb E R_n(\theta) = R(\theta)$ for fixed $\theta$.
For a class of parameters $\Theta$, the empirical risk minimizer (ERM) is the $$ \hat \theta = \arg \min_{\theta \in \Theta} R_n(\theta) $$ (may not be unique).
For a fixed $\theta$ we know by the Law of Large Numbers (as long as expectations exist and data is iid), $$ R_n(\theta) = \frac 1n \sum_{i=1}^n \ell(\theta; x_i, y_i) \rightarrow \mathbb E \ell(\theta; X,Y) = R(\theta), $$ where convergence is in probability (or almost surely). We want to minimize $R(\theta)$ so $R_n(\theta)$ is a pretty good surrogate.
Mortgage insurer pays the mortgage company if the insuree defaults on loan. To determine how much to charge want to predict if they will default (1) or not (0).
An actuary (from 19th century) says that people that are young (less than 30) are irresponsible and will not insure them. Let $x$ be the age in years, $y = 1$ if they default, and $\theta = 30$.
$$ g_\theta(x) = \left\{ \begin{array}{ll} 1, &x < \theta\\ 0, &x \ge \theta \end{array} \right. $$0-1 loss is $$ \ell_{0-1}(\theta; X,Y) = \mathbf 1 \{g_\theta(X)\ne Y\}. $$ The risk is $$ R(\theta) = \mathbb E \mathbf 1 \{g_\theta(X)\ne Y\} = \mathbb P \{ g_\theta(X) \ne Y \}. $$ How well will he do?
Want to summarize/compress/learn distribution of $X$. Clustering for example is the problem of assigning each datum to a cluster.
Image from https://rpubs.com/cyobero/k-means
Clustering for example is the problem of assigning each datum to a cluster in index set $[C] = \{1,\ldots,C\}$ for cluster centers $z_k$, $$ \theta = \left\{ \textrm{cluster centers, } \{ z_k \}_{k=1}^C \subset \mathbb R^p, \textrm{ cluster assignments, } \sigma:[n] \to [C] \right\} $$ The loss is $$ \ell(\theta;x_i) = \| x_i - z_{\sigma(i)} \|^2 = \sum_{j=1}^p (x_{i,j} - z_{\sigma(i),j})^2. $$ Loss, risk, and empirical risk still can be defined, but many concepts are not the same (such as bias-variance tradeoff).
Let $\hat \theta$ be the ERM, then the training error is $$ R_n(\hat \theta) = \min_{\theta \in \Theta} R_n(\theta) $$ which does NOT converge to $R(\theta)$ because $$ \mathbb E \min_\theta R_n(\theta) \ne \min_{\theta} \mathbb E R_n(\theta) = \min_\theta R(\theta). $$
Split the data randomly into training and test sets:
Because the test data is independent of $\hat \theta$ we can think of the training process as fixed and test error is now unbiased for risk of $\hat \theta$.
In [54]:
## randomly shuffle data and split
n,p = X.shape
Ind = np.arange(n)
np.random.shuffle(Ind)
train_size = 2 * n // 3 +1 # set training set size
X_tr, X_te = X[Ind[:train_size],:], X[Ind[train_size:],:]
Y_tr, Y_te = Y[Ind[:train_size]], Y[Ind[train_size:]]
In [55]:
## compute losses on test set
betahat = fit(X_tr,Y_tr)
Y_hat_te = [predict(x,betahat) for x in X_te]
test_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_te,Y_te)]
## compute losses on train set
Y_hat_tr = [predict(x,betahat) for x in X_tr]
train_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_tr,Y_tr)]
In [56]:
train_losses
Out[56]:
In [57]:
test_losses
Out[57]:
In [58]:
print("train avg loss: {}\ntest avg loss: {}".format(np.mean(train_losses), np.mean(test_losses)))
print("n p :",n,p)
In [59]:
def train_test_split(X,Y,split_pr = 0.5):
"""train-test split"""
n,p = X.shape
Ind = np.arange(n)
np.random.shuffle(Ind)
train_size = int(split_pr * n) # set training set size
X_tr, X_te = X[Ind[:train_size],:], X[Ind[train_size:],:]
Y_tr, Y_te = Y[Ind[:train_size]], Y[Ind[train_size:]]
return (X_tr,Y_tr), (X_te, Y_te)
In [60]:
Y = wine_ar[:,-1]
X = wine_ar[:,:-1]
(X_tr,Y_tr), (X_te, Y_te) = train_test_split(X,Y)
In [61]:
## compute losses on test set
betahat = fit(X_tr,Y_tr)
Y_hat_te = [predict(x,betahat) for x in X_te]
test_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_te,Y_te)]
## compute losses on train set
Y_hat_tr = [predict(x,betahat) for x in X_tr]
train_losses = [loss(yhat,y) for yhat,y in zip(Y_hat_tr,Y_tr)]
In [62]:
print("train avg loss: {}\ntest avg loss: {}".format(np.mean(train_losses), np.mean(test_losses)))