The for loop to compute $$\sum_{i=1}^n i$$ takes $O(n)$ time while computing $$n (n + 1)/2$$ takes $O(1)$ time.
Given $x_{n+1}$, $$ \hat f(x_{n+1}) = x_{n+1}^\top \hat \beta = \hat \beta_0 + \sum_{j=1}^n \hat \beta_j x_{n+1,j}. $$ is $O(p)$ time.
In [119]:
## Nonlinear example
x = np.arange(.05,1,.01)
y = np.sin(1/x) + np.random.normal(0,.25,len(x))
plt.plot(x,y,'.')
Out[119]:
Given a metric (d), the K-nearest neighbors of $x$ in $x_1,\ldots,x_n$ is $x_{j_1},\ldots,x_{j_K}$ such that $$ d(x,x_{j_1}) \le d(x,x_{j_2}) \le \ldots $$
In [130]:
## Fit K neighbors regression
knn = neighbors.KNeighborsRegressor(n_neighbors=5)
X = x.reshape(-1,1)
knn.fit(X,y)
y_pred = knn.predict(X)
In [131]:
## 5-nearest neighbors regression
plt.plot(x,y,'.')
plt.plot(x,y_pred)
Out[131]:
In [39]:
import pandas as pd
import numpy as np
from sklearn import model_selection, linear_model, neighbors, preprocessing, metrics
import matplotlib.pyplot as plt
import seaborn as sns
In [80]:
## Reading in data - using pandas!
datapath = "../../data/"
filename = datapath + 'winequality-red.csv'
wine = pd.read_csv(filename,delimiter=';')
In [81]:
wine.head()
Out[81]:
In [82]:
wine.describe()
Out[82]:
In [151]:
## Preparing training/test data
y = wine['quality'].values
X = wine.drop(['quality'],axis=1).values
test_size = .33
X_tr, X_te, y_tr, y_te = model_selection.train_test_split(X,y,test_size = .33)
In [152]:
X_tr.shape, X_te.shape # checking
Out[152]:
In [153]:
## TRAINING - DON'T TOUCH Y_te (better not to touch X_te)
## Preprocessing
stand = preprocessing.StandardScaler()
stand.fit(X_tr)
X_tr = stand.transform(X_tr)
In [154]:
## Fitting OLS and KNN
lr = linear_model.LinearRegression()
lr.fit(X_tr,y_tr)
knns = []
neighbors_K = np.arange(1,241,2)
M = len(neighbors_K)
for k in neighbors_K:
knn = neighbors.KNeighborsRegressor(n_neighbors=k)
knn.fit(X_tr,y_tr)
knns.append(knn)
models = [lr] + knns
In [155]:
SEL_COLS = [1,10] # acid volatility, alcohol
class WinePredictor:
"""
Custom predictor that selects vars 1 and 10 and performs knn
"""
def __init__(self,n_neighbors=5):
self.n_neighbors = n_neighbors
self.knn = neighbors.KNeighborsRegressor(n_neighbors=n_neighbors)
def fit(self, X_tr, y_tr):
X_sub = X_tr[:,SEL_COLS]
self.knn.fit(X_sub,y_tr)
def predict(self,X_te):
X_sub = X_te[:,SEL_COLS]
return self.knn.predict(X_sub)
In [156]:
## Fitting wine predictor
wine_knns = []
for k in neighbors_K:
knn = WinePredictor(n_neighbors=k)
knn.fit(X_tr,y_tr)
wine_knns.append(knn)
models = models + wine_knns
In [157]:
## TESTING - DON'T TOUCH MODELS
## transformations
X_te = stand.transform(X_te)
## prediction
MSEs = []
for m in models:
y_pred = m.predict(X_te)
MSEs.append(metrics.mean_squared_error(y_te,y_pred))
In [158]:
plt.plot(neighbors_K,MSEs[1:M+1],label='knn')
plt.plot(neighbors_K,MSEs[M+1:],label='wine pred')
plt.hlines(MSEs[0],0,neighbors_K[-1],label='OLS')
plt.xlabel('k (neighbors)')
plt.ylabel('MSE')
plt.legend()
Out[158]:
Suppose that $f$ is random ($\hat{f}$) because it comes from training data.
\begin{align*} \mathbb{E}(\hat{f}(x)-\eta(X))^2 &= \mathbb{E}(\hat{f}(X)-\mathbb{E}[\hat{f}(X)|X])^2+\mathbb{E}(\mathbb{E}[\hat{f}(X)|X]-\eta(X))+ (\text{cross-term}=0)\\ &= \mathbb{E}[\text{Var}(\hat{f}(X)|X)]+\mathbb{E}[\text{Bias}^2(\hat{f}(X)|X)], \end{align*}where $\text{Bias}(\hat{f}(X)|x) = \mathbb{E}[\hat{f}(X)|X] -\eta(X)$.
$$ R(\hat{f}) = R^* + \mathbb{E}[\text{Bias}^2(\hat{X}|X)+\text{Var}(\hat{f}(X)|X)]. $$Risk is irreducible error + bias + variance
In [167]:
## Fit K neighbors regression
x = np.arange(.05,1,.01)
y = np.sin(1/x) + np.random.normal(0,.25,len(x))
knn = neighbors.KNeighborsRegressor(n_neighbors=2)
X = x.reshape(-1,1)
knn.fit(X,y)
y_pred = knn.predict(X)
In [168]:
## 2-nearest neighbors regression
plt.plot(x,y,'.')
plt.plot(x,y_pred)
Out[168]:
In [169]:
knn = neighbors.KNeighborsRegressor(n_neighbors=21)
X = x.reshape(-1,1)
knn.fit(X,y)
y_pred = knn.predict(X)
In [170]:
## 21-nearest neighbors regression
plt.plot(x,y,'.')
plt.plot(x,y_pred)
Out[170]:
In [180]:
plt.plot(neighbors_K,MSEs[1:M+1],label='knn')
plt.xlabel(r"""k (neighbors)
low bias $\rightarrow$ high bias
high var $\rightarrow$ low var""")
plt.ylabel('MSE')
Out[180]: