We can truncate the new set of vectors to exclude components with small eigenvalues (features/dimensions that contribute little variance)
Principal component regression:
PCA has an advantage over ridge regression when the data contains many independent collinear variables (high covariance)
In this case the regression coefficients $\theta_{i}$ have high variance and their solutions can become unstable
In [1]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import random as ra
from scipy.stats import multivariate_normal
import scipy.interpolate
In [2]:
for covar in [0,0.95]:
#Generate bivariate norm distribution
var = multivariate_normal(mean = [0,0], cov = [[1,covar],[covar,1]])
#Randomly draw x1, x2 values from pdf
x1 = []
x2 = []
for i in range(1,30000):
x1rand = 10.*ra.random()-5
x2rand = 10.*ra.random()-5
norm_chance = var.pdf([x1rand, x2rand])
const_chance = ra.random()
if const_chance <= norm_chance:
x1.append(x1rand)
x2.append(x2rand)
#Plot (x1, x2; y) data
alpha = 1
beta = 0.3
l = 0.2
y = [1*x1[i] + 1*x2[i] + beta*(ra.random()-.5) for i in range(len(x1))]
cm = plt.cm.get_cmap('RdYlBu')
plt.scatter(x1,x2, c = y, cmap = cm)
plt.colorbar()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
#Generate cost contour
th1, th2 = np.linspace(0, 2, 100), np.linspace(0, 2, 100)
cost = np.zeros((len(th1),len(th2)))
for i in range(len(th1)):
for j in range(len(th2)):
summ=0
for k in range(len(x1)):
summ+=(th1[i]*x1[k]+th2[j]*x2[k]-y[k]+l*(th1[i]**2.0+th2[j]**2.0))**2.0
cost[i,j]=summ
th1, th2 = np.meshgrid(th1, th2)
# Interpolate
rbf = scipy.interpolate.Rbf(th1, th2, cost, function='linear')
cost_inter = rbf(th1, th2)
plt.imshow(cost_inter, vmin=cost_inter.min(), vmax=cost_inter.max(), origin='lower', extent=[-5, 5, -5, 5])
plt.colorbar()
plt.xlabel('Theta 1')
plt.ylabel('Theta 2')
plt.title('Mean-square error')
plt.show()
Define a kernel $K\left(x_{i}, x\right)$ local to each data point with the amplitude of the kernel depending onn the distance from the local point to all other points in the sample
e.g. top-hat or Gaussian kernel
Nadaraya-Watson estimate of the regression function: $$f\left(x|K\right)=\frac{\sum_{i=1}^{N}K\left(\frac{||x_{i}-x||}{h}\right)y_{i}}{\sum_{i=1}^{N}K\left(\frac{||x_{i}-x||}{h}\right)}$$
i.e., the predicted value of the function at $x_{i}$ is a weighted average of the y-values of all the points, with the individual weights given by the values of the kernel at that position
Rule of thumb: bandwidth is more important than shape of kernel used
Optimal bandwidth can be found by minimizing the cost with respect to the bandwidth on a cross-validation set $$CV_{L_{2}}\left(h\right)=\frac{1}{N}\sum_{i=1}^{N}\left(y_{i}-f\left(x_{i}|K\left(\frac{||x_{i}-x_{j}||}{h}\right)\right)\right)^{2}$$
In [ ]:
In [ ]: