Regression simulation

Step 1: Assumptions

Assume that synaptic density (synpases/unmasked), Y, follows some joint distribution $F_{Y \mid X}$ where $X$ is the set of data, which are vectors in $\mathbb{R}^3$ and its elements correspond, respectively, to x, y, z coordinates given by the data.

Step 2: Define model

Let the true values of density correspond to the set Y, and let the joint distribution be parameterized by $\theta$. So for each $x_i \in X \textrm{ and } y_i \in Y \ , F(x;\theta)=y$.

We want to find parameters $\hat \theta$ such that we minimize a loss function $l(\hat y, y)$, where $\hat y = F(x;\hat \theta)$.

Step 3: Algorithms

Linear Regression
Support Vector Regression (SVR)
K-Nearest Neighbor Regression (KNN)
Random Forest Regression (RF)
Polynomial Regression

Setup



In [1]:

    
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import urllib2
from __future__ import division

np.random.seed(1)
url = ('https://raw.githubusercontent.com/Upward-Spiral-Science'
       '/data/master/syn-density/output.csv')
data = urllib2.urlopen(url)
csv = np.genfromtxt(data, delimiter=",")[1:] # don't want first row (labels)

# chopping data based on thresholds on x and y coordinates
x_bounds = (409, 3529)
y_bounds = (1564, 3124)

def check_in_bounds(row, x_bounds, y_bounds):
    if row[0] < x_bounds[0] or row[0] > x_bounds[1]:
        return False
    if row[1] < y_bounds[0] or row[1] > y_bounds[1]:
        return False
    if row[3] == 0:
        return False

    return True

indices_in_bound, = np.where(np.apply_along_axis(check_in_bounds, 1, csv, x_bounds, y_bounds))
data_thresholded = csv[indices_in_bound]
n = data_thresholded.shape[0]
data = data_thresholded
data[:, -2] = data[:,-1]/data[:,-2]
data = data[:, :-1]
print data[:, -1]
# normalize  density so they're not super tiny values
data[:, -1] -= np.average(data[:, -1])
data[:, -1] /= np.std(data[:, -1])
print data[:, -1]









    



[ 0.00152724  0.00160285  0.00162364 ...,  0.00092469  0.00060045
  0.00093872]
[ 0.92781019  1.11377747  1.1649147  ..., -0.55424367 -1.35176376
 -0.51974171]

Step 4/5/6 part A: Null distribution

No relationship, i.e. density is independent of position. Let's just let density be uniform across the entire 3D space defined by the dataset. So the target variable Y, i.e. unmasked, follows a uniform distribution.



In [2]:

    
mins = [np.min(csv[:,i]) for i in xrange(4)]
maxs = [np.max(csv[:,i]) for i in xrange(4)]
domains = zip(mins, maxs)

# sample sizes
S = np.logspace(2.0, 4.0, num=20, base=10.0, dtype='int')

null_X = np.array([[np.random.randint(*domains[i]) for i in xrange(3)] 
                   for k in xrange(S[-1])])
null_Y = np.random.uniform(*domains[-1], size=S[-1])
print null_X.shape, null_Y.shape









    



(10000, 3) (10000,)



In [13]:

    
# load our regressions
from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor as KNN
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.preprocessing import PolynomialFeatures as PF
from sklearn.pipeline import Pipeline
from sklearn import cross_validation
names = ['Linear Regression','SVR','KNN Regression','Random Forest Regression','Polynomial Regression']
regressions = [LinearRegression(),
               LinearSVR(C=1.0),
               KNN(n_neighbors=10, algorithm='auto'),
               RF(max_depth=5, max_features=1),
               Pipeline([('poly', PF(degree=2)),('linear', LinearRegression(fit_intercept=False))])]



In [3]:

    
r2 = np.zeros((len(S), len(regressions), 2), dtype=np.dtype('float64'))

#iterate over sample sizes and regression algos
for i, N in enumerate(S):
    # Randomly sample from synthetic data with sample size N
    a = np.random.permutation(np.arange(S[-1]))[:N]
    X = null_X[a]
    Y = null_Y[a]
    Y = np.ravel(Y)
    print "Sample size = ", N
    for k, reg in enumerate(regressions):
        scores = cross_validation.cross_val_score(reg, X, Y, scoring='r2', cv=10)
        r2[i, k, :] = [scores.mean(), scores.std()]
        print("R^2 of %s: %0.2f (+/- %0.2f)" % (names[k], scores.mean(), scores.std() * 2))









    



Sample size =  100
R^2 of Linear Regression: -0.20 (+/- 0.37)
R^2 of SVR: -0.39 (+/- 0.61)
R^2 of KNN Regression: -0.35 (+/- 0.45)
R^2 of Random Forest Regression: -0.34 (+/- 0.57)
R^2 of Polynomial Regression: -0.24 (+/- 0.51)
Sample size =  127
R^2 of Linear Regression: -0.29 (+/- 0.59)
R^2 of SVR: -0.52 (+/- 1.09)
R^2 of KNN Regression: -0.27 (+/- 0.64)
R^2 of Random Forest Regression: -0.35 (+/- 0.81)
R^2 of Polynomial Regression: -0.34 (+/- 0.62)
Sample size =  162
R^2 of Linear Regression: -0.12 (+/- 0.21)
R^2 of SVR: -0.39 (+/- 0.62)
R^2 of KNN Regression: -0.10 (+/- 0.29)
R^2 of Random Forest Regression: -0.15 (+/- 0.21)
R^2 of Polynomial Regression: -0.13 (+/- 0.21)
Sample size =  206
R^2 of Linear Regression: -0.06 (+/- 0.14)
R^2 of SVR: -0.15 (+/- 0.24)
R^2 of KNN Regression: -0.14 (+/- 0.33)
R^2 of Random Forest Regression: -0.09 (+/- 0.30)
R^2 of Polynomial Regression: -0.05 (+/- 0.28)
Sample size =  263
R^2 of Linear Regression: -0.05 (+/- 0.12)
R^2 of SVR: -0.25 (+/- 0.35)
R^2 of KNN Regression: -0.20 (+/- 0.20)
R^2 of Random Forest Regression: -0.12 (+/- 0.26)
R^2 of Polynomial Regression: -0.06 (+/- 0.25)
Sample size =  335
R^2 of Linear Regression: -0.04 (+/- 0.10)
R^2 of SVR: -0.15 (+/- 0.31)
R^2 of KNN Regression: -0.06 (+/- 0.26)
R^2 of Random Forest Regression: -0.04 (+/- 0.15)
R^2 of Polynomial Regression: -0.08 (+/- 0.19)
Sample size =  428
R^2 of Linear Regression: -0.05 (+/- 0.10)
R^2 of SVR: -0.30 (+/- 0.47)
R^2 of KNN Regression: -0.19 (+/- 0.24)
R^2 of Random Forest Regression: -0.09 (+/- 0.10)
R^2 of Polynomial Regression: -0.05 (+/- 0.15)
Sample size =  545
R^2 of Linear Regression: -0.03 (+/- 0.06)
R^2 of SVR: -0.15 (+/- 0.25)
R^2 of KNN Regression: -0.15 (+/- 0.18)
R^2 of Random Forest Regression: -0.08 (+/- 0.07)
R^2 of Polynomial Regression: -0.05 (+/- 0.08)
Sample size =  695
R^2 of Linear Regression: -0.02 (+/- 0.04)
R^2 of SVR: -0.14 (+/- 0.12)
R^2 of KNN Regression: -0.13 (+/- 0.16)
R^2 of Random Forest Regression: -0.04 (+/- 0.11)
R^2 of Polynomial Regression: -0.03 (+/- 0.10)
Sample size =  885
R^2 of Linear Regression: -0.02 (+/- 0.05)
R^2 of SVR: -0.14 (+/- 0.18)
R^2 of KNN Regression: -0.10 (+/- 0.11)
R^2 of Random Forest Regression: -0.04 (+/- 0.07)
R^2 of Polynomial Regression: -0.03 (+/- 0.07)
Sample size =  1128
R^2 of Linear Regression: -0.01 (+/- 0.03)
R^2 of SVR: -0.15 (+/- 0.13)
R^2 of KNN Regression: -0.08 (+/- 0.08)
R^2 of Random Forest Regression: -0.03 (+/- 0.04)
R^2 of Polynomial Regression: -0.02 (+/- 0.03)
Sample size =  1438
R^2 of Linear Regression: -0.01 (+/- 0.02)
R^2 of SVR: -0.19 (+/- 0.17)
R^2 of KNN Regression: -0.11 (+/- 0.09)
R^2 of Random Forest Regression: -0.02 (+/- 0.04)
R^2 of Polynomial Regression: -0.01 (+/- 0.02)
Sample size =  1832
R^2 of Linear Regression: -0.01 (+/- 0.03)
R^2 of SVR: -0.19 (+/- 0.14)
R^2 of KNN Regression: -0.10 (+/- 0.12)
R^2 of Random Forest Regression: -0.03 (+/- 0.03)
R^2 of Polynomial Regression: -0.01 (+/- 0.04)
Sample size =  2335
R^2 of Linear Regression: -0.00 (+/- 0.01)
R^2 of SVR: -0.18 (+/- 0.12)
R^2 of KNN Regression: -0.11 (+/- 0.10)
R^2 of Random Forest Regression: -0.01 (+/- 0.02)
R^2 of Polynomial Regression: -0.01 (+/- 0.02)
Sample size =  2976
R^2 of Linear Regression: -0.01 (+/- 0.02)
R^2 of SVR: -0.16 (+/- 0.11)
R^2 of KNN Regression: -0.13 (+/- 0.07)
R^2 of Random Forest Regression: -0.02 (+/- 0.02)
R^2 of Polynomial Regression: -0.01 (+/- 0.02)
Sample size =  3792
R^2 of Linear Regression: -0.00 (+/- 0.01)
R^2 of SVR: -0.16 (+/- 0.07)
R^2 of KNN Regression: -0.10 (+/- 0.06)
R^2 of Random Forest Regression: -0.01 (+/- 0.02)
R^2 of Polynomial Regression: -0.00 (+/- 0.01)
Sample size =  4832
R^2 of Linear Regression: -0.00 (+/- 0.01)
R^2 of SVR: -0.13 (+/- 0.08)
R^2 of KNN Regression: -0.12 (+/- 0.06)
R^2 of Random Forest Regression: -0.01 (+/- 0.01)
R^2 of Polynomial Regression: -0.01 (+/- 0.01)
Sample size =  6158
R^2 of Linear Regression: -0.00 (+/- 0.01)
R^2 of SVR: -0.14 (+/- 0.07)
R^2 of KNN Regression: -0.11 (+/- 0.06)
R^2 of Random Forest Regression: -0.01 (+/- 0.01)
R^2 of Polynomial Regression: -0.00 (+/- 0.01)
Sample size =  7847
R^2 of Linear Regression: -0.00 (+/- 0.01)
R^2 of SVR: -0.15 (+/- 0.08)
R^2 of KNN Regression: -0.12 (+/- 0.04)
R^2 of Random Forest Regression: -0.01 (+/- 0.01)
R^2 of Polynomial Regression: -0.00 (+/- 0.01)
Sample size =  10000
R^2 of Linear Regression: -0.00 (+/- 0.00)
R^2 of SVR: -0.17 (+/- 0.12)
R^2 of KNN Regression: -0.09 (+/- 0.06)
R^2 of Random Forest Regression: -0.00 (+/- 0.01)
R^2 of Polynomial Regression: -0.00 (+/- 0.01)

Now graphing this data:



In [4]:

    
plt.errorbar(S, r2[:,0,0], yerr = r2[:,0,1], hold=True, label=names[0])
plt.errorbar(S, r2[:,1,0], yerr = r2[:,1,1], color='green', hold=True, label=names[1])
plt.errorbar(S, r2[:,2,0], yerr = r2[:,2,1], color='red', hold=True, label=names[2])
plt.errorbar(S, r2[:,3,0], yerr = r2[:,3,1], color='black', hold=True, label=names[3])
plt.errorbar(S, r2[:,4,0], yerr = r2[:,4,1], color='brown', hold=True, label=names[4])
plt.xscale('log')
plt.axhline(1, color='red', linestyle='--')
plt.xlabel('Sample size')
plt.ylabel('R^2 Score')
plt.title('Regression results on simulated data under the null')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

Step 4/5/6 part b: Alternate distribution

Here we want assume a conditional dependence. Let's keep the x, y, z uniformly distributed across the sample space, but let density, $y_i$, be the sum of a deterministic function, $f:\mathbb{R}^3 \rightarrow \mathbb{R}$, and $\epsilon$ some Gaussian noise with low std dev and 0 mean. Let $f(x,y,z)=x+y+z$ (normalized over the average of all f). Let the variance of $\epsilon$ be .01.



In [5]:

    
# X under the alt same as under the null
alt_X = null_X
f_X = np.apply_along_axis(lambda r: reduce(lambda x,y:x+y, r)/3, 1, alt_X)
f_X -= np.average(f_X)
f_X /= np.std(f_X)
alt_Y = np.random.normal(0, .01, size=f_X.shape)+f_X
print alt_Y.shape



In [6]:

    
r2 = np.zeros((len(S), len(regressions), 2), dtype=np.dtype('float64'))
#iterate over sample sizes and regression algos
for i, N in enumerate(S):
    # Randomly sample from synthetic data with sample size N
    a = np.random.permutation(np.arange(S[-1]))[:N]
    X = alt_X[a]
    Y = alt_Y[a]
    Y = np.ravel(Y)
    print "Sample size = ", N
    for k, reg in enumerate(regressions):
        scores = cross_validation.cross_val_score(reg, X, Y, scoring='r2', cv=10)
        r2[i, k, :] = [scores.mean(), scores.std()]
        print("R^2 of %s: %0.2f (+/- %0.2f)" % (names[k], scores.mean(), scores.std() * 2))









    



Sample size =  100
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.46 (+/- 1.97)
R^2 of KNN Regression: 0.94 (+/- 0.05)
R^2 of Random Forest Regression: 0.78 (+/- 0.16)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  127
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.27 (+/- 1.38)
R^2 of KNN Regression: 0.96 (+/- 0.02)
R^2 of Random Forest Regression: 0.82 (+/- 0.17)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  162
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.53 (+/- 3.03)
R^2 of KNN Regression: 0.96 (+/- 0.05)
R^2 of Random Forest Regression: 0.82 (+/- 0.19)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  206
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: 0.03 (+/- 1.23)
R^2 of KNN Regression: 0.97 (+/- 0.05)
R^2 of Random Forest Regression: 0.83 (+/- 0.11)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  263
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.81 (+/- 3.50)
R^2 of KNN Regression: 0.98 (+/- 0.02)
R^2 of Random Forest Regression: 0.89 (+/- 0.05)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  335
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -1.39 (+/- 6.05)
R^2 of KNN Regression: 0.98 (+/- 0.01)
R^2 of Random Forest Regression: 0.88 (+/- 0.07)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  428
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.30 (+/- 1.81)
R^2 of KNN Regression: 0.99 (+/- 0.01)
R^2 of Random Forest Regression: 0.87 (+/- 0.06)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  545
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.68 (+/- 2.17)
R^2 of KNN Regression: 0.99 (+/- 0.01)
R^2 of Random Forest Regression: 0.87 (+/- 0.12)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  695
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.36 (+/- 1.34)
R^2 of KNN Regression: 0.99 (+/- 0.00)
R^2 of Random Forest Regression: 0.89 (+/- 0.05)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  885
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: 0.01 (+/- 0.75)
R^2 of KNN Regression: 0.99 (+/- 0.00)
R^2 of Random Forest Regression: 0.89 (+/- 0.03)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  1128
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: 0.10 (+/- 0.60)
R^2 of KNN Regression: 0.99 (+/- 0.00)
R^2 of Random Forest Regression: 0.88 (+/- 0.06)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  1438
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.01 (+/- 0.82)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.90 (+/- 0.06)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  1832
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -1.59 (+/- 5.82)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.90 (+/- 0.03)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  2335
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -1.74 (+/- 9.26)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.91 (+/- 0.05)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  2976
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.55 (+/- 1.87)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.91 (+/- 0.05)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  3792
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.48 (+/- 1.93)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.90 (+/- 0.05)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  4832
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: 0.17 (+/- 0.55)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.91 (+/- 0.03)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  6158
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -2.00 (+/- 7.02)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.91 (+/- 0.03)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  7847
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -1.02 (+/- 4.35)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.91 (+/- 0.04)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)
Sample size =  10000
R^2 of Linear Regression: 1.00 (+/- 0.00)
R^2 of SVR: -0.98 (+/- 4.14)
R^2 of KNN Regression: 1.00 (+/- 0.00)
R^2 of Random Forest Regression: 0.91 (+/- 0.03)
R^2 of Polynomial Regression: 1.00 (+/- 0.00)

Now graphing it:



In [7]:

    
plt.errorbar(S, r2[:,0,0], yerr = r2[:,0,1], hold=True, label=names[0])
plt.errorbar(S, r2[:,1,0], yerr = r2[:,1,1], color='green', hold=True, label=names[1])
plt.errorbar(S, r2[:,2,0], yerr = r2[:,2,1], color='red', hold=True, label=names[2])
plt.errorbar(S, r2[:,3,0], yerr = r2[:,3,1], color='black', hold=True, label=names[3])
plt.errorbar(S, r2[:,4,0], yerr = r2[:,4,1], color='brown', hold=True, label=names[4])
plt.xscale('log')
plt.axhline(1, color='red', linestyle='--')
plt.xlabel('Sample size')
plt.ylabel('R^2 Score')
plt.title('Regression results on simulated data under the alternate')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

Step 7: Apply on actual data



In [8]:

    
X = data[:, (0, 1, 2)]
Y = data[:, -1]
for i, reg in enumerate(regressions):
    scores = cross_validation.cross_val_score(reg, X, Y, scoring='r2', cv=10)
    print("R^2 of %s: %0.2f (+/- %0.2f)" % (names[i], scores.mean(), scores.std() * 2))









    



R^2 of Linear Regression: 0.12 (+/- 0.16)
R^2 of SVR: -0.49 (+/- 1.36)
R^2 of KNN Regression: 0.20 (+/- 0.08)
R^2 of Random Forest Regression: 0.21 (+/- 0.13)
R^2 of Polynomial Regression: 0.18 (+/- 0.17)

Tune parameters for RF and KNN



In [9]:

    
n_neighbors = np.arange(1, 50)
r2 = []
for n in n_neighbors:
    reg = KNN(n_neighbors=n, algorithm='auto')
    scores = cross_validation.cross_val_score(reg, X, Y, scoring='r2', cv=10)
    r2.append(np.array([scores.mean(), scores.std()]))
r2 = np.array(r2)

plt.errorbar(n_neighbors, r2[:,0], yerr = r2[:,1])
plt.title("Number of neighbors against R^2 for KNN Regression")
plt.xlabel("number of neighbors")
plt.ylabel("R^2")
plt.show()
print "mean r^2 maximized at: ", np.argmax(r2[:,0])+1
print "variance minimized at: ", np.argmin(r2[:,1])+1









    












    



mean r^2 maximized at:  29
variance minimized at:  18



In [10]:

    
depth = np.arange(1, 20)
r2 = []
for d in depth:
    reg = RF(max_depth=d, max_features=1)
    scores = cross_validation.cross_val_score(reg, X, Y, scoring='r2', cv=10)
    r2.append(np.array([scores.mean(), scores.std()]))
r2 = np.array(r2)

plt.errorbar(depth, r2[:,0], yerr = r2[:,1])
plt.title("Max depth against R^2 for RandomForestRegression")
plt.xlabel("Max depth")
plt.ylabel("R^2")
plt.show()
print "mean r^2 maximized at: ", np.argmax(r2[:,0])+1
print "variance minimized at: ", np.argmin(r2[:,1])+1









    












    



mean r^2 maximized at:  10
variance minimized at:  1



In [11]:

    
features = np.arange(1, 4)
r2 = []
for f in features:
    reg = RF(max_depth=10, max_features=f)
    scores = cross_validation.cross_val_score(reg, X, Y, scoring='r2', cv=10)
    r2.append(np.array([scores.mean(), scores.std()]))
    print("R^2 of %s: %0.2f (+/- %0.2f)" % ('RF', scores.mean(), scores.std() * 2))
r2 = np.array(r2)









    



R^2 of RF: 0.29 (+/- 0.10)
R^2 of RF: 0.28 (+/- 0.08)
R^2 of RF: 0.27 (+/- 0.09)

Conclude that for KNN, optimal number of neighbors is around 29. For random forest, optimal depth 10, features 2.

Retry KNN and RF regressions with new params



In [12]:

    
# boost number of neighbors for KNN and max depth for random forest
regressions = [KNN(n_neighbors=29, algorithm='auto'),
               RF(max_depth=10, max_features=1)]
names = ['KNN Regression', 'Random Forest Regression']
for i, reg in enumerate(regressions):
    scores = cross_validation.cross_val_score(reg, X, Y, scoring='r2', cv=10)
    print("R^2 of %s: %0.2f (+/- %0.2f)" % (names[i], scores.mean(), scores.std() * 2))









    



R^2 of KNN Regression: 0.24 (+/- 0.08)
R^2 of Random Forest Regression: 0.28 (+/- 0.07)

Step 8: Reflect on results

The two most promising regression algorithms were KNN and Random Forest, but even then, R^2 values were low, indicating that the regressions were overall unsucessful. Tuning the parameters a bit improved the results a bit, but still the R^2 is low.