Classification and Regression

By: Adam Li

  1. State Assumptions
  2. Formally define a classification and regression problem
  3. Provide an algorithm to solve classification/regression
  4. Sample simulated data
  5. Compute accuracy
  6. Plot accuracy vs. N

Step 1: State Assumptions

$D = {d_1, ..., d_{96}}$ are our data columns representing f0, f1, ..., f3. Our assumption is that all variables/metrics come from a continuous distribution.

In addition, $L = {x,y,z}$, our location data are continuous.

Step 2: Formally Define Classification/Regression Problem

Since our data is unsupervised, we want to just run some simple regressions on each column of data to see if we can predict a column based on the other 95 columns.

$$H_o: d_i, d_j \in D \ \forall \ i \neq j \ are\ independent$$

$$H_a: \exists i \neq j \ s.t. d_i \ dependent \ on \ d_j$$

D is our data matrix. i and j are indices.

Step 3: Algorithm For Solving This Regression Problem

regression: linear regression, support vector regression, k-nearest neighbour regression, random forest regression, polynomial regression

Our regression model:

Class Notes

Bayes Optimal Classifier g = argmin ( E[loss(g(x), y)] ) g = argmax(F(x=x|y=y)*F(y=y)) leave 1 out analysis LDA SVM Logistic Regression

Linear Discriminant Analysis

Under the assumption of F(x|y) = N(u, 1) and F(y) = Bernoulli(pi). Variances are same across classes.

  • very interpretable
  • fast
  • linear boundary ### Quadratic Discriminant Analysis Under similar assumptions, except assume variances are different across classes.
  • interpretable
  • fast LDA and QDA converges to Bayes classifier under certain assumptions ### K-nearest neighbors
  • not interpretable as LDA/QDA
  • not fast
  • consistent

In [1]:
# Import Necessary Libraries
import numpy as np
import os, csv

from matplotlib import pyplot as plt

import scipy

# Regression
from sklearn import cross_validation
from sklearn.cross_validation import LeaveOneOut
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn import datasets

# Classification
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.svm import SVC
# from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# pretty charting
import seaborn as sns
sns.set_palette('muted')
sns.set_style('darkgrid')

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Steps 4 & 5: Sample data from setting similar to data and record classification accuracy


In [2]:
np.random.seed(12345678)  # for reproducibility, set random seed
r = 20  # define number of rois
N = 100 # number of samples at each iteration
p0 = 0.10
p1 = 0.15
# define number of subjects per class
S = np.array((8, 16, 20, 32, 40, 64, 80, 100, 120, 200, 320,
              400, 600))
S = np.array((200,300))

names = ["Linear Regression", 
         "Support Vector Regression", 
         "Nearest Neighbors", 
         "Random Forest"] #, "Polynomial Regression"]

regressors = [
    LinearRegression(),
    SVR(kernel="linear", C=0.5, epsilon=0.01),
    KNeighborsRegressor(6, weights="distance"),
    RandomForestRegressor(max_depth=5, n_estimators=10, max_features=1)
    ]

In [27]:
errors = np.zeros((len(S), len(regressors), 2), dtype=np.dtype('float64'))

# sample data accordingly for each # of simulations
for idx1, s in enumerate(S):
    # null regression y = a + bx + episilon
    X = np.arange(0, s)
#     epsilon = np.random.normal(0, 0.5, s)
    eps_x = np.random.normal(0, 0.5, s)
    a = np.random.rand(s,)
    b = np.random.normal(s,)*5
    y = a + b*(X+eps_x)
    X, y, coef = datasets.make_regression(n_samples=s, n_features=1,
                                      n_informative=1, noise=10,
                                      coef=True, random_state=0)

    # reshape array to make it work with linear regression
#     y = y.reshape(-1,1)
    X = X.reshape(-1,1)
    
    for idx2, regr in enumerate(regressors):
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
    
        # Train the model using the training sets
        reg = regr.fit(X_train, y_train)
        
        # leave one out analysis
#         loo = LeaveOneOut(len(X))
        loo = cross_validation.KFold(n=len(X), n_folds=10, shuffle=False, random_state=None)

        # compute scores for running this regressor
        scores = cross_validation.cross_val_score(reg, X, y, scoring='mean_squared_error', cv=loo)

        errors[idx1, idx2,] = [scores.mean(), scores.std()]
        print("MSE of %s: %f (+/- %0.5f)" % (names[idx2], scores.mean(), scores.std() * 2))
# print accuracy


MSE of Linear Regression: -98.030525 (+/- 67.16011)
MSE of Support Vector Regression: -818.494604 (+/- 663.41018)
MSE of Nearest Neighbors: -173.809540 (+/- 367.25922)
MSE of Random Forest: -152.351897 (+/- 245.77380)
MSE of Linear Regression: -97.929016 (+/- 30.06343)
MSE of Support Vector Regression: -137.314648 (+/- 56.80810)
MSE of Nearest Neighbors: -139.766465 (+/- 61.11190)
MSE of Random Forest: -122.361538 (+/- 43.24464)

In [28]:
print errors
print errors.shape


[[[ -98.0305249    33.58005539]
  [-818.4946042   331.70509207]
  [-173.80954049  183.62961076]
  [-152.35189737  122.88689903]]

 [[ -97.92901607   15.03171328]
  [-137.31464786   28.40405126]
  [-139.76646452   30.55595091]
  [-122.36153778   21.6223224 ]]]
(2, 4, 2)

STEP 6: PLOTTING ACCURACY VS. N FOR EACH REGRESSOR


In [30]:
plt.errorbar(S, errors[:,0,0], yerr = errors[:,0,1], hold=True, label=names[0])
plt.errorbar(S, errors[:,1,0], yerr = errors[:,1,1], color='green', hold=True, label=names[1])
plt.errorbar(S, errors[:,2,0], yerr = errors[:,2,1], color='red', hold=True, label=names[2])
plt.errorbar(S, errors[:,3,0], yerr = errors[:,3,1], color='black', hold=True, label=names[3])
plt.xscale('log')
plt.xlabel('number of samples')
plt.ylabel('MSE')
plt.title('MSE of Regressions under simulated data')
plt.axhline(1, color='red', linestyle='--')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()



In [ ]:
# null regression y = a + bx + episilon
x = np.random.normal(0, 1, s)
epsilon = np.random.normal(0, 0.05, s)
a = np.random.rand(s,) 
b = np.random.rand(s,)
y = a + b*x + epsilon
y = np.reshape(y, (s,1))
X = np.reshape(X, (s,1))

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)

# Train the model using the training sets
# reg = SVR(kernel="linear", C=1.).fit(X_train, y_train)
reg = KNeighborsRegressor(5).fit(X_train, y_train)

print reg
print X.shape
print epsilon.shape
print y.shape

STEP 7: APPLYING REGRESSIONS TO COLUMNS OF FEATURES


In [3]:
#### RUN AT BEGINNING AND TRY NOT TO RUN AGAIN - TAKES WAY TOO LONG ####
csvfile = "data_normalized/shortenedFeatures_normalized.txt"

# load in the feature data
list_of_features = []
with open(csvfile) as file:
    for line in file:
        inner_list = [float(elt.strip()) for elt in line.split(',')]
        
        # create list of features
        list_of_features.append(inner_list)

# conver to a numpy matrix
list_of_features = np.array(list_of_features)
print list_of_features.shape


(1119299, 96)

In [4]:
sub_features = list_of_features[:,1:]
print sub_features.shape

# Keep only first 96 features
# X = X[np.random.choice(range(X.shape[0]),size=100000,replace=False),0:24*4]

# randomly select n rows in list_of_features
num_rows = len(list_of_features)
X = list_of_features[np.random.choice(range(list_of_features.shape[0]), size=10000, replace=False), :]
print X.shape


(1119299, 95)
(10000, 96)

In [5]:
## Run regression on one column of the data
errors = np.zeros((len(list_of_features), 2)) # create a 1119299 x 2 matrix

# y = list_of_features[:,0]
# sub_features = list_of_features[:,1:]
num_cols = list_of_features.shape[1]

errors_cols = {}
for i in range(0, num_cols):
    y = X[:,i]
    
    indices = [p for p in range(0,96) if p != i]
    sub_features = X[:,indices]

    for idx, regr in enumerate(regressors):
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(sub_features, y, test_size=0.4, random_state=0)

        # create regression and fit
        reg = regr.fit(X_train, y_train)

        # leave one out & compute cross-validation scores with MSE
        loo = cross_validation.KFold(n=len(sub_features), n_folds=10, shuffle=False, random_state=None)
    #     loo = LeaveOneOut(len(sub_features))
        scores = cross_validation.cross_val_score(reg, sub_features, y, scoring='mean_squared_error', cv=loo)

        # get error scores and print 
        errors[idx,] = [scores.mean(), scores.std()]
        print("MSE's of %s: %f (+/- %0.2f)" % (names[idx], scores.mean(), scores.std())) 
    
    errors_cols[str(i)] = errors
# # print accuracy


MSE's of Linear Regression: -0.000097 (+/- 0.00)
MSE's of Support Vector Regression: -0.000098 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001490 (+/- 0.00)
MSE's of Random Forest: -0.002404 (+/- 0.00)
MSE's of Linear Regression: -0.000066 (+/- 0.00)
MSE's of Support Vector Regression: -0.000068 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001049 (+/- 0.00)
MSE's of Random Forest: -0.001569 (+/- 0.00)
MSE's of Linear Regression: -0.000175 (+/- 0.00)
MSE's of Support Vector Regression: -0.000176 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001996 (+/- 0.00)
MSE's of Random Forest: -0.003663 (+/- 0.00)
MSE's of Linear Regression: -0.000330 (+/- 0.00)
MSE's of Support Vector Regression: -0.000332 (+/- 0.00)
MSE's of Nearest Neighbors: -0.004153 (+/- 0.00)
MSE's of Random Forest: -0.006514 (+/- 0.00)
MSE's of Linear Regression: -0.002056 (+/- 0.00)
MSE's of Support Vector Regression: -0.002212 (+/- 0.00)
MSE's of Nearest Neighbors: -0.005663 (+/- 0.00)
MSE's of Random Forest: -0.006944 (+/- 0.00)
MSE's of Linear Regression: -0.087525 (+/- 0.00)
MSE's of Support Vector Regression: -0.089671 (+/- 0.00)
MSE's of Nearest Neighbors: -0.145185 (+/- 0.00)
MSE's of Random Forest: -0.150689 (+/- 0.01)
MSE's of Linear Regression: -0.000103 (+/- 0.00)
MSE's of Support Vector Regression: -0.000104 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001406 (+/- 0.00)
MSE's of Random Forest: -0.002117 (+/- 0.00)
MSE's of Linear Regression: -0.000102 (+/- 0.00)
MSE's of Support Vector Regression: -0.000103 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001434 (+/- 0.00)
MSE's of Random Forest: -0.002028 (+/- 0.00)
MSE's of Linear Regression: -0.000153 (+/- 0.00)
MSE's of Support Vector Regression: -0.000153 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001521 (+/- 0.00)
MSE's of Random Forest: -0.002300 (+/- 0.00)
MSE's of Linear Regression: -0.000263 (+/- 0.00)
MSE's of Support Vector Regression: -0.000264 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002588 (+/- 0.00)
MSE's of Random Forest: -0.003822 (+/- 0.00)
MSE's of Linear Regression: -0.001427 (+/- 0.00)
MSE's of Support Vector Regression: -0.001464 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003665 (+/- 0.00)
MSE's of Random Forest: -0.004527 (+/- 0.00)
MSE's of Linear Regression: -0.042295 (+/- 0.00)
MSE's of Support Vector Regression: -0.042717 (+/- 0.00)
MSE's of Nearest Neighbors: -0.070754 (+/- 0.00)
MSE's of Random Forest: -0.069627 (+/- 0.00)
MSE's of Linear Regression: -0.000151 (+/- 0.00)
MSE's of Support Vector Regression: -0.000163 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002472 (+/- 0.00)
MSE's of Random Forest: -0.003973 (+/- 0.00)
MSE's of Linear Regression: -0.000128 (+/- 0.00)
MSE's of Support Vector Regression: -0.000134 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002251 (+/- 0.00)
MSE's of Random Forest: -0.003356 (+/- 0.00)
MSE's of Linear Regression: -0.000121 (+/- 0.00)
MSE's of Support Vector Regression: -0.000131 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001295 (+/- 0.00)
MSE's of Random Forest: -0.002046 (+/- 0.00)
MSE's of Linear Regression: -0.000198 (+/- 0.00)
MSE's of Support Vector Regression: -0.000203 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002217 (+/- 0.00)
MSE's of Random Forest: -0.003462 (+/- 0.00)
MSE's of Linear Regression: -0.002008 (+/- 0.00)
MSE's of Support Vector Regression: -0.002042 (+/- 0.00)
MSE's of Nearest Neighbors: -0.004993 (+/- 0.00)
MSE's of Random Forest: -0.006229 (+/- 0.00)
MSE's of Linear Regression: -0.040804 (+/- 0.00)
MSE's of Support Vector Regression: -0.041811 (+/- 0.00)
MSE's of Nearest Neighbors: -0.063524 (+/- 0.00)
MSE's of Random Forest: -0.064333 (+/- 0.00)
MSE's of Linear Regression: -0.000162 (+/- 0.00)
MSE's of Support Vector Regression: -0.000163 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002157 (+/- 0.00)
MSE's of Random Forest: -0.003321 (+/- 0.00)
MSE's of Linear Regression: -0.000149 (+/- 0.00)
MSE's of Support Vector Regression: -0.000150 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002007 (+/- 0.00)
MSE's of Random Forest: -0.002848 (+/- 0.00)
MSE's of Linear Regression: -0.000114 (+/- 0.00)
MSE's of Support Vector Regression: -0.000115 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001223 (+/- 0.00)
MSE's of Random Forest: -0.001882 (+/- 0.00)
MSE's of Linear Regression: -0.000181 (+/- 0.00)
MSE's of Support Vector Regression: -0.000181 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002088 (+/- 0.00)
MSE's of Random Forest: -0.003061 (+/- 0.00)
MSE's of Linear Regression: -0.002754 (+/- 0.00)
MSE's of Support Vector Regression: -0.002794 (+/- 0.00)
MSE's of Nearest Neighbors: -0.006098 (+/- 0.00)
MSE's of Random Forest: -0.007353 (+/- 0.00)
MSE's of Linear Regression: -0.039213 (+/- 0.00)
MSE's of Support Vector Regression: -0.039923 (+/- 0.00)
MSE's of Nearest Neighbors: -0.061765 (+/- 0.00)
MSE's of Random Forest: -0.060274 (+/- 0.00)
MSE's of Linear Regression: -0.000026 (+/- 0.00)
MSE's of Support Vector Regression: -0.000035 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000441 (+/- 0.00)
MSE's of Random Forest: -0.000515 (+/- 0.00)
MSE's of Linear Regression: -0.000027 (+/- 0.00)
MSE's of Support Vector Regression: -0.000034 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000514 (+/- 0.00)
MSE's of Random Forest: -0.000574 (+/- 0.00)
MSE's of Linear Regression: -0.000159 (+/- 0.00)
MSE's of Support Vector Regression: -0.000160 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001557 (+/- 0.00)
MSE's of Random Forest: -0.002426 (+/- 0.00)
MSE's of Linear Regression: -0.000286 (+/- 0.00)
MSE's of Support Vector Regression: -0.000287 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002690 (+/- 0.00)
MSE's of Random Forest: -0.004295 (+/- 0.00)
MSE's of Linear Regression: -0.000373 (+/- 0.00)
MSE's of Support Vector Regression: -0.000363 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001156 (+/- 0.00)
MSE's of Random Forest: -0.001077 (+/- 0.00)
MSE's of Linear Regression: -0.021814 (+/- 0.00)
MSE's of Support Vector Regression: -0.022090 (+/- 0.00)
MSE's of Nearest Neighbors: -0.041289 (+/- 0.00)
MSE's of Random Forest: -0.041229 (+/- 0.00)
MSE's of Linear Regression: -0.000002 (+/- 0.00)
MSE's of Support Vector Regression: -0.000016 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000027 (+/- 0.00)
MSE's of Random Forest: -0.000026 (+/- 0.00)
MSE's of Linear Regression: -0.000002 (+/- 0.00)
MSE's of Support Vector Regression: -0.000017 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000023 (+/- 0.00)
MSE's of Random Forest: -0.000021 (+/- 0.00)
MSE's of Linear Regression: -0.000097 (+/- 0.00)
MSE's of Support Vector Regression: -0.000098 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000864 (+/- 0.00)
MSE's of Random Forest: -0.000915 (+/- 0.00)
MSE's of Linear Regression: -0.000164 (+/- 0.00)
MSE's of Support Vector Regression: -0.000164 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001514 (+/- 0.00)
MSE's of Random Forest: -0.001580 (+/- 0.00)
MSE's of Linear Regression: -0.000357 (+/- 0.00)
MSE's of Support Vector Regression: -0.000374 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001034 (+/- 0.00)
MSE's of Random Forest: -0.000870 (+/- 0.00)
MSE's of Linear Regression: -0.025738 (+/- 0.00)
MSE's of Support Vector Regression: -0.027134 (+/- 0.00)
MSE's of Nearest Neighbors: -0.039655 (+/- 0.00)
MSE's of Random Forest: -0.033851 (+/- 0.00)
MSE's of Linear Regression: -0.000197 (+/- 0.00)
MSE's of Support Vector Regression: -0.000201 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002143 (+/- 0.00)
MSE's of Random Forest: -0.002714 (+/- 0.00)
MSE's of Linear Regression: -0.000207 (+/- 0.00)
MSE's of Support Vector Regression: -0.000209 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002168 (+/- 0.00)
MSE's of Random Forest: -0.002690 (+/- 0.00)
MSE's of Linear Regression: -0.000190 (+/- 0.00)
MSE's of Support Vector Regression: -0.000188 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001601 (+/- 0.00)
MSE's of Random Forest: -0.002358 (+/- 0.00)
MSE's of Linear Regression: -0.000326 (+/- 0.00)
MSE's of Support Vector Regression: -0.000326 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002674 (+/- 0.00)
MSE's of Random Forest: -0.003918 (+/- 0.00)
MSE's of Linear Regression: -0.002827 (+/- 0.00)
MSE's of Support Vector Regression: -0.002906 (+/- 0.00)
MSE's of Nearest Neighbors: -0.007122 (+/- 0.00)
MSE's of Random Forest: -0.006940 (+/- 0.00)
MSE's of Linear Regression: -0.028689 (+/- 0.00)
MSE's of Support Vector Regression: -0.028841 (+/- 0.00)
MSE's of Nearest Neighbors: -0.050625 (+/- 0.00)
MSE's of Random Forest: -0.047194 (+/- 0.00)
MSE's of Linear Regression: -0.000292 (+/- 0.00)
MSE's of Support Vector Regression: -0.000319 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002373 (+/- 0.00)
MSE's of Random Forest: -0.003317 (+/- 0.00)
MSE's of Linear Regression: -0.000319 (+/- 0.00)
MSE's of Support Vector Regression: -0.000333 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002868 (+/- 0.00)
MSE's of Random Forest: -0.003531 (+/- 0.00)
MSE's of Linear Regression: -0.000230 (+/- 0.00)
MSE's of Support Vector Regression: -0.000230 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002210 (+/- 0.00)
MSE's of Random Forest: -0.003521 (+/- 0.00)
MSE's of Linear Regression: -0.000406 (+/- 0.00)
MSE's of Support Vector Regression: -0.000408 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003889 (+/- 0.00)
MSE's of Random Forest: -0.006279 (+/- 0.00)
MSE's of Linear Regression: -0.003171 (+/- 0.00)
MSE's of Support Vector Regression: -0.003307 (+/- 0.00)
MSE's of Nearest Neighbors: -0.009082 (+/- 0.00)
MSE's of Random Forest: -0.008791 (+/- 0.00)
MSE's of Linear Regression: -0.022437 (+/- 0.00)
MSE's of Support Vector Regression: -0.022675 (+/- 0.00)
MSE's of Nearest Neighbors: -0.039889 (+/- 0.00)
MSE's of Random Forest: -0.043397 (+/- 0.00)
MSE's of Linear Regression: -0.000138 (+/- 0.00)
MSE's of Support Vector Regression: -0.000140 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001559 (+/- 0.00)
MSE's of Random Forest: -0.001725 (+/- 0.00)
MSE's of Linear Regression: -0.000142 (+/- 0.00)
MSE's of Support Vector Regression: -0.000148 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001532 (+/- 0.00)
MSE's of Random Forest: -0.001495 (+/- 0.00)
MSE's of Linear Regression: -0.000193 (+/- 0.00)
MSE's of Support Vector Regression: -0.000193 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001531 (+/- 0.00)
MSE's of Random Forest: -0.002029 (+/- 0.00)
MSE's of Linear Regression: -0.000326 (+/- 0.00)
MSE's of Support Vector Regression: -0.000327 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002583 (+/- 0.00)
MSE's of Random Forest: -0.003574 (+/- 0.00)
MSE's of Linear Regression: -0.002287 (+/- 0.00)
MSE's of Support Vector Regression: -0.002358 (+/- 0.00)
MSE's of Nearest Neighbors: -0.006124 (+/- 0.00)
MSE's of Random Forest: -0.005345 (+/- 0.00)
MSE's of Linear Regression: -0.023277 (+/- 0.00)
MSE's of Support Vector Regression: -0.023380 (+/- 0.00)
MSE's of Nearest Neighbors: -0.044452 (+/- 0.00)
MSE's of Random Forest: -0.040805 (+/- 0.00)
MSE's of Linear Regression: -0.000094 (+/- 0.00)
MSE's of Support Vector Regression: -0.000098 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001009 (+/- 0.00)
MSE's of Random Forest: -0.001096 (+/- 0.00)
MSE's of Linear Regression: -0.000121 (+/- 0.00)
MSE's of Support Vector Regression: -0.000125 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001337 (+/- 0.00)
MSE's of Random Forest: -0.001331 (+/- 0.00)
MSE's of Linear Regression: -0.000248 (+/- 0.00)
MSE's of Support Vector Regression: -0.000261 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001909 (+/- 0.00)
MSE's of Random Forest: -0.002691 (+/- 0.00)
MSE's of Linear Regression: -0.000400 (+/- 0.00)
MSE's of Support Vector Regression: -0.000404 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003152 (+/- 0.00)
MSE's of Random Forest: -0.004419 (+/- 0.00)
MSE's of Linear Regression: -0.001867 (+/- 0.00)
MSE's of Support Vector Regression: -0.001932 (+/- 0.00)
MSE's of Nearest Neighbors: -0.005800 (+/- 0.00)
MSE's of Random Forest: -0.005071 (+/- 0.00)
MSE's of Linear Regression: -0.024622 (+/- 0.00)
MSE's of Support Vector Regression: -0.025200 (+/- 0.00)
MSE's of Nearest Neighbors: -0.044594 (+/- 0.00)
MSE's of Random Forest: -0.040682 (+/- 0.00)
MSE's of Linear Regression: -0.000102 (+/- 0.00)
MSE's of Support Vector Regression: -0.000106 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001079 (+/- 0.00)
MSE's of Random Forest: -0.002076 (+/- 0.00)
MSE's of Linear Regression: -0.000086 (+/- 0.00)
MSE's of Support Vector Regression: -0.000088 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001266 (+/- 0.00)
MSE's of Random Forest: -0.002141 (+/- 0.00)
MSE's of Linear Regression: -0.000268 (+/- 0.00)
MSE's of Support Vector Regression: -0.000269 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001964 (+/- 0.00)
MSE's of Random Forest: -0.002511 (+/- 0.00)
MSE's of Linear Regression: -0.000459 (+/- 0.00)
MSE's of Support Vector Regression: -0.000461 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003586 (+/- 0.00)
MSE's of Random Forest: -0.004617 (+/- 0.00)
MSE's of Linear Regression: -0.002831 (+/- 0.00)
MSE's of Support Vector Regression: -0.003036 (+/- 0.00)
MSE's of Nearest Neighbors: -0.009144 (+/- 0.00)
MSE's of Random Forest: -0.009095 (+/- 0.00)
MSE's of Linear Regression: -0.021679 (+/- 0.00)
MSE's of Support Vector Regression: -0.022520 (+/- 0.00)
MSE's of Nearest Neighbors: -0.035799 (+/- 0.00)
MSE's of Random Forest: -0.034830 (+/- 0.00)
MSE's of Linear Regression: -0.000128 (+/- 0.00)
MSE's of Support Vector Regression: -0.000133 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001196 (+/- 0.00)
MSE's of Random Forest: -0.002173 (+/- 0.00)
MSE's of Linear Regression: -0.000117 (+/- 0.00)
MSE's of Support Vector Regression: -0.000120 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001417 (+/- 0.00)
MSE's of Random Forest: -0.002261 (+/- 0.00)
MSE's of Linear Regression: -0.000267 (+/- 0.00)
MSE's of Support Vector Regression: -0.000266 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001886 (+/- 0.00)
MSE's of Random Forest: -0.002644 (+/- 0.00)
MSE's of Linear Regression: -0.000452 (+/- 0.00)
MSE's of Support Vector Regression: -0.000453 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003319 (+/- 0.00)
MSE's of Random Forest: -0.004755 (+/- 0.00)
MSE's of Linear Regression: -0.001970 (+/- 0.00)
MSE's of Support Vector Regression: -0.002033 (+/- 0.00)
MSE's of Nearest Neighbors: -0.005376 (+/- 0.00)
MSE's of Random Forest: -0.005813 (+/- 0.00)
MSE's of Linear Regression: -0.020872 (+/- 0.00)
MSE's of Support Vector Regression: -0.021262 (+/- 0.00)
MSE's of Nearest Neighbors: -0.037028 (+/- 0.00)
MSE's of Random Forest: -0.035538 (+/- 0.00)
MSE's of Linear Regression: -0.000032 (+/- 0.00)
MSE's of Support Vector Regression: -0.000036 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000594 (+/- 0.00)
MSE's of Random Forest: -0.000567 (+/- 0.00)
MSE's of Linear Regression: -0.000033 (+/- 0.00)
MSE's of Support Vector Regression: -0.000037 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000616 (+/- 0.00)
MSE's of Random Forest: -0.000578 (+/- 0.00)
MSE's of Linear Regression: -0.000317 (+/- 0.00)
MSE's of Support Vector Regression: -0.000319 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002072 (+/- 0.00)
MSE's of Random Forest: -0.002675 (+/- 0.00)
MSE's of Linear Regression: -0.000573 (+/- 0.00)
MSE's of Support Vector Regression: -0.000576 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003786 (+/- 0.00)
MSE's of Random Forest: -0.004678 (+/- 0.00)
MSE's of Linear Regression: -0.000573 (+/- 0.00)
MSE's of Support Vector Regression: -0.000583 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001569 (+/- 0.00)
MSE's of Random Forest: -0.001370 (+/- 0.00)
MSE's of Linear Regression: -0.020265 (+/- 0.00)
MSE's of Support Vector Regression: -0.020562 (+/- 0.00)
MSE's of Nearest Neighbors: -0.035902 (+/- 0.00)
MSE's of Random Forest: -0.032717 (+/- 0.00)
MSE's of Linear Regression: -0.000077 (+/- 0.00)
MSE's of Support Vector Regression: -0.000082 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000545 (+/- 0.00)
MSE's of Random Forest: -0.000687 (+/- 0.00)
MSE's of Linear Regression: -0.000042 (+/- 0.00)
MSE's of Support Vector Regression: -0.000047 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000395 (+/- 0.00)
MSE's of Random Forest: -0.000431 (+/- 0.00)
MSE's of Linear Regression: -0.000415 (+/- 0.00)
MSE's of Support Vector Regression: -0.000418 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002657 (+/- 0.00)
MSE's of Random Forest: -0.004095 (+/- 0.00)
MSE's of Linear Regression: -0.000682 (+/- 0.00)
MSE's of Support Vector Regression: -0.000688 (+/- 0.00)
MSE's of Nearest Neighbors: -0.004712 (+/- 0.00)
MSE's of Random Forest: -0.007278 (+/- 0.00)
MSE's of Linear Regression: -0.000913 (+/- 0.00)
MSE's of Support Vector Regression: -0.000970 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003427 (+/- 0.00)
MSE's of Random Forest: -0.003063 (+/- 0.00)
MSE's of Linear Regression: -0.014905 (+/- 0.00)
MSE's of Support Vector Regression: -0.014967 (+/- 0.00)
MSE's of Nearest Neighbors: -0.031157 (+/- 0.00)
MSE's of Random Forest: -0.032549 (+/- 0.00)
MSE's of Linear Regression: -0.000065 (+/- 0.00)
MSE's of Support Vector Regression: -0.000067 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000603 (+/- 0.00)
MSE's of Random Forest: -0.000649 (+/- 0.00)
MSE's of Linear Regression: -0.000044 (+/- 0.00)
MSE's of Support Vector Regression: -0.000048 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000438 (+/- 0.00)
MSE's of Random Forest: -0.000416 (+/- 0.00)
MSE's of Linear Regression: -0.000312 (+/- 0.00)
MSE's of Support Vector Regression: -0.000314 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002136 (+/- 0.00)
MSE's of Random Forest: -0.002931 (+/- 0.00)
MSE's of Linear Regression: -0.000541 (+/- 0.00)
MSE's of Support Vector Regression: -0.000544 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003727 (+/- 0.00)
MSE's of Random Forest: -0.005384 (+/- 0.00)
MSE's of Linear Regression: -0.001075 (+/- 0.00)
MSE's of Support Vector Regression: -0.001102 (+/- 0.00)
MSE's of Nearest Neighbors: -0.003899 (+/- 0.00)
MSE's of Random Forest: -0.003232 (+/- 0.00)
MSE's of Linear Regression: -0.016380 (+/- 0.00)
MSE's of Support Vector Regression: -0.016796 (+/- 0.00)
MSE's of Nearest Neighbors: -0.034742 (+/- 0.00)
MSE's of Random Forest: -0.032548 (+/- 0.00)
MSE's of Linear Regression: -0.000019 (+/- 0.00)
MSE's of Support Vector Regression: -0.000026 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000293 (+/- 0.00)
MSE's of Random Forest: -0.000293 (+/- 0.00)
MSE's of Linear Regression: -0.000011 (+/- 0.00)
MSE's of Support Vector Regression: -0.000018 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000169 (+/- 0.00)
MSE's of Random Forest: -0.000153 (+/- 0.00)
MSE's of Linear Regression: -0.000164 (+/- 0.00)
MSE's of Support Vector Regression: -0.000175 (+/- 0.00)
MSE's of Nearest Neighbors: -0.001386 (+/- 0.00)
MSE's of Random Forest: -0.001955 (+/- 0.00)
MSE's of Linear Regression: -0.000277 (+/- 0.00)
MSE's of Support Vector Regression: -0.000279 (+/- 0.00)
MSE's of Nearest Neighbors: -0.002275 (+/- 0.00)
MSE's of Random Forest: -0.003051 (+/- 0.00)
MSE's of Linear Regression: -0.000317 (+/- 0.00)
MSE's of Support Vector Regression: -0.000323 (+/- 0.00)
MSE's of Nearest Neighbors: -0.000971 (+/- 0.00)
MSE's of Random Forest: -0.000795 (+/- 0.00)
MSE's of Linear Regression: -0.028376 (+/- 0.00)
MSE's of Support Vector Regression: -0.029250 (+/- 0.00)
MSE's of Nearest Neighbors: -0.049142 (+/- 0.00)
MSE's of Random Forest: -0.044023 (+/- 0.00)

STEP 8: DISCUSSION:


In [11]:
X, y, coef = datasets.make_regression(n_samples=300, n_features=1,
                                  n_informative=1, noise=1,
                                  coef=True, random_state=0)

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
    
# create regression and fit
reg = regr.fit(X_train, y_train)

# leave one out & compute cross-validation scores with MSE
# loo = LeaveOneOut(len(X))
loo = cross_validation.KFold(n=len(X), n_folds=10, shuffle=False, random_state=None)
scores = cross_validation.cross_val_score(reg, X, y, scoring='mean_squared_error', cv=loo)

# get error scores and print 
errors[0,] = [scores.mean(), scores.std()]
print("MSE's of %s: %0.2f (+/- %0.2f)" % ('linear reg', scores.mean(), scores.std()))  
# reg = LinearRegression()

# regr = reg.fit(X,y)

plt.plot(y)


MSE's of linear reg: -6.13 (+/- 4.81)
Out[11]:
[<matplotlib.lines.Line2D at 0x17417f950>]

In [16]:
errors_mse = {}
for key in errors_cols.keys():
    errors = errors_cols[key]
    errors_mse[key] = np.mean(errors,axis=0)

In [28]:
# write new list_of_features to new txt file
csvfile = "data_normalized/mse_regressions_features.txt"

#Assuming res is a flat list
with open(csvfile, "w") as output:
    # write to new file the data
    writer = csv.writer(output, lineterminator='\n')
    for key in errors_mse.keys():
        try:
            writer.writerow(errors_mse[key])
        except:
            print key


24

In [26]:
errors_mse['24'] = None

In [ ]: