Regression: Linear and Non-Linear

1. Create new sets

a) A training set of the team data containing only the columns listed below and year in interval [2005, 2010]. Create a scatter matrix (pair plot) and heat map of this data and answer the following:

Why might ha (hitting average) appear to be negatively correlated with w (wins)?
If we wanted to predict w (wins), what might be the best three predictors to model first? Why?

columns:

['bb', 'bpf', 'cs', 'double', 'dp', 'e', 'h', 'ha', 'hr', 'hra', 'l', 'ppf', 'r', 'ra', 'sb', 'sf', 'so', 'soa', 'triple', 'w', 'year']

b) A merge of player and batting data containing only the columns listed below and year >= 2005

columns

['ab', 'r', 'h', 'double', 'triple', 'hr', 'rbi', 'sb', 'cs', 'bb']



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



In [3]:

    
team = pd.read_csv('team.csv')



In [4]:

    
team_cols = ['bb', 'bpf', 'cs', 'double', 'dp', 'e', 'h', 'ha', 'hr', 'hra', 'l',
             'ppf', 'r', 'ra', 'sb', 'sf', 'so', 'soa', 'triple', 'w', 'year']

team_train = team.loc[team.year.isin(np.arange(2005, 2011)), team_cols].reset_index(drop=True)



In [5]:

    
team_train.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 21 columns):
bb        180 non-null int64
bpf       180 non-null int64
cs        180 non-null float64
double    180 non-null int64
dp        180 non-null float64
e         180 non-null int64
h         180 non-null int64
ha        180 non-null int64
hr        180 non-null int64
hra       180 non-null int64
l         180 non-null int64
ppf       180 non-null int64
r         180 non-null int64
ra        180 non-null int64
sb        180 non-null float64
sf        180 non-null float64
so        180 non-null float64
soa       180 non-null int64
triple    180 non-null int64
w         180 non-null int64
year      180 non-null int64
dtypes: float64(5), int64(16)
memory usage: 29.6 KB



In [6]:

    
team_train.head()









    Out[6]:






  
    
      
      bb
      bpf
      cs
      double
      dp
      e
      h
      ha
      hr
      hra
      ...
      ppf
      r
      ra
      sb
      sf
      so
      soa
      triple
      w
      year
    
  
  
    
      0
      606
      103
      26.0
      291
      159.0
      94
      1419
      1580
      191
      193
      ...
      105
      696
      856
      67.0
      45.0
      1094.0
      1038
      27
      77
      2005
    
    
      1
      534
      101
      32.0
      308
      170.0
      86
      1453
      1487
      184
      145
      ...
      100
      769
      674
      92.0
      46.0
      1084.0
      929
      37
      90
      2005
    
    
      2
      447
      99
      37.0
      296
      154.0
      107
      1492
      1458
      189
      180
      ...
      99
      729
      800
      83.0
      42.0
      902.0
      1052
      27
      74
      2005
    
    
      3
      653
      104
      12.0
      339
      135.0
      109
      1579
      1550
      199
      164
      ...
      104
      910
      805
      45.0
      63.0
      1044.0
      959
      21
      95
      2005
    
    
      4
      435
      103
      67.0
      253
      166.0
      94
      1450
      1392
      200
      167
      ...
      103
      741
      645
      137.0
      49.0
      1002.0
      1040
      23
      99
      2005
    
  

5 rows × 21 columns



In [15]:

    
# solutions:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(team_train, figsize=(24, 24));



In [6]:

    
sns.pairplot(team_train);

According to the documentation ha is not hitting average but hits allowed, so it's a measure of how well the other team played.

It's only natural that it is negatively correlated with wins.



In [7]:

    
fig, ax = plt.subplots(figsize=(16, 10)) 
ax = sns.heatmap(team_train.corr(), annot=True);

Obviously losses has a correlation of -1 with wins, but I'm assuming that it can't be used to model wins. So I'm going with runs (r), runs allowed (ra) and a third feature which could be either errors (e) or hits allowed (ha); among the latter two I think errors is the best choice because the other is highly correlated to ra.

Correlation doesn't imply causation



In [19]:

    
# A merge of player and batting data containing only the columns listed below and year >= 2005
player_batting_cols = ['ab', 'r', 'h', 'double', 'triple', 'hr', 'rbi', 'sb', 'cs', 'bb']

player = pd.read_csv('player.csv')
batting = pd.read_csv('batting.csv')

player_batting = pd.merge(left=player, right=batting, left_on='player_id', right_on='player_id')
player_batting = player_batting.loc[player_batting.year >= 2004, ['name_given', 'name_last', 'year'] + player_batting_cols]



In [20]:

    
player_batting.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 16695 entries, 0 to 101331
Data columns (total 13 columns):
name_given    16695 non-null object
name_last     16695 non-null object
year          16695 non-null int64
ab            16695 non-null float64
r             16695 non-null float64
h             16695 non-null float64
double        16695 non-null float64
triple        16695 non-null float64
hr            16695 non-null float64
rbi           16695 non-null float64
sb            16695 non-null float64
cs            16695 non-null float64
bb            16695 non-null float64
dtypes: float64(10), int64(1), object(2)
memory usage: 1.8+ MB



In [21]:

    
player_batting.head()









    Out[21]:






  
    
      
      name_given
      name_last
      year
      ab
      r
      h
      double
      triple
      hr
      rbi
      sb
      cs
      bb
    
  
  
    
      0
      David Allan
      Aardsma
      2004
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      David Allan
      Aardsma
      2006
      2.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      David Allan
      Aardsma
      2007
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      David Allan
      Aardsma
      2008
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      David Allan
      Aardsma
      2009
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0

2. Using the OLS LinearRegression in the scikit-learn library

Regress r onto double using the train set (create a linear model to predict r from double)
What is the equation for the model? Interpret the slope.
What is the $R^2$ value and what does it tell us?



In [22]:

    
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score



In [23]:

    
# creating the model
OLS_1var = LinearRegression()
# creating training sets
X_1var = team_train.double.values.reshape(-1, 1)
y_1var = team_train.r.values.reshape(-1, 1)
# fitting the model
OLS_1var.fit(X_1var, y_1var)









    Out[23]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [27]:

    
# making predictions
y_1var_pred = OLS_1var.predict(X_1var)
print('Model equation is: y = {:.5f}x + {:.5f}'.format(float(OLS_1var.coef_), float(OLS_1var.intercept_)))









    



Model equation is: y = 1.55752x + 290.75819

The slope of the equation represents the rate of change of r with respect to double.



In [28]:

    
r2_score(y_1var, y_1var_pred)









    Out[28]:





0.31120469036164045

The $R^2$ value tells us what proportion of the variance in $y$ is predictable from $X$.

3. Plot the data points and the line of best fit. Create a residual plot for residuals vs the independent variable (`double`) in the model you just created and interpret it.

Is the distribution of residuals random?
Is there any heteroscedasticity?
Would you say the data used in this model is well-suited to a linear model?



In [26]:

    
# data for the line plot
X_1var_plot = np.linspace(X_1var.min()-10, X_1var.max()+10, 1000)
y_1var_plot = OLS_1var.predict(X_1var_plot.reshape(-1, 1))

fig, ax = plt.subplots(figsize=(16, 10))
ax = plt.scatter(X_1var, y_1var)
ax = plt.plot(X_1var_plot, y_1var_plot, 'r')
# solution:
# ax = plt.plot(X_1var, y_1var_pred, 'r')



In [29]:

    
fig, ax = plt.subplots(figsize=(16, 10))
ax = plt.scatter(y_1var_pred, y_1var_pred - y_1var)
ax = plt.hlines(y=0, xmin=y_1var_pred.min(), xmax=y_1var_pred.max(), lw=2, color='red')

The residuals seems randomly distributed and there doesn't seem to be any recognizable pattern, so there is no heteroscedasticity.

The data presents a slight linearity but, as the $R^2$ score tells us, this is not a very good model.

Solution: it seems to work well with a linear model.

4. Create one function that returns the standard error of the estimated parameters in a bivariate model, and create another that returns 95% confidence intervals for these parameters. Compute and interpret these intervals for the model above, then compute the t-stat and p-value for $\hat \beta_1$.

Solution:

$Var(\hat \beta_1) = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \overline x)^2}$

$Var(\hat \beta_0) = \frac{\sigma^2 n^{-1} \sum_{i=1}^{n}x_i^2}{\sum_{i=1}^{n}(x_i - \overline x)^2}$



In [30]:

    
from scipy import stats



In [21]:

    
# just a reminder for me:
# this is the interval for 0.95 confidence and 5 df
t = stats.t.interval(alpha=0.95, df=5)
print(t)
# this is the probability of being under the t-value with 5 df
print(stats.t.cdf(2.57058, df=5))
# this is the inverse of the above: what is the t-value you have probability x of being under with 5 df?
print(stats.t.ppf(0.975, df=5))
# I don't really remember what this is...
stats.t.sf(x=np.abs(t), df=5)*2









    



(-2.5705818366147395, 2.5705818366147395)
0.974999944311
2.57058183661






    Out[21]:





array([ 0.05,  0.05])



In [31]:

    
# solution:
def se_beta(X, y, yhat):
    e = y - yhat
    n = len(X)
    # degrees of freedom
    # + 1 is for the intercept
    k = X.shape[1] + 1
    # compute estimated variance of the regression (sigma_hat^2)
    sig2 = np.sum(e**2) / (n - k)
    # compute variance of b0 and b1
    denom = np.sum((X - X.mean())**2)
    vb1 = sig2 / denom
    vb0 = (sig2/n)*np.sum(X**2) / denom
    
    return np.sqrt(vb0), np.sqrt(vb1)

def interval_beta(b0, b1, se_b0, se_b1):
    # compute the z-score for 95%
    z = stats.norm.ppf(0.95)
    # create intervals [b -/+ z*se]
    itv_b0 = b0 + z*np.array([-se_b0, se_b0])
    itv_b1 = b1 + z*np.array([-se_b1, se_b1])
    
    return itv_b0, itv_b1



In [34]:

    
interval_beta(float(OLS_1var.intercept_), float(OLS_1var.coef_), *se_beta(X_1var, y_1var, y_1var_pred))









    Out[34]:





(array([ 205.63668252,  375.87969552]), array([ 1.27184635,  1.84319804]))



In [36]:

    
# get se_b1
_, se_b1 = se_beta(X_1var, y_1var, y_1var_pred)

# t-stat
t = float(OLS_1var.coef_) / se_b1

# p-val using normal cdf
p = stats.norm.cdf(-abs(t))

print(t, p)









    



8.96784260929 1.51189360457e-19



In [22]:

    
def standard_error(model, X, y):
    MSE = ((model.predict(X) - y)**2).mean()
    return np.sqrt(MSE * np.linalg.inv(X.T.dot(X)))

def conf_intervals(model, X, y, alpha=0.05):
    se = standard_error(model, X, y)
    n = len(X)
    k = len(model.coef_)
    # t-value for half of 1 minus the desired alpha (two-tailed interval)
    t_val = stats.t.ppf(1-alpha/2, df=n-(k+1))
    return model.coef_ - se * t_val, model.coef_ + se * t_val

# 95% confidence interval for previous model
# I'm missing the intercept...
print(conf_intervals(OLS_1var, X_1var, y_1var))

# t-stat and p-value for previous model slope
t_stat = OLS_1var.coef_ / standard_error(OLS_1var, X_1var, y_1var)
p_val = stats.t.sf(x=t_stat, df=178)

print('t-stat: {:.5f}\np-value: {:.5f}'.format(float(t_stat), float(p_val)))









    



(array([[ 1.52798695]]), array([[ 1.58705744]]))
t-stat: 104.06497
p-value: 0.00000

5. Create a RANSAC model for the same data used in question 2, and compare it's R-squared value to the first model. Be sure to get the indices of inliers



In [45]:

    
from sklearn.linear_model import RANSACRegressor
# create the RANSAC model
ransac = RANSACRegressor(
                        LinearRegression(),
#                         max_trials=100,
#                         min_samples=50,
#                         loss='absolute_loss',
#                         residual_threshold=5.0,
                        random_state=1234)

# fit the model
ransac.fit(X_1var, y_1var)
# make predictions
ransac_pred = ransac.predict(X_1var)

print('Slope: {:.5f}'.format(float(ransac.estimator_.coef_)))
print('Intercept: {:.5f}'.format(float(ransac.estimator_.intercept_)))
print('R2 score: {:.5f}'.format(r2_score(y_1var, ransac_pred)))

# get the inliers mask
inliers = ransac.inlier_mask_
print(X_1var[inliers][:5])









    



Slope: 1.86835
Intercept: 183.82008
R2 score: 0.25739
[[291]
 [308]
 [296]
 [335]
 [337]]



In [46]:

    
# solution
ransac.score(X_1var[inliers], y_1var[inliers])









    Out[46]:





0.76198265732964365



In [48]:

    
# solution
yhat_ransac = ransac.predict(X_1var[inliers])

plt.scatter(X_1var[inliers], y_1var[inliers])
plt.scatter(X_1var[~inliers], y_1var[~inliers], edgecolors='k', facecolors='none')

plt.plot(X_1var, y_1var_pred, 'k', alpha=0.5)
plt.plot(X_1var[inliers], yhat_ransac, 'r')









    Out[48]:





[<matplotlib.lines.Line2D at 0x1f824b84c50>]



In [50]:

    
# solution:
# residuals vs. yhat
resid = y_1var[inliers] - yhat_ransac
plt.scatter(yhat_ransac, resid)
# horizontal line at y = 0
plt.axhline(0, c='r')









    Out[50]:





<matplotlib.lines.Line2D at 0x1f824b75c18>

6. Create a linear model to predict `r` using both `double` and `h` as predictors, and do the following

Interpret the parameters and give the $R^2$ value.
Is there any problem with the two predictors together in this model?
Create a 3D plot that shows the data points for both actual and predicted values for r.



In [51]:

    
X_2var = team_train[['double', 'h']]
y_2var = team_train.r.values.reshape(-1, 1)

OLS_2var = LinearRegression()
OLS_2var.fit(X_2var, y_2var)

print('Coefficients: {}'.format(OLS_2var.coef_))
print('Intercept: {:.5f}'.format(float(OLS_2var.intercept_)))

print('Correlation between coefficients:\n{}'.format(X_2var.corr()))

y_2var_pred = OLS_2var.predict(X_2var)

print('R2 score: {:.5f}'.format(r2_score(y_2var, y_2var_pred)))









    



Coefficients: [[ 0.71934416  0.62245511]]
Intercept: -373.68216
Correlation between coefficients:
        double       h
double  1.0000  0.4739
h       0.4739  1.0000
R2 score: 0.62239



In [52]:

    
sns.pairplot(X_2var);

The two parameters presents a small linear relationship, so there may be some correlation in the features causing problems in the model.



In [53]:

    
from mpl_toolkits import mplot3d



In [54]:

    
# plotting the points
fig = plt.figure(figsize=(16, 8))
ax = plt.axes(projection='3d')
ax.scatter3D(X_2var.double, X_2var.h, y_2var, c=y_2var)
ax.scatter3D(X_2var.double, X_2var.h, y_2var_pred, c=y_2var_pred, cmap='viridis');

7. Create a multiple regression model to predict `r` using the variables listed below as predictors, and determine the $R^2$ value. Print all parameter values along with their corresponding variable names.

['bb','bpf','double','dp','e','h','ha','hra','ppf','ra','sb','sf','triple']



In [55]:

    
# model features
mult_regr_cols = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training sets
X_multvar = team_train[mult_regr_cols] # in solution uses .values
y_multvar = team_train.r.values.reshape(-1, 1)
# fit model
OLS_multvar = LinearRegression()
OLS_multvar.fit(X_multvar, y_multvar)
# print parameters and intercept
for i, col in enumerate(mult_regr_cols):
    print('Coefficient for {} is: {}'.format(col, OLS_multvar.coef_[0][i]))

print('Intercept: {:.5f}'.format(float(OLS_multvar.intercept_)))

# predict and print R2
y_multvar_pred = OLS_multvar.predict(X_multvar)
print('R2 score: {:.5f}'.format(r2_score(y_multvar, y_multvar_pred)))









    



Coefficient for bb is: 0.32532424691943296
Coefficient for bpf is: 18.846529117352258
Coefficient for double is: 0.4397304479295745
Coefficient for dp is: 0.17870078320113036
Coefficient for e is: 0.03940524716282212
Coefficient for h is: 0.5194200382940695
Coefficient for ha is: 0.11292488964728964
Coefficient for hra is: 0.3305064938706632
Coefficient for ppf is: -17.039926128845746
Coefficient for ra is: -0.05576480937751427
Coefficient for sb is: 0.14319056188560886
Coefficient for sf is: -0.18307152600088283
Coefficient for triple is: 0.1927186886122314
Intercept: -715.03548
R2 score: 0.82876

Adjusted $R^2$ can be used in this case because we have added a lot of features and it can be good to account for that.

8. Create a test set that includes all years >= 2011 and create a set of predictions from this data. Compute the MSE and RMSE for the model above using scikit-learn. Create your own function to compute these scores and compare your outputs to scikit's, and interpret the RMSE.



In [56]:

    
# create the test dataframe
team_test = team.loc[team.year >= 2011, team_cols].reset_index(drop=True)



In [57]:

    
# create test sets
X_test = team_test[mult_regr_cols]
y_test = team_test.r.values.reshape(-1, 1)
# make predictions
y_test_pred = OLS_multvar.predict(X_test)



In [58]:

    
from sklearn.metrics import mean_squared_error
# using scikit-learn
print('MSE: {:.5f}'.format(mean_squared_error(y_test, y_test_pred)))
print('RMSE: {:.5f}'.format(np.sqrt(mean_squared_error(y_test, y_test_pred))))









    



MSE: 1554.25499
RMSE: 39.42404



In [59]:

    
def MSE(y, pred):
    return ((y - pred)**2).mean()

def RMSE(y, pred):
    return np.sqrt(((y - pred)**2).mean())
# using my functions
print('MSE: {:.5f}'.format(MSE(y_test, y_test_pred)))
print('RMSE: {:.5f}'.format(RMSE(y_test, y_test_pred)))









    



MSE: 1554.25499
RMSE: 39.42404

The RMSE tells us how much on average the predictions differ from the real values.

9. The RMSE computed in the last problem doesn't actually tell us much by itself. Is +/- 39.4 runs a small error, or a really large error? Come up with a solution to this problem and report/interpret the new result.

We could use the error percentage, obtained dividing the RMSE by the mean value of y (since the RMSE is in the same units as the values being predicted); this gives us an error of about 6%.



In [60]:

    
RMSE(y_test, y_test_pred) / y_test.mean() # from solution: this is called Normalized RMSE









    Out[60]:





0.057700473432857162

Optional Exercises

10. Create and run a function to select the best subset from the current data using k-fold CV, and based on RMSE on test sets. Do not use any functions from scikit-learn. The function need only return the list of best features and the best score, and should use only simple forward selection.



In [68]:

    
# solution:
def partition(n, k):
    groups = []
    g = []
    # iterate through
    for i in range(n):
        # if group full and not last index
        if (i%int(n/k) == 0) & (i > 0) & (i != n-1):
            groups.append(g)
            g = []
            g.append(i)
            continue
        # if last index
        if i == n-1:
            g.append(i)
            groups.append(g)
            continue
        g.append(i)
    return groups


def best_subset(X, y, attrs, k):
    from sklearn.linear_model import LinearRegression
    
    # init score and list for subset
    current_score = 1e9
    best_score = 1e9
    best_subset = []
    
    # partition indices into k-folds
    n = len(y)
    folds = partition(n, k)
    
    # iterate through attributes and folds to train/test
    first = True # first attribute tested
    # keep track of the current best attr
    best_attr = None
    # init list of current attrs -- attributes to be tested
    current_attrs = attrs[:] # create copy
    
    def loops():
        # outer-scope vars that we need to modify
        nonlocal current_score, best_score, best_subset, first, best_attr, current_attrs
        
        for a in current_attrs:
            # keep score for folds
            scores = []
            # cross-val
            for i,f in enumerate(folds):
                # get all indices non in left out fold
                G = [folds[j] for j in range(k) if not j == i]
                # combine the groups
                mask = [i for g in G for i in g]
                
                # create numerical mask for kept and currently tested attrs
                if first:
                    attr_mask = [attrs.index(a)]
                else:
                    attr_mask = [attrs.index(x) for x in best_subset] + [attrs.index(a)]
                
                # train group
                Xtr = X[mask][:, attr_mask]
                ytr = y[mask]
                # test group
                Xts = X[folds[i]][:, attr_mask]
                yts = y[folds[i]]
                
                # reshape if only one attr
                if len(attr_mask) == 1:
                    Xtr = Xtr.reshape(-1, 1)
                    Xts = Xts.reshape(-1, 1)
                
                # fit model
                mod = LinearRegression()
                mod.fit(Xtr, ytr)
                
                # score test set below
                # get predictions
                yhat = mod.predict(Xts)
                # use mse_rmse func we wrote
                rmse = RMSE(yts, yhat)
                # append rmse to scores
                scores.append(rmse)
            # compute mean of scores after last fold
            score = np.mean(scores)
            if first:
                current_score = score
                best_attr = a
                first = False
            elif score < current_score:
                current_score = score
                best_attr = a
        
        # append attr that provided model with best score in this loop
        if current_score < best_score:
            best_score = current_score
            best_subset.append(best_attr)
            # update current attrs
            current_attrs = [c for c in current_attrs if c not in best_subset]
    
    
    for n in range(len(attrs)):
        loops()
    
    return best_score, best_subset



In [74]:

    
# solution continues:
cols_forw = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_forw_sol = team_train[cols_forw].values
y_forw_sol = team_train.r.values.reshape(-1, 1)

score, best = best_subset(X_forw_sol, y_forw_sol, cols_forw, 10)
print(score, best)









    



31.5395437302 ['h', 'bb', 'bpf', 'ppf', 'ha', 'double', 'hra', 'sb', 'dp']



In [77]:

    
# solution continues:
# train regression on all the data
OLS_forw_sol = LinearRegression()
X_forw_sol = team_train[best].values
OLS_forw_sol.fit(X_forw_sol, y_forw_sol)
# and test it on test data to see the RMSE
X_forw_test_sol = team_test[best].values
y_forw_test_sol = team_test.r.values.reshape(-1, 1)
RMSE(y_forw_test_sol, OLS_forw_sol.predict(X_forw_test_sol))









    Out[77]:





39.234716811605196



In [70]:

    
def k_fold(model, X, y, k):
    '''calculate the meand RMSE for a model using k-fold'''
    # lenght of a single partition
    n = len(X) // k
    # transforming the dataframe in an array
    X_arr = np.array(X.values)
    score = []
    # for every partition we split the data in train and test sets, fit the model and evaluate
    for i in range(n):
        # the first (i-1)-th partions and the last n-(i+1)-th
        Xtrain = np.vstack((X_arr[0:i*k, :], X_arr[(i+1)*k:, :]))
        # the i-th partition
        Xtest = X_arr[i*k:(i+1)*k, :]
        ytrain = np.vstack((y[0:i*k, :], y[(i+1)*k:, :]))
        ytest = y[i*k:(i+1)*k, :]
        model.fit(Xtrain, ytrain)
        curr_score = RMSE(ytest, model.predict(Xtest))
        score.append(curr_score)
    # returning the mean score
    return np.array(score).mean()

All of this is just to verify that all this kfold CV stuff makes sense (and that my functions are somewhat correct):



In [35]:

    
# choose some columns
cols_1 = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_kfold_1 = team_train[cols_1]
y_kfold_1 = team_train.r.values.reshape(-1, 1)
# perform kfold CV on it and see the result
OLS_kfold = LinearRegression()
k_fold(OLS_kfold, X_kfold_1, y_kfold_1, 5)









    Out[35]:





30.961743820507667



In [36]:

    
# train regression on all the data
OLS_kfold.fit(X_kfold_1, y_kfold_1)
# and test it on test data to see the RMSE
X_kfold_1_test = team_test[cols_1]
y_kfold_1_test = team_test.r.values.reshape(-1, 1)
RMSE(y_kfold_1_test, OLS_kfold.predict(X_kfold_1_test))









    Out[36]:





39.424040807911098



In [37]:

    
# again on other columns
cols_2 = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf']
# training set and kfold CV
X_kfold_2 = team_train[cols_2]
y_kfold_2 = team_train.r.values.reshape(-1, 1)
k_fold(OLS_kfold, X_kfold_2, y_kfold_2, 5)









    Out[37]:





30.63794636543934

This has a best CV score so it should perform better on the test set:



In [38]:

    
# again training on all data
OLS_kfold.fit(X_kfold_2, y_kfold_2)
# and testing it on test set
X_kfold_2_test = team_test[cols_2]
y_kfold_2_test = team_test.r.values.reshape(-1, 1)
RMSE(y_kfold_2_test, OLS_kfold.predict(X_kfold_2_test))









    Out[38]:





39.390007566525419

Ditto, the RMSE is slightly better on unseen data!

Now all I have to do is forward selection:



In [71]:

    
def add_one_feature(model, X, y, selected_features, additional_features, k=5):
    '''given a model, the current features and the reamining ones,
    it returns the best score and best features obtained adding 1 feature.
    It uses k-fold validation to determine the best set'''
    best_score = -1
    best_features = []
    for feature in additional_features:
        # initialize the features by adding one of the remaining ones
        curr_features = selected_features + [feature]
        # selecting data and performing k-fold CV
        curr_X = X[curr_features]
        curr_score = k_fold(model, curr_X, y, k)
        # eventually updating the best scores and features
        if best_score == -1:
            best_score = curr_score
            best_features = curr_features
        elif best_score >= curr_score:
            best_score = curr_score
            best_features = curr_features    
    return best_score, best_features

def forward_sel(model, X, y, features, k=5):
    '''performs forward selection on the list of features determining the best set using k-fold CV'''
    best_score = -1
    best_features = []
    # we keep track of the best set of features for each i in order to keep forward selection going on
    last_i_best_features = []
    for i in range(len(features)):
        # get the features yet to be processed by difference between all the features and the last best ones
        additional_features = list(set(features) - set(last_i_best_features))
        # calculating best score and i+1 fetures
        curr_score, curr_features = add_one_feature(model, X, y, last_i_best_features, additional_features, k)
        # updating last best set of features and eventually best score and features
        last_i_best_features = curr_features
        if best_score == -1:
            best_features = curr_features
            best_score = curr_score
        elif best_score >= curr_score:
            best_features = curr_features
            best_score = curr_score
    return best_score, best_features



In [72]:

    
# choose some columns
cols_forw = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_forw = team_train[cols_forw]
y_forw = team_train.r.values.reshape(-1, 1)
# perform forward selection on it and see the result
OLS_forward_sel = LinearRegression()
best_score, best_features = forward_sel(OLS_forward_sel, X_forw, y_forw, cols_forw)
print(best_score, best_features)









    



30.4125952229 ['h', 'bb', 'double', 'bpf', 'ppf', 'ha', 'hra', 'sb']

And it does slightly better than the ones above... Wow, seems like it's working pretty well!



In [78]:

    
# train regression on all the data
X_forw = team_train[best_features]
OLS_forward_sel.fit(X_forw, y_forw)
# and test it on test data to see the RMSE
X_forw_test = team_test[best_features]
y_forw_test = team_test.r.values.reshape(-1, 1)
RMSE(y_forw_test, OLS_forward_sel.predict(X_forw_test))









    Out[78]:





39.312801031266886

11. Model the same data as in (7), only this time compare the previous results to a Lasso model using the Lasso class from scikit-learn. Then use the LassoCV with 10-fold CV to determine alpha and compare results again.



In [79]:

    
from sklearn.linear_model import Lasso, LassoCV
# create and train model
# lasso = Lasso(alpha=0.1, normalize=True, max_iter=1000, random_state=42)
lasso = Lasso()
lasso.fit(X_multvar, y_multvar)
# predict and calculate R2 score
lasso_pred = lasso.predict(X_multvar)
r2_score(y_multvar, lasso_pred)
# previous r2 score: 0.8287621915536203









    Out[79]:





0.82648255292413808



In [80]:

    
lasso.coef_









    Out[80]:





array([  3.51806173e-01,   1.29514181e+01,   4.56074106e-01,
         1.59588033e-01,   8.95301628e-03,   5.43105862e-01,
         1.06576556e-01,   3.30622847e-01,  -1.10980596e+01,
        -8.62103423e-02,   1.57063313e-01,  -1.96090861e-01,
         1.58526788e-01])



In [81]:

    
lasso_pred_test = lasso.predict(X_test)
RMSE(y_test, lasso_pred_test)

# previous RMSE: 39.4240408079









    Out[81]:





93.604129815930108

$R^2$ is better but RMSE is not, something isn't right...



In [85]:

    
# lassocv = LassoCV(alphas=None, cv=10, max_iter=1000, normalize=True, random_state=42)
lassocv = LassoCV(cv=10)
lassocv.fit(X_multvar, y_multvar.ravel())
print(lassocv.alpha_)
print(lassocv.score(X_multvar, y_multvar))









    



14.0327489078
0.806531918212

Also, the RMSE for the alpha determined by CV is higher than the alpha randomly chosen...



In [84]:

    
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_multvar, y_multvar)
RMSE(y_test, lasso.predict(X_test))









    Out[84]:





92.180240393931655



In [88]:

    
# solution:
yhat = lassocv.predict(X_test)
RMSE(y_test, yhat)
# in the video they get 35.11 but the rest of the results are identical to the solution, why???









    Out[88]:





92.180240393931655

12. Create a scatter matrix of the merged set "`data`". Take note of the curvature of the `r` - `h` pair. Model `r` as a polynomial function of `h`, and compare this to a linear model using the Standard Error of the Regression (S).

We should not use $R^2$ for non-linear models!



In [47]:

    
sns.pairplot(player_batting);

Here we take a better look at the relationship between h and r. There seems to be a quadratic relationship or more, maybe I can use a third or fourth grade polynomial.



In [48]:

    
sns.pairplot(player_batting[['r', 'h']]);



In [89]:

    
from sklearn.preprocessing import PolynomialFeatures

# creating the model
OLS_poly = LinearRegression()
# creating training sets
X_poly = player_batting.h.values.reshape(-1, 1)
y_poly = player_batting.r.values.reshape(-1, 1)

# create polynomial features
fourth = PolynomialFeatures(degree=4)
X_fourth = fourth.fit_transform(X_poly)

# linear model comparison training, predicting and scoring
OLS_poly_comp = LinearRegression()
OLS_poly_comp.fit(X_poly, y_poly)
y_poly_test_pred = OLS_poly_comp.predict(X_poly)
print(r2_score(y_poly, y_poly_test_pred))

# fit features and predict
OLS_poly.fit(X_fourth, y_poly)
y_fourth_pred = OLS_poly.predict(X_fourth)
print(r2_score(y_poly, y_fourth_pred))









    



0.955145788814
0.955889108784



In [92]:

    
# S is the same as RMSE, so I'll reuse my previous function
print(RMSE(y_poly, y_poly_test_pred))
print(RMSE(y_poly, y_fourth_pred))
print(RMSE(y_poly, y_fourth_pred)/RMSE(y_poly, y_poly_test_pred))









    



5.56276050523
5.51647516054
0.99167942883

$S$ is another name for RMSE, and you can see that $R^2$ tells us that the linear model is better but $S$ gives us the opposite result.

	bb	bpf	cs	double	dp	e	h	ha	hr	hra	...	ppf	r	ra	sb	sf	so	soa	triple	w	year
0	606	103	26.0	291	159.0	94	1419	1580	191	193	...	105	696	856	67.0	45.0	1094.0	1038	27	77	2005
1	534	101	32.0	308	170.0	86	1453	1487	184	145	...	100	769	674	92.0	46.0	1084.0	929	37	90	2005
2	447	99	37.0	296	154.0	107	1492	1458	189	180	...	99	729	800	83.0	42.0	902.0	1052	27	74	2005
3	653	104	12.0	339	135.0	109	1579	1550	199	164	...	104	910	805	45.0	63.0	1044.0	959	21	95	2005
4	435	103	67.0	253	166.0	94	1450	1392	200	167	...	103	741	645	137.0	49.0	1002.0	1040	23	99	2005

	name_given	name_last	year	ab
0	David Allan	Aardsma	2004	0.0
1	David Allan	Aardsma	2006	2.0
2	David Allan	Aardsma	2007	0.0
3	David Allan	Aardsma	2008	1.0
4	David Allan	Aardsma	2009	0.0