Regression: Linear and Non-Linear

1. Create new sets

a) A training set of the team data containing only the columns listed below and year in interval [2005, 2010]. Create a scatter matrix (pair plot) and heat map of this data and answer the following:

  • Why might ha (hitting average) appear to be negatively correlated with w (wins)?
  • If we wanted to predict w (wins), what might be the best three predictors to model first? Why?

columns:

['bb', 'bpf', 'cs', 'double', 'dp', 'e', 'h', 'ha', 'hr', 'hra', 'l', 'ppf', 'r', 'ra', 'sb', 'sf', 'so', 'soa', 'triple', 'w', 'year']

b) A merge of player and batting data containing only the columns listed below and year >= 2005

columns

['ab', 'r', 'h', 'double', 'triple', 'hr', 'rbi', 'sb', 'cs', 'bb']


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
team = pd.read_csv('team.csv')

In [4]:
team_cols = ['bb', 'bpf', 'cs', 'double', 'dp', 'e', 'h', 'ha', 'hr', 'hra', 'l',
             'ppf', 'r', 'ra', 'sb', 'sf', 'so', 'soa', 'triple', 'w', 'year']

team_train = team.loc[team.year.isin(np.arange(2005, 2011)), team_cols].reset_index(drop=True)

In [5]:
team_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 21 columns):
bb        180 non-null int64
bpf       180 non-null int64
cs        180 non-null float64
double    180 non-null int64
dp        180 non-null float64
e         180 non-null int64
h         180 non-null int64
ha        180 non-null int64
hr        180 non-null int64
hra       180 non-null int64
l         180 non-null int64
ppf       180 non-null int64
r         180 non-null int64
ra        180 non-null int64
sb        180 non-null float64
sf        180 non-null float64
so        180 non-null float64
soa       180 non-null int64
triple    180 non-null int64
w         180 non-null int64
year      180 non-null int64
dtypes: float64(5), int64(16)
memory usage: 29.6 KB

In [6]:
team_train.head()


Out[6]:
bb bpf cs double dp e h ha hr hra ... ppf r ra sb sf so soa triple w year
0 606 103 26.0 291 159.0 94 1419 1580 191 193 ... 105 696 856 67.0 45.0 1094.0 1038 27 77 2005
1 534 101 32.0 308 170.0 86 1453 1487 184 145 ... 100 769 674 92.0 46.0 1084.0 929 37 90 2005
2 447 99 37.0 296 154.0 107 1492 1458 189 180 ... 99 729 800 83.0 42.0 902.0 1052 27 74 2005
3 653 104 12.0 339 135.0 109 1579 1550 199 164 ... 104 910 805 45.0 63.0 1044.0 959 21 95 2005
4 435 103 67.0 253 166.0 94 1450 1392 200 167 ... 103 741 645 137.0 49.0 1002.0 1040 23 99 2005

5 rows × 21 columns


In [15]:
# solutions:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(team_train, figsize=(24, 24));



In [6]:
sns.pairplot(team_train);


According to the documentation ha is not hitting average but hits allowed, so it's a measure of how well the other team played.

It's only natural that it is negatively correlated with wins.


In [7]:
fig, ax = plt.subplots(figsize=(16, 10)) 
ax = sns.heatmap(team_train.corr(), annot=True);


Obviously losses has a correlation of -1 with wins, but I'm assuming that it can't be used to model wins. So I'm going with runs (r), runs allowed (ra) and a third feature which could be either errors (e) or hits allowed (ha); among the latter two I think errors is the best choice because the other is highly correlated to ra.

Correlation doesn't imply causation


In [19]:
# A merge of player and batting data containing only the columns listed below and year >= 2005
player_batting_cols = ['ab', 'r', 'h', 'double', 'triple', 'hr', 'rbi', 'sb', 'cs', 'bb']

player = pd.read_csv('player.csv')
batting = pd.read_csv('batting.csv')

player_batting = pd.merge(left=player, right=batting, left_on='player_id', right_on='player_id')
player_batting = player_batting.loc[player_batting.year >= 2004, ['name_given', 'name_last', 'year'] + player_batting_cols]

In [20]:
player_batting.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 16695 entries, 0 to 101331
Data columns (total 13 columns):
name_given    16695 non-null object
name_last     16695 non-null object
year          16695 non-null int64
ab            16695 non-null float64
r             16695 non-null float64
h             16695 non-null float64
double        16695 non-null float64
triple        16695 non-null float64
hr            16695 non-null float64
rbi           16695 non-null float64
sb            16695 non-null float64
cs            16695 non-null float64
bb            16695 non-null float64
dtypes: float64(10), int64(1), object(2)
memory usage: 1.8+ MB

In [21]:
player_batting.head()


Out[21]:
name_given name_last year ab r h double triple hr rbi sb cs bb
0 David Allan Aardsma 2004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 David Allan Aardsma 2006 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 David Allan Aardsma 2007 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 David Allan Aardsma 2008 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 David Allan Aardsma 2009 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2. Using the OLS LinearRegression in the scikit-learn library

  • Regress r onto double using the train set (create a linear model to predict r from double)
  • What is the equation for the model? Interpret the slope.
  • What is the $R^2$ value and what does it tell us?

In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [23]:
# creating the model
OLS_1var = LinearRegression()
# creating training sets
X_1var = team_train.double.values.reshape(-1, 1)
y_1var = team_train.r.values.reshape(-1, 1)
# fitting the model
OLS_1var.fit(X_1var, y_1var)


Out[23]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [27]:
# making predictions
y_1var_pred = OLS_1var.predict(X_1var)
print('Model equation is: y = {:.5f}x + {:.5f}'.format(float(OLS_1var.coef_), float(OLS_1var.intercept_)))


Model equation is: y = 1.55752x + 290.75819

The slope of the equation represents the rate of change of r with respect to double.


In [28]:
r2_score(y_1var, y_1var_pred)


Out[28]:
0.31120469036164045

The $R^2$ value tells us what proportion of the variance in $y$ is predictable from $X$.

3. Plot the data points and the line of best fit. Create a residual plot for residuals vs the independent variable (double) in the model you just created and interpret it.

  • Is the distribution of residuals random?
  • Is there any heteroscedasticity?
  • Would you say the data used in this model is well-suited to a linear model?

In [26]:
# data for the line plot
X_1var_plot = np.linspace(X_1var.min()-10, X_1var.max()+10, 1000)
y_1var_plot = OLS_1var.predict(X_1var_plot.reshape(-1, 1))

fig, ax = plt.subplots(figsize=(16, 10))
ax = plt.scatter(X_1var, y_1var)
ax = plt.plot(X_1var_plot, y_1var_plot, 'r')
# solution:
# ax = plt.plot(X_1var, y_1var_pred, 'r')



In [29]:
fig, ax = plt.subplots(figsize=(16, 10))
ax = plt.scatter(y_1var_pred, y_1var_pred - y_1var)
ax = plt.hlines(y=0, xmin=y_1var_pred.min(), xmax=y_1var_pred.max(), lw=2, color='red')


The residuals seems randomly distributed and there doesn't seem to be any recognizable pattern, so there is no heteroscedasticity.

The data presents a slight linearity but, as the $R^2$ score tells us, this is not a very good model.

Solution: it seems to work well with a linear model.

4. Create one function that returns the standard error of the estimated parameters in a bivariate model, and create another that returns 95% confidence intervals for these parameters. Compute and interpret these intervals for the model above, then compute the t-stat and p-value for $\hat \beta_1$.

Solution:

$Var(\hat \beta_1) = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \overline x)^2}$

$Var(\hat \beta_0) = \frac{\sigma^2 n^{-1} \sum_{i=1}^{n}x_i^2}{\sum_{i=1}^{n}(x_i - \overline x)^2}$


In [30]:
from scipy import stats

In [21]:
# just a reminder for me:
# this is the interval for 0.95 confidence and 5 df
t = stats.t.interval(alpha=0.95, df=5)
print(t)
# this is the probability of being under the t-value with 5 df
print(stats.t.cdf(2.57058, df=5))
# this is the inverse of the above: what is the t-value you have probability x of being under with 5 df?
print(stats.t.ppf(0.975, df=5))
# I don't really remember what this is...
stats.t.sf(x=np.abs(t), df=5)*2


(-2.5705818366147395, 2.5705818366147395)
0.974999944311
2.57058183661
Out[21]:
array([ 0.05,  0.05])

In [31]:
# solution:
def se_beta(X, y, yhat):
    e = y - yhat
    n = len(X)
    # degrees of freedom
    # + 1 is for the intercept
    k = X.shape[1] + 1
    # compute estimated variance of the regression (sigma_hat^2)
    sig2 = np.sum(e**2) / (n - k)
    # compute variance of b0 and b1
    denom = np.sum((X - X.mean())**2)
    vb1 = sig2 / denom
    vb0 = (sig2/n)*np.sum(X**2) / denom
    
    return np.sqrt(vb0), np.sqrt(vb1)

def interval_beta(b0, b1, se_b0, se_b1):
    # compute the z-score for 95%
    z = stats.norm.ppf(0.95)
    # create intervals [b -/+ z*se]
    itv_b0 = b0 + z*np.array([-se_b0, se_b0])
    itv_b1 = b1 + z*np.array([-se_b1, se_b1])
    
    return itv_b0, itv_b1

In [34]:
interval_beta(float(OLS_1var.intercept_), float(OLS_1var.coef_), *se_beta(X_1var, y_1var, y_1var_pred))


Out[34]:
(array([ 205.63668252,  375.87969552]), array([ 1.27184635,  1.84319804]))

In [36]:
# get se_b1
_, se_b1 = se_beta(X_1var, y_1var, y_1var_pred)

# t-stat
t = float(OLS_1var.coef_) / se_b1

# p-val using normal cdf
p = stats.norm.cdf(-abs(t))

print(t, p)


8.96784260929 1.51189360457e-19

In [22]:
def standard_error(model, X, y):
    MSE = ((model.predict(X) - y)**2).mean()
    return np.sqrt(MSE * np.linalg.inv(X.T.dot(X)))

def conf_intervals(model, X, y, alpha=0.05):
    se = standard_error(model, X, y)
    n = len(X)
    k = len(model.coef_)
    # t-value for half of 1 minus the desired alpha (two-tailed interval)
    t_val = stats.t.ppf(1-alpha/2, df=n-(k+1))
    return model.coef_ - se * t_val, model.coef_ + se * t_val

# 95% confidence interval for previous model
# I'm missing the intercept...
print(conf_intervals(OLS_1var, X_1var, y_1var))

# t-stat and p-value for previous model slope
t_stat = OLS_1var.coef_ / standard_error(OLS_1var, X_1var, y_1var)
p_val = stats.t.sf(x=t_stat, df=178)

print('t-stat: {:.5f}\np-value: {:.5f}'.format(float(t_stat), float(p_val)))


(array([[ 1.52798695]]), array([[ 1.58705744]]))
t-stat: 104.06497
p-value: 0.00000

5. Create a RANSAC model for the same data used in question 2, and compare it's R-squared value to the first model. Be sure to get the indices of inliers


In [45]:
from sklearn.linear_model import RANSACRegressor
# create the RANSAC model
ransac = RANSACRegressor(
                        LinearRegression(),
#                         max_trials=100,
#                         min_samples=50,
#                         loss='absolute_loss',
#                         residual_threshold=5.0,
                        random_state=1234)

# fit the model
ransac.fit(X_1var, y_1var)
# make predictions
ransac_pred = ransac.predict(X_1var)

print('Slope: {:.5f}'.format(float(ransac.estimator_.coef_)))
print('Intercept: {:.5f}'.format(float(ransac.estimator_.intercept_)))
print('R2 score: {:.5f}'.format(r2_score(y_1var, ransac_pred)))

# get the inliers mask
inliers = ransac.inlier_mask_
print(X_1var[inliers][:5])


Slope: 1.86835
Intercept: 183.82008
R2 score: 0.25739
[[291]
 [308]
 [296]
 [335]
 [337]]

In [46]:
# solution
ransac.score(X_1var[inliers], y_1var[inliers])


Out[46]:
0.76198265732964365

In [48]:
# solution
yhat_ransac = ransac.predict(X_1var[inliers])

plt.scatter(X_1var[inliers], y_1var[inliers])
plt.scatter(X_1var[~inliers], y_1var[~inliers], edgecolors='k', facecolors='none')

plt.plot(X_1var, y_1var_pred, 'k', alpha=0.5)
plt.plot(X_1var[inliers], yhat_ransac, 'r')


Out[48]:
[<matplotlib.lines.Line2D at 0x1f824b84c50>]

In [50]:
# solution:
# residuals vs. yhat
resid = y_1var[inliers] - yhat_ransac
plt.scatter(yhat_ransac, resid)
# horizontal line at y = 0
plt.axhline(0, c='r')


Out[50]:
<matplotlib.lines.Line2D at 0x1f824b75c18>

6. Create a linear model to predict r using both double and h as predictors, and do the following

  • Interpret the parameters and give the $R^2$ value.
  • Is there any problem with the two predictors together in this model?
  • Create a 3D plot that shows the data points for both actual and predicted values for r.

In [51]:
X_2var = team_train[['double', 'h']]
y_2var = team_train.r.values.reshape(-1, 1)

OLS_2var = LinearRegression()
OLS_2var.fit(X_2var, y_2var)

print('Coefficients: {}'.format(OLS_2var.coef_))
print('Intercept: {:.5f}'.format(float(OLS_2var.intercept_)))

print('Correlation between coefficients:\n{}'.format(X_2var.corr()))

y_2var_pred = OLS_2var.predict(X_2var)

print('R2 score: {:.5f}'.format(r2_score(y_2var, y_2var_pred)))


Coefficients: [[ 0.71934416  0.62245511]]
Intercept: -373.68216
Correlation between coefficients:
        double       h
double  1.0000  0.4739
h       0.4739  1.0000
R2 score: 0.62239

In [52]:
sns.pairplot(X_2var);


The two parameters presents a small linear relationship, so there may be some correlation in the features causing problems in the model.


In [53]:
from mpl_toolkits import mplot3d

In [54]:
# plotting the points
fig = plt.figure(figsize=(16, 8))
ax = plt.axes(projection='3d')
ax.scatter3D(X_2var.double, X_2var.h, y_2var, c=y_2var)
ax.scatter3D(X_2var.double, X_2var.h, y_2var_pred, c=y_2var_pred, cmap='viridis');


7. Create a multiple regression model to predict r using the variables listed below as predictors, and determine the $R^2$ value. Print all parameter values along with their corresponding variable names.

['bb','bpf','double','dp','e','h','ha','hra','ppf','ra','sb','sf','triple']


In [55]:
# model features
mult_regr_cols = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training sets
X_multvar = team_train[mult_regr_cols] # in solution uses .values
y_multvar = team_train.r.values.reshape(-1, 1)
# fit model
OLS_multvar = LinearRegression()
OLS_multvar.fit(X_multvar, y_multvar)
# print parameters and intercept
for i, col in enumerate(mult_regr_cols):
    print('Coefficient for {} is: {}'.format(col, OLS_multvar.coef_[0][i]))

print('Intercept: {:.5f}'.format(float(OLS_multvar.intercept_)))

# predict and print R2
y_multvar_pred = OLS_multvar.predict(X_multvar)
print('R2 score: {:.5f}'.format(r2_score(y_multvar, y_multvar_pred)))


Coefficient for bb is: 0.32532424691943296
Coefficient for bpf is: 18.846529117352258
Coefficient for double is: 0.4397304479295745
Coefficient for dp is: 0.17870078320113036
Coefficient for e is: 0.03940524716282212
Coefficient for h is: 0.5194200382940695
Coefficient for ha is: 0.11292488964728964
Coefficient for hra is: 0.3305064938706632
Coefficient for ppf is: -17.039926128845746
Coefficient for ra is: -0.05576480937751427
Coefficient for sb is: 0.14319056188560886
Coefficient for sf is: -0.18307152600088283
Coefficient for triple is: 0.1927186886122314
Intercept: -715.03548
R2 score: 0.82876

Adjusted $R^2$ can be used in this case because we have added a lot of features and it can be good to account for that.

8. Create a test set that includes all years >= 2011 and create a set of predictions from this data. Compute the MSE and RMSE for the model above using scikit-learn. Create your own function to compute these scores and compare your outputs to scikit's, and interpret the RMSE.


In [56]:
# create the test dataframe
team_test = team.loc[team.year >= 2011, team_cols].reset_index(drop=True)

In [57]:
# create test sets
X_test = team_test[mult_regr_cols]
y_test = team_test.r.values.reshape(-1, 1)
# make predictions
y_test_pred = OLS_multvar.predict(X_test)

In [58]:
from sklearn.metrics import mean_squared_error
# using scikit-learn
print('MSE: {:.5f}'.format(mean_squared_error(y_test, y_test_pred)))
print('RMSE: {:.5f}'.format(np.sqrt(mean_squared_error(y_test, y_test_pred))))


MSE: 1554.25499
RMSE: 39.42404

In [59]:
def MSE(y, pred):
    return ((y - pred)**2).mean()

def RMSE(y, pred):
    return np.sqrt(((y - pred)**2).mean())
# using my functions
print('MSE: {:.5f}'.format(MSE(y_test, y_test_pred)))
print('RMSE: {:.5f}'.format(RMSE(y_test, y_test_pred)))


MSE: 1554.25499
RMSE: 39.42404

The RMSE tells us how much on average the predictions differ from the real values.

9. The RMSE computed in the last problem doesn't actually tell us much by itself. Is +/- 39.4 runs a small error, or a really large error? Come up with a solution to this problem and report/interpret the new result.

We could use the error percentage, obtained dividing the RMSE by the mean value of y (since the RMSE is in the same units as the values being predicted); this gives us an error of about 6%.


In [60]:
RMSE(y_test, y_test_pred) / y_test.mean() # from solution: this is called Normalized RMSE


Out[60]:
0.057700473432857162

Optional Exercises

10. Create and run a function to select the best subset from the current data using k-fold CV, and based on RMSE on test sets. Do not use any functions from scikit-learn. The function need only return the list of best features and the best score, and should use only simple forward selection.


In [68]:
# solution:
def partition(n, k):
    groups = []
    g = []
    # iterate through
    for i in range(n):
        # if group full and not last index
        if (i%int(n/k) == 0) & (i > 0) & (i != n-1):
            groups.append(g)
            g = []
            g.append(i)
            continue
        # if last index
        if i == n-1:
            g.append(i)
            groups.append(g)
            continue
        g.append(i)
    return groups


def best_subset(X, y, attrs, k):
    from sklearn.linear_model import LinearRegression
    
    # init score and list for subset
    current_score = 1e9
    best_score = 1e9
    best_subset = []
    
    # partition indices into k-folds
    n = len(y)
    folds = partition(n, k)
    
    # iterate through attributes and folds to train/test
    first = True # first attribute tested
    # keep track of the current best attr
    best_attr = None
    # init list of current attrs -- attributes to be tested
    current_attrs = attrs[:] # create copy
    
    def loops():
        # outer-scope vars that we need to modify
        nonlocal current_score, best_score, best_subset, first, best_attr, current_attrs
        
        for a in current_attrs:
            # keep score for folds
            scores = []
            # cross-val
            for i,f in enumerate(folds):
                # get all indices non in left out fold
                G = [folds[j] for j in range(k) if not j == i]
                # combine the groups
                mask = [i for g in G for i in g]
                
                # create numerical mask for kept and currently tested attrs
                if first:
                    attr_mask = [attrs.index(a)]
                else:
                    attr_mask = [attrs.index(x) for x in best_subset] + [attrs.index(a)]
                
                # train group
                Xtr = X[mask][:, attr_mask]
                ytr = y[mask]
                # test group
                Xts = X[folds[i]][:, attr_mask]
                yts = y[folds[i]]
                
                # reshape if only one attr
                if len(attr_mask) == 1:
                    Xtr = Xtr.reshape(-1, 1)
                    Xts = Xts.reshape(-1, 1)
                
                # fit model
                mod = LinearRegression()
                mod.fit(Xtr, ytr)
                
                # score test set below
                # get predictions
                yhat = mod.predict(Xts)
                # use mse_rmse func we wrote
                rmse = RMSE(yts, yhat)
                # append rmse to scores
                scores.append(rmse)
            # compute mean of scores after last fold
            score = np.mean(scores)
            if first:
                current_score = score
                best_attr = a
                first = False
            elif score < current_score:
                current_score = score
                best_attr = a
        
        # append attr that provided model with best score in this loop
        if current_score < best_score:
            best_score = current_score
            best_subset.append(best_attr)
            # update current attrs
            current_attrs = [c for c in current_attrs if c not in best_subset]
    
    
    for n in range(len(attrs)):
        loops()
    
    return best_score, best_subset

In [74]:
# solution continues:
cols_forw = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_forw_sol = team_train[cols_forw].values
y_forw_sol = team_train.r.values.reshape(-1, 1)

score, best = best_subset(X_forw_sol, y_forw_sol, cols_forw, 10)
print(score, best)


31.5395437302 ['h', 'bb', 'bpf', 'ppf', 'ha', 'double', 'hra', 'sb', 'dp']

In [77]:
# solution continues:
# train regression on all the data
OLS_forw_sol = LinearRegression()
X_forw_sol = team_train[best].values
OLS_forw_sol.fit(X_forw_sol, y_forw_sol)
# and test it on test data to see the RMSE
X_forw_test_sol = team_test[best].values
y_forw_test_sol = team_test.r.values.reshape(-1, 1)
RMSE(y_forw_test_sol, OLS_forw_sol.predict(X_forw_test_sol))


Out[77]:
39.234716811605196

In [70]:
def k_fold(model, X, y, k):
    '''calculate the meand RMSE for a model using k-fold'''
    # lenght of a single partition
    n = len(X) // k
    # transforming the dataframe in an array
    X_arr = np.array(X.values)
    score = []
    # for every partition we split the data in train and test sets, fit the model and evaluate
    for i in range(n):
        # the first (i-1)-th partions and the last n-(i+1)-th
        Xtrain = np.vstack((X_arr[0:i*k, :], X_arr[(i+1)*k:, :]))
        # the i-th partition
        Xtest = X_arr[i*k:(i+1)*k, :]
        ytrain = np.vstack((y[0:i*k, :], y[(i+1)*k:, :]))
        ytest = y[i*k:(i+1)*k, :]
        model.fit(Xtrain, ytrain)
        curr_score = RMSE(ytest, model.predict(Xtest))
        score.append(curr_score)
    # returning the mean score
    return np.array(score).mean()

All of this is just to verify that all this kfold CV stuff makes sense (and that my functions are somewhat correct):


In [35]:
# choose some columns
cols_1 = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_kfold_1 = team_train[cols_1]
y_kfold_1 = team_train.r.values.reshape(-1, 1)
# perform kfold CV on it and see the result
OLS_kfold = LinearRegression()
k_fold(OLS_kfold, X_kfold_1, y_kfold_1, 5)


Out[35]:
30.961743820507667

In [36]:
# train regression on all the data
OLS_kfold.fit(X_kfold_1, y_kfold_1)
# and test it on test data to see the RMSE
X_kfold_1_test = team_test[cols_1]
y_kfold_1_test = team_test.r.values.reshape(-1, 1)
RMSE(y_kfold_1_test, OLS_kfold.predict(X_kfold_1_test))


Out[36]:
39.424040807911098

In [37]:
# again on other columns
cols_2 = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf']
# training set and kfold CV
X_kfold_2 = team_train[cols_2]
y_kfold_2 = team_train.r.values.reshape(-1, 1)
k_fold(OLS_kfold, X_kfold_2, y_kfold_2, 5)


Out[37]:
30.63794636543934

This has a best CV score so it should perform better on the test set:


In [38]:
# again training on all data
OLS_kfold.fit(X_kfold_2, y_kfold_2)
# and testing it on test set
X_kfold_2_test = team_test[cols_2]
y_kfold_2_test = team_test.r.values.reshape(-1, 1)
RMSE(y_kfold_2_test, OLS_kfold.predict(X_kfold_2_test))


Out[38]:
39.390007566525419

Ditto, the RMSE is slightly better on unseen data!

Now all I have to do is forward selection:


In [71]:
def add_one_feature(model, X, y, selected_features, additional_features, k=5):
    '''given a model, the current features and the reamining ones,
    it returns the best score and best features obtained adding 1 feature.
    It uses k-fold validation to determine the best set'''
    best_score = -1
    best_features = []
    for feature in additional_features:
        # initialize the features by adding one of the remaining ones
        curr_features = selected_features + [feature]
        # selecting data and performing k-fold CV
        curr_X = X[curr_features]
        curr_score = k_fold(model, curr_X, y, k)
        # eventually updating the best scores and features
        if best_score == -1:
            best_score = curr_score
            best_features = curr_features
        elif best_score >= curr_score:
            best_score = curr_score
            best_features = curr_features    
    return best_score, best_features

def forward_sel(model, X, y, features, k=5):
    '''performs forward selection on the list of features determining the best set using k-fold CV'''
    best_score = -1
    best_features = []
    # we keep track of the best set of features for each i in order to keep forward selection going on
    last_i_best_features = []
    for i in range(len(features)):
        # get the features yet to be processed by difference between all the features and the last best ones
        additional_features = list(set(features) - set(last_i_best_features))
        # calculating best score and i+1 fetures
        curr_score, curr_features = add_one_feature(model, X, y, last_i_best_features, additional_features, k)
        # updating last best set of features and eventually best score and features
        last_i_best_features = curr_features
        if best_score == -1:
            best_features = curr_features
            best_score = curr_score
        elif best_score >= curr_score:
            best_features = curr_features
            best_score = curr_score
    return best_score, best_features

In [72]:
# choose some columns
cols_forw = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_forw = team_train[cols_forw]
y_forw = team_train.r.values.reshape(-1, 1)
# perform forward selection on it and see the result
OLS_forward_sel = LinearRegression()
best_score, best_features = forward_sel(OLS_forward_sel, X_forw, y_forw, cols_forw)
print(best_score, best_features)


30.4125952229 ['h', 'bb', 'double', 'bpf', 'ppf', 'ha', 'hra', 'sb']

And it does slightly better than the ones above... Wow, seems like it's working pretty well!


In [78]:
# train regression on all the data
X_forw = team_train[best_features]
OLS_forward_sel.fit(X_forw, y_forw)
# and test it on test data to see the RMSE
X_forw_test = team_test[best_features]
y_forw_test = team_test.r.values.reshape(-1, 1)
RMSE(y_forw_test, OLS_forward_sel.predict(X_forw_test))


Out[78]:
39.312801031266886

11. Model the same data as in (7), only this time compare the previous results to a Lasso model using the Lasso class from scikit-learn. Then use the LassoCV with 10-fold CV to determine alpha and compare results again.


In [79]:
from sklearn.linear_model import Lasso, LassoCV
# create and train model
# lasso = Lasso(alpha=0.1, normalize=True, max_iter=1000, random_state=42)
lasso = Lasso()
lasso.fit(X_multvar, y_multvar)
# predict and calculate R2 score
lasso_pred = lasso.predict(X_multvar)
r2_score(y_multvar, lasso_pred)
# previous r2 score: 0.8287621915536203


Out[79]:
0.82648255292413808

In [80]:
lasso.coef_


Out[80]:
array([  3.51806173e-01,   1.29514181e+01,   4.56074106e-01,
         1.59588033e-01,   8.95301628e-03,   5.43105862e-01,
         1.06576556e-01,   3.30622847e-01,  -1.10980596e+01,
        -8.62103423e-02,   1.57063313e-01,  -1.96090861e-01,
         1.58526788e-01])

In [81]:
lasso_pred_test = lasso.predict(X_test)
RMSE(y_test, lasso_pred_test)

# previous RMSE: 39.4240408079


Out[81]:
93.604129815930108

$R^2$ is better but RMSE is not, something isn't right...


In [85]:
# lassocv = LassoCV(alphas=None, cv=10, max_iter=1000, normalize=True, random_state=42)
lassocv = LassoCV(cv=10)
lassocv.fit(X_multvar, y_multvar.ravel())
print(lassocv.alpha_)
print(lassocv.score(X_multvar, y_multvar))


14.0327489078
0.806531918212

Also, the RMSE for the alpha determined by CV is higher than the alpha randomly chosen...


In [84]:
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_multvar, y_multvar)
RMSE(y_test, lasso.predict(X_test))


Out[84]:
92.180240393931655

In [88]:
# solution:
yhat = lassocv.predict(X_test)
RMSE(y_test, yhat)
# in the video they get 35.11 but the rest of the results are identical to the solution, why???


Out[88]:
92.180240393931655

12. Create a scatter matrix of the merged set "data". Take note of the curvature of the r - h pair. Model r as a polynomial function of h, and compare this to a linear model using the Standard Error of the Regression (S).

We should not use $R^2$ for non-linear models!


In [47]:
sns.pairplot(player_batting);


Here we take a better look at the relationship between h and r. There seems to be a quadratic relationship or more, maybe I can use a third or fourth grade polynomial.


In [48]:
sns.pairplot(player_batting[['r', 'h']]);



In [89]:
from sklearn.preprocessing import PolynomialFeatures

# creating the model
OLS_poly = LinearRegression()
# creating training sets
X_poly = player_batting.h.values.reshape(-1, 1)
y_poly = player_batting.r.values.reshape(-1, 1)

# create polynomial features
fourth = PolynomialFeatures(degree=4)
X_fourth = fourth.fit_transform(X_poly)

# linear model comparison training, predicting and scoring
OLS_poly_comp = LinearRegression()
OLS_poly_comp.fit(X_poly, y_poly)
y_poly_test_pred = OLS_poly_comp.predict(X_poly)
print(r2_score(y_poly, y_poly_test_pred))

# fit features and predict
OLS_poly.fit(X_fourth, y_poly)
y_fourth_pred = OLS_poly.predict(X_fourth)
print(r2_score(y_poly, y_fourth_pred))


0.955145788814
0.955889108784

In [92]:
# S is the same as RMSE, so I'll reuse my previous function
print(RMSE(y_poly, y_poly_test_pred))
print(RMSE(y_poly, y_fourth_pred))
print(RMSE(y_poly, y_fourth_pred)/RMSE(y_poly, y_poly_test_pred))


5.56276050523
5.51647516054
0.99167942883

$S$ is another name for RMSE, and you can see that $R^2$ tells us that the linear model is better but $S$ gives us the opposite result.