ha
(hitting average) appear to be negatively correlated with w
(wins)?w
(wins), what might be the best three predictors to model first? Why?columns:
['bb',
'bpf',
'cs',
'double',
'dp',
'e',
'h',
'ha',
'hr',
'hra',
'l',
'ppf',
'r',
'ra',
'sb',
'sf',
'so',
'soa',
'triple',
'w',
'year']
columns
['ab',
'r',
'h',
'double',
'triple',
'hr',
'rbi',
'sb',
'cs',
'bb']
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [3]:
team = pd.read_csv('team.csv')
In [4]:
team_cols = ['bb', 'bpf', 'cs', 'double', 'dp', 'e', 'h', 'ha', 'hr', 'hra', 'l',
'ppf', 'r', 'ra', 'sb', 'sf', 'so', 'soa', 'triple', 'w', 'year']
team_train = team.loc[team.year.isin(np.arange(2005, 2011)), team_cols].reset_index(drop=True)
In [5]:
team_train.info()
In [6]:
team_train.head()
Out[6]:
In [15]:
# solutions:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(team_train, figsize=(24, 24));
In [6]:
sns.pairplot(team_train);
According to the documentation ha
is not hitting average but hits allowed, so it's a measure of how well the other team played.
It's only natural that it is negatively correlated with wins.
In [7]:
fig, ax = plt.subplots(figsize=(16, 10))
ax = sns.heatmap(team_train.corr(), annot=True);
Obviously losses has a correlation of -1 with wins, but I'm assuming that it can't be used to model wins. So I'm going with runs (r
), runs allowed (ra
) and a third feature which could be either errors (e
) or hits allowed (ha
); among the latter two I think errors is the best choice because the other is highly correlated to ra
.
Correlation doesn't imply causation
In [19]:
# A merge of player and batting data containing only the columns listed below and year >= 2005
player_batting_cols = ['ab', 'r', 'h', 'double', 'triple', 'hr', 'rbi', 'sb', 'cs', 'bb']
player = pd.read_csv('player.csv')
batting = pd.read_csv('batting.csv')
player_batting = pd.merge(left=player, right=batting, left_on='player_id', right_on='player_id')
player_batting = player_batting.loc[player_batting.year >= 2004, ['name_given', 'name_last', 'year'] + player_batting_cols]
In [20]:
player_batting.info()
In [21]:
player_batting.head()
Out[21]:
In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
In [23]:
# creating the model
OLS_1var = LinearRegression()
# creating training sets
X_1var = team_train.double.values.reshape(-1, 1)
y_1var = team_train.r.values.reshape(-1, 1)
# fitting the model
OLS_1var.fit(X_1var, y_1var)
Out[23]:
In [27]:
# making predictions
y_1var_pred = OLS_1var.predict(X_1var)
print('Model equation is: y = {:.5f}x + {:.5f}'.format(float(OLS_1var.coef_), float(OLS_1var.intercept_)))
The slope of the equation represents the rate of change of r
with respect to double
.
In [28]:
r2_score(y_1var, y_1var_pred)
Out[28]:
The $R^2$ value tells us what proportion of the variance in $y$ is predictable from $X$.
double
) in the model you just created and interpret it.
In [26]:
# data for the line plot
X_1var_plot = np.linspace(X_1var.min()-10, X_1var.max()+10, 1000)
y_1var_plot = OLS_1var.predict(X_1var_plot.reshape(-1, 1))
fig, ax = plt.subplots(figsize=(16, 10))
ax = plt.scatter(X_1var, y_1var)
ax = plt.plot(X_1var_plot, y_1var_plot, 'r')
# solution:
# ax = plt.plot(X_1var, y_1var_pred, 'r')
In [29]:
fig, ax = plt.subplots(figsize=(16, 10))
ax = plt.scatter(y_1var_pred, y_1var_pred - y_1var)
ax = plt.hlines(y=0, xmin=y_1var_pred.min(), xmax=y_1var_pred.max(), lw=2, color='red')
The residuals seems randomly distributed and there doesn't seem to be any recognizable pattern, so there is no heteroscedasticity.
The data presents a slight linearity but, as the $R^2$ score tells us, this is not a very good model.
Solution: it seems to work well with a linear model.
Solution:
$Var(\hat \beta_1) = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \overline x)^2}$
$Var(\hat \beta_0) = \frac{\sigma^2 n^{-1} \sum_{i=1}^{n}x_i^2}{\sum_{i=1}^{n}(x_i - \overline x)^2}$
In [30]:
from scipy import stats
In [21]:
# just a reminder for me:
# this is the interval for 0.95 confidence and 5 df
t = stats.t.interval(alpha=0.95, df=5)
print(t)
# this is the probability of being under the t-value with 5 df
print(stats.t.cdf(2.57058, df=5))
# this is the inverse of the above: what is the t-value you have probability x of being under with 5 df?
print(stats.t.ppf(0.975, df=5))
# I don't really remember what this is...
stats.t.sf(x=np.abs(t), df=5)*2
Out[21]:
In [31]:
# solution:
def se_beta(X, y, yhat):
e = y - yhat
n = len(X)
# degrees of freedom
# + 1 is for the intercept
k = X.shape[1] + 1
# compute estimated variance of the regression (sigma_hat^2)
sig2 = np.sum(e**2) / (n - k)
# compute variance of b0 and b1
denom = np.sum((X - X.mean())**2)
vb1 = sig2 / denom
vb0 = (sig2/n)*np.sum(X**2) / denom
return np.sqrt(vb0), np.sqrt(vb1)
def interval_beta(b0, b1, se_b0, se_b1):
# compute the z-score for 95%
z = stats.norm.ppf(0.95)
# create intervals [b -/+ z*se]
itv_b0 = b0 + z*np.array([-se_b0, se_b0])
itv_b1 = b1 + z*np.array([-se_b1, se_b1])
return itv_b0, itv_b1
In [34]:
interval_beta(float(OLS_1var.intercept_), float(OLS_1var.coef_), *se_beta(X_1var, y_1var, y_1var_pred))
Out[34]:
In [36]:
# get se_b1
_, se_b1 = se_beta(X_1var, y_1var, y_1var_pred)
# t-stat
t = float(OLS_1var.coef_) / se_b1
# p-val using normal cdf
p = stats.norm.cdf(-abs(t))
print(t, p)
In [22]:
def standard_error(model, X, y):
MSE = ((model.predict(X) - y)**2).mean()
return np.sqrt(MSE * np.linalg.inv(X.T.dot(X)))
def conf_intervals(model, X, y, alpha=0.05):
se = standard_error(model, X, y)
n = len(X)
k = len(model.coef_)
# t-value for half of 1 minus the desired alpha (two-tailed interval)
t_val = stats.t.ppf(1-alpha/2, df=n-(k+1))
return model.coef_ - se * t_val, model.coef_ + se * t_val
# 95% confidence interval for previous model
# I'm missing the intercept...
print(conf_intervals(OLS_1var, X_1var, y_1var))
# t-stat and p-value for previous model slope
t_stat = OLS_1var.coef_ / standard_error(OLS_1var, X_1var, y_1var)
p_val = stats.t.sf(x=t_stat, df=178)
print('t-stat: {:.5f}\np-value: {:.5f}'.format(float(t_stat), float(p_val)))
In [45]:
from sklearn.linear_model import RANSACRegressor
# create the RANSAC model
ransac = RANSACRegressor(
LinearRegression(),
# max_trials=100,
# min_samples=50,
# loss='absolute_loss',
# residual_threshold=5.0,
random_state=1234)
# fit the model
ransac.fit(X_1var, y_1var)
# make predictions
ransac_pred = ransac.predict(X_1var)
print('Slope: {:.5f}'.format(float(ransac.estimator_.coef_)))
print('Intercept: {:.5f}'.format(float(ransac.estimator_.intercept_)))
print('R2 score: {:.5f}'.format(r2_score(y_1var, ransac_pred)))
# get the inliers mask
inliers = ransac.inlier_mask_
print(X_1var[inliers][:5])
In [46]:
# solution
ransac.score(X_1var[inliers], y_1var[inliers])
Out[46]:
In [48]:
# solution
yhat_ransac = ransac.predict(X_1var[inliers])
plt.scatter(X_1var[inliers], y_1var[inliers])
plt.scatter(X_1var[~inliers], y_1var[~inliers], edgecolors='k', facecolors='none')
plt.plot(X_1var, y_1var_pred, 'k', alpha=0.5)
plt.plot(X_1var[inliers], yhat_ransac, 'r')
Out[48]:
In [50]:
# solution:
# residuals vs. yhat
resid = y_1var[inliers] - yhat_ransac
plt.scatter(yhat_ransac, resid)
# horizontal line at y = 0
plt.axhline(0, c='r')
Out[50]:
r
using both double
and h
as predictors, and do the followingr
.
In [51]:
X_2var = team_train[['double', 'h']]
y_2var = team_train.r.values.reshape(-1, 1)
OLS_2var = LinearRegression()
OLS_2var.fit(X_2var, y_2var)
print('Coefficients: {}'.format(OLS_2var.coef_))
print('Intercept: {:.5f}'.format(float(OLS_2var.intercept_)))
print('Correlation between coefficients:\n{}'.format(X_2var.corr()))
y_2var_pred = OLS_2var.predict(X_2var)
print('R2 score: {:.5f}'.format(r2_score(y_2var, y_2var_pred)))
In [52]:
sns.pairplot(X_2var);
The two parameters presents a small linear relationship, so there may be some correlation in the features causing problems in the model.
In [53]:
from mpl_toolkits import mplot3d
In [54]:
# plotting the points
fig = plt.figure(figsize=(16, 8))
ax = plt.axes(projection='3d')
ax.scatter3D(X_2var.double, X_2var.h, y_2var, c=y_2var)
ax.scatter3D(X_2var.double, X_2var.h, y_2var_pred, c=y_2var_pred, cmap='viridis');
In [55]:
# model features
mult_regr_cols = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training sets
X_multvar = team_train[mult_regr_cols] # in solution uses .values
y_multvar = team_train.r.values.reshape(-1, 1)
# fit model
OLS_multvar = LinearRegression()
OLS_multvar.fit(X_multvar, y_multvar)
# print parameters and intercept
for i, col in enumerate(mult_regr_cols):
print('Coefficient for {} is: {}'.format(col, OLS_multvar.coef_[0][i]))
print('Intercept: {:.5f}'.format(float(OLS_multvar.intercept_)))
# predict and print R2
y_multvar_pred = OLS_multvar.predict(X_multvar)
print('R2 score: {:.5f}'.format(r2_score(y_multvar, y_multvar_pred)))
Adjusted $R^2$ can be used in this case because we have added a lot of features and it can be good to account for that.
In [56]:
# create the test dataframe
team_test = team.loc[team.year >= 2011, team_cols].reset_index(drop=True)
In [57]:
# create test sets
X_test = team_test[mult_regr_cols]
y_test = team_test.r.values.reshape(-1, 1)
# make predictions
y_test_pred = OLS_multvar.predict(X_test)
In [58]:
from sklearn.metrics import mean_squared_error
# using scikit-learn
print('MSE: {:.5f}'.format(mean_squared_error(y_test, y_test_pred)))
print('RMSE: {:.5f}'.format(np.sqrt(mean_squared_error(y_test, y_test_pred))))
In [59]:
def MSE(y, pred):
return ((y - pred)**2).mean()
def RMSE(y, pred):
return np.sqrt(((y - pred)**2).mean())
# using my functions
print('MSE: {:.5f}'.format(MSE(y_test, y_test_pred)))
print('RMSE: {:.5f}'.format(RMSE(y_test, y_test_pred)))
The RMSE tells us how much on average the predictions differ from the real values.
We could use the error percentage, obtained dividing the RMSE by the mean value of y (since the RMSE is in the same units as the values being predicted); this gives us an error of about 6%.
In [60]:
RMSE(y_test, y_test_pred) / y_test.mean() # from solution: this is called Normalized RMSE
Out[60]:
In [68]:
# solution:
def partition(n, k):
groups = []
g = []
# iterate through
for i in range(n):
# if group full and not last index
if (i%int(n/k) == 0) & (i > 0) & (i != n-1):
groups.append(g)
g = []
g.append(i)
continue
# if last index
if i == n-1:
g.append(i)
groups.append(g)
continue
g.append(i)
return groups
def best_subset(X, y, attrs, k):
from sklearn.linear_model import LinearRegression
# init score and list for subset
current_score = 1e9
best_score = 1e9
best_subset = []
# partition indices into k-folds
n = len(y)
folds = partition(n, k)
# iterate through attributes and folds to train/test
first = True # first attribute tested
# keep track of the current best attr
best_attr = None
# init list of current attrs -- attributes to be tested
current_attrs = attrs[:] # create copy
def loops():
# outer-scope vars that we need to modify
nonlocal current_score, best_score, best_subset, first, best_attr, current_attrs
for a in current_attrs:
# keep score for folds
scores = []
# cross-val
for i,f in enumerate(folds):
# get all indices non in left out fold
G = [folds[j] for j in range(k) if not j == i]
# combine the groups
mask = [i for g in G for i in g]
# create numerical mask for kept and currently tested attrs
if first:
attr_mask = [attrs.index(a)]
else:
attr_mask = [attrs.index(x) for x in best_subset] + [attrs.index(a)]
# train group
Xtr = X[mask][:, attr_mask]
ytr = y[mask]
# test group
Xts = X[folds[i]][:, attr_mask]
yts = y[folds[i]]
# reshape if only one attr
if len(attr_mask) == 1:
Xtr = Xtr.reshape(-1, 1)
Xts = Xts.reshape(-1, 1)
# fit model
mod = LinearRegression()
mod.fit(Xtr, ytr)
# score test set below
# get predictions
yhat = mod.predict(Xts)
# use mse_rmse func we wrote
rmse = RMSE(yts, yhat)
# append rmse to scores
scores.append(rmse)
# compute mean of scores after last fold
score = np.mean(scores)
if first:
current_score = score
best_attr = a
first = False
elif score < current_score:
current_score = score
best_attr = a
# append attr that provided model with best score in this loop
if current_score < best_score:
best_score = current_score
best_subset.append(best_attr)
# update current attrs
current_attrs = [c for c in current_attrs if c not in best_subset]
for n in range(len(attrs)):
loops()
return best_score, best_subset
In [74]:
# solution continues:
cols_forw = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_forw_sol = team_train[cols_forw].values
y_forw_sol = team_train.r.values.reshape(-1, 1)
score, best = best_subset(X_forw_sol, y_forw_sol, cols_forw, 10)
print(score, best)
In [77]:
# solution continues:
# train regression on all the data
OLS_forw_sol = LinearRegression()
X_forw_sol = team_train[best].values
OLS_forw_sol.fit(X_forw_sol, y_forw_sol)
# and test it on test data to see the RMSE
X_forw_test_sol = team_test[best].values
y_forw_test_sol = team_test.r.values.reshape(-1, 1)
RMSE(y_forw_test_sol, OLS_forw_sol.predict(X_forw_test_sol))
Out[77]:
In [70]:
def k_fold(model, X, y, k):
'''calculate the meand RMSE for a model using k-fold'''
# lenght of a single partition
n = len(X) // k
# transforming the dataframe in an array
X_arr = np.array(X.values)
score = []
# for every partition we split the data in train and test sets, fit the model and evaluate
for i in range(n):
# the first (i-1)-th partions and the last n-(i+1)-th
Xtrain = np.vstack((X_arr[0:i*k, :], X_arr[(i+1)*k:, :]))
# the i-th partition
Xtest = X_arr[i*k:(i+1)*k, :]
ytrain = np.vstack((y[0:i*k, :], y[(i+1)*k:, :]))
ytest = y[i*k:(i+1)*k, :]
model.fit(Xtrain, ytrain)
curr_score = RMSE(ytest, model.predict(Xtest))
score.append(curr_score)
# returning the mean score
return np.array(score).mean()
All of this is just to verify that all this kfold CV stuff makes sense (and that my functions are somewhat correct):
In [35]:
# choose some columns
cols_1 = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_kfold_1 = team_train[cols_1]
y_kfold_1 = team_train.r.values.reshape(-1, 1)
# perform kfold CV on it and see the result
OLS_kfold = LinearRegression()
k_fold(OLS_kfold, X_kfold_1, y_kfold_1, 5)
Out[35]:
In [36]:
# train regression on all the data
OLS_kfold.fit(X_kfold_1, y_kfold_1)
# and test it on test data to see the RMSE
X_kfold_1_test = team_test[cols_1]
y_kfold_1_test = team_test.r.values.reshape(-1, 1)
RMSE(y_kfold_1_test, OLS_kfold.predict(X_kfold_1_test))
Out[36]:
In [37]:
# again on other columns
cols_2 = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf']
# training set and kfold CV
X_kfold_2 = team_train[cols_2]
y_kfold_2 = team_train.r.values.reshape(-1, 1)
k_fold(OLS_kfold, X_kfold_2, y_kfold_2, 5)
Out[37]:
This has a best CV score so it should perform better on the test set:
In [38]:
# again training on all data
OLS_kfold.fit(X_kfold_2, y_kfold_2)
# and testing it on test set
X_kfold_2_test = team_test[cols_2]
y_kfold_2_test = team_test.r.values.reshape(-1, 1)
RMSE(y_kfold_2_test, OLS_kfold.predict(X_kfold_2_test))
Out[38]:
Ditto, the RMSE is slightly better on unseen data!
Now all I have to do is forward selection:
In [71]:
def add_one_feature(model, X, y, selected_features, additional_features, k=5):
'''given a model, the current features and the reamining ones,
it returns the best score and best features obtained adding 1 feature.
It uses k-fold validation to determine the best set'''
best_score = -1
best_features = []
for feature in additional_features:
# initialize the features by adding one of the remaining ones
curr_features = selected_features + [feature]
# selecting data and performing k-fold CV
curr_X = X[curr_features]
curr_score = k_fold(model, curr_X, y, k)
# eventually updating the best scores and features
if best_score == -1:
best_score = curr_score
best_features = curr_features
elif best_score >= curr_score:
best_score = curr_score
best_features = curr_features
return best_score, best_features
def forward_sel(model, X, y, features, k=5):
'''performs forward selection on the list of features determining the best set using k-fold CV'''
best_score = -1
best_features = []
# we keep track of the best set of features for each i in order to keep forward selection going on
last_i_best_features = []
for i in range(len(features)):
# get the features yet to be processed by difference between all the features and the last best ones
additional_features = list(set(features) - set(last_i_best_features))
# calculating best score and i+1 fetures
curr_score, curr_features = add_one_feature(model, X, y, last_i_best_features, additional_features, k)
# updating last best set of features and eventually best score and features
last_i_best_features = curr_features
if best_score == -1:
best_features = curr_features
best_score = curr_score
elif best_score >= curr_score:
best_features = curr_features
best_score = curr_score
return best_score, best_features
In [72]:
# choose some columns
cols_forw = ['bb', 'bpf', 'double', 'dp', 'e', 'h', 'ha', 'hra', 'ppf', 'ra', 'sb', 'sf', 'triple']
# create training set
X_forw = team_train[cols_forw]
y_forw = team_train.r.values.reshape(-1, 1)
# perform forward selection on it and see the result
OLS_forward_sel = LinearRegression()
best_score, best_features = forward_sel(OLS_forward_sel, X_forw, y_forw, cols_forw)
print(best_score, best_features)
And it does slightly better than the ones above... Wow, seems like it's working pretty well!
In [78]:
# train regression on all the data
X_forw = team_train[best_features]
OLS_forward_sel.fit(X_forw, y_forw)
# and test it on test data to see the RMSE
X_forw_test = team_test[best_features]
y_forw_test = team_test.r.values.reshape(-1, 1)
RMSE(y_forw_test, OLS_forward_sel.predict(X_forw_test))
Out[78]:
In [79]:
from sklearn.linear_model import Lasso, LassoCV
# create and train model
# lasso = Lasso(alpha=0.1, normalize=True, max_iter=1000, random_state=42)
lasso = Lasso()
lasso.fit(X_multvar, y_multvar)
# predict and calculate R2 score
lasso_pred = lasso.predict(X_multvar)
r2_score(y_multvar, lasso_pred)
# previous r2 score: 0.8287621915536203
Out[79]:
In [80]:
lasso.coef_
Out[80]:
In [81]:
lasso_pred_test = lasso.predict(X_test)
RMSE(y_test, lasso_pred_test)
# previous RMSE: 39.4240408079
Out[81]:
$R^2$ is better but RMSE is not, something isn't right...
In [85]:
# lassocv = LassoCV(alphas=None, cv=10, max_iter=1000, normalize=True, random_state=42)
lassocv = LassoCV(cv=10)
lassocv.fit(X_multvar, y_multvar.ravel())
print(lassocv.alpha_)
print(lassocv.score(X_multvar, y_multvar))
Also, the RMSE for the alpha determined by CV is higher than the alpha randomly chosen...
In [84]:
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_multvar, y_multvar)
RMSE(y_test, lasso.predict(X_test))
Out[84]:
In [88]:
# solution:
yhat = lassocv.predict(X_test)
RMSE(y_test, yhat)
# in the video they get 35.11 but the rest of the results are identical to the solution, why???
Out[88]:
In [47]:
sns.pairplot(player_batting);
Here we take a better look at the relationship between h
and r
. There seems to be a quadratic relationship or more, maybe I can use a third or fourth grade polynomial.
In [48]:
sns.pairplot(player_batting[['r', 'h']]);
In [89]:
from sklearn.preprocessing import PolynomialFeatures
# creating the model
OLS_poly = LinearRegression()
# creating training sets
X_poly = player_batting.h.values.reshape(-1, 1)
y_poly = player_batting.r.values.reshape(-1, 1)
# create polynomial features
fourth = PolynomialFeatures(degree=4)
X_fourth = fourth.fit_transform(X_poly)
# linear model comparison training, predicting and scoring
OLS_poly_comp = LinearRegression()
OLS_poly_comp.fit(X_poly, y_poly)
y_poly_test_pred = OLS_poly_comp.predict(X_poly)
print(r2_score(y_poly, y_poly_test_pred))
# fit features and predict
OLS_poly.fit(X_fourth, y_poly)
y_fourth_pred = OLS_poly.predict(X_fourth)
print(r2_score(y_poly, y_fourth_pred))
In [92]:
# S is the same as RMSE, so I'll reuse my previous function
print(RMSE(y_poly, y_poly_test_pred))
print(RMSE(y_poly, y_fourth_pred))
print(RMSE(y_poly, y_fourth_pred)/RMSE(y_poly, y_poly_test_pred))
$S$ is another name for RMSE, and you can see that $R^2$ tells us that the linear model is better but $S$ gives us the opposite result.