Welcome to the practical section of module 4.3. Here we'll continue with the advertising-sales dataset to investigate the ideas of regularization and model evaluation. We'll continue with the multivariate regression model we build in the previous module and we'll be looking into tuning the regularization parameter to achieve the most accurate model and we'll evaluate this accuracy using better metrics than MSE which we have been using in the previous modules.
In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 10)
In the following you'll see the same code (without visualization) we wrote in the previous module for the rgeression model using both TV and Newspaper data, so it's nothing new, except for the part where we prepare our data. We'll be splitting the dataset into three parts now instead of two:
In [3]:
def scale_features(X, scalar=None):
if(len(X.shape) == 1):
X = X.reshape(-1, 1)
if scalar == None:
scalar = StandardScaler()
scalar.fit(X)
return scalar.transform(X), scalar
In [3]:
# get the advertising data set
dataset = pd.read_csv('../datasets/Advertising.csv')
dataset = dataset[["TV", "Radio", "Newspaper", "Sales"]] # filtering the Unamed index column out of the dataset
In [7]:
dataset_size = len(dataset)
training_size = np.floor(dataset_size * 0.6).astype(int)
validation_size = np.floor(dataset_size * 0.2).astype(int)
# First we split the shuffled dataset into three parts: training, validation and test
X_training = dataset[["TV", "Newspaper"]][:training_size]
y_training = dataset["Sales"][:training_size]
X_validation = dataset[["TV", "Newspaper"]][training_size:training_size + validation_size]
y_validation = dataset["Sales"][training_size:training_size + validation_size]
X_test = dataset[["TV", "Newspaper"]][training_size:training_size + validation_size:]
y_test = dataset["Sales"][training_size:training_size + validation_size:]
# Second we apply feature scaling on X_training and X_test
X_training, training_scalar = scale_features(X_training)
X_validation,_ = scale_features(X_validation, scalar=training_scalar)
X_test,_ = scale_features(X_test, scalar=training_scalar)
In [40]:
model = SGDRegressor(loss='squared_loss')
model.fit(X_training, y_training)
w0 = model.intercept_
w1 = model.coef_[0] # Notice that model.coef_ is a list now not a single number
w2 = model.coef_[1]
print "Trained model: y = %0.2f + %0.2fx₁ + %0.2fx₂" % (w0, w1, w2)
MSE = np.mean((y_test - model.predict(X_test)) ** 2)
print "The Test Data MSE is: %0.3f" % (MSE)
From the videos, we learned that the idea of regularization is introduced to prevent the model from overfitting to the data points by adding a penality for large weights values. Such penality is expressed mathematically with the second term of the cost function:
$$ J(W) = \sum_{i=1}^{m} (h_w(X^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} w_j^2 $$This is called L2 Regularization and $\lambda$ is called the Regularization Parameter , How can we implment it then with scikit-learn for our models?
Well, no worries, scikit-learn implements that for you and we have been using it all the time. The SGDRegressor constructs has two arguments that define the behavior of the penality:
Now let's play with the value of alpha and see how does that affect our model's accuracy. Let's set alpha to a large number say 1. In this case we give the values of the weights a very harsh penalty so they'll end up smaller than they should be and the accuracy should be worse!
In [41]:
model = SGDRegressor(loss='squared_loss', alpha=1)
model.fit(X_training, y_training)
w0 = model.intercept_
w1 = model.coef_[0] # Notice that model.coef_ is a list now not a single number
w2 = model.coef_[1]
print "Trained model: y = %0.2f + %0.2fx₁ + %0.2fx₂" % (w0, w1, w2)
MSE = np.mean((y_test - model.predict(X_test)) ** 2)
print "The Test Data MSE is: %0.3f" % (MSE)
The effect the value of the regularization parameter has on the model's accuracy makes a very good candidate for tuning. We can use the validation data set we created for that purpose. We create a list of possible values for the regularization parameter, we train the model using each of these value and evaluate the model using the validation set. The value with the best evaluation (least MSE) is the best value for the regularization parameter.
In [61]:
alphas = [0.00025, 0.00005, 0.0001, 0.0002, 0.0004]
best_alpha = alphas[0]
least_mse = float("inf") #initialized to infinity
for possible_alpha in alphas:
model = SGDRegressor(loss='squared_loss', alpha=possible_alpha)
model.fit(X_training, y_training)
mse = np.mean((y_validation - model.predict(X_validation)) ** 2)
if mse <= least_mse:
least_mse = mse
best_alpha = possible_alpha
print "The Best alpha is: %.4f" % (best_alpha)
best_model = SGDRegressor(loss='squared_loss', alpha=best_alpha)
best_model.fit(X_training, y_training)
MSE = np.mean((y_test - best_model.predict(X_test)) ** 2) # evaluating the best model on test data
print "The Test Data MSE is: %0.3f" % (MSE)
There's a better way to tune the regularization parameter and possiblby multiple parameters at the same time. This way through scikit-learn's GridSearchCV. We'll not be working with that here, but you're encouraged to read the documentation and user guides and try for yourself how it could be done. Once you got the hang of it, you can maybe try and tune the learning rate and the regularization parameter at the same time!
The Last thing we have here is to see how we can evaluate our model using the $R^2$ metric. We learned in the videos that the $R^2$ metric measures how close the data points are to our regression line (or plane). We also learned that there's an adjusted version of that metric denoted by $\overline{R^2}$ that penalizes for the extra features we add to the model that doesn't help the model be more accurate. Those metric can be calculated using the following formulas:
$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - f_i)^2}{\sum_{i=1}^{n}(y_i - \overline{y})^2}$$where $f_i$ is our model prediction and $overline{y}$ is the mean of all n $y_i$s. And for the adjusted version:
$$\overline{R^2} = R^2 - \frac{k - 1}{n - k}(1 - R^2)$$where $k$ is the number of fatures and $n$ is the number of data samples. Both $R^2$ and $\overline{R^2}$ take a value less than or equal to 1.The closer it is to one, the better our model is.
Fortunately, we don't have to do all these calculations by hand to use this metric with scikit-learn. The model's score method does that for us. It takes the test Xs and ys and spits out the value of $\overline{R^2}$
In [64]:
model = SGDRegressor(loss='squared_loss', eta0=0.02)
model.fit(X_training, y_training)
w0 = model.intercept_
w1 = model.coef_[0] # Notice that model.coef_ is a list now not a single number
w2 = model.coef_[1]
print "Trained model: y = %0.2f + %0.2fx₁ + %0.2fx₂" % (w0, w1, w2)
R2_adjusted = model.score(X_test, y_test)
print "The Model's Adjusted R² on Test Data is %0.2f" % (R2_adjusted)
Apply the ideas of L2 Regularization and $R^2$ metric to the exercises you did in the last two modules.
Download Kaggle's 2016 US Election Dataset and explore the data using what you learned in Linear Regression. Make assumptions about the data correlations and dependence and test your assumptions using what you learned. If had interesting results, publish your code and your results to the Script's Repo and share them with the community.
In [ ]: