In this notebook, you will implement your own logistic regression classifier with L2 regularization. You will do the following:
In [5]:
import os
import zipfile
import string
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
%matplotlib inline
For this assignment, we will use a subset of the Amazon product review dataset. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted primarily of positive reviews.
In [6]:
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]
In [7]:
# Filename of unzipped file
unzipped_file = 'amazon_baby_subset.csv'
In [8]:
# If upzipped file not in files_list, unzip the file
if unzipped_file not in files_list:
zip_file = unzipped_file + '.zip'
unzipping = zipfile.ZipFile(zip_file)
unzipping.extractall()
unzipping.close
We will use a dataset consisting of baby product reviews on Amazon.com.
In [9]:
products = pd.read_csv("amazon_baby_subset.csv")
Now, let us see a preview of what the dataset looks like.
In [10]:
products.head()
Out[10]:
One column of this dataset is 'sentiment', corresponding to the class label with +1 indicating a review with positive sentiment and -1 indicating one with negative sentiment.
In [11]:
products['sentiment']
Out[11]:
Let us quickly explore more of this dataset. The 'name' column indicates the name of the product. Here we list the first 10 products in the dataset. We then count the number of positive and negative reviews.
In [12]:
products.head(10)['name']
Out[12]:
In [13]:
print '# of positive reviews =', len(products[products['sentiment']==1])
print '# of negative reviews =', len(products[products['sentiment']==-1])
In this section, we will perform some simple feature cleaning. The last assignment used all words in building bag-of-words features, but here we limit ourselves to 193 words (for simplicity). We compiled a list of 193 most frequent words into a JSON file.
Now, we will load these words from this JSON file:
In [14]:
import json
with open('important_words.json', 'r') as f: # Reads the list of most frequent words
important_words = json.load(f)
important_words = [str(s) for s in important_words]
In [15]:
print important_words
Now, we will perform 2 simple data transformations:
We start with Step 1 which can be done as follows:
Before removing the punctuation from the strings in the review column, we will fill all NA values with empty string.
In [16]:
products["review"] = products["review"].fillna("")
Below, we are removing all the punctuation from the strings in the review column and saving the result into a new column in the dataframe.
In [17]:
products["review_clean"] = products["review"].str.translate(None, string.punctuation)
Now we proceed with Step 2. For each word in important_words, we compute a count for the number of times the word occurs in the review. We will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in important_words which keeps a count of the number of times the respective word occurs in the review text. Note: There are several ways of doing this. In this assignment, we use the built-in count function for Python lists. Each review string is first split into individual words and the number of occurances of a given word is counted.
In [18]:
for word in important_words:
products[word] = products['review_clean'].apply(lambda s : s.split().count(word))
The products DataFrame now contains one column for each of the 193 important_words. As an example, the column perfect contains a count of the number of times the word perfect occurs in each of the reviews.
In [19]:
products['perfect']
Out[19]:
We split the data into a train-validation split with 80% of the data in the training set and 20% of the data in the validation set.
Note: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters. Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on a validation set, while evaluation of selected model should always be on a test set.
Loading the JSON files with the indicies from the training data and the validation data into a a list.
In [20]:
with open('module-4-assignment-train-idx.json', 'r') as f:
train_idx_lst = json.load(f)
train_idx_lst = [int(entry) for entry in train_idx_lst]
In [21]:
with open('module-4-assignment-validation-idx.json', 'r') as f:
validation_idx_lst = json.load(f)
validation_idx_lst = [int(entry) for entry in validation_idx_lst]
Using the list of the training data indicies and the validation data indicies to get a DataFrame with the training data and a DataFrame with the validation data.
In [22]:
train_data = products.ix[train_idx_lst]
validation_data = products.ix[validation_idx_lst]
In [23]:
print 'Training set : %d data points' % len(train_data)
print 'Validation set : %d data points' % len(validation_data)
Just like in the second assignment of the previous module, we provide you with a function that extracts columns from a DataFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels.
Note: The feature matrix includes an additional column 'intercept' filled with 1's to take account of the intercept term.
In [24]:
def get_numpy_data(data_frame, features, label):
data_frame['intercept'] = 1
features = ['intercept'] + features
features_frame = data_frame[features]
feature_matrix = data_frame.as_matrix(columns=features)
label_array = data_frame[label]
label_array = label_array.values
return(feature_matrix, label_array)
We convert both the training and validation sets into NumPy arrays.
Warning: This may take a few minutes.
In [25]:
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment')
Let us now build on Module 3 assignment. Recall from lecture that the link function for logistic regression can be defined as:
$$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}, $$where the feature vector $h(\mathbf{x}_i)$ is given by the word counts of important_words in the review $\mathbf{x}_i$.
We will use the same code as in this past assignment to make probability predictions since this part is not affected by the L2 penalty. (Only the way in which the coefficients are learned is affected by the addition of a regularization term.)
In [26]:
'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''
def predict_probability(feature_matrix, coefficients):
# Take dot product of feature_matrix and coefficients
arg_exp = np.dot(coefficients,feature_matrix.transpose())
# Compute P(y_i = +1 | x_i, w) using the link function
predictions = 1.0/(1.0 + np.exp(-arg_exp))
# return predictions
return predictions
Let us now work on extending logistic regression with L2 regularization. As discussed in the lectures, the L2 regularization is particularly useful in preventing overfitting. In this assignment, we will explore L2 regularization in detail.
Recall from lecture and the previous assignment that for logistic regression without an L2 penalty, the derivative of the log likelihood function is: $$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) $$
Adding L2 penalty to the derivative
It takes only a small modification to add a L2 penalty. All terms indicated in red refer to terms that were added due to an L2 penalty.
The per-coefficient derivative for logistic regression with an L2 penalty is as follows: $$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) \color{red}{-2\lambda w_j } $$ and for the intercept term, we have $$ \frac{\partial\ell}{\partial w_0} = \sum_{i=1}^N h_0(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) $$
Note: As we did in the Regression course, we do not apply the L2 penalty on the intercept. A large intercept does not necessarily indicate overfitting because the intercept is not associated with any particular feature.
Write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. Unlike its counterpart in the last assignment, the function accepts five arguments:
errors
vector containing $(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w}))$ for all $i$feature
vector containing $h_j(\mathbf{x}_i)$ for all $i$coefficient
containing the current value of coefficient $w_j$.l2_penalty
representing the L2 penalty constant $\lambda$feature_is_constant
telling whether the $j$-th feature is constant or not.
In [27]:
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, feature_is_constant):
# Compute the dot product of errors and feature
derivative = np.dot(feature.transpose(), errors)
# add L2 penalty term for any feature that isn't the intercept.
if not feature_is_constant:
derivative = derivative - 2.0*l2_penalty*coefficient
return derivative
Quiz question: In the code above, was the intercept term regularized?
No
To verify the correctness of the gradient ascent algorithm, we provide a function for computing log likelihood (which we recall from the last assignment was a topic detailed in an advanced optional video, and used here for its numerical stability).
In [28]:
def compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty):
indicator = (sentiment==+1)
scores = np.dot(feature_matrix, coefficients)
lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores))) - l2_penalty*np.sum(coefficients[1:]**2)
return lp
Quiz question: Does the term with L2 regularization increase or decrease $\ell\ell(\mathbf{w})$?
Decreases
The logistic regression function looks almost like the one in the last assignment, with a minor modification to account for the L2 penalty. Fill in the code below to complete this modification.
In [29]:
def logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, l2_penalty, max_iter):
coefficients = np.array(initial_coefficients) # make sure it's a numpy array
for itr in xrange(max_iter):
# Predict P(y_i = +1|x_i,w) using your predict_probability() function
predictions = predict_probability(feature_matrix, coefficients)
# Compute indicator value for (y_i = +1)
indicator = (sentiment==+1)
# Compute the errors as indicator - predictions
errors = indicator - predictions
for j in xrange(len(coefficients)): # loop over each coefficient
is_intercept = (j == 0)
# Recall that feature_matrix[:,j] is the feature column associated with coefficients[j].
# Compute the derivative for coefficients[j]. Save it in a variable called derivative
derivative = feature_derivative_with_L2(errors, feature_matrix[:,j],coefficients[j],l2_penalty, is_intercept)
# add the step size times the derivative to the current coefficient
coefficients[j] = coefficients[j] + step_size*derivative
# Checking whether log likelihood is increasing
if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
lp = compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty)
print 'iteration %*d: log likelihood of observed labels = %.8f' % \
(int(np.ceil(np.log10(max_iter))), itr, lp)
return coefficients
Now that we have written up all the pieces needed for regularized logistic regression, let's explore the benefits of using L2 regularization in analyzing sentiment for product reviews. As iterations pass, the log likelihood should increase.
Below, we train models with increasing amounts of regularization, starting with no L2 penalty, which is equivalent to our previous logistic regression implementation.
In [30]:
# run with L2 = 0
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
initial_coefficients=np.zeros(194),
step_size=5e-6, l2_penalty=0, max_iter=501)
In [31]:
# run with L2 = 4
coefficients_4_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
initial_coefficients=np.zeros(194),
step_size=5e-6, l2_penalty=4, max_iter=501)
In [32]:
# run with L2 = 10
coefficients_10_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
initial_coefficients=np.zeros(194),
step_size=5e-6, l2_penalty=10, max_iter=501)
In [33]:
# run with L2 = 1e2
coefficients_1e2_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
initial_coefficients=np.zeros(194),
step_size=5e-6, l2_penalty=1e2, max_iter=501)
In [34]:
# run with L2 = 1e3
coefficients_1e3_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
initial_coefficients=np.zeros(194),
step_size=5e-6, l2_penalty=1e3, max_iter=501)
In [35]:
# run with L2 = 1e5
coefficients_1e5_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
initial_coefficients=np.zeros(194),
step_size=5e-6, l2_penalty=1e5, max_iter=501)
In [36]:
def add_coefficients_to_table(coefficients, column_name):
return pd.Series(coefficients, index = column_name)
Now, let's run the function add_coefficients_to_table
for each of the L2 penalty strengths.
In [37]:
coeff_L2_0_table = add_coefficients_to_table(coefficients_0_penalty, ['intercept'] + important_words)
coeff_L2_4_table = add_coefficients_to_table(coefficients_4_penalty, ['intercept'] + important_words)
coeff_L2_10_table = add_coefficients_to_table(coefficients_10_penalty, ['intercept'] + important_words)
coeff_L2_1e2_table = add_coefficients_to_table(coefficients_1e2_penalty, ['intercept'] + important_words)
coeff_L2_1e3_table = add_coefficients_to_table(coefficients_1e3_penalty, ['intercept'] + important_words)
coeff_L2_1e5_table = add_coefficients_to_table(coefficients_1e5_penalty, ['intercept'] + important_words)
Using the coefficients trained with L2 penalty 0, find the 5 most positive words (with largest positive coefficients). Save them to positive_words. Similarly, find the 5 most negative words (with largest negative coefficients) and save them to negative_words.
In [38]:
positive_words = coeff_L2_0_table.sort_values(ascending=False)[0:5].index.tolist()
negative_words = coeff_L2_0_table.sort_values(ascending=True)[0:5].index.tolist()
Quiz Question. Which of the following is not listed in either positive_words or negative_words?
In [39]:
print "positive_words: ", positive_words
print "negative_words: ", negative_words
First, let's put the 6 L2 penalty values we considered in a list.
In [40]:
l2_pen_vals = [0.0, 4.0, 10.0, 1.0e2, 1.0e3, 1.0e5]
Next, let's put all the words we considered as features for the classification model plus the intercept features
In [41]:
feature_words_lst = ['intercept'] + important_words
Now, we will fill-in 2 dictionaries, one with the 5 positive words as the index for the dictionary and the other with the 5 negative words as the index for the dictionary. For each index (word), we fill in a list which has the coefficient value of the index (word) for the 6 different L2 penalties we considered.
In [42]:
pos_word_coeff_dict = {}
for curr_word in positive_words:
# Finding the index of the word we are considering in the feature_words_lst
word_index = feature_words_lst.index(curr_word)
# Filling in the list for this index with the coefficient values for the 6 L2 penalties we considered.
pos_word_coeff_dict[curr_word] = [coefficients_0_penalty[word_index], coefficients_4_penalty[word_index],
coefficients_10_penalty[word_index], coefficients_1e2_penalty[word_index],
coefficients_1e3_penalty[word_index], coefficients_1e5_penalty[word_index] ]
In [43]:
neg_word_coeff_dict = {}
for curr_word in negative_words:
# Finding the index of the word we are considering in the feature_words_lst
word_index = feature_words_lst.index(curr_word)
# Filling in the list for this index with the coefficient values for the 6 L2 penalties we considered.
neg_word_coeff_dict[curr_word] = [coefficients_0_penalty[word_index], coefficients_4_penalty[word_index],
coefficients_10_penalty[word_index], coefficients_1e2_penalty[word_index],
coefficients_1e3_penalty[word_index], coefficients_1e5_penalty[word_index] ]
Plotting coefficient path for positive words
In [44]:
plt.figure(figsize=(10,6))
for pos_word in positive_words:
plt.semilogx(l2_pen_vals, pos_word_coeff_dict[pos_word], linewidth =2, label = pos_word )
plt.plot(l2_pen_vals, [0,0,0,0,0,0],linewidth =2, linestyle = '--', color = "black")
plt.axis([4.0, 1.0e5, -0.5, 1.5])
plt.title("Positive Words Coefficient Path", fontsize=18)
plt.xlabel("L2 Penalty ($\lambda$)", fontsize=18)
plt.ylabel("Coefficient Value", fontsize=18)
plt.legend(loc = "upper right", fontsize=18)
Out[44]:
Plotting coefficient path for negative words
In [45]:
plt.figure(figsize=(10,6))
for pos_word in negative_words:
plt.semilogx(l2_pen_vals, neg_word_coeff_dict[pos_word], linewidth =2, label = pos_word )
plt.plot(l2_pen_vals, [0,0,0,0,0,0],linewidth =2, linestyle = '--', color = "black")
plt.axis([4.0, 1.0e5, -1.5, 0.5])
plt.title("Negative Words Coefficient Path", fontsize=18)
plt.xlabel("L2 Penalty ($\lambda$)", fontsize=18)
plt.ylabel("Coefficient Value", fontsize=18)
plt.legend(loc = "lower right", fontsize=18)
Out[45]:
The following 2 questions relate to the 2 figures above.
Quiz Question: (True/False) All coefficients consistently get smaller in size as the L2 penalty is increased.
True
Quiz Question: (True/False) The relative order of coefficients is preserved as the L2 penalty is increased. (For example, if the coefficient for 'cat' was more positive than that for 'dog', this remains true as the L2 penalty increases.)
False
Now, let us compute the accuracy of the classifier model. Recall that the accuracy is given by
$$ \mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}} $$Recall from lecture that that the class prediction is calculated using $$ \hat{y}_i = \left\{ \begin{array}{ll} +1 & h(\mathbf{x}_i)^T\mathbf{w} > 0 \\ -1 & h(\mathbf{x}_i)^T\mathbf{w} \leq 0 \\ \end{array} \right. $$
Note: It is important to know that the model prediction code doesn't change even with the addition of an L2 penalty. The only thing that changes is the estimated coefficients used in this prediction.
Step 1: First compute the scores using feature_matrix and coefficients using a dot product. Do this for the training data and the validation data.
In [46]:
# Compute the scores as a dot product between feature_matrix and coefficients.
scores_l2_pen_0_train = np.dot(feature_matrix_train, coefficients_0_penalty)
scores_l2_pen_4_train = np.dot(feature_matrix_train, coefficients_4_penalty)
scores_l2_pen_10_train = np.dot(feature_matrix_train, coefficients_10_penalty)
scores_l2_pen_1e2_train = np.dot(feature_matrix_train, coefficients_1e2_penalty)
scores_l2_pen_1e3_train = np.dot(feature_matrix_train, coefficients_1e3_penalty)
scores_l2_pen_1e5_train = np.dot(feature_matrix_train, coefficients_1e5_penalty)
In [47]:
scores_l2_pen_0_valid = np.dot(feature_matrix_valid, coefficients_0_penalty)
scores_l2_pen_4_valid = np.dot(feature_matrix_valid, coefficients_4_penalty)
scores_l2_pen_10_valid = np.dot(feature_matrix_valid, coefficients_10_penalty)
scores_l2_pen_1e2_valid = np.dot(feature_matrix_valid, coefficients_1e2_penalty)
scores_l2_pen_1e3_valid = np.dot(feature_matrix_valid, coefficients_1e3_penalty)
scores_l2_pen_1e5_valid = np.dot(feature_matrix_valid, coefficients_1e5_penalty)
Step 2: Using the formula above, compute the class predictions from the scores.
First, writing a helper function that will return an array with the predictions.
In [48]:
def get_pred_from_score(scores_array):
# First, set predictions equal to scores array
predictions = scores_array
# Replace <= 0 scores with negative review classification (-1)
scores_array[scores_array<=0] = -1
# Replace > 0 scores with positive review classification (+1)
scores_array[scores_array>0] = 1
return predictions
Now, getting the predictions for the training data and the validation data for the 6 L2 penalties we considered.
In [49]:
pred_l2_pen_0_train = get_pred_from_score(scores_l2_pen_0_train)
pred_l2_pen_4_train = get_pred_from_score(scores_l2_pen_4_train)
pred_l2_pen_10_train = get_pred_from_score(scores_l2_pen_10_train)
pred_l2_pen_1e2_train = get_pred_from_score(scores_l2_pen_1e2_train)
pred_l2_pen_1e3_train = get_pred_from_score(scores_l2_pen_1e3_train)
pred_l2_pen_1e5_train = get_pred_from_score(scores_l2_pen_1e5_train)
In [50]:
pred_l2_pen_0_valid = get_pred_from_score(scores_l2_pen_0_valid)
pred_l2_pen_4_valid = get_pred_from_score(scores_l2_pen_4_valid)
pred_l2_pen_10_valid = get_pred_from_score(scores_l2_pen_10_valid)
pred_l2_pen_1e2_valid = get_pred_from_score(scores_l2_pen_1e2_valid)
pred_l2_pen_1e3_valid = get_pred_from_score(scores_l2_pen_1e3_valid)
pred_l2_pen_1e5_valid = get_pred_from_score(scores_l2_pen_1e5_valid)
Step 3: Getting the accurary for the training set data and the validation set data
In [51]:
train_accuracy = {}
train_accuracy[0] = np.sum(pred_l2_pen_0_train==sentiment_train)/float(len(sentiment_train))
train_accuracy[4] = np.sum(pred_l2_pen_4_train==sentiment_train)/float(len(sentiment_train))
train_accuracy[10] = np.sum(pred_l2_pen_10_train==sentiment_train)/float(len(sentiment_train))
train_accuracy[1e2] = np.sum(pred_l2_pen_1e2_train==sentiment_train)/float(len(sentiment_train))
train_accuracy[1e3] = np.sum(pred_l2_pen_1e3_train==sentiment_train)/float(len(sentiment_train))
train_accuracy[1e5] = np.sum(pred_l2_pen_1e5_train==sentiment_train)/float(len(sentiment_train))
validation_accuracy = {}
validation_accuracy[0] = np.sum(pred_l2_pen_0_valid==sentiment_valid)/float(len(sentiment_valid))
validation_accuracy[4] = np.sum(pred_l2_pen_4_valid==sentiment_valid)/float(len(sentiment_valid))
validation_accuracy[10] = np.sum(pred_l2_pen_10_valid==sentiment_valid)/float(len(sentiment_valid))
validation_accuracy[1e2] = np.sum(pred_l2_pen_1e2_valid==sentiment_valid)/float(len(sentiment_valid))
validation_accuracy[1e3] = np.sum(pred_l2_pen_1e3_valid==sentiment_valid)/float(len(sentiment_valid))
validation_accuracy[1e5] = np.sum(pred_l2_pen_1e5_valid==sentiment_valid)/float(len(sentiment_valid))
In [52]:
# Build a simple report
for key in sorted(validation_accuracy.keys()):
print "L2 penalty = %g" % key
print "train accuracy = %s, validation_accuracy = %s" % (train_accuracy[key], validation_accuracy[key])
print "--------------------------------------------------------------------------------"
Cleating a list of tuples with the entries as (accuracy, l2_penalty) for the training set and the validation set.
In [53]:
accuracy_training_data = [(train_accuracy[0], 0), (train_accuracy[4], 4), (train_accuracy[10], 10),
(train_accuracy[1e2], 1e2), (train_accuracy[1e3], 1e3), (train_accuracy[1e5], 1e5)]
accuracy_validation_data = [(validation_accuracy[0], 0), (validation_accuracy[4], 4), (validation_accuracy[10], 10),
(validation_accuracy[1e2], 1e2), (validation_accuracy[1e3], 1e3), (validation_accuracy[1e5], 1e5)]
Quiz question: Which model (L2 = 0, 4, 10, 100, 1e3, 1e5) has the highest accuracy on the training data?
In [54]:
max(accuracy_training_data)[1]
Out[54]:
Quiz question: Which model (L2 = 0, 4, 10, 100, 1e3, 1e5) has the highest accuracy on the validation data?
In [55]:
max(accuracy_validation_data)[1]
Out[55]:
Quiz question: Does the highest accuracy on the training data imply that the model is the best one?
No, this model probably suffers from overfitting
In [ ]: