In [1]:
import pandas as pd

Load and process review dataset


In [108]:
products = pd.read_csv('../../data/amazon_baby_subset.csv')

In [5]:
products['sentiment']


Out[5]:
0        1
1        1
2        1
3        1
4        1
5        1
6        1
7        1
8        1
9        1
10       1
11       1
12       1
13       1
14       1
15       1
16       1
17       1
18       1
19       1
20       1
21       1
22       1
23       1
24       1
25       1
26       1
27       1
28       1
29       1
        ..
53042   -1
53043   -1
53044   -1
53045   -1
53046   -1
53047   -1
53048   -1
53049   -1
53050   -1
53051   -1
53052   -1
53053   -1
53054   -1
53055   -1
53056   -1
53057   -1
53058   -1
53059   -1
53060   -1
53061   -1
53062   -1
53063   -1
53064   -1
53065   -1
53066   -1
53067   -1
53068   -1
53069   -1
53070   -1
53071   -1
Name: sentiment, dtype: int64

In [109]:
products['sentiment'].size


Out[109]:
53072

In [8]:
products.head(10).name


Out[8]:
0    Stop Pacifier Sucking without tears with Thumb...
1      Nature's Lullabies Second Year Sticker Calendar
2      Nature's Lullabies Second Year Sticker Calendar
3                          Lamaze Peekaboo, I Love You
4    SoftPlay Peek-A-Boo Where's Elmo A Children's ...
5                            Our Baby Girl Memory Book
6    Hunnt® Falling Flowers and Birds Kids Nurs...
7    Blessed By Pope Benedict XVI Divine Mercy Full...
8    Cloth Diaper Pins Stainless Steel Traditional ...
9    Cloth Diaper Pins Stainless Steel Traditional ...
Name: name, dtype: object

In [14]:
print ('# of positive reviews =', len(products[products['sentiment']==1]))
print ('# of negative reviews =', len(products[products['sentiment']==-1]))


# of positive reviews = 26579
# of negative reviews = 26493

In [110]:
# The same feature processing (same as the previous assignments)
# ---------------------------------------------------------------
import json
with open('../../data/important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]


def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return str(text).translate(translator) 

# Remove punctuation.
products['review_clean'] = products['review'].apply(remove_punctuation)

# Split out the words into individual columns
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

Train-Validation split

We split the data into a train-validation split with 80% of the data in the training set and 20% of the data in the validation set. We use seed=2 so that everyone gets the same result.

Note: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters. Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on a validation set, while evaluation of selected model should always be on a test set.


In [111]:
with open('../../data/module-4-assignment-train-idx.json', 'r') as f:
    train_idx = json.load(f)                
train_data = products.ix[train_idx]

In [112]:
with open ('../../data/module-4-assignment-validation-idx.json', 'r') as f:
    v_idx = json.load(f)
validation_data = products.ix[v_idx]

Convert Frame to NumPy array

Just like in the second assignment of the previous module, we provide you with a function that extracts columns from an SFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels.

Note: The feature matrix includes an additional column 'intercept' filled with 1's to take account of the intercept term.


In [113]:
import numpy as np

def get_numpy_data(data_frame, features, label):
    data_frame['intercept'] = 1
    features = ['intercept'] + features
    features_frame = data_frame[features]
    feature_matrix = features_frame.as_matrix()
    label_array = data_frame[label]
    return(feature_matrix, label_array)

In [114]:
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment')

Building on logistic regression with no L2 penalty assignment

Let us now build on Module 3 assignment. Recall from lecture that the link function for logistic regression can be defined as:

$$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}, $$

where the feature vector $h(\mathbf{x}_i)$ is given by the word counts of important_words in the review $\mathbf{x}_i$.

We will use the same code as in this past assignment to make probability predictions since this part is not affected by the L2 penalty. (Only the way in which the coefficients are learned is affected by the addition of a regularization term.)


In [115]:
def prediction(score):
    return (1 / (1 + np.exp(-score)))

'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''
def predict_probability(feature_matrix, coefficients):
    # Take dot product of feature_matrix and coefficients  
    scores = np.dot(feature_matrix, coefficients)
    
    # Compute P(y_i = +1 | x_i, w) using the link function
    predictions = np.apply_along_axis(prediction, 0, scores)
    
    # return predictions
    return predictions

Adding L2 penalty

Let us now work on extending logistic regression with L2 regularization. As discussed in the lectures, the L2 regularization is particularly useful in preventing overfitting. In this assignment, we will explore L2 regularization in detail.

Recall from lecture and the previous assignment that for logistic regression without an L2 penalty, the derivative of the log likelihood function is: $$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) $$

Adding L2 penalty to the derivative

It takes only a small modification to add a L2 penalty. All terms indicated in red refer to terms that were added due to an L2 penalty.

  • Recall from the lecture that the link function is still the sigmoid: $$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}, $$
  • We add the L2 penalty term to the per-coefficient derivative of log likelihood: $$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) \color{red}{-2\lambda w_j } $$

The per-coefficient derivative for logistic regression with an L2 penalty is as follows: $$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) \color{red}{-2\lambda w_j } $$ and for the intercept term, we have $$ \frac{\partial\ell}{\partial w_0} = \sum_{i=1}^N h_0(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) $$

Note: As we did in the Regression course, we do not apply the L2 penalty on the intercept. A large intercept does not necessarily indicate overfitting because the intercept is not associated with any particular feature.

Write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. Unlike its counterpart in the last assignment, the function accepts five arguments:

  • errors vector containing $(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w}))$ for all $i$
  • feature vector containing $h_j(\mathbf{x}_i)$ for all $i$
  • coefficient containing the current value of coefficient $w_j$.
  • l2_penalty representing the L2 penalty constant $\lambda$
  • feature_is_constant telling whether the $j$-th feature is constant or not.

In [116]:
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, feature_is_constant): 
    
    # Compute the dot product of errors and feature
    derivative = np.dot(feature, errors)

    # add L2 penalty term for any feature that isn't the intercept.
    if not feature_is_constant: 
        derivative = derivative - 2 * l2_penalty * coefficient
        
    return derivative

Quiz Question: In the code above, was the intercept term regularized?

To verify the correctness of the gradient ascent algorithm, we provide a function for computing log likelihood (which we recall from the last assignment was a topic detailed in an advanced optional video, and used here for its numerical stability).

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) \color{red}{-\lambda\|\mathbf{w}\|_2^2} $$

In [117]:
def compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty):
    indicator = (sentiment==+1)
    scores = np.dot(feature_matrix, coefficients)
    
    lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores))) - l2_penalty*np.sum(coefficients[1:]**2)
    
    return lp

Quiz Question: Does the term with L2 regularization increase or decrease $\ell\ell(\mathbf{w})$?

The logistic regression function looks almost like the one in the last assignment, with a minor modification to account for the L2 penalty. Fill in the code below to complete this modification.


In [118]:
from math import sqrt

def logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, l2_penalty, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in range(max_iter):

        # Predict P(y_i = +1|x_i,w) using your predict_probability() function
        # YOUR CODE HERE
        predictions = predict_probability(feature_matrix, coefficients)
        
        # Compute indicator value for (y_i = +1)
        indicator = (sentiment==+1)
        
        # Compute the errors as indicator - predictions
        errors = indicator - predictions
        for j in range(len(coefficients)): # loop over each coefficient
            
            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j].
            # Compute the derivative for coefficients[j]. Save it in a variable called derivative
            # YOUR CODE HERE
            derivative = feature_derivative_with_L2(errors, feature_matrix[:, j], coefficients[j], l2_penalty, j == 0)
            
            # add the step size times the derivative to the current coefficient
            coefficients[j] += (step_size * derivative)
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty)
            print ('iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp))
    return coefficients

Explore effects of L2 regularization

Now that we have written up all the pieces needed for regularized logistic regression, let's explore the benefits of using L2 regularization in analyzing sentiment for product reviews. As iterations pass, the log likelihood should increase.

Below, we train models with increasing amounts of regularization, starting with no L2 penalty, which is equivalent to our previous logistic regression implementation.


In [119]:
# run with L2 = 0
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=5e-6, l2_penalty=0, max_iter=501)


iteration   0: log likelihood of observed labels = -29179.39138303
iteration   1: log likelihood of observed labels = -29003.71259047
iteration   2: log likelihood of observed labels = -28834.66187288
iteration   3: log likelihood of observed labels = -28671.70781507
iteration   4: log likelihood of observed labels = -28514.43078198
iteration   5: log likelihood of observed labels = -28362.48344665
iteration   6: log likelihood of observed labels = -28215.56713122
iteration   7: log likelihood of observed labels = -28073.41743783
iteration   8: log likelihood of observed labels = -27935.79536396
iteration   9: log likelihood of observed labels = -27802.48168669
iteration  10: log likelihood of observed labels = -27673.27331484
iteration  11: log likelihood of observed labels = -27547.98083656
iteration  12: log likelihood of observed labels = -27426.42679977
iteration  13: log likelihood of observed labels = -27308.44444728
iteration  14: log likelihood of observed labels = -27193.87673876
iteration  15: log likelihood of observed labels = -27082.57555831
iteration  20: log likelihood of observed labels = -26570.43059938
iteration  30: log likelihood of observed labels = -25725.48742389
iteration  40: log likelihood of observed labels = -25055.53326910
iteration  50: log likelihood of observed labels = -24509.63590026
iteration  60: log likelihood of observed labels = -24054.97906083
iteration  70: log likelihood of observed labels = -23669.51640848
iteration  80: log likelihood of observed labels = -23337.89167628
iteration  90: log likelihood of observed labels = -23049.07066021
iteration 100: log likelihood of observed labels = -22794.90974921
iteration 200: log likelihood of observed labels = -21283.29527353
iteration 300: log likelihood of observed labels = -20570.97485473
iteration 400: log likelihood of observed labels = -20152.21466944
iteration 500: log likelihood of observed labels = -19876.62333410

In [120]:
# run with L2 = 4
coefficients_4_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                      initial_coefficients=np.zeros(194),
                                                      step_size=5e-6, l2_penalty=4, max_iter=501)


iteration   0: log likelihood of observed labels = -29179.39508175
iteration   1: log likelihood of observed labels = -29003.73417180
iteration   2: log likelihood of observed labels = -28834.71441858
iteration   3: log likelihood of observed labels = -28671.80345068
iteration   4: log likelihood of observed labels = -28514.58077956
iteration   5: log likelihood of observed labels = -28362.69830317
iteration   6: log likelihood of observed labels = -28215.85663259
iteration   7: log likelihood of observed labels = -28073.79071393
iteration   8: log likelihood of observed labels = -27936.26093762
iteration   9: log likelihood of observed labels = -27803.04751805
iteration  10: log likelihood of observed labels = -27673.94684207
iteration  11: log likelihood of observed labels = -27548.76901327
iteration  12: log likelihood of observed labels = -27427.33612958
iteration  13: log likelihood of observed labels = -27309.48101569
iteration  14: log likelihood of observed labels = -27195.04624253
iteration  15: log likelihood of observed labels = -27083.88333261
iteration  20: log likelihood of observed labels = -26572.49874392
iteration  30: log likelihood of observed labels = -25729.32604153
iteration  40: log likelihood of observed labels = -25061.34245801
iteration  50: log likelihood of observed labels = -24517.52091982
iteration  60: log likelihood of observed labels = -24064.99093939
iteration  70: log likelihood of observed labels = -23681.67373669
iteration  80: log likelihood of observed labels = -23352.19298741
iteration  90: log likelihood of observed labels = -23065.50180166
iteration 100: log likelihood of observed labels = -22813.44844580
iteration 200: log likelihood of observed labels = -21321.14164794
iteration 300: log likelihood of observed labels = -20624.98634439
iteration 400: log likelihood of observed labels = -20219.92048845
iteration 500: log likelihood of observed labels = -19956.11341776

In [121]:
# run with L2 = 10
coefficients_10_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                      initial_coefficients=np.zeros(194),
                                                      step_size=5e-6, l2_penalty=10, max_iter=501)


iteration   0: log likelihood of observed labels = -29179.40062984
iteration   1: log likelihood of observed labels = -29003.76654163
iteration   2: log likelihood of observed labels = -28834.79322654
iteration   3: log likelihood of observed labels = -28671.94687528
iteration   4: log likelihood of observed labels = -28514.80571589
iteration   5: log likelihood of observed labels = -28363.02048078
iteration   6: log likelihood of observed labels = -28216.29071186
iteration   7: log likelihood of observed labels = -28074.35036891
iteration   8: log likelihood of observed labels = -27936.95892966
iteration   9: log likelihood of observed labels = -27803.89576265
iteration  10: log likelihood of observed labels = -27674.95647005
iteration  11: log likelihood of observed labels = -27549.95042714
iteration  12: log likelihood of observed labels = -27428.69905549
iteration  13: log likelihood of observed labels = -27311.03455140
iteration  14: log likelihood of observed labels = -27196.79890162
iteration  15: log likelihood of observed labels = -27085.84308528
iteration  20: log likelihood of observed labels = -26575.59697506
iteration  30: log likelihood of observed labels = -25735.07304608
iteration  40: log likelihood of observed labels = -25070.03447306
iteration  50: log likelihood of observed labels = -24529.31188025
iteration  60: log likelihood of observed labels = -24079.95349572
iteration  70: log likelihood of observed labels = -23699.83199186
iteration  80: log likelihood of observed labels = -23373.54108747
iteration  90: log likelihood of observed labels = -23090.01500055
iteration 100: log likelihood of observed labels = -22841.08995135
iteration 200: log likelihood of observed labels = -21377.25595328
iteration 300: log likelihood of observed labels = -20704.63995428
iteration 400: log likelihood of observed labels = -20319.25685307
iteration 500: log likelihood of observed labels = -20072.16321721

In [122]:
# run with L2 = 1e2
coefficients_1e2_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                       initial_coefficients=np.zeros(194),
                                                       step_size=5e-6, l2_penalty=1e2, max_iter=501)


iteration   0: log likelihood of observed labels = -29179.48385119
iteration   1: log likelihood of observed labels = -29004.25177457
iteration   2: log likelihood of observed labels = -28835.97382190
iteration   3: log likelihood of observed labels = -28674.09410083
iteration   4: log likelihood of observed labels = -28518.17112932
iteration   5: log likelihood of observed labels = -28367.83774654
iteration   6: log likelihood of observed labels = -28222.77708940
iteration   7: log likelihood of observed labels = -28082.70799392
iteration   8: log likelihood of observed labels = -27947.37595368
iteration   9: log likelihood of observed labels = -27816.54738615
iteration  10: log likelihood of observed labels = -27690.00588850
iteration  11: log likelihood of observed labels = -27567.54970126
iteration  12: log likelihood of observed labels = -27448.98991327
iteration  13: log likelihood of observed labels = -27334.14912742
iteration  14: log likelihood of observed labels = -27222.86041863
iteration  15: log likelihood of observed labels = -27114.96648229
iteration  20: log likelihood of observed labels = -26621.50201299
iteration  30: log likelihood of observed labels = -25819.72803950
iteration  40: log likelihood of observed labels = -25197.34035501
iteration  50: log likelihood of observed labels = -24701.03698195
iteration  60: log likelihood of observed labels = -24296.66378580
iteration  70: log likelihood of observed labels = -23961.38842316
iteration  80: log likelihood of observed labels = -23679.38088853
iteration  90: log likelihood of observed labels = -23439.31824267
iteration 100: log likelihood of observed labels = -23232.88192018
iteration 200: log likelihood of observed labels = -22133.50726528
iteration 300: log likelihood of observed labels = -21730.03957488
iteration 400: log likelihood of observed labels = -21545.87572145
iteration 500: log likelihood of observed labels = -21451.95551390

In [123]:
# run with L2 = 1e3
coefficients_1e3_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                       initial_coefficients=np.zeros(194),
                                                       step_size=5e-6, l2_penalty=1e3, max_iter=501)


iteration   0: log likelihood of observed labels = -29180.31606471
iteration   1: log likelihood of observed labels = -29009.07176112
iteration   2: log likelihood of observed labels = -28847.62378912
iteration   3: log likelihood of observed labels = -28695.14439397
iteration   4: log likelihood of observed labels = -28550.95060743
iteration   5: log likelihood of observed labels = -28414.45771129
iteration   6: log likelihood of observed labels = -28285.15124375
iteration   7: log likelihood of observed labels = -28162.56976044
iteration   8: log likelihood of observed labels = -28046.29387744
iteration   9: log likelihood of observed labels = -27935.93902900
iteration  10: log likelihood of observed labels = -27831.15045502
iteration  11: log likelihood of observed labels = -27731.59955260
iteration  12: log likelihood of observed labels = -27636.98108219
iteration  13: log likelihood of observed labels = -27547.01092670
iteration  14: log likelihood of observed labels = -27461.42422295
iteration  15: log likelihood of observed labels = -27379.97375625
iteration  20: log likelihood of observed labels = -27027.18208317
iteration  30: log likelihood of observed labels = -26527.22737267
iteration  40: log likelihood of observed labels = -26206.59048765
iteration  50: log likelihood of observed labels = -25995.96903148
iteration  60: log likelihood of observed labels = -25854.95710284
iteration  70: log likelihood of observed labels = -25759.08109950
iteration  80: log likelihood of observed labels = -25693.05688014
iteration  90: log likelihood of observed labels = -25647.09929349
iteration 100: log likelihood of observed labels = -25614.81468705
iteration 200: log likelihood of observed labels = -25536.20998919
iteration 300: log likelihood of observed labels = -25532.57691220
iteration 400: log likelihood of observed labels = -25532.35543765
iteration 500: log likelihood of observed labels = -25532.33970049

In [124]:
# run with L2 = 1e5
coefficients_1e5_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                       initial_coefficients=np.zeros(194),
                                                       step_size=5e-6, l2_penalty=1e5, max_iter=501)


iteration   0: log likelihood of observed labels = -29271.85955115
iteration   1: log likelihood of observed labels = -29271.71006589
iteration   2: log likelihood of observed labels = -29271.65738833
iteration   3: log likelihood of observed labels = -29271.61189923
iteration   4: log likelihood of observed labels = -29271.57079975
iteration   5: log likelihood of observed labels = -29271.53358505
iteration   6: log likelihood of observed labels = -29271.49988440
iteration   7: log likelihood of observed labels = -29271.46936584
iteration   8: log likelihood of observed labels = -29271.44172890
iteration   9: log likelihood of observed labels = -29271.41670149
iteration  10: log likelihood of observed labels = -29271.39403722
iteration  11: log likelihood of observed labels = -29271.37351294
iteration  12: log likelihood of observed labels = -29271.35492661
iteration  13: log likelihood of observed labels = -29271.33809523
iteration  14: log likelihood of observed labels = -29271.32285309
iteration  15: log likelihood of observed labels = -29271.30905015
iteration  20: log likelihood of observed labels = -29271.25729150
iteration  30: log likelihood of observed labels = -29271.20657205
iteration  40: log likelihood of observed labels = -29271.18775997
iteration  50: log likelihood of observed labels = -29271.18078247
iteration  60: log likelihood of observed labels = -29271.17819447
iteration  70: log likelihood of observed labels = -29271.17723457
iteration  80: log likelihood of observed labels = -29271.17687853
iteration  90: log likelihood of observed labels = -29271.17674648
iteration 100: log likelihood of observed labels = -29271.17669750
iteration 200: log likelihood of observed labels = -29271.17666862
iteration 300: log likelihood of observed labels = -29271.17666862
iteration 400: log likelihood of observed labels = -29271.17666862
iteration 500: log likelihood of observed labels = -29271.17666862

Compare coefficients

We now compare the coefficients for each of the models that were trained above. We will create a table of features and learned coefficients associated with each of the different L2 penalty values.

Below is a simple helper function that will help us create this table.


In [125]:
important_words.insert(0, 'intercept')
data = np.array(important_words)
table = pd.DataFrame(columns = ['words'], data = data)
def add_coefficients_to_table(coefficients, column_name):
    table[column_name] = coefficients
    return table
important_words.remove('intercept')

In [126]:
add_coefficients_to_table(coefficients_0_penalty, 'coefficients [L2=0]')
add_coefficients_to_table(coefficients_4_penalty, 'coefficients [L2=4]')
add_coefficients_to_table(coefficients_10_penalty, 'coefficients [L2=10]')
add_coefficients_to_table(coefficients_1e2_penalty, 'coefficients [L2=1e2]')
add_coefficients_to_table(coefficients_1e3_penalty, 'coefficients [L2=1e3]')
add_coefficients_to_table(coefficients_1e5_penalty, 'coefficients [L2=1e5]')


Out[126]:
words coefficients [L2=0] coefficients [L2=4] coefficients [L2=10] coefficients [L2=1e2] coefficients [L2=1e3] coefficients [L2=1e5]
0 intercept -0.063742 -0.063143 -0.062256 -0.050438 0.000054 0.011362
1 baby 0.074073 0.073994 0.073877 0.072360 0.059752 0.001784
2 one 0.012753 0.012495 0.012115 0.007247 -0.008761 -0.001827
3 great 0.801625 0.796897 0.789935 0.701425 0.376012 0.008950
4 love 1.058554 1.050856 1.039529 0.896644 0.418354 0.009042
5 use -0.000104 0.000163 0.000556 0.005481 0.017326 0.000418
6 would -0.287021 -0.286027 -0.284564 -0.265993 -0.188662 -0.008127
7 like -0.003384 -0.003442 -0.003527 -0.004635 -0.007043 -0.000827
8 easy 0.984559 0.977600 0.967362 0.838245 0.401904 0.008808
9 little 0.524419 0.521385 0.516917 0.460235 0.251221 0.005941
10 seat -0.086968 -0.086125 -0.084883 -0.069109 -0.017718 0.000611
11 old 0.208912 0.207749 0.206037 0.184332 0.105074 0.002741
12 well 0.453866 0.450969 0.446700 0.392304 0.194926 0.003945
13 get -0.196835 -0.196100 -0.195017 -0.181251 -0.122728 -0.004578
14 also 0.158163 0.157246 0.155899 0.139153 0.080918 0.001929
15 really -0.017906 -0.017745 -0.017508 -0.014481 -0.004448 -0.000340
16 son 0.128396 0.127761 0.126828 0.115192 0.070411 0.001552
17 time -0.072429 -0.072281 -0.072065 -0.069480 -0.057581 -0.002805
18 bought -0.151817 -0.150917 -0.149594 -0.132884 -0.072431 -0.001985
19 product -0.263330 -0.262328 -0.260854 -0.242391 -0.167962 -0.006211
20 good 0.156507 0.155270 0.153445 0.129972 0.047879 0.000266
21 daughter 0.263418 0.261775 0.259357 0.228685 0.117158 0.002401
22 much -0.013247 -0.013295 -0.013366 -0.014326 -0.015219 -0.000839
23 loves 1.052484 1.043903 1.031265 0.870794 0.345870 0.006150
24 stroller -0.037533 -0.036988 -0.036186 -0.025990 0.005912 0.001326
25 put -0.000330 -0.000323 -0.000312 -0.000127 0.001529 -0.000097
26 months -0.067995 -0.067315 -0.066314 -0.053594 -0.013083 -0.000157
27 car 0.193364 0.191904 0.189754 0.162531 0.072719 0.001765
28 still 0.188508 0.187071 0.184955 0.158163 0.068491 0.000976
29 back -0.268954 -0.267419 -0.265161 -0.236730 -0.134671 -0.003988
... ... ... ... ... ... ... ...
164 started -0.153174 -0.151852 -0.149905 -0.125084 -0.045084 -0.000877
165 anything -0.186801 -0.185242 -0.182943 -0.153602 -0.057284 -0.001053
166 last -0.099469 -0.098692 -0.097547 -0.083001 -0.034797 -0.000775
167 company -0.276548 -0.274151 -0.270621 -0.225839 -0.084898 -0.001719
168 come -0.032009 -0.031804 -0.031502 -0.027685 -0.014185 -0.000426
169 returned -0.572707 -0.567518 -0.559870 -0.462056 -0.150021 -0.002225
170 maybe -0.224076 -0.222015 -0.218976 -0.180192 -0.058149 -0.000945
171 took -0.046445 -0.046199 -0.045838 -0.041422 -0.025566 -0.000772
172 broke -0.555195 -0.550209 -0.542861 -0.448989 -0.148726 -0.002182
173 makes -0.009023 -0.008764 -0.008382 -0.003467 0.008757 0.000255
174 stay -0.300563 -0.297920 -0.294024 -0.244247 -0.083709 -0.001310
175 instead -0.193123 -0.191418 -0.188907 -0.156863 -0.054125 -0.000925
176 idea -0.465370 -0.461130 -0.454879 -0.374890 -0.118469 -0.001627
177 head -0.110472 -0.109559 -0.108215 -0.090992 -0.032986 -0.000502
178 said -0.098049 -0.097331 -0.096274 -0.082875 -0.037594 -0.000947
179 less -0.136801 -0.135652 -0.133958 -0.112360 -0.042260 -0.000873
180 went -0.106836 -0.106003 -0.104776 -0.089294 -0.039417 -0.001006
181 working -0.320363 -0.317559 -0.313427 -0.260764 -0.092334 -0.001674
182 high 0.003326 0.003282 0.003217 0.002404 0.000236 -0.000062
183 unit -0.196121 -0.194516 -0.192153 -0.162210 -0.066568 -0.001567
184 seems 0.058308 0.057905 0.057312 0.049753 0.022875 0.000329
185 picture -0.196906 -0.195273 -0.192866 -0.162143 -0.061171 -0.001151
186 completely -0.277845 -0.275461 -0.271947 -0.227098 -0.081775 -0.001421
187 wish 0.173191 0.171640 0.169352 0.140022 0.044374 0.000468
188 buying -0.132197 -0.131083 -0.129441 -0.108471 -0.040331 -0.000792
189 babies 0.052494 0.052130 0.051594 0.044805 0.021026 0.000365
190 won 0.004960 0.004907 0.004830 0.003848 0.001084 0.000017
191 tub -0.166745 -0.165367 -0.163338 -0.137693 -0.054778 -0.000936
192 almost -0.031916 -0.031621 -0.031186 -0.025604 -0.007361 -0.000125
193 either -0.228852 -0.226793 -0.223758 -0.184986 -0.061138 -0.000980

194 rows × 7 columns

Using the coefficients trained with L2 penalty 0, find the 5 most positive words (with largest positive coefficients). Save them to positive_words. Similarly, find the 5 most negative words (with largest negative coefficients) and save them to negative_words.

Quiz Question. Which of the following is not listed in either positive_words or negative_words?


In [127]:
def make_tuple(column_name):
    word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip( table['words'], table[column_name])]
    return word_coefficient_tuples


positive_words = list(map(lambda x: x[0], sorted(make_tuple('coefficients [L2=0]'), key=lambda x:x[1], reverse=True)[:5]))
negative_words = list(map(lambda x: x[0], sorted(make_tuple('coefficients [L2=0]'), key=lambda x:x[1], reverse=False)[:5]))

In [84]:
positive_words


Out[84]:
['love', 'loves', 'easy', 'perfect', 'great']

In [86]:
negative_words


Out[86]:
['disappointed', 'money', 'return', 'waste', 'returned']

Let us observe the effect of increasing L2 penalty on the 10 words just selected. We provide you with a utility function to plot the coefficient path.


In [104]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 6

def make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list):
    cmap_positive = plt.get_cmap('Reds')
    cmap_negative = plt.get_cmap('Blues')
    
    xx = l2_penalty_list
    plt.plot(xx, [0.]*len(xx), '--', lw=1, color='k')
    
    table_positive_words = table[table['words'].isin(positive_words)]
    table_negative_words = table[table['words'].isin(negative_words)]
    del table_positive_words['words']
    del table_negative_words['words']
    
    for i in range(len(positive_words)):
        color = cmap_positive(0.8*((i+1)/(len(positive_words)*1.2)+0.15))
        plt.plot(xx, table_positive_words[i:i+1].as_matrix().flatten(),
                 '-', label=positive_words[i], linewidth=4.0, color=color)
        
    for i in range(len(negative_words)):
        color = cmap_negative(0.8*((i+1)/(len(negative_words)*1.2)+0.15))
        plt.plot(xx, table_negative_words[i:i+1].as_matrix().flatten(),
                 '-', label=negative_words[i], linewidth=4.0, color=color)
        
    plt.legend(loc='best', ncol=3, prop={'size':16}, columnspacing=0.5)
    plt.axis([1, 1e5, -1, 2])
    plt.title('Coefficient path')
    plt.xlabel('L2 penalty ($\lambda$)')
    plt.ylabel('Coefficient value')
    plt.xscale('log')
    plt.rcParams.update({'font.size': 18})
    plt.tight_layout()

In [105]:
make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list=[0, 4, 10, 1e2, 1e3, 1e5])


Quiz Question: (True/False) All coefficients consistently get smaller in size as the L2 penalty is increased.

Quiz Question: (True/False) The relative order of coefficients is preserved as the L2 penalty is increased. (For example, if the coefficient for 'cat' was more positive than that for 'dog', this remains true as the L2 penalty increases.)

Measuring accuracy

Now, let us compute the accuracy of the classifier model. Recall that the accuracy is given by

$$ \mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}} $$

Recall from lecture that that the class prediction is calculated using $$ \hat{y}_i = \left\{ \begin{array}{ll} +1 & h(\mathbf{x}_i)^T\mathbf{w} > 0 \\ -1 & h(\mathbf{x}_i)^T\mathbf{w} \leq 0 \\ \end{array} \right. $$

Note: It is important to know that the model prediction code doesn't change even with the addition of an L2 penalty. The only thing that changes is the estimated coefficients used in this prediction.

Based on the above, we will use the same code that was used in Module 3 assignment.


In [128]:
def get_classification_accuracy(feature_matrix, sentiment, coefficients):
    scores = np.dot(feature_matrix, coefficients)
    apply_threshold = np.vectorize(lambda x: 1. if x > 0  else -1.)
    predictions = apply_threshold(scores)
    
    num_correct = (predictions == sentiment).sum()
    accuracy = num_correct / len(feature_matrix)    
    return accuracy

Below, we compare the accuracy on the training data and validation data for all the models that were trained in this assignment. We first calculate the accuracy values and then build a simple report summarizing the performance for the various models.


In [129]:
train_accuracy = {}
train_accuracy[0]   = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_0_penalty)
train_accuracy[4]   = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_4_penalty)
train_accuracy[10]  = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_10_penalty)
train_accuracy[1e2] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e2_penalty)
train_accuracy[1e3] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e3_penalty)
train_accuracy[1e5] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e5_penalty)

validation_accuracy = {}
validation_accuracy[0]   = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_0_penalty)
validation_accuracy[4]   = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_4_penalty)
validation_accuracy[10]  = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_10_penalty)
validation_accuracy[1e2] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e2_penalty)
validation_accuracy[1e3] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e3_penalty)
validation_accuracy[1e5] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e5_penalty)

In [131]:
# Build a simple report
for key in sorted(validation_accuracy.keys()):
    print("L2 penalty = %g" % key)
    print("train accuracy = %s, validation_accuracy = %s" % (train_accuracy[key], validation_accuracy[key]))
    print("--------------------------------------------------------------------------------")


L2 penalty = 0
train accuracy = 0.785156157787, validation_accuracy = 0.78143964149
--------------------------------------------------------------------------------
L2 penalty = 4
train accuracy = 0.785108944548, validation_accuracy = 0.781533003454
--------------------------------------------------------------------------------
L2 penalty = 10
train accuracy = 0.784990911452, validation_accuracy = 0.781719727383
--------------------------------------------------------------------------------
L2 penalty = 100
train accuracy = 0.783975826822, validation_accuracy = 0.781066193633
--------------------------------------------------------------------------------
L2 penalty = 1000
train accuracy = 0.775855149784, validation_accuracy = 0.771356549342
--------------------------------------------------------------------------------
L2 penalty = 100000
train accuracy = 0.680366374731, validation_accuracy = 0.667818130893
--------------------------------------------------------------------------------

In [132]:
# Optional. Plot accuracy on training and validation sets over choice of L2 penalty.
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 6

sorted_list = sorted(train_accuracy.items(), key=lambda x:x[0])
plt.plot([p[0] for p in sorted_list], [p[1] for p in sorted_list], 'bo-', linewidth=4, label='Training accuracy')
sorted_list = sorted(validation_accuracy.items(), key=lambda x:x[0])
plt.plot([p[0] for p in sorted_list], [p[1] for p in sorted_list], 'ro-', linewidth=4, label='Validation accuracy')
plt.xscale('symlog')
plt.axis([0, 1e3, 0.78, 0.786])
plt.legend(loc='lower left')
plt.rcParams.update({'font.size': 18})
plt.tight_layout


Out[132]:
<function matplotlib.pyplot.tight_layout>

In [ ]: