Homework and bake-off: Stanford Sentiment Treebank


In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

Overview

This homework and associated bake-off are devoted to the Stanford Sentiment Treebank (SST). The homework questions ask you to implement some baseline systems, and the bake-off challenge is to define a system that does extremely well at the SST task.

We'll focus on the ternary task as defined by sst.ternary_class_func.

The SST test set will be used for the bake-off evaluation. This dataset is already publicly distributed, so we are counting on people not to cheat by develping their models on the test set. You must do all your development without using the test set at all, and then evaluate exactly once on the test set and turn in the results, with no further system tuning or additional runs. Much of the scientific integrity of our field depends on people adhering to this honor code.

Our only additional restriction is that you cannot make any use of the subtree labels. This corresponds to the 'Root' condition in the paper. As we discussed in class, the subtree labels are a really interesting feature of SST, but bringing them in results in a substantially different learning problem.

One of our goals for this homework and bake-off is to encourage you to engage in the basic development cycle for supervised models, in which you

  1. Write a new feature function. We recommend starting with something simple.
  2. Use sst.experiment to evaluate your new feature function, with at least fit_softmax_classifier.
  3. If you have time, compare your feature function with unigrams_phi using sst.compare_models or sst.compare_models_mcnemar. (For discussion, see this notebook section.)
  4. Return to step 1, or stop the cycle and conduct a more rigorous evaluation with hyperparameter tuning and assessment on the dev set.

Error analysis is one of the most important methods for steadily improving a system, as it facilitates a kind of human-powered hill-climbing on your ultimate objective. Often, it takes a careful human analyst just a few examples to spot a major pattern that can lead to a beneficial change to the feature representations.

Methodological note

You don't have to use the experimental framework defined below (based on sst). However, if you don't use sst.experiment as below, then make sure you're training only on train, evaluating on dev, and that you report with

from sklearn.metrics import classification_report
classification_report(y_dev, predictions)

where y_dev = [y for tree, y in sst.dev_reader(class_func=sst.ternary_class_func)]. We'll focus on the value at macro avg under f1-score in these reports.

Set-up

See the first notebook in this unit for set-up instructions.


In [2]:
from collections import Counter
from nltk.tree import Tree
import numpy as np
import os
import pandas as pd
import random
from sklearn.linear_model import LogisticRegression
import sst
import torch.nn as nn
from torch_rnn_classifier import TorchRNNClassifier
from torch_tree_nn import TorchTreeNN
import utils

In [3]:
SST_HOME = os.path.join('data', 'trees')

A softmax baseline

This example is here mainly as a reminder of how to use our experimental framework with linear models.


In [4]:
def unigrams_phi(tree):
    """The basis for a unigrams feature function.
    
    Parameters
    ----------
    tree : nltk.tree
        The tree to represent.
    
    Returns
    -------    
    Counter
        A map from strings to their counts in `tree`. (Counter maps a 
        list to a dict of counts of the elements in that list.)
    
    """
    return Counter(tree.leaves())

Thin wrapper around LogisticRegression for the sake of sst.experiment:


In [5]:
def fit_softmax_classifier(X, y):        
    mod = LogisticRegression(
        fit_intercept=True,
        solver='liblinear',
        multi_class='ovr')
    mod.fit(X, y)
    return mod

The experimental run with some notes:


In [6]:
softmax_experiment = sst.experiment(
    SST_HOME,
    unigrams_phi,                      # Free to write your own!
    fit_softmax_classifier,            # Free to write your own!
    train_reader=sst.train_reader,     # Fixed by the competition.
    assess_reader=sst.dev_reader,      # Fixed until the bake-off.
    class_func=sst.ternary_class_func) # Fixed by the bake-off rules.


              precision    recall  f1-score   support

    negative      0.628     0.689     0.657       428
     neutral      0.343     0.153     0.211       229
    positive      0.629     0.750     0.684       444

   micro avg      0.602     0.602     0.602      1101
   macro avg      0.533     0.531     0.518      1101
weighted avg      0.569     0.602     0.575      1101

softmax_experiment contains a lot of information that you can use for analysis; see this section below for starter code.

RNNClassifier wrapper

This section illustrates how to use sst.experiment with RNN and TreeNN models.

To featurize examples for an RNN, we just get the words in order, letting the model take care of mapping them into an embedding space.


In [7]:
def rnn_phi(tree):
    return tree.leaves()

The model wrapper gets the vocabulary using sst.get_vocab. If you want to use pretrained word representations in here, then you can have fit_rnn_classifier build that space too; see this notebook section for details.


In [8]:
def fit_rnn_classifier(X, y):    
    sst_glove_vocab = utils.get_vocab(X, n_words=10000)     
    mod = TorchRNNClassifier(
        sst_glove_vocab, 
        eta=0.05,
        embedding=None,
        batch_size=1000,
        embed_dim=50,
        hidden_dim=50,
        max_iter=50,
        l2_strength=0.001,
        bidirectional=True,
        hidden_activation=nn.ReLU())
    mod.fit(X, y)
    return mod

In [9]:
rnn_experiment = sst.experiment(
    SST_HOME,
    rnn_phi,
    fit_rnn_classifier, 
    vectorize=False,  # For deep learning, use `vectorize=False`.
    assess_reader=sst.dev_reader)


Finished epoch 50 of 50; error is 2.6432904899120333
              precision    recall  f1-score   support

    negative      0.568     0.666     0.613       428
     neutral      0.290     0.175     0.218       229
    positive      0.640     0.664     0.652       444

   micro avg      0.563     0.563     0.563      1101
   macro avg      0.499     0.502     0.494      1101
weighted avg      0.539     0.563     0.547      1101

Error analysis

This section begins to build an error-analysis framework using the dicts returned by sst.experiment. These have the following structure:

'model': trained model
'train_dataset':
   'X': feature matrix
   'y': list of labels
   'vectorizer': DictVectorizer,
   'raw_examples': list of raw inputs, before featurizing   
'assess_dataset': same structure as the value of 'train_dataset'
'predictions': predictions on the assessment data
'metric': `score_func.__name__`, where `score_func` is an `sst.experiment` argument
'score': the `score_func` score on the assessment data

The following function just finds mistakes, and returns a pd.DataFrame for easy subsequent processing:


In [10]:
def find_errors(experiment):
    """Find mistaken predictions.
    
    Parameters
    ----------
    experiment : dict
        As returned by `sst.experiment`.
        
    Returns
    -------
    pd.DataFrame
    
    """
    raw_examples = experiment['assess_dataset']['raw_examples']
    raw_examples = [" ".join(tree.leaves()) for tree in raw_examples]
    df = pd.DataFrame({
        'raw_examples': raw_examples,
        'predicted': experiment['predictions'],
        'gold': experiment['assess_dataset']['y']})
    df['correct'] = df['predicted'] == df['gold']
    return df

In [11]:
softmax_analysis = find_errors(softmax_experiment)

In [12]:
rnn_analysis = find_errors(rnn_experiment)

Here we merge the sotmax and RNN experiments into a single DataFrame:


In [13]:
analysis = softmax_analysis.merge(
    rnn_analysis, left_on='raw_examples', right_on='raw_examples')

analysis = analysis.drop('gold_y', axis=1).rename(columns={'gold_x': 'gold'})

The following code collects a specific subset of examples; small modifications to its structure will give you different interesting subsets:


In [14]:
# Examples where the softmax model is correct, the RNN is not,
# and the gold label is 'positive'

error_group = analysis[
    (analysis['predicted_x'] == analysis['gold'])
    &
    (analysis['predicted_y'] != analysis['gold'])    
    &
    (analysis['gold'] == 'positive')
]

In [15]:
error_group.shape[0]


Out[15]:
58

In [16]:
for ex in error_group['raw_examples'].sample(5):
    print("="*70)
    print(ex)


======================================================================
An operatic , sprawling picture that 's entertainingly acted , magnificently shot and gripping enough to sustain most of its 170-minute length .
======================================================================
This is a good script , good dialogue , funny even for adults .
======================================================================
An intriguing cinematic omnibus and round-robin that occasionally is more interesting in concept than in execution .
======================================================================
A spellbinding African film about the modern condition of rootlessness , a state experienced by millions around the globe .
======================================================================
A woman 's pic directed with resonance by Ilya Chaiken .

Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

Reproducing Socher et al.'s NaiveBayes baselines [2 points]

Socher et al. compare against (among other models) a NaiveBayes baseline with bigram features. See how close you can come to reproducing the performance of that model on the binary, root-only problem (values in the rightmost column of their Table 1, rows 1 and 3).

Specific tasks:

  1. Write a bigrams feature function called bigrams_phi on the model of unigrams_phi. The included function test_bigrams_phi should help verify that you've done this correctly.
  2. Write a function fit_nb_classifier that serves as a wrapper for sklearn.naive_bayes.MultinomialNB, which you can use with all its default arguments or change them as you see fit.
  3. Use sst.experiment to run the experiments, assessing against sst.dev_reader.

Submit all the code you write for this, including any new import statements, and make sure your notebook embeds the output from running the code in step 3.

A note on performance: in our experience, the bigrams Naive Bayes model gets around 0.75. It's fine to submit answers with comparable numbers; the Socher et al. baselines are very strong. We're not evaluating how good your model is; we want to see your code, and we're interested to see what the range of F1 scores is across the whole class.


In [17]:
##### YOUR CODE HERE

In [18]:
def test_bigrams_phi(func):
    """`func` should be `bigrams_phi`."""
    tree = Tree.fromstring("""(4 (2 NLU) (4 (2 is) (4 amazing)))""")
    result = bigrams_phi(tree)
    expected = {('<S>', 'NLU'): 1, 
                ('NLU', 'is'): 1, 
                ('is', 'amazing'): 1, 
                ('amazing', '</S>'): 1}
    assert result == expected, \
        "Expected {}\nGot {}".format(expected, result)

Sentiment words alone [2 points]

NLTK includes an easy interface to Minqing Hu and Bing Liu's Opinion Lexicon, which consists of a list of positive words and a list of negative words. How much of the ternary SST story does this lexicon tell?

For this problem, submit code to do the following:

  1. Create a feature function op_unigrams on the model of unigrams_phi above, but filtering the vocabulary to just items that are members of the Opinion Lexicon. Submit this feature function.

  2. Evaluate your feature function with sst.experiment, with all the same parameters as were used to create softmax_experiment in A softmax baseline above, except of course for the feature function.

  3. Use utils.mcnemar to compare your feature function with the results in softmax_experiment. The information you need for this is in softmax_experiment and your own sst.experiment results. Submit your evaluation code. You can assume softmax_experiment is already in memory, but your code should create the other objects necessary for this comparison.


In [19]:
from nltk.corpus import opinion_lexicon

# Use set for fast membership checking:
positive = set(opinion_lexicon.positive())
negative = set(opinion_lexicon.negative())

##### YOUR CODE HERE

A more powerful vector-summing baseline [2 points]

In Distributed representations as features, we looked at a baseline for the ternary SST problem in which each example is modeled as the sum of its 50-dimensional GloVe representations. A LogisticRegression model was used for prediction. A neural network might do better with these representations, since there might be complex relationships between the input feature dimensions that a linear classifier can't learn.

To address this question, rerun the experiment with TorchShallowNeuralClassifier as the classifier. Specs:

  • Use sst.experiment to conduct the experiment.
  • Using 3-fold cross-validation, exhaustively explore this set of hyperparameter combinations:
    • The hidden dimensionality at 50, 100, and 200.
    • The hidden activation function as nn.Tanh or nn.ReLU.
  • (For all other parameters to TorchShallowNeuralClassifier, use the defaults.)

For this problem, submit code to do the following:

  1. Your model wrapper function around TorchShallowNeuralClassifier. This function should implement the requisite cross-validation; see this notebook section for examples.
  2. The classification report as printed by sst.experiment. (This will print out when you run sst.experiment. That print-out suffices.)
  3. The optimal hyperparameters chosen in your experiment. (This too will print out when you run sst.experiment. The print-out again suffices.)

We're not evaluating the quality of your model. (We've specified the protocols completely, but there will still be variation in the results.) However, the primary goal of this question is to get you thinking more about this strikingly good baseline feature representation scheme for SST, so we're sort of hoping you feel compelled to try out variations on your own.


In [20]:
##### YOUR CODE HERE

Your original system [3 points]

Your task is to develop an original model for the SST ternary problem, using only the root-level labels (again, you cannot make any use of the subtree labels). There are many options. If you spend more than a few hours on this homework problem, you should consider letting it grow into your final project! Here are some relatively manageable ideas that you might try:

  1. We didn't systematically evaluate the bidirectional option to the TorchRNNClassifier. Similarly, that model could be tweaked to allow multiple LSTM layers (at present there is only one), and you could try adding layers to the classifier portion of the model as well.

  2. We've already glimpsed the power of rich initial word representations, and later in the course we'll see that smart initialization usually leads to a performance gain in NLP, so you could perhaps achieve a winning entry with a simple model that starts in a great place.

  3. The practical introduction to contextual word representations (to be discussed later in the quarter) covers pretrained representations and interfaces that are likely to boost the performance of any system.

  4. The TreeNN and TorchTreeNN don't perform all that well, and this could be for the same reason that RNNs don't peform well: the gradient signal doesn't propagate reliably down inside very deep trees. Tai et al. 2015 sought to address this with TreeLSTMs, which are fairly easy to implement in PyTorch.

  5. In the distributed representations as features section, we just summed all of the leaf-node GloVe vectors to obtain a fixed-dimensional representation for all sentences. This ignores all of the tree structure. See if you can do better by paying attention to the binary tree structure: write a function glove_subtree_phi that obtains a vector representation for each subtree by combining the vectors of its daughters, with the leaf nodes again given by GloVe (any dimension you like) and the full representation of the sentence given by the final vector obtained by this recursive process. You can decide on how you combine the vectors.

  6. If you have a lot of computing resources, then you can fire off a large hyperparameter search over many parameter values. All the model classes for this course are compatible with the scikit-learn and scikit-optimize methods, because they define the required functions for getting and setting parameters.

We want to emphasize that this needs to be an original system. It doesn't suffice to download code from the Web, retrain, and submit. You can build on others' code, but you have to do something new and meaningful with it.

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.


In [21]:
# Enter your system description in this cell.
# Please do not remove this comment.

Bake-off [1 point]

As we said above, the bake-off evaluation data is the official SST test-set release. For this bake-off, you'll evaluate your original system from the above homework problem on the test set, using the ternary class problem. Rules:

  1. Only one evaluation is permitted.
  2. No additional system tuning is permitted once the bake-off has started.
  3. As noted above, you cannot make any use of the subtree labels.

The cells below this one constitute your bake-off entry.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

The announcement will include the details on where to submit your entry.


In [22]:
# Enter your bake-off assessment code in this cell. 
# Please do not remove this comment.

##### YOUR CODE HERE

In [23]:
# On an otherwise blank line in this cell, please enter
# your macro-average F1 value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.