In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"
We have a few inter-related goals for this unit:
Provide a basic introduction to supervised learning in the context of a problem that has long been central to academic research and industry applications: sentiment analysis.
Explore and evaluate a diverse array of methods for modeling sentiment:
Begin discussing and implementing responsible methods for hyperparameter optimization and classifier assessment and comparison.
The unit is built around the Stanford Sentiment Treebank (SST), a widely-used resource for evaluating supervised NLU models, and one that provides rich linguistic representations.
If you're relatively new to supervised learning, we suggest studying the details of this notebook closely and following the links to additional resources.
If you're familiar with supervised learning, then you can focus right away on innovative feature representations and modeling.
As of this writing, the state-of-the-art for the SST seems to be around 88% accuracy for the binary problem and 48% accuracy for the five-class problem. Perhaps you can best these numbers!
Sentiment analysis seems simple at first but turns out to exhibit all of the complexity of full natural language understanding. To see this, consider how your intuitions about the sentiment of the following sentences can change depending on perspective, social relationships, tone of voice, and other aspects of the context of utterance:
SST mostly steers around these challenges by including only focused, evaluative texts (sentences from movie reviews), but you should have them in mind if you consider new domains and applications for the ideas.
Make sure your environment includes all the requirements for the cs224u repository.
If you haven't already, download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change SST_HOME
below.)
In [2]:
from nltk.tree import Tree
import os
import pandas as pd
import sst
In [3]:
SST_HOME = os.path.join('data', 'trees')
The train/dev/test SST distribution contains files that are lists of trees where the part-of-speech tags have been replaced with sentiment scores 0...4
:
0
and 1
are negative labels.2
is a neutral label.3
and 4
are positive labels. Our readers are iterators that yield (tree, score)
pairs, where tree
is an NLTK Tree instance and score
is a string.
In [4]:
tree, score = next(sst.train_reader(SST_HOME))
Here, score
is one of the labels. tree
is an NLTK Tree instance. It should render pretty legibly in your browser:
In [5]:
tree
Out[5]:
This is what it actually looks like, of course:
In [6]:
(tree,)
Out[6]:
Here's a smaller example:
In [7]:
Tree.fromstring("""(4 (2 NLU) (4 (2 is) (4 enlightening)))""")
Out[7]:
We've deliberately ignored test
readers. We urge you not to use the test
set until and unless you are running experiments for a final project or similar. Overuse of test-sets corrupts them, since even subtle lessons learned from those runs can be incorporated back into model-building efforts.
We actually have mixed feelings about the overuse of dev
that might result from working with these notebooks! We've tried to encourage using just splits of the training data for assessment most of the time, with only occasionally use of dev
. This will give you a clearer picture of how you will ultimately do on test
; over-use of dev
can lead to over-fitting on that particular dataset with a resulting loss of performance of test
.
Working with the SST involves making decisions about how to handle the raw SST labels. The interpretation of these labels is as follows (Socher et al., sec. 3):
'0'
: very negative'1'
: negative'2'
: neutral'3'
: positive'4'
: very positiveThe labels look like they could be treated as totally ordered, even continuous. However, conceptually, they do not form such an order. Rather, they consist of three separate classes, with the negative and positive classes being totally ordered in opposite directions:
'0' > '1'
: negative'2'
: neutral'4' > '3'
: positiveThus, in this notebook, we'll look mainly at binary (positive/negative) and ternary tasks.
A related note: the above shows that the fine-grained sentiment task for the SST is particularly punishing as usually formulated, since it ignores the partial-order structure in the categories completely. As a result, mistaking '0'
for '1'
is as bad as mistaking '0'
for '4'
, though the first error is clearly less severe than the second.
The functions sst.binary_class_func
and sst.ternary_class_func
will convert the labels for you, and recommended usage is to use them as the class_func
keyword argument to train_reader
and dev_reader
; examples below.
Check that these numbers all match those reported in Socher et al. 2013, sec 5.1.
In [8]:
train_labels = [y for tree, y in sst.train_reader(SST_HOME)]
In [9]:
print("Total train examples: {:,}".format(len(train_labels)))
Distribution over the full label set:
In [10]:
pd.Series(train_labels).value_counts()
Out[10]:
Binary label conversion:
In [11]:
binary_train_labels = [
y for tree, y in sst.train_reader(SST_HOME, class_func=sst.binary_class_func)]
In [12]:
print("Total binary train examples: {:,}".format(len(binary_train_labels)))
In [13]:
pd.Series(binary_train_labels).value_counts()
Out[13]:
Ternary label conversion:
In [14]:
ternary_train_labels = [
y for tree, y in sst.train_reader(SST_HOME, class_func=sst.ternary_class_func)]
pd.Series(ternary_train_labels).value_counts()
Out[14]:
Check that these numbers all match those reported in Socher et al. 2013, sec 5.1.
In [15]:
dev_labels = [y for tree, y in sst.dev_reader(SST_HOME)]
In [16]:
print("Total dev examples: {:,}".format(len(dev_labels)))
In [17]:
pd.Series(dev_labels).value_counts()
Out[17]:
Binary label conversion:
In [18]:
binary_dev_labels = [
y for tree, y in sst.dev_reader(SST_HOME, class_func=sst.binary_class_func)]
In [19]:
print("Total binary dev examples: {:,}".format(len(binary_dev_labels)))
In [20]:
pd.Series(binary_dev_labels).value_counts()
Out[20]:
Ternary label conversion:
In [21]:
ternary_dev_labels = [
y for tree, y in sst.dev_reader(SST_HOME, class_func=sst.ternary_class_func)]
pd.Series(ternary_dev_labels).value_counts()
Out[21]:
Here are a few publicly available datasets and other resources; if you decide to work on sentiment analysis, get in touch with the teaching staff — we have a number of other resources that we can point you to.