In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"
Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.
Dagan et al. (2006), one of the foundational papers on NLI (also called Recognizing Textual Entailment; RTE), make a case for the generality of this task in NLU:
It seems that major inferences, as needed by multiple applications, can indeed be cast in terms of textual entailment. For example, a QA system has to identify texts that entail a hypothesized answer. [...] Similarly, for certain Information Retrieval queries the combination of semantic concepts and relations denoted by the query should be entailed from relevant retrieved documents. [...] In multi-document summarization a redundant sentence, to be omitted from the summary, should be entailed from other sentences in the summary. And in MT evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations should entail each other. Consequently, we hypothesize that textual entailment recognition is a suitable generic task for evaluating and comparing applied semantic inference models. Eventually, such efforts can promote the development of entailment recognition "engines" which may provide useful generic modules across applications.
Our NLI data will look like this:
| Premise | Relation | Hypothesis |
|---|---|---|
| turtle | contradiction | linguist |
| A turtled danced | entails | A turtle moved |
| Every reptile danced | entails | Every turtle moved |
| Some turtles walk | contradicts | No turtles move |
| James Byron Dean refused to move without blue jeans | entails | James Dean didn't dance without pants |
In the word-entailment bakeoff, we looked at a special case of this where the premise and hypothesis are single words. This notebook begins to introduce the problem of NLI more fully.
We're going to focus on three NLI corpora:
The first was collected by a group at Stanford, led by Sam Bowman, and the second was collected by a group at NYU, also led by Sam Bowman. Both have the same format and were crowdsourced using the same basic methods. However, SNLI is entirely focused on image captions, whereas MultiNLI includes a greater range of contexts.
The third corpus was collected by a group at Facebook AI and UNC Chapel Hill. The team's goal was to address the fact that datasets like SNLI and MultiNLI seem to be artificially easy – models trained on them can often surpass stated human performance levels but still fail on examples that are simple and intuitive for people. The dataset is "Adversarial" because the annotators were asked to try to construct examples that fooled strong models but still passed muster with other human readers.
This notebook presents tools for working with these corpora. The second notebook in the unit concerns models of NLI.
As usual, you need to be fully set up to work with the CS224u repository.
If you haven't already, download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change DATA_HOME below.)
In [2]:
import nli
import os
import pandas as pd
import random
In [3]:
DATA_HOME = os.path.join("data", "nlidata")
SNLI_HOME = os.path.join(DATA_HOME, "snli_1.0")
MULTINLI_HOME = os.path.join(DATA_HOME, "multinli_1.0")
ANNOTATIONS_HOME = os.path.join(DATA_HOME, "multinli_1.0_annotations")
ANLI_HOME = os.path.join(DATA_HOME, "anli_v0.1")
For SNLI (and MultiNLI), MTurk annotators were presented with premise sentences and asked to produce new sentences that entailed, contradicted, or were neutral with respect to the premise. A subset of the examples were then validated by an additional four MTurk annotators.
Mean length in tokens:
Clause-types
The following readers should make it easy to work with SNLI:
nli.SNLITrainReadernli.SNLIDevReaderWriting a Test reader is easy and so left to the user who decides that a test-set evaluation is appropriate. We omit that code as a subtle way of discouraging use of the test set during project development.
The base class, nli.NLIReader, is used by all the readers discussed here.
Because the datasets are so large, it is often useful to be able to randomly sample from them. All of the reader classes discussed here support this with their keyword argument samp_percentage. For example, the following samples approximately 10% of the examples from the SNLI training set:
In [4]:
nli.SNLITrainReader(SNLI_HOME, samp_percentage=0.10, random_state=42)
Out[4]:
The precise number of examples will vary somewhat because of the way the sampling is done. (Here, we trade efficiency for precision in the number of cases we return; see the implementation for details.)
All of the readers have a read method that yields NLIExample example instances. For SNLI, these have the following attributes:
list of strstrstrstrstrnltk.tree.Treenltk.tree.Treestrnltk.tree.Treenltk.tree.TreeThe following creates the label distribution for the training data:
In [5]:
snli_labels = pd.Series(
[ex.gold_label for ex in nli.SNLITrainReader(
SNLI_HOME, filter_unlabeled=False).read()])
snli_labels.value_counts()
Out[5]:
Use filter_unlabeled=True (the default) to silently drop the examples for which gold_label is -.
Let's look at a specific example in some detail:
In [6]:
snli_iterator = iter(nli.SNLITrainReader(SNLI_HOME).read())
In [7]:
snli_ex = next(snli_iterator)
In [8]:
print(snli_ex)
In [9]:
snli_ex
Out[9]:
As you can see from the above attribute list, there are three versions of the premise and hypothesis sentences:
In [10]:
snli_ex.sentence1
Out[10]:
The binary parses lack node labels; so that we can use nltk.tree.Tree with them, the label X is added to all of them:
In [11]:
snli_ex.sentence1_binary_parse
Out[11]:
Here's the full parse tree with syntactic categories:
In [12]:
snli_ex.sentence1_parse
Out[12]:
The leaves of either tree are tokenized versions of them:
In [13]:
snli_ex.sentence1_parse.leaves()
Out[13]:
Test-set labels available as a Kaggle competition.
For MultiNLI, we have the following readers:
nli.MultiNLITrainReadernli.MultiNLIMatchedDevReadernli.MultiNLIMismatchedDevReaderThe MultiNLI test sets are available on Kaggle (matched version and mismatched version).
The interface to these is the same as for the SNLI readers:
In [14]:
nli.MultiNLITrainReader(MULTINLI_HOME, samp_percentage=0.10, random_state=42)
Out[14]:
The NLIExample instances for MultiNLI have the same attributes as those for SNLI. Here is the list repeated from above for convenience:
list of strstrstrstrstrnltk.tree.Treenltk.tree.Treestrnltk.tree.Treenltk.tree.TreeThe full label distribution:
In [15]:
multinli_labels = pd.Series(
[ex.gold_label for ex in nli.MultiNLITrainReader(
MULTINLI_HOME, filter_unlabeled=False).read()])
multinli_labels.value_counts()
Out[15]:
No examples in the MultiNLI train set lack a gold label, so the value of the filter_unlabeled parameter has no effect here, but it does have an effect in the Dev versions.
In [16]:
matched_ann_filename = os.path.join(
ANNOTATIONS_HOME,
"multinli_1.0_matched_annotations.txt")
mismatched_ann_filename = os.path.join(
ANNOTATIONS_HOME,
"multinli_1.0_mismatched_annotations.txt")
In [17]:
def view_random_example(annotations, random_state=42):
random.seed(random_state)
ann_ex = random.choice(list(annotations.items()))
pairid, ann_ex = ann_ex
ex = ann_ex['example']
print("pairID: {}".format(pairid))
print(ann_ex['annotations'])
print(ex.sentence1)
print(ex.gold_label)
print(ex.sentence2)
In [18]:
matched_ann = nli.read_annotated_subset(matched_ann_filename, MULTINLI_HOME)
In [19]:
view_random_example(matched_ann)
The ANLI dataset was created in response to evidence that datasets like SNLI and MultiNLI are artificially easy for modern machine learning models to solve. The team sought to tackle this weakness head-on, by designing a crowdsourcing task in which annotators were explicitly trying to confuse state-of-the-art models. In broad outline, the task worked like this:
The crowdworker is presented with a premise (context) text and asked to construct a hypothesis sentence that entails, contradicts, or is neutral with respect to that premise. (The precise wording is more informally, along the lines of the SNLI/MultiNLI task).
The crowdworker submits a hypothesis text.
The premise/hypothesis pair is fed to a trained model that makes a prediction about the correct NLI label.
If the model's prediction is correct, then the crowdworker loops back to step 2 to try again. If the model's prediction is incorrect, then the example is validated by different crowdworkers.
The dataset consists of three rounds, each involving a different model and a different set of sources for the premise texts:
| Round | Model | Training data | Context sources |
|---|---|---|---|
| 1 | BERT-large | SNLI + MultiNLI | Wikipedia |
| 2 | ROBERTa | SNLI + MultiNLI + NLI-FEVER + Round 1 | Wikipedia |
| 3 | ROBERTa | SNLI + MultiNLI + NLI-FEVER + Round 1 | Various |
Each round has train/dev/test splits. The sizes of these splits and their label distributions are calculated just below.
The project README seeks to establish some rules for how the rounds can be used for training and evaluation.
For ANLI, we have the following readers:
nli.ANLITrainReadernli.ANLIDevReaderAs with SNLI, we leave the writing of a Test version to the user, as a way of discouraging inadvertent use of the test set during project development.
Because ANLI is distributed in three rounds, and the rounds can be used independently or pooled, the interface has a rounds argument. The default is rounds=(1,2,3), but any subset of them can be specified. Here are some illustrations using the Train reader; the Dev interface is the same:
In [20]:
for rounds in ((1,), (2,), (3,), (1,2,3)):
count = len(list(nli.ANLITrainReader(ANLI_HOME, rounds=rounds).read()))
print("R{0:}: {1:,}".format(rounds, count))
The above figures correspond to those in Table 2 of the paper. I am not sure what accounts for the differences of 100 examples in round 2 (and, in turn, in the grand total).
ANLI uses a different set of attributes from SNLI/MultiNLI. Here is a summary of what NLIExample instances offer for this corpus:
pairID in SNLI/MultiNLI sentence1 in SNLI/MultiNLIsentence2 in SNLI/MultiNLIgold_label in SNLI/MultiNLITrue if the annotator contributed only dev (test) exmples, else False; in turn, it is False for all train examples.All these attribute are str-valued except for emturk, which is bool-valued.
The labels in this datset are conceptually the same as for SNLI/MultiNLI, but they are encoded differently:
In [21]:
anli_labels = pd.Series([ex.label for ex in nli.ANLITrainReader(ANLI_HOME).read()])
anli_labels.value_counts()
Out[21]:
For the dev set, the label and model_label values are always different, suggesting that these evaluations will be very challenging for present-day models:
In [22]:
pd.Series(
[ex.label == ex.model_label for ex in nli.ANLIDevReader(ANLI_HOME).read()]
).value_counts()
Out[22]:
In the train set, they do sometimes correspond, and you can track the changes in the rate of correct model predictions across the rounds:
In [23]:
for r in (1,2,3):
dist = pd.Series(
[ex.label == ex.model_label for ex in nli.ANLITrainReader(ANLI_HOME, rounds=(r,)).read()]
).value_counts()
dist = dist / dist.sum()
dist.name = "Round {}".format(r)
print(dist, end="\n\n")
This corresponds to Table 2, "Model error rate (Verified)", in the paper. (I am not sure what accounts for the slight differences in the percentages.)
The FraCaS textual inference test suite is a smaller, hand-built dataset that is great for evaluating a model's ability to handle complex logical patterns.
SemEval 2013 had a wide range of interesting data sets for NLI and related tasks.
The SemEval 2014 semantic relatedness shared task used an NLI dataset called Sentences Involving Compositional Knowledge (SICK).
MedNLI is specialized to the medical domain, using data derived from MIMIC III.
XNLI is a multilingual NLI dataset derived from MultiNLI.
Diverse Natural Language Inference Collection (DNC) transforms existing annotations from other tasks into NLI problems for a diverse range of reasoning challenges.
SciTail is an NLI dataset derived from multiple-choice science exam questions and Web text.
NLI Style FEVER is a version of the FEVER dataset put into a standard NLI format. It was used by the Adversarial NLI team to train models for their annotation round 2.
Models for NLI might be adapted for use with the 30M Factoid Question-Answer Corpus.
Models for NLI might be adapted for use with the Penn Paraphrase Database.