Natural language inference: task and datasets


In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

Overview

Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, and so forth.

Dagan et al. (2006), one of the foundational papers on NLI (also called Recognizing Textual Entailment; RTE), make a case for the generality of this task in NLU:

It seems that major inferences, as needed by multiple applications, can indeed be cast in terms of textual entailment. For example, a QA system has to identify texts that entail a hypothesized answer. [...] Similarly, for certain Information Retrieval queries the combination of semantic concepts and relations denoted by the query should be entailed from relevant retrieved documents. [...] In multi-document summarization a redundant sentence, to be omitted from the summary, should be entailed from other sentences in the summary. And in MT evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations should entail each other. Consequently, we hypothesize that textual entailment recognition is a suitable generic task for evaluating and comparing applied semantic inference models. Eventually, such efforts can promote the development of entailment recognition "engines" which may provide useful generic modules across applications.

Our version of the task

Our NLI data will look like this:

Premise Relation Hypothesis
turtle contradiction linguist
A turtled danced entails A turtle moved
Every reptile danced entails Every turtle moved
Some turtles walk contradicts No turtles move
James Byron Dean refused to move without blue jeans entails James Dean didn't dance without pants

In the word-entailment bakeoff, we looked at a special case of this where the premise and hypothesis are single words. This notebook begins to introduce the problem of NLI more fully.

Primary resources

We're going to focus on three NLI corpora:

The first was collected by a group at Stanford, led by Sam Bowman, and the second was collected by a group at NYU, also led by Sam Bowman. Both have the same format and were crowdsourced using the same basic methods. However, SNLI is entirely focused on image captions, whereas MultiNLI includes a greater range of contexts.

The third corpus was collected by a group at Facebook AI and UNC Chapel Hill. The team's goal was to address the fact that datasets like SNLI and MultiNLI seem to be artificially easy – models trained on them can often surpass stated human performance levels but still fail on examples that are simple and intuitive for people. The dataset is "Adversarial" because the annotators were asked to try to construct examples that fooled strong models but still passed muster with other human readers.

This notebook presents tools for working with these corpora. The second notebook in the unit concerns models of NLI.

Set-up

  • As usual, you need to be fully set up to work with the CS224u repository.

  • If you haven't already, download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change DATA_HOME below.)


In [2]:
import nli
import os
import pandas as pd
import random

In [3]:
DATA_HOME = os.path.join("data", "nlidata")

SNLI_HOME = os.path.join(DATA_HOME, "snli_1.0")

MULTINLI_HOME = os.path.join(DATA_HOME, "multinli_1.0")

ANNOTATIONS_HOME = os.path.join(DATA_HOME, "multinli_1.0_annotations")

ANLI_HOME = os.path.join(DATA_HOME, "anli_v0.1")

SNLI

SNLI properties

For SNLI (and MultiNLI), MTurk annotators were presented with premise sentences and asked to produce new sentences that entailed, contradicted, or were neutral with respect to the premise. A subset of the examples were then validated by an additional four MTurk annotators.

  • 550,152 train examples; 10K dev; 10K test
  • Mean length in tokens:

    • Premise: 14.1
    • Hypothesis: 8.3
  • Clause-types

    • Premise S-rooted: 74%
    • Hypothesis S-rooted: 88.9%
  • Vocab size: 37,026
  • 56,951 examples validated by four additional annotators
    • 58.3% examples with unanimous gold label
    • 91.2% of gold labels match the author's label
    • 0.70 overall Fleiss kappa
  • Top scores currently around 90%.

Working with SNLI

The following readers should make it easy to work with SNLI:

  • nli.SNLITrainReader
  • nli.SNLIDevReader

Writing a Test reader is easy and so left to the user who decides that a test-set evaluation is appropriate. We omit that code as a subtle way of discouraging use of the test set during project development.

The base class, nli.NLIReader, is used by all the readers discussed here.

Because the datasets are so large, it is often useful to be able to randomly sample from them. All of the reader classes discussed here support this with their keyword argument samp_percentage. For example, the following samples approximately 10% of the examples from the SNLI training set:


In [4]:
nli.SNLITrainReader(SNLI_HOME, samp_percentage=0.10, random_state=42)


Out[4]:
"NLIReader({'src_filename': 'data/nlidata/snli_1.0/snli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42, 'gold_label_attr_name': 'gold_label'})

The precise number of examples will vary somewhat because of the way the sampling is done. (Here, we trade efficiency for precision in the number of cases we return; see the implementation for details.)

All of the readers have a read method that yields NLIExample example instances. For SNLI, these have the following attributes:

  • annotator_labels: list of str
  • captionID: str
  • gold_label: str
  • pairID: str
  • sentence1: str
  • sentence1_binary_parse: nltk.tree.Tree
  • sentence1_parse: nltk.tree.Tree
  • sentence2: str
  • sentence2_binary_parse: nltk.tree.Tree
  • sentence2_parse: nltk.tree.Tree

The following creates the label distribution for the training data:


In [5]:
snli_labels = pd.Series(
    [ex.gold_label for ex in nli.SNLITrainReader(
        SNLI_HOME, filter_unlabeled=False).read()])

snli_labels.value_counts()


Out[5]:
entailment       183416
contradiction    183187
neutral          182764
-                   785
dtype: int64

Use filter_unlabeled=True (the default) to silently drop the examples for which gold_label is -.

Let's look at a specific example in some detail:


In [6]:
snli_iterator = iter(nli.SNLITrainReader(SNLI_HOME).read())

In [7]:
snli_ex = next(snli_iterator)

In [8]:
print(snli_ex)


"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})

In [9]:
snli_ex


Out[9]:
"NLIExample({'annotator_labels': ['neutral'], 'captionID': '3416050480.jpg#4', 'gold_label': 'neutral', 'pairID': '3416050480.jpg#4r1n', 'sentence1': 'A person on a horse jumps over a broken down airplane.', 'sentence1_binary_parse': Tree('X', [Tree('X', [Tree('X', ['A', 'person']), Tree('X', ['on', Tree('X', ['a', 'horse'])])]), Tree('X', [Tree('X', ['jumps', Tree('X', ['over', Tree('X', ['a', Tree('X', ['broken', Tree('X', ['down', 'airplane'])])])])]), '.'])]), 'sentence1_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('PP', [Tree('IN', ['on']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['horse'])])])]), Tree('VP', [Tree('VBZ', ['jumps']), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['broken']), Tree('JJ', ['down']), Tree('NN', ['airplane'])])])]), Tree('.', ['.'])])]), 'sentence2': 'A person is training his horse for a competition.', 'sentence2_binary_parse': Tree('X', [Tree('X', ['A', 'person']), Tree('X', [Tree('X', ['is', Tree('X', [Tree('X', ['training', Tree('X', ['his', 'horse'])]), Tree('X', ['for', Tree('X', ['a', 'competition'])])])]), '.'])]), 'sentence2_parse': Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['A']), Tree('NN', ['person'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBG', ['training']), Tree('NP', [Tree('PRP$', ['his']), Tree('NN', ['horse'])]), Tree('PP', [Tree('IN', ['for']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['competition'])])])])]), Tree('.', ['.'])])])})

As you can see from the above attribute list, there are three versions of the premise and hypothesis sentences:

  1. Regular string representations of the data
  2. Unlabeled binary parses
  3. Labeled parses

In [10]:
snli_ex.sentence1


Out[10]:
'A person on a horse jumps over a broken down airplane.'

The binary parses lack node labels; so that we can use nltk.tree.Tree with them, the label X is added to all of them:


In [11]:
snli_ex.sentence1_binary_parse


Out[11]:

Here's the full parse tree with syntactic categories:


In [12]:
snli_ex.sentence1_parse


Out[12]:

The leaves of either tree are tokenized versions of them:


In [13]:
snli_ex.sentence1_parse.leaves()


Out[13]:
['A',
 'person',
 'on',
 'a',
 'horse',
 'jumps',
 'over',
 'a',
 'broken',
 'down',
 'airplane',
 '.']

MultiNLI

MultiNLI properties

  • Train premises drawn from five genres:
    1. Fiction: works from 1912–2010 spanning many genres
    2. Government: reports, letters, speeches, etc., from government websites
    3. The Slate website
    4. Telephone: the Switchboard corpus
    5. Travel: Berlitz travel guides
  • Additional genres just for dev and test (the mismatched condition):
    1. The 9/11 report
    2. Face-to-face: The Charlotte Narrative and Conversation Collection
    3. Fundraising letters
    4. Non-fiction from Oxford University Press
    5. Verbatim articles about linguistics
  • 392,702 train examples; 20K dev; 20K test
  • 19,647 examples validated by four additional annotators
    • 58.2% examples with unanimous gold label
    • 92.6% of gold labels match the author's label
  • Test-set labels available as a Kaggle competition.

    • Top matched scores currently around 0.81.
    • Top mismatched scores currently around 0.83.

Working with MultiNLI

For MultiNLI, we have the following readers:

  • nli.MultiNLITrainReader
  • nli.MultiNLIMatchedDevReader
  • nli.MultiNLIMismatchedDevReader

The MultiNLI test sets are available on Kaggle (matched version and mismatched version).

The interface to these is the same as for the SNLI readers:


In [14]:
nli.MultiNLITrainReader(MULTINLI_HOME, samp_percentage=0.10, random_state=42)


Out[14]:
"NLIReader({'src_filename': 'data/nlidata/multinli_1.0/multinli_1.0_train.jsonl', 'filter_unlabeled': True, 'samp_percentage': 0.1, 'random_state': 42, 'gold_label_attr_name': 'gold_label'})

The NLIExample instances for MultiNLI have the same attributes as those for SNLI. Here is the list repeated from above for convenience:

  • annotator_labels: list of str
  • captionID: str
  • gold_label: str
  • pairID: str
  • sentence1: str
  • sentence1_binary_parse: nltk.tree.Tree
  • sentence1_parse: nltk.tree.Tree
  • sentence2: str
  • sentence2_binary_parse: nltk.tree.Tree
  • sentence2_parse: nltk.tree.Tree

The full label distribution:


In [15]:
multinli_labels = pd.Series(
    [ex.gold_label for ex in nli.MultiNLITrainReader(
        MULTINLI_HOME, filter_unlabeled=False).read()])

multinli_labels.value_counts()


Out[15]:
contradiction    130903
neutral          130900
entailment       130899
dtype: int64

No examples in the MultiNLI train set lack a gold label, so the value of the filter_unlabeled parameter has no effect here, but it does have an effect in the Dev versions.

Annotated MultiNLI subsets

MultiNLI includes additional annotations for a subset of the dev examples. The goal is to help people understand how well their models are doing on crucial NLI-related linguistic phenomena.


In [16]:
matched_ann_filename = os.path.join(
    ANNOTATIONS_HOME,
    "multinli_1.0_matched_annotations.txt")

mismatched_ann_filename = os.path.join(
    ANNOTATIONS_HOME, 
    "multinli_1.0_mismatched_annotations.txt")

In [17]:
def view_random_example(annotations, random_state=42):
    random.seed(random_state)
    ann_ex = random.choice(list(annotations.items()))
    pairid, ann_ex = ann_ex
    ex = ann_ex['example']   
    print("pairID: {}".format(pairid))
    print(ann_ex['annotations'])
    print(ex.sentence1)
    print(ex.gold_label)
    print(ex.sentence2)

In [18]:
matched_ann = nli.read_annotated_subset(matched_ann_filename, MULTINLI_HOME)

In [19]:
view_random_example(matched_ann)


pairID: 63218c
[]
Recently, however, I have settled down and become decidedly less experimental.
contradiction
I am still as experimental as ever, and I am always on the move.

Adversarial NLI

Adversarial NLI properties

The ANLI dataset was created in response to evidence that datasets like SNLI and MultiNLI are artificially easy for modern machine learning models to solve. The team sought to tackle this weakness head-on, by designing a crowdsourcing task in which annotators were explicitly trying to confuse state-of-the-art models. In broad outline, the task worked like this:

  1. The crowdworker is presented with a premise (context) text and asked to construct a hypothesis sentence that entails, contradicts, or is neutral with respect to that premise. (The precise wording is more informally, along the lines of the SNLI/MultiNLI task).

  2. The crowdworker submits a hypothesis text.

  3. The premise/hypothesis pair is fed to a trained model that makes a prediction about the correct NLI label.

  4. If the model's prediction is correct, then the crowdworker loops back to step 2 to try again. If the model's prediction is incorrect, then the example is validated by different crowdworkers.

The dataset consists of three rounds, each involving a different model and a different set of sources for the premise texts:

Round Model Training data Context sources
1 BERT-large SNLI + MultiNLI Wikipedia
2 ROBERTa SNLI + MultiNLI + NLI-FEVER + Round 1 Wikipedia
3 ROBERTa SNLI + MultiNLI + NLI-FEVER + Round 1 Various

Each round has train/dev/test splits. The sizes of these splits and their label distributions are calculated just below.

The project README seeks to establish some rules for how the rounds can be used for training and evaluation.

Working with Adversarial NLI

For ANLI, we have the following readers:

  • nli.ANLITrainReader
  • nli.ANLIDevReader

As with SNLI, we leave the writing of a Test version to the user, as a way of discouraging inadvertent use of the test set during project development.

Because ANLI is distributed in three rounds, and the rounds can be used independently or pooled, the interface has a rounds argument. The default is rounds=(1,2,3), but any subset of them can be specified. Here are some illustrations using the Train reader; the Dev interface is the same:


In [20]:
for rounds in ((1,), (2,), (3,), (1,2,3)):
    count = len(list(nli.ANLITrainReader(ANLI_HOME, rounds=rounds).read()))
    print("R{0:}: {1:,}".format(rounds, count))


R(1,): 16,946
R(2,): 45,460
R(3,): 100,459
R(1, 2, 3): 162,865

The above figures correspond to those in Table 2 of the paper. I am not sure what accounts for the differences of 100 examples in round 2 (and, in turn, in the grand total).

ANLI uses a different set of attributes from SNLI/MultiNLI. Here is a summary of what NLIExample instances offer for this corpus:

  • uid: a unique identifier; akin to pairID in SNLI/MultiNLI
  • context: the premise; corresponds to sentence1 in SNLI/MultiNLI
  • hypothesis: the hypothesis; corresponds to sentence2 in SNLI/MultiNLI
  • label: the gold label; corresponds to gold_label in SNLI/MultiNLI
  • model_label: the label predicted by the model used in the current round
  • reason: a crowdworker's free-text hypothesis about why the model made an incorrect prediction for the current context/hypothesis pair
  • emturk: for dev (and test), this is True if the annotator contributed only dev (test) exmples, else False; in turn, it is False for all train examples.
  • genre: the source for the context text
  • tag: information about the round and train/dev/test classification

All these attribute are str-valued except for emturk, which is bool-valued.

The labels in this datset are conceptually the same as for SNLI/MultiNLI, but they are encoded differently:


In [21]:
anli_labels = pd.Series([ex.label for ex in nli.ANLITrainReader(ANLI_HOME).read()])

anli_labels.value_counts()


Out[21]:
n    68789
e    52111
c    41965
dtype: int64

For the dev set, the label and model_label values are always different, suggesting that these evaluations will be very challenging for present-day models:


In [22]:
pd.Series(
    [ex.label == ex.model_label for ex in nli.ANLIDevReader(ANLI_HOME).read()]
).value_counts()


Out[22]:
False    3200
dtype: int64

In the train set, they do sometimes correspond, and you can track the changes in the rate of correct model predictions across the rounds:


In [23]:
for r in (1,2,3):
    dist = pd.Series(
        [ex.label == ex.model_label for ex in nli.ANLITrainReader(ANLI_HOME, rounds=(r,)).read()]
    ).value_counts()
    dist = dist / dist.sum()
    dist.name = "Round {}".format(r)
    print(dist, end="\n\n")


True     0.821197
False    0.178803
Name: Round 1, dtype: float64

True     0.932028
False    0.067972
Name: Round 2, dtype: float64

True     0.915916
False    0.084084
Name: Round 3, dtype: float64

This corresponds to Table 2, "Model error rate (Verified)", in the paper. (I am not sure what accounts for the slight differences in the percentages.)

Other NLI datasets