"Challenges in Data-to-Document Generation", Wiseman, Shieber, Rush, 2017

Abstract

"Recent neural models have shown significant progress on the problem of generating short descriptive texts conditioned on a small number of database records. In this work, we suggest a slightly more difficult data-to-text generation task, and investigate how effective current approaches are on this task. In particular, we introduce a new, large-scale corpus of data records paired with descriptive documents, proposse a series of extractive evaluation methods for analyzing performance, and obtain baseline results using current neural generation methods. Experiments show that these models produce fluent text, but fail to convincingly approximate human-generated documents. Moreover, even templated baselines exceed the performance of these neural models on some metrics, though copy- and reconstruction-based extensions lead to noticeable improvements."

1. Introduction

"Over the past several years, neural text generation systems have shown impressive performance on tasks such as machine translation and summarization. As neural systems begin to move toward generating longer outputs in response to longer and more complicated inputs, however, the generated texts begin to display reference errors, inter-sentence incoherence, and a lack of fidelity to the source material. The goal of this paper is to suggest a particular, long-form generation task in which these challenges may be fruitfully explored, to provide a publicly available dataset for this task, to suggest some automatic evaluation metrics, and finally to establish how current, neural text generation methods perform on this task."

Aside Ideas from me on what might be important in a metric:

grammatical/linguistic coherence => standard linguistic models, eg lstm, rnn?
contains the data that it's supposed to contain => the extractive bit?
concise (vs contains a bunch of trivial/useless bits/irrelevant bits/noise) => simply use minus the length as the score?

"A classic problem in natural-language generation (NLG) (Kukich, 1983; McKeown, 1992; Reiter and Dale, 1997) involves taking structured data, such as a table, as input, and producing text that adequately and fluently describes this data as output. Unlike machine translation, which aims for a complete transduction of the sentence to be translated, this form of NLG is typically taken to require addressing (at least) two separate challenges: what to say, the selection of an appropriate subset of the input data to discuss, and how to say it, the surface realization of a generation (Reiter and Dale, 1997; Jurafsky and Martin, 2014). Traditionally, these two challenges have been modularized and handled separately by generation systems. However, neural generation systems, which are typically trained end-to-end as conditional language models (Mikolov et al, 2010; Sutskever et al, 2011, 2014), blur this distinction.

"In this context, we believe the problem of generating multi-sentence summaries of tables or database records to be a reasonable next-problem for neural techniques to tackle as they begin to consider more difficult NLG tasks. In particular, we would like this generation task to have the following two properties: (1) it is relatively easy to obtain fairly clean summaries and their corresponding databases for dataset construction, and (2) the summaries should be primarily focused on conveying the information in the database. This latter property ensures that the task is somewhat congenial to a standard encoder-decoder approach, and more importantly, that it is reasonable to evaluate generations in terms of their fidelity to the database.

"One task that meets these criteria is that of generating summaries of sports games from associated box-score data, and there is indeed a long history of NLG work that generates sports games summaries (Robin, 1994; Tanaka-Ishii et al., 1998, Barzilay and Lapata, 2005). To this end, we make the following contributions:

We introduce a new large-scale corpus consisting of textual descriptions of basketball games paired with extensive statistical tables. This dataset is sufficiently large that fully data-driven approaches might be sufficient.
We introduce a series of extractive evalution models to automatically evaluate output generation performance, exploiting the fact that post-hoc information extraction is significantly easier than generation itself.
We apply a series of state-of-the-art neural methods, as well as a simple templated generation system, to our data-to-document generation task in order to establish baseliens and study their generations.

"Our experiments indicate that neural systems are quite good at producing fluent outputs and generally score well on standard word-match metrics, but perform quite poorly at content selection and at capturing long-term structure. While the use of copy-based models and additional reconstruction terms in the training can lead to improvements in BLEU and in our proposed extractive evalutions, current models are still quite far from producing human-level output, and are significantly worse than templated systems in terms of content selection and realization. Overall, we believe this problem of data-to-document generation highlights important remaining challenges in neural generation systems, and the use of extractive evaluation reveals significant issues hidden by standard automatic metrics."

2. Data-to-Text Datasets

"We consider the problem of generating descriptive text from database records. Following the notation in Liang et al. (2009), let $\mathbf{s} = \{r_j\}_{j=1}^J$ be a set of records, where for each $r \in \mathbf{s}$ we define $r.t \in \mathcal{T}$ to be the type of $r$, and we assume each $r$ to be a binarized relation, where $r.e$ and $r.m$ are a record's entity and value ('e' => 'entity', 'm' => 'mention'), respectively. For example, a database recording statistics for a basketball game might have a record $r$ such that $r.t = \mathtt{Points}$, $r.e = \mathtt{Russell\,Westbrook}$, and $r.m = 50$. In this case, $r.e$ gives the player in question, and $r.m$ gives the number of points the player scored. From these records, we are interested in generating descriptive text, $\hat{y}_{1:T}=\hat{y}_1,\dots,\hat{y}_T$ of $T$ words such that $\hat{y}_{1:T}$ is an adequate and fluent summary of $\mathbf{s}$. A dataset for training data-to-document systems typically consists of $(\mathbf{s}, y_{1:T})$ pairs, where $y_{1:T}$ is a document consisting of a gold (ie, human generated) summary for database $\mathbf{s}$.

"Several benchmark datasets have been used in recent years for the text generation task, the most popular of these being $\mathtt{WeatherGov}$ (Liang et al, 2009) and $\mathtt{Robocup}$ (Chen and Mooney, 2008). Recently, neural generation systems have shown strong results on these datasets, with the system of Mei et al. (2016) achieving BLEU scores in the 60s and 70s on $\mathtt{WeatherGov}$, and BLEU scores of almost 30 even on the smaller $\mathtt{Robocup}$ dataset. These results are quite promising, and suggest that neural models are a good fit for text generation. However, the statistics of these datasets, shown in Table 1, indicate that these datasets use relatively simple language and record structure. Furthermore, there is reason to believe that $\mathtt{WeatherGov}$ is at least partially machine-generated (Reiter, 2017). More recently, Lebret et al. (2016) introduced the $\mathtt{WikiBio}$ dataset, which is at least an order of magnitude larger in terms of number of tokens and record types. However, as shown in Table 1, this dataset too only contains short (single-sentence) generations, and relatively few records per generation. As such, we believe that early success on these datasets is not yet sufficient for testing the desired linguistic capabilities of text generation at a document-scale.

"With this challenge in mind, we introduce a new dataset for data-to-document text generation, available at https://github.com/harvardnlp/boxscore-data. The dataset is intended to be comparable to $\mathtt{WeatherGov}$ in terms of token count, but to have significantly longer target texts, a larger vocabulary space, and to require more difficult content selection.

"The dataset consists of two sources of articles summarizing NBA basketball games, paired with their corresponding box- and line-score tables. The data statistics of these two sources, $\mathtt{RotoWire}$ and $\mathtt{SBNation}$, are also shown in Table 1. Tjhe first dataset, $\mathtt{RotoWire}$, uses professionally written, medium length game summaries targeted at fantasy basketball fans. The writing is colloquial, but relatively well structured, and targets an audience primarily interested in game statistics. The second dataset, $\mathtt{SBNation}$, uses fan-written summaries targeted at other fans. This dataset is significantly larger, but also much more challenging, as the language is very informl, and often tangential to the statistics themselves. We show some sample text from $\mathtt{RotoWire}$ in Figure 1. Our primary focus will be on the $\mathtt{RotoWire}$ data."

3. Evaluating Document Generation

"We begin by discussing the evaluation of generated documents, since both the task we introduce and the evaluation methods we propose are motivated by some of the shortcomings of current approaches to evaluation. Text generation systems are typically evaluated using a combination of automatic measures, such as BLEU (Papineni et al., 2002), and human evaluation. While BLUE is perhaps a reasonably effective way of evaluating short-form text generation, we found it to be un-satisfactory for document generation. In particular, we note that it primarily rewards fluent text generation, rather than generations that capture the most important information in the database, or that report the information in a particularly coherent way. While human evaluation, on the other hand, is likely ultimately necessary for evaluating generations (Liu et al., 2016; Wu et al., 2016), it is much less convenient than using automatic metrics. Furthermore, we believe that current text generations are sufficiently bad in sufficiently obvious ways that automatic metrics can still be of use in evaluation, and we are not yet at the point of needing to rely solely on human evaluators."

3.1 Extractive Evaluation

"To address this evaluation challenge, we begin with the intuition that assessing document quality is easier than document generation. In particular, it is much easier to automatically extract information from documents than to generate documents that accurately convey desired information. As such, simple, high-precision information extraction models can serve as the basis for assessing and better understanding the quality of automatic generations. We emphasize that such an evaluation scheme is more appropriate when evaluating generations (such as basketball game summaries) that are primarily intended to summarize information. While many generation problems do not fall into this category, we believe this to be an interesting category, and one worth focusing on because it is amenable to this sort of evaluation.

"To see how a simple information extraction system might work, consider the document in Figure 1. We may first extract candidate entity (player, team and city) and value (number and certain string) pairs $r.e$, $r.m$ that appear in the text, and then predict the type $r.t$ (or none) of each candidate pair. For example, we might extract the entity-value pair ("Miami Heat", "95") from the first sentence in Figure 1, and then predict that the type of this pair is $\mathtt{Points}$, giving us an extracted record $r$ such that $(r.e, r.m, r.t) = (\mathtt{Miami\,Heat}, 95, \mathtt{Points})$. Indeed, many relation extract systems reduce relation extraction to multi-class classification precisely in this way (Zhang, 2004; Zhou et al., 2008; Zeng et al., 2014; dos Santos et al., 2015).

"More concretely, given a document $\hat{y}_{1:T}$, we consider all pairs of word-spans in each sentence that represent possible entities $e$ and values $m$. We then model $p(r.t \mid e, m; \mathbf{\theta})$ for each pair, using $r.t = \epsilon$ to indicate unrelated pairs. We use architectures similar to those discussed in Collobert et al. (2011) and dos Santos et al. (2015) to parameterize this probability; full details are given in the Appendix.

"Importantly, we note that the $(\mathbf{s}, y_{1:T})$ pairs typically used for training data-to-document systems are also sufficient for training the information extraction model presented above, since we can obtain (partial) supervision by simply checking whether a candidate record lexically matches a record in $\mathbf{s}$. (Alternative approaches explicitly align the document with the table for this task (Liang et al., 2009)) However, since there may be multiple records $r \in \mathbf{s}$ with the same $e$ and $m$ but with different types $r.t$ we will not always be able to determien the type of a given entity-value pair found in the text. We therefore train our classifier to minimize a latent-variable loss: for all document spans $e$ and $m$, with observed types $t(e,m) = \{r.t: r \in \mathbf{s}, r.e=e, r.m=m\}$ (possibly $\{\epsilon\}$, we minimize

$$ \mathcal{L}(\mathbf{\theta}) = =\sum_{e,m} \log \sum_{t' \in t(e,m)} p(r.t = t' \mid e, m; \mathbf{\theta}) $$

"We find that this simple system trained in this way is quite accurate at predicting relations. On the $\mathtt{Rotowire}$ data it achieves over 90% accuracy on held-out data, and recalls approximately 60% of the relations licensed by the records."

3.2 Comparing Generations

"With a sufficiently precise relation extraction system, we can begin to evaluate how well an automatic generation $\hat{y}_{1:T}$ has captured the information in a set of records $\mathbf{s}$. In particular, since the predictions of a precise information extraction system serve to align entity-mention pairs in the text with database records, this alignment can be used both to evaluate a generation's content selection ("what the generation says"), as well as content placement ("how the generation says it").

"We consider in particular three induced metrics:

Content Selection (CS): precision and recall of unique relations $r$ extracted from $\hat{y}_{1:T}$ that are also extracted from $y_{1:T}$. This measures how well the generated document matches the gold document in terms of selecting which records to generate. ie, compare the generated textual data with the gold standard textual data.
Relation Generation (RG): precision and number of unique relations $r$ extracted from $\hat{y}_{1:T}$ that also appear in $\mathbf{s}$. This measures how well the system is able to generate text containing factual (ie, correct) records. ie, compare the generated data with the database records.
Content Ordering (CO): normalized Damerau-Levenshtein Distance (Brill and Moore, 2000) between the sequences of records extracted from $y_{1:T}$ and that extracted from $\hat{y}_{1:T}$. This measures how well the system orders the records it chooses to discuss.

"We note that CS primarily targets the "what to say" aspect of evaluation, CO targets the "how to say it" aspect, and RG targets both.

"We conclude the section by contrasting the automatic evaluation we have proposed with recently proposed adversarial evaluation approaches, which also advocate automatic metrics backed by classification (Bowman et al, 2016; Kannan and Vinyals, 2016; Li et al., 2017). Unlike adversarial evaluation, which uses a black-box classifier to determine the quality of a generation, our metrics are defined with respect to the predictions of an information extraction system. Accordingly, our metrics are quite interpretable, since by construction it is always possible to determine which fact (ie, entity-mention pair) in the generation is determined by the extractor to not match the database or the gold generation."

4. Neural Data-to-Document Models

"In this section we briefly describe the neural generation methods we apply to the proposed task. As a base model we utilize the now standard attention-based encoder-decoder model (Sutskever et al., 2014; Cho et al., 2014; Bahdenau et al., 2015). We also experiment with several recent extensions to this model, including copy-based generation, and training with a source reconstruction term in the loss (in addition to the standard per-target-word loss).

"Base Model For our base model, we map each record $r \in \mathbf{s}$ into a vector $\mathbf{\tilde{r}}$ by first embedding $r.t$ (eg $\mathtt{Points}$), $r.e$ (eg $\mathtt{Russell\,Westbrook}$), and $r.m$ (eg 50), and then applying a 1-layer MLP (similar to Yang et al. (2016)). (and also an additional feature for whether the player is on the home- or away- team). Our source data-records are then represented as $\mathbf{\tilde{s}} = \{\mathbf{\tilde{r}}_j\}_{j=1}^J$. Given $\mathbf{\tilde{s}}$, we use an LSTM decoder with attention and input-feeding, in the style of Luong et al. (2015), to compute the probability of each target word, conditioned on the previous words and on $\mathbf{s}$. The model is trained end-to-end to minimize the negative log-likelihood of the words in the gold text $y_{1:T}$ given corresponding source matrial $\mathbf{s}$.

"Copying There has been a surge of recent work involving augmenting encoder-decoder models to copy words directly from the source material on which they condition (Gu et al., 2016; Gulcehre et al., 2016; Merity et al., 2016; Jia and Liang, 2016; Yang et al., 2016). These models typically introduce an additional binary variable $z_t$ into the per-timestep target word distribution, which indicates whether the target word $\hat{y}_t$ is copied from the source or generated:

$$ p(\hat{y}_t \mid \hat{y}_{1:t-1}, \mathbf{s}) = \sum_{z \in \{0, 1\}} p(\hat{y}_t, z_t = z \mid \hat{y}_{1:t-1}, \mathbf{s}) $$

"In our case, we assume that target words are copied from the value porition of a record $r$; that is, a copy implies $\hat{y}_t = r.m$ for some $r$ and $t$."

There are two forms of the copying model in common usage:

Joint Copy Model, and
Conditional Copy Model

"Joint Copy Model The models of Gu et al. (2016) and Yang et al. (2016) parameterize the joint distribution table over $\hat{y}_t$ and $z_t$ directly:

$$ \begin{align} p(\hat{y}_t, z_t & \mid \hat{y}_{1:t-1}, \mathbf{s}) \propto \\ & \begin{cases} \text{copy}(\hat{y}_t, \hat{y}_{1:t-1}, \mathbf{s}) & z_t = 1, \hat{y}_t \in \mathbf{s} \\ 0 & z_t = 1, \hat{y}_t \notin \mathbf{s} \\ \text{gen}(\hat{y}_t, \hat{y}_{1:t-1}, \mathbf{s}) & z_t = 0, \end{cases} \end{align} $$

"where copy and gen are functions parameterized in terms of the decoder RNN's hidden state that assign scores to words, and where the notation $\hat{y}_t \in \mathbf{s}$ indicates that $\hat{y}_t$ is equal to $r.m$ for some $r \in \mathbf{s}$.

"Conditional Copy Model Gulcehre et al. (2016), on the other hand, decompose the joint probability as:

$$ \begin{align} p(\hat{y}_t, z_t & \mid \hat{y}_{1:t-1}, \mathbf{s}) = \\ & \begin{cases} p_\text{copy}(\hat{y}_t \mid z_t, \hat{y}_{1:t-1}, \mathbf{s})\, p(z_t \mid \hat{y}_{1:t-1}, \mathbf{s}) & z_t = 1 \\ p_\text{gen}(\hat{y}_t \mid z_t, \hat{y}_{1:t-1}, \mathbf{s}) \, p(z_t \mid \hat{y}_{1:t-1}, \mathbf{s}) & z_t = 0 \\ \end{cases} \end{align} $$

where an MLP is used to model $p(z_t \mid \hat{y}_{1:t-1}, \mathbf{s})$.

"Models with copy-decoders may be trained to minimize the negative log marginal probability, marginalizing out the latent-variable $z_t$ (Gu et al., 2016; Yang et al., 2016; Merity et al., 2016). However, if it is known which target words $y_t$ are copied , it is possible to train with a loss that does not marginalize out the latent $z_t$. Gulcehre et al. (2016), for instance, assume that any target word $y_t$ that also appears in the source is copied, and train to minimize the negative joint log-likelihood of the $y_t$ and $z_t$.

"In applying such a loss in our case, we again note that there may be multiple records $r$ such taht $r.m$ appears in $\hat{y}_{1:T}$. Accordingly, we slightly modify the $p_\text{copy}$ portion of the loss of Gulcehre et al. (2016) to sum over all matched records. In particular, we model the probability of relations $r \in \mathbf{s}$ such that $r.m = y_t$ and $r.e$ is in the same sentence as $r.m$. Letting $r(y_t) = \{r \in \mathbf{s} \colon r.m = y_t, \text{same-sentence}(r.e, r.m)\}$, we have:

$$ p\text{copy}(y_t \mid z_t, y_{1:t-1}, \mathbf{s})= \sum_{r \in r(y_t)} p(r \mid z_t, y_{1:t-1}, \mathbf{s}) $$

"We note here that the key distinction for our purpses between the Joint Copy model and the Conditional Copy model is that the latter conditions on whether there is a copy or not, and so in $p_\text{copy}$ the source records compete only with each other. In the Joint Copy model, however, the source records also compete with words that cannot be copied. As a result, training the Conditional Copy model with the supervised loss of Gulcehre et al. (2016) can be seen as training with a word-level reconstruction loss, where the decoder is trained to choose the record in $\mathbf{s}$ that gives rise to $y_t$.

"Reconstruction Losses Reconstruction-based techniques can also be applied at the document- or sentence-level during training. One simple approach to this problem is to utilize the hidden states of the decoder to try to reconstruct the database. A fully differentiable approach using the decoder hidden states has recently been successfully applied to neural machine translation by Tu et al. (2017). Unlike copying, this method is applied only at training, and attempts to learn decoder hidden states with broader coverage of the input data.

"In adopting this reconstruction approach we segment the decoder hidden states $\mathbf{h}_t$ into $\lceil\frac{T}{B}\rceil$ contiguous blocks of size at most $B$."

The text in the paper at this point is a bit ambiguous in my opinion. It states that $p(r.e, r.m \mid\mathbf{b}_i) = \text{softmax}(f(\mathbf{b}_i))$. However, in my mind, when I look at this, $r$ is some specific record, whereas $\text{softmax}$ is a distribution. After reading carefully, it looks to me like the $\text{softmax}$ is a distribution over all $r \in \mathbf{s}$. So, I think it would be more accurate to write something more like:

Let $z_

Observations from Hugh

doesnt present any evidence for the assertion "it [BLEU] primarily rewards fluent text generation, rather than generations that capture the most important information in the database"