"Reinforcement Learning for Mapping Instructions to Actions", Branavan, Chen, Zettlemoyer, Barzilay, 2009

http://people.csail.mit.edu/branavan/papers/acl2009.pdf

Abstract

"In this paper, we present a reinforcement learning approach for mapping natural language instructions to sequences of executable actions. We assume access to a reward function that defines the quality of the executed actions. During training, the learner repeatedly constructs action sequences for a set of documents, executes those actions, and observes the resulting reward. We use a policy gradient algorithm to estimate the parameters of a log-linear model to interpret instructions in two domains - Windows troubleshooting guides and game tutorials. Our results demonstrate that this technique can rival supervised learning techniques while requiring few or no annotated training examples."

Introduction

"The problem of interpreting instructions written in natural language has been widely studied since the early days of artificial intelligence (Winograd, 1972; Di Eugenio, 1992). Mapping instructions to a sequence of executable actions would enable the automation of tasks that currently require human participation. Examples include configuring software based on how-to guides and operating simulators using instruction manuals. In this paper, we present a reinforcement learning framework for inducing mappings from text to actions without the need for annotated training examples.

"For concreteness, consider instructions from a Windows troubleshooting guide on deleting temporary folders, shown in Figure 1. We aim to map this text to the corresponding low-level commands and parameters. For example, properly interpreting the third instruction requires clicking on a tab, finding the appropriate option in a tree control, and clearing its associated checkbox.

"In this and many other applications, the validity of a mapping can be verified by executing the induced actions in the corresponding environment and observing their effects. For instance, in the example above we can assess whether the goal described in the instructions is achieved, ie the folder is deleted. The key idea of our approach is to leverage the validation process as the main source of supervision to guide learning. This form of supervision allows us to learn interpretations of natural language instructions when standard supervised techniques are not applicable, due to the lack of human-created annotations.

"Reinforcement learning is a natural framework for building models using validation from an environment (Sutton and Barto, 1998). We assume that supervision is provided in the form of a reward function that defines the quality of executed actions. During training, the learner repeatedly constructs action sequences for a set of given documents, executes those actions, and observes the resulting reward. The learner's goal is to estimate a policy - a distribution over actions given instruction text and environment state - that maximizes future expected reward. Our policy is modeled in a log-linear fashion, allowing us to incorporate features of both the instruction text and the environment. We employ a policy gradient algorithm to estimate the parameters of this model.

"We evaluate our method on two distinct applications: Windows troubleshooting guides and puzzle game tutorials. The key findings of our experiments are twofold. First, models trained only with simple reward signals achieve surprisingly high results, coming within 11% of a fully supervised method in the Windows domain. Second, augmenting unlabeled documents with even a small fraction of annotated examples greatly reduces this performance gap, to within 4% in that domain. These results indicate the power of learning from this new form of automated supervision."

"Grounded Language Acquisition Our work fits into a broader class of approaches that aim to learn language from a situated context (Mooney, 2008a; Mooney, 2008b; Fleischman and Roy, 2005; Yu and Ballard, 2004; Siskind, 2001; Oates, 2001). Instances of such approaches include work on inferring the meaning of words from video data (Roy and Pentland, 2002; Barnard and Forsyth, 2001), and interpreting the commentary of a simulated soccer game (Chen and Mooney, 2008). Most of these approaches assume some form of parallel data, and learn perceptual co-occurrence patterns. In contrast, our emphasis is on learning language by proactively interactin with an external environment.

"Reinforcement Learning for Language Processing Reinforcement learning has been previously applied to the problem of dialogue management (Scheffler and Young, 2002; Roy et al., 2000; Litman et al., 2000; Singh et al., 1999). These systems converse with a human user by taking actions that emit natural language utterances. The reinforcement learning state space encodes information about the goals of the user and what they say at each time step. The learning problem is to find an optimal policy that maps states to actions, through a trial-and-error process of repeated interaction with the user.

"Reinforcement learning is applied very differently in dialogue systems compared to our setup. In some respects, our task is more easily amenable to reinforcement learning. For instance, we are not interacting with a human user, so the cost of interaction is lower. However, while the state space can be designed to be relatively small in the dialogue management task, our state space is determined by the underlying environment and is typically quite large. We address this complexity by developing a policy gradient algorithm that learns efficiently while exploring a small subset of the states.

3. Problem Formulation

"Our task is to learn a mapping between documents and the sequence of actions they express. Figure 2 shows how one example sentence is mapped to three actions.

"Mapping Text To Actions As input, we are given a document $d$, comprising as sequence of sentences $(u_1, \dots, u_l)$, where each $u_i$ is a sequence of words. Our goal is to map $d$ to a sequence of actions $\overrightarrow{a} = (a_0,\dots,a_{n-1})$. Actions are predicted and executed sequentially, meaning that action $a_i$ is executed before $a_{i+1}$ is predicted.

"An action $a = (c, R, W')$ encompasses:

a command $c$, eg click, or type_into
the command's parameters $R$, eg the concrete text to type, such as secpol.msc, and
the words $W'$ associated with $c$ and $R$, eg 'please type secpol.msc in the open box'

"Elements of $R$ refer to objects available in the environment state, as described below. Some parameters can also refer to words in document $d$. Additionally, to account for words that do not describe any actions, $c$ can be a null command.

"The Environment The environment state $\mathcal{E}$ specifies the set of objects available for interaction, and their properties. In Figure 2, $\mathcal{E}$ is shown on the right. The environment state $\mathcal{E}$ changes in response to the execution of command $c$ with parameters $R$ according to a transition distribution $p(\mathcal{E}' \mid \mathcal{E}, c, R)$. This distribution is a priori unknown to the learner. As we will see in Section 5, our approach avoids having to directly estimate this distribution.

"State To predict actions sequentially, we need to track the state of the document-to-actions mapping over time. A mapping state $s$ is a tuple $(\mathcal{E}, d, j, W)$, where $\mathcal{E}$ refers to the current environment state; $j$ is the index of the sentence currently being interpreted in document $d$; and $W$ contains words that were mapped by previous actions for the same sentence. The mapping state $s$ is observed after each action.

"The initial mapping state $s_0$ for document $d$ is $(\mathcal{E}_d, d, 0, \emptyset)$; $\mathcal{E}_d$ is the unique starting environment for $d$. Performing action $a$ in state $s = (\mathcal{E}, d, j, W)$ leads to a new state $s'$ according to distribution $p(s' \mid s, a)$, defined as follows: $\mathcal{E}$ transitions according to $p(\mathcal{E}' \mid \mathcal{E}, c, R)$, $W$ is updated with $a$'s selected words, and $j$ is incremented if all words of the sentence have been mapped. For the applications we consider in this work, environment state transitions, and consequently mapping state transitions, are deterministic.

"Training During training, we are provided with a set $D$ of documents, the ability to sample from the transition distribution, and a reward function $r(h)$. Here, $h = (s_0, a_0, \dots, s_{n-1}, a_{n-1}, s_n)$ is a history of states and actions visited while interpreting one document. $r(h)$ outputs a real-valued score that correlates with correct action selection. Note that in most reinforcement learning problems, the reward function is defined over state-action pairs, as $r(s,a)$ - in this case $r(h) = \sum_tr(s_t, a_t)$, and our formulation becomes a standard finite-horizon Markov decision process. Policy gradient approaches allow us to learn using the more general case of history-based reward.

"The goal of training is to estimate parameters $\theta$ of the action selection distribution $p(a \mid s, \theta)$, called the policy. Since the reward correlates with action sequence correctness, the $\theta$ that maximizes expected reward will yield the best actions."

4. A Log-Linear Model for Actions

"Our goal is a to predict a sequence of actions. We construct this sequence by repeatedly choosing an action given the current mapping state, and applying that action to advance to a new state.

"Given a state $s = (\mathcal{E}, d, j, W)$, the space of possible next actions is defined by enumerating subspans of unused words in the current sentence (ie subspans of the $j$th sentence of $d$ not in $W$), and the possible commands and parameters in environment state $\mathcal{E}$. We model the policy distribution $p(a \mid s; \theta)$ over this action space in a log-linear fashion (Della Pietra et al, 1997; Lafferty et al, 2001), giving us the flexibility to incorporate a diverse range of features. Under this representation, the policy distribution is:

$$ p(a \mid s; \theta) = \frac{\exp(\theta \cdot \phi(s,a))} {\sum_{a'} \exp(\theta \cdot \phi(s, a'))} \tag{1} $$

"where $\phi(s,a) \in \mathbb{R}^n$ is an $n$-dimensional feature representation. During test, actions are selected according to the mode of this distribution."

So, $\phi(s,a)$ is some fixed set of features, based on the environment, the document, the sentence index into the document, and words used/not-used in the sentence so far. Meanwhile, $\theta$ parameterizes the policy distribution based on these features. $\phi(s,a)$ represents not just the state though, but also a potential action, that can be taken.

5. Reinforcement Learning

"During training, our goal is to find the optimal policy $p(a \mid s; \theta)$. Since reward correlates with correct action selection, a natural objective is to maximize expected future reward - that is, the reward we expect while acting according to that policy from state $s$. Formally, we maximize the value function:

$$ V_\theta(s) = \mathbb{E}_{p(h \mid \theta)} [r(h)] \tag{2} $$

"where the history $h$ is the sequence of states and actions encountered while interpreting a single document $d \in D$. This expectation is averaged over all documents in $D$. The distribution $p(h\mid \theta)$ returns the probability of seeing history $h$ when starting from state $s$ and acting according to a policy with parameters $\theta$. This distribution can be decomposed into a product over time steps:

$$ p(h \mid \theta) = \prod_{t=1}^{n-1} p(a_t \mid s_t; \theta)\, p(s_{t+1} \mid s_t, a_t) \tag{3} $$

5.1 A Policy Gradient Algorithm

"Our reinforcement learning problem is to find the parameters $\theta$ that maximize $V_\theta$ from equation 2. Although there is no closed form solution, policy gradient algorithms (Sutton et al, 2000) estimate the parameters $\theta$ by performing stochastic gradient ascent. The gradient of $V_\theta$ is approximated by interacting with the environment, and the resulting reward is used to update the estimate of $\theta$. Policy gradient algorithms optimi ze a non-convex objective and are only guaranteed to find a local optimum. However, as we will see, they scale to large state spaces and can perform well in practice.

"To find the parameters $\theta$ that maximize the objective, we first compute the derivative of $V_\theta$. We can expanding according to the product rule, https://en.wikipedia.org/wiki/Product_rule#Generalizations, and https://github.com/hughperkins/pub-prototyping/tree/master/maths

$$ V_\theta(s) = \mathbb{E}_{p(h\mid \theta)} [r(h)] \\ = \sum_h r(h)\,p(h \mid \theta) \\ = \sum_h \left( r(h) \prod_{t=1}^{n-1} p(a_t \mid s_t; \theta) \, p(s_{t+1} \mid s_t, a_t) \right) $$

$$ =\sum_h\left( r(h) \left( \prod_{t=1}^{n-1} p(s_{t+1} \mid s_t, a_t) \right) \left( \prod_{t=1}^{n-1} p(a_t \mid s_t; \theta) \right) \right) $$

So, using product rule: $$ \frac{\partial} {\partial \theta} V_\theta(s) = \\ \sum_h \left( r(h) \left( \prod_{t=1}^{n-1} p(s_{t+1} \mid s_t, a_t) \right) \left( \prod_{t=1}^{n-1} p(a_t \mid s_t; \theta) \right) \left( \sum_{t=1}^{n-1} \frac{1}{p(a_t \mid s_t; \theta)} \frac{\partial}{\partial \theta} p(a_t \mid s_t; \theta) \right) \right) $$

And rewriting using $f'(x) / f(x) = f'(log(x))$:

$$ = \sum_h \left( r(h) \left( \prod_{t=1}^{n-1} p(s_{t+1} \mid s_t, a_t) \right) \left( \prod_{t=1}^{n-1} p(a_t \mid s_t; \theta) \right) \left( \sum_{t=1}^{n-1} \frac{\partial}{\partial \theta} \log p(a_t \mid s_t; \theta) \right) \right) $$

$$ = \sum_h \left( r(h) p(h\mid \theta) \left( \sum_{t=1}^{n-1} \frac{\partial}{\partial \theta} \log p(a_t \mid s_t; \theta) \right) \right) $$

$$ = \mathbb{E}_{p(h \mid \theta)} \left[ r(h) \sum_{t=1}^{n-1} \frac{\partial}{\partial \theta} \log p(a_t \mid s_t; \theta) \right] \tag{4} $$

So, the expectation is the sum over all histories, and for each history, we calculate:

the sum of the partial derivate with respect to theta of the log probability of each action taken, at each timestep
multiplied by the reward for the entire history

Looking at the inner partial derivative we have:

$$ \frac{\partial}{\partial \theta} \log p(a \mid s; \theta) =\frac{\partial}{\partial \theta} \log \frac{\exp(\theta \cdot \phi(s, a))} {\sum_{a'} \exp(\theta \cdot \phi(s, a')} $$

$$ \frac{\partial}{\partial \theta} \log p(a \mid s; \theta) =\frac{\partial}{\partial \theta} \left( (\theta \cdot \phi(s,a) - \log \left( \sum_{a'} \exp(\theta \cdot \phi(s, a')) \right) \right) $$

$$ \frac{\partial}{\partial \theta} \log p(a \mid s; \theta) = \phi(s, a) - \frac{\partial}{\partial \theta} \left( \log \left( \sum_{a'} \exp(\theta \cdot \phi(s, a')) \right) \right) $$

$$ \frac{\partial}{\partial \theta} \log p(a \mid s; \theta) = \phi(s, a) - \frac{1} {\sum_{a'} \exp(\theta \cdot \phi(s, a'))} \frac{\partial}{\partial \theta} \left( \sum_{a'} \exp(\theta \cdot \phi(s, a') \right) $$

$$ \frac{\partial}{\partial \theta} \log p(a \mid s; \theta) = \phi(s, a) - \frac{1} {\sum_{a'} \exp(\theta \cdot \phi(s, a'))} \sum_{a'} \frac{\partial}{\partial \theta} \exp(\theta \cdot \phi(s, a') $$

$$ \frac{\partial}{\partial \theta} \log p(a \mid s; \theta) = \phi(s, a) - \frac{1} {\sum_{a'} \exp(\theta \cdot \phi(s, a'))} \sum_{a'} \phi(s, a') \exp(\theta \cdot \phi(s, a')) $$

$$ \frac{\partial}{\partial \theta} \log p(a \mid s; \theta) = \phi(s, a) - \sum_{a'} \phi(s, a') \frac{\exp(\theta \cdot \phi(s, a'))} {\sum_{a''} \exp(\theta \cdot \phi(s, a''))} $$

$$ \frac{\partial}{\partial \theta} \log p(a \mid s; \theta) = \phi(s, a) - \sum_{a'} \phi(s, a')\, p(a' \mid s; \theta) \tag{5} $$

"which is the derivative of a log-linear distribution.

"Equation 5 is easy to compute directly. However, the complete derivative of $V_\theta$ in equation 4 is intractable, because computing the expectation would require summing over all possible histories. Instead, policy gradient algorithms employ stochastic gradient ascent by computing a noisy estimate of the expectation using just a subset of the histories. Specifically, we draw samples from $p(h\mid \theta)$ by acting in the target environment, and use these samples to approximate the expectation in equation 4. In practice, it is often sufficient to sample a single history $h$ for this approximation.

"Algorithm 1 details the complete policy gradient algorithm. It performs $T$ iterations over the set of documents $D$. Step 3 samples a history that maps each document to actions. This is done by repeatedly selecting actions according to the current policy, and updating the state by executing the selected actions. Steps 4 and 5 compute the empirical gradient and update the parameters $\theta$.

Algorithm 1 A policy gradient algorithm

Input:

document set $D$
feature representation $\phi(s, a)$
reward function $r(h)$
number of iterations $T$

Initialization: Set $\theta$ to small random values

$$ \begin{align} \def\for{\mathbf{for\,}} \def\foreach{\mathbf{foreach\,}} \def\do{\mathbf{\,do}} \def\endfor{\mathbf{end}} \def\tab{\hspace{24px}} &\for i = 1\dots T \do \\ &\tab\foreach d \in D \do \\ &\tab \tab \text{Sample history }h \sim p(h \mid \theta)\text{ where}\\ &\tab \tab \tab h = (s_0, a_0, \dots, a_{n-1},s_n)\text{ as follows:}\\ &\tab \tab \for t = 0 \dots n-1 \do \\ &\tab \tab \tab \text{Sample action }a_t \sim p(a \mid s_t; \theta)\\ &\tab \tab \tab \text{Execute }a_t \text{ on state }s_t: s_{t+1} \sim p(s \mid s_t, a_t)\\ &\tab \tab \endfor\\ &\tab \tab \Delta \leftarrow \sum_t(\phi(s_t, a_t) - \sum_{a'}\phi(s_t, a')\,p(a' \mid s_t; \theta)) \\ &\tab \tab \theta \leftarrow \theta + r(h)\Delta \\ &\tab \endfor \\ &\endfor \\ \end{align} $$

Output: Estimate of parameters $\theta$

"In many domains, interacting with the environment is expensive. Therefore, we use two techniques that allow us to take maximum advantage of each environment interaction. First, a history $h = (s_0, a_0, \dots, s_n)$ contains subsequences $(s_i, a_i, \dots, s_n)$ for $i=1$ to $n-1$, each with its own reward value given by the environment as a side effect of executing $h$. We apply the update from equation 5 for each subsequence. Second, for a sampled history $h$, we can propose alternative histories $h'$ that result in the same commands and parameters with different word spans. We can again apply equation 5 for each $h'$, weighted by its probability under the current policy, $\frac{p(h' \mid \theta)}{p(h \mid \theta)}$.

"The algorithm we have presented belongs to a family of policy gradient algorithms that have been successfully used for complex tasks such as robot control (Ng et al, 2003). Our formulation is unique in how it represents natural language in the reinforcement learning framework."

Table 1 Example features in the Windows domain. All features are binary, except for the normalized edit distance which is real-valued

Notation:

o: Parameter referring to an environment object
L: Set of class names (eg 'button')
V: Vocabulary

__Features on $W$ and object $o$:

Test of $o$ is visible in $s$
Test of $o$ has input focus
Test if $o$ is in the foreground
Test if $o$ was previously interacted with
Test if $o$ came into existence since last action
Min. edit distance between $w \in W$ and object labels in $s$

Features on words in $W$, command $c$, and object $o$

$\forall c' \in C, w \in V: \text{ test if } c' = c \text{ and } w \in W$
$\forall c' \in C, l \in L: \text{ test if } c' = c \text{ and } l \text{ is the class of }o$

6.2 Crossblock

...

"For this domain, we use two sets of binary features on state-action pairs $(s, a)$. First, for each vocabulary word $w$, we define a feature that is one if $w$ is the last word of $a$'s consumed words $W'$. These feature help identify the proper text segmentation points between actions. Second, we introduce features for pairs of vocabulary word $w$ and attributes of action $a$, eg the line orientation and grid locations of the squares that $a$ would remove. This set of features enables us to match words (eg 'row') with objects in the environment (eg, a move that removes a horizontal series of squares). In total, there are 8094 features.