# 1. Purpose

This notebook is a testing ground for a human-robot chatbox interface. There are three primary elements to our intelligent chatbox:

• Suggestions using Bootstrap-3 Typeahead, so that the user has a better understanding of the language the robot can understand. This is shown at the end of the document.
• Template-matching to allow the robot to understand non-standard sentences. We're building off Tellex, Kollar and others' work on Spatial Description Clauses [1,2] to parse the natural language into our templates. This document develops the mathematical framework.
• Online correction, in which the operator tells the robot that the robot misunderstood. This work is ongoing.

## 1.1 Template-matching Block Diagram

In general, our goal is to find the template human sensor statement(s) that best matches some natural language input. To solve this, we have developed a workflow as shown below:

## 1.2 Problem Formulation

We identify the following probabilistic and deterministic variables to track:

• $D$, a input document.
• $T$, one of $\mathcal{T}$ tokenizations of an input document.
• $S$, a quality score associated with a tokenization. $\mathcal{S} = \{S_1,\dots,S_\mathcal{T}\}$
• $L$, the semantically tagged labels associated with a tokenization.
• $TDC$, one of $\mathcal{N}$ target description clauses associated with a tokenization.
• $H$, a human sensor template (one of $\mathcal{H}_T$ templates), comprised of some combination of $\{p, c, t, g, s, a, m\}$, where:
• $p$ is a positivity,
• $c$ is a certainty,
• $t$ is a target,
• $g$ is a grounding,
• $s$ is a spatial relation,
• $a$ is an action,
• $m$ is a modifier.
• $O$, observed elements in the environment.

Based on conditional independence assumptions, we can form the following Bayesian Network (BN):

We can simplify the joint probability distribution based on the BN: $$P(T,L,TDC,H \vert D, \mathcal{S}, O) = \prod_{i=1}^\mathcal{T}P(T_i \vert S_i, D) P(L_i \vert T_i) \prod_{j=1}^\mathcal{N} P(TDC_j \vert L_i) P(H_j \vert TDC, O)$$

Our goal is to find the set of template human sensor statements $\mathcal{H} = \{H_1,\dots,H_\mathcal{N}\}$ and tokenizations $\mathcal{T}$ that maximize this distribution, thus:

$$\DeclareMathOperator*{\argmax}{arg\,max} \argmax_{\mathcal{H},\mathcal{\tilde{T}}} \prod_{i=1}^\mathcal{\tilde{T}}P(T_i \vert S_i, D) P(L_i \vert T_i) \prod_{j=1}^\mathcal{N} P(TDC_j \vert L_i) P(H_j \vert TDC, O)$$

In practice, this likely means we'll need to limit the number of tokenizations considered, as there are $2^{(W-1)}$ tokenizations available for each input document consisting of $W$ words (see below for a full description).

Additionally, it will be important to remove 'garbage' sentences:

# Human Sensor Statement Matching

The following sections will describe how to compute each of the following terms:

• $P(T_i \vert S_i, D)$ using a tokenization and scoring algorithm;
• $P(L_i \vert T_i)$ using Conditional Random Fields;
• $P(TDC_j \vert L_i)$ using a greedy templating algorithm;
• $P(H_j \vert O)$ using sense2vec.

## 2.1 Multi-Word Tokenization

Tokenization is the process of splitting an input document into a sequence of tokens: individual portions of the input document. We'll then tag each token with a semantic label in the next step.

Given an input document, say, "Roy is behind the desk." we need to be able to figure out how to split the document the sentence into a set of tokens to be labeled. A first-cut approach would be to split the sentence based on just the words: ["Roy", "is", "behind", "the", "desk."]. But, this causes issues with words that should go together for one label (i.e., one label should be assigned to "the desk" instead of labeling "the" and "desk" separately).

We make the assumption that a document should be tokenized based on n-grams and delimiter punctuation. $n$-grams are arbitrarily grouped sequences of words, with $n$ representing the number of words grouped together (i.e. "the desk" is a bigram or $2$-gram). Delimiter punctuation, (',', ';', '.', '!', '?'), marks divisions between n-grams, such that no two words on either side of the punctuation can be combined into an $n$-gram. The following lists the tokenizations under these assumptions of "Roy is behind the desk.":

Token 1 Token 2 Token 3 Token 4 Token 5 Token 6 Schema
Tokenization 1 Roy is behind the desk . [1, 1, 1, 1, 1, 1]
Tokenization 2 Roy is behind the desk . [2, 1, 1, 1, 1]
Tokenization 3 Roy is behind the desk . [2, 2, 1, 1]
Tokenization 4 Roy is behind the desk . [2, 1, 2, 1]
Tokenization 5 Roy is behind the desk . [2, 3, 1]
Tokenization 6 Roy is behind the desk . [3, 1, 1, 1]
Tokenization 7 Roy is behind the desk . [3, 2, 1]
Tokenization 8 Roy is behind the desk . [4, 1, 1]
Tokenization 9 Roy is behind the desk . [5, 1]
Tokenization 10 Roy is behind the desk . [1, 2, 1, 1, 1]
Tokenization 11 Roy is behind the desk . [1, 2, 2, 1]
Tokenization 12 Roy is behind the desk . [1, 3, 1, 1]
Tokenization 13 Roy is behind the desk . [1, 4, 1]
Tokenization 14 Roy is behind the desk . [1, 1, 2, 1, 1]
Tokenization 15 Roy is behind the desk . [1, 1, 3, 1]
Tokenization 16 Roy is behind the desk . [1, 1, 1, 2, 1]

As clearly visible in the above example, a simple 5-word sentence can easily lead to a large number of possible tokenizations. Specifically, the upper bound for the total number of tokenizations is $2^{(W-1)}$, where $W$ is the number of words in the document. This is an upper bound because of the tokenization reduction effects of delimiter punctuation (i.e., "The woman, in blue, near me." has only $2^1 \times 2^1 \times 2^1 = 8$ tokenizations, as opposed to the $2^5 =32$ tokenizations of "The woman in blue near me.").

### 2.1.1 Probability Measure

We require the definition of $P(T_i \vert S_i, D)$ for each $T_i \in \mathcal{T}$. We assume that each input document $D$ has one correct (unknown) tokenization $T^*$, and that $P(T_i \vert S_i, D)$ represents the probability that the tokenization $T_i = T^*$. Clearly, the score metric chosen will influence this probability.

### 2.1.2 Association Measures

Given a corpus, we can define association measures between words that define how likely two words are combined to form one token. These can be applied recursively to determine multi-word tokenizations. We begin by defining a few terms (largely taken from Michelbacher 2013).

A corpus is analyzed for each word pair $(U,V)$, where $U$ is the pair's first component and $V$ is the pair's second component. $u$ and $v$ are specific realizations of the random variables $U$ and $V$. For example, $\{ (u,v) \vert U =u \land V\neq v\}$ could represent case of $(U,V) = (\text{the}, \text{cat})$ and $(u,v) = (\text{the}, \text{dog})$.

Observed frequencies for word pairs are:

\begin{align} O_{11} &= \vert \{ (u,v) \vert U =u \land V=v\} \vert \\ O_{12} &= \vert \{ (u,v) \vert U =u \land V\neq v\} \vert \\ O_{21} &= \vert \{ (u,v) \vert U \neq u \land V=v\} \vert \\ O_{22} &= \vert \{ (u,v) \vert U \neq u \land V\neq v\} \vert \end{align}

Marginal frequencies for individual words within word pairs are:

\begin{align} R_{1} &= \vert \{ (u,v) \vert U =u\} \vert \\ R_{2} &= \vert \{ (u,v) \vert U \neq u\} \vert \\ C_{1} &= \vert \{ (u,v) \vert V=v\} \vert \\ C_{2} &= \vert \{ (u,v) \vert V\neq v\} \vert \end{align}

Expected frequencies, then, are:

\begin{align} E_{11} &= \frac{R_1C_1}{N} \\ E_{12} &= \frac{R_1C_2}{N} \\ E_{21} &= \frac{R_2C_1}{N} \\ E_{22} &= \frac{R_2C_2}{N}\vert \end{align}

#### 2.1.2.1 t-score

Measure of how far (positively or negatively) the observed frequency deviates from expected frequency in multiples of sample variance: $$t =\frac{O_{11} - E_{11}}{\sqrt{O_{11}}}$$

#### 2.1.2.2 z-score

Measure of how far (positively or negatively) the observed frequency deviates from expected frequency in multiples of sample variance (using a different approximation for sample variance than the t-test): $$t =\frac{O_{11} - E_{11}}{\sqrt{E_{11}}}$$

#### 2.1.2.3 chi-square

Measure of independence between $u$ and $v$: $$\chi^2 = \sum_{i,j} \frac{O_{ij} - E_{i,j}^2}{E_{i,j}}$$

Where $\{i,j\} \in \{\{1,1\},\{1,2\},\{2,1\},\{2,2\}\}$.

#### 2.1.2.4 log-likelihood

Requires two hypotheses:

\begin{align} \mathbf{H_0} &: P((u,v)) = P((u,v^\prime)); v^\prime \neq v\\ \mathbf{H_1} &: P((u,v)) =p_1 \neq p_2 = P((u,v^\prime)); v^\prime \neq v\\ \end{align}

Where $\mathbf{H_0}$ is the hypothesis that the probability of $u$ occurring is the same regardless if the next word is $v$ or not, and $\mathbf{H_1}$ is the probability of $u$ occurring is different depending on whether $v$ or some other word $v^\prime$ preceeds it.

The log-likelihood tests

$$\lambda = \frac{\text{max}L(H_0)}{\text{max}L(H_1)} = \frac{L(O_{11},C_1,p)L(O_{12},C_2,p)}{L(O_{11},C_1,p_1)L(O_{12},C_2,p_2)}$$

Where $L(k,n,p) = p^k(1-p)^{(n-k)}$, and maximum likelihood estimates are $p = \frac{R_1}{N}$, $p_1 = \frac{O_{11}}{C_1}$, and $p_2 = \frac{O_{12}}{C_2}$. $-2 \log{\lambda}$ is asymptotically $\chi^2$.

A simplified expression is:

$$\text{log-likelihood} = 2 \sum_{i,j}O_{i,j} \log{\frac{O_{ij}}{E{ij}}}$$

#### 2.1.2.5 Dice coefficient

The Dice coefficient represent similarity between two sets. For sets $A$ and $B$, the coefficient is:

$$\frac{2 \vert A \cap B \vert}{\vert A \vert + \vert B \vert}$$

In our case:

$$\text{Dice} = \frac{2O_{11}}{R_1 + C_1}$$

#### 2.1.2.6 Pointwise Mutual Information

The idea is to measure for two random variables $X$ and $Y$ how much information about $Y$ is contained in $X$ or in other words the reduction in uncertainty about $Y$ when knowing $X$. Thus,

$$I(x,y) = \frac{p(x,y)}{p(x)p(y)} = \frac{O_{11}}{E_{11}}$$

#### 2.1.2.7 Symmetrical Conditional Probability

Recall: conditional probability is defined as,

$$P(A \vert B) = \frac{P(A \cap B)}{P(B)}$$

Silvia and Lopes (1999) proposed symmetrical conditional probability as,

$$P(A \vert B) P(B \vert A) = \frac{P(A \cap B)^2}{P(a)P(B)}$$

Which gives us:

$$\text{scp} = \frac{O_{11}^2}{R_1 C_1}$$

#### 2.1.2.8 Raw Frequency

Finally, we can simply consider frequency of the observed pairs:

$$\text{frequency} = O_{11}$$

### 2.1.3 Supervised Learning with Association Measures

How do we formulate this as a supervised learning problem for tokenization?

• Identify which association measures lead to the best scores to identify the true tokenizations
• Train a classifier to use a mixture of association measures?

### 2.1.4 Mikolov's Association Measures

If we wanted to accomodate, say, 20-word sentences, we'd need to maximize $\mathcal{H}$ over each of those $1,048,576$ individual possible tokenizations. Instead of doing that, what we'd like to do is find a measure for the $m$ most likely tokenizations and then maximize $\mathcal{H}$ for each of those. But, how do we find a measure for how likely each tokenization is?

To explore this, we use the following cherry-picked example corpus:

The green robot is behind the couch, heading north.
The green robot is moving slowly.
The green robot is moving quickly.
Deckard is next to the green fern.



Mikolov et al. [3] use is the following score metric:

$$\text{score}(w_i,w_j) = \frac{\text{count}(w_i,w_j) - \delta}{\text{count}(w_i)\text{count}(w_j)}$$

Where $\delta$ is some tuned discounting coefficient. This parses the entire corpus for all unigrams and bigrams, counts them, and then returns a higher score for words that are only frequently seen in context of one-another. To identify multi-word tokens greater than two, this process is applied recursively. However, this poses several problems and is only one of several possible metrics.

Most importantly, this score metric ignores possible training data – if we have proper tokenizations of words, we should use that information to modify the score.

[The green robot] [is] [behind] [the couch], [heading] [north] [.]
[The green robot] [is] [moving] [slowly] [.]
[The green robot] [is] [moving] [quickly] [.]
[Deckard] [is] [next to] [the green fern] [.]



Then we can use this as training data for our score function.

## 2.2 Semantic Label Tagging

Conditional Random Fields (CRFs) are a classification technique postulated by Lafferty et al. [4]. They are similar to Hidden Markov Models (HMMs) and stochastic grammars (SGs), but, unlike those two generative models (i.e. concerned with joint label-observation probability), are a discriminative model (i.e. concerned with label probability conditioned on observations). SGs and HMMs can be intractable because of their need to enumerate all possible observation sequences; this is not the case for CRFs.

Other discriminative models, such as Maximum Entropy Markov Models (MEMMs), however, suffer from the label-bias problem, in which the transition probabilities leaving any one state only interact with other transitions leaving that state, rather than the global transition set. This causes a bias towards states with fewer outgoing transitions (as illustrated in [4]). CRFs don't have this problem, as Lafferty et al. states: "The critical difference between CRFs and MEMMs is that a MEMM uses per-state exponential models for the conditional probabilities of next states given the current state, while a CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence."

In essence, we want to learn $p(\mathbf{Y} \vert \mathbf{X})$, where $\mathbf{Y}$ is a sequence of labels given to the sequence of observed values $\mathbf{X}$. Once we've trained our model, we can find the $\mathbf{Y_*}$ that maximizes $p(\mathbf{Y_*} \vert \mathbf{X_*})$, where a $*$ signifies a new sequence.

By the Hammersley-Clifford theorem of random fields, the join distribution over $\mathbf{Y}$, given $\mathbf{X}$, is:

$$p_\theta(\mathbf{y} \vert \mathbf{x}) \propto \exp\left( \sum_{e \in E,k} \lambda_k f_k(e,\mathbf{y}\vert_e,\mathbf{x}) + \sum_{v \in V,k} \mu_k g_k(v,\mathbf{y}\vert_v,\mathbf{x}) \right)$$

where $f_k(\cdot)$ and $g_k(\cdot)$ are transition feature functions and state feature functions, respectively, that are defined a priori, and $\lambda_k$, $\mu_k$ are parameters to be learned from training data.

In our case, we have $P(L_i \vert T_i) \triangleq p_\theta(\mathbf{y} \vert \mathbf{x})$.

## 2.4 Similarity Matching



In [ ]:



# Notes

## General

• Overall process: Single-word tokenization -> multi-word tokenization -> Lemmatization/POS tagging (optional)
• Order is important! Should we check for sense matching before TDC classification, or vice-versa?
• Requires identification method for non-measurements: i.e. "I am blue." should not map to any human sensor statement:
• Threshhold on $P(T,L,TDC,H \vert D, \mathcal{S}, O)$, potentially
• Look at precision and recall as metrics for evaluating each step of the NLP pipeline:
• Full-sentence metrics:
• True positive: raw NL input matches correct human sensor statement
• False positive: raw NL input matches incorrect human sensor statement
• False negative: raw NL input incorrectly does not match any human sensor statement
• True negative: raw NL input correctly does not match any human sensor statement
• $\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$ is a measure of how many NL inputs were correctly matched
• $\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{TN}}$ is a measure of how many NL inputs were correctly transformed into HSSs
• Requires having ground truth for each stage, not just the end

## Development Process

1. Close loop ~√
1. Develop chat interface in Python √
2. Refactor human sensor code (i.e. centralize the Human Sensor class)
2. Identify end-to-end performance metric
1. Identify full human sensor codebook [dependent on 1.2]
2. Assign each input sentence to one codebook entry [Jeremy, Sierra]
3. Output proposed human sensor statement for each
4. Check precision/recall
3. Identify & create missing human sensor templates
1. Consider N/A cases [dependent on 2.2]
2. Create low-hanging human sensor templates
1. Camera as grounding
4. Identify component performance metrics
5. Introduce clean-up and augmentation components
6. Incorporate confidence to each step of fusion chain
7. Consider multiple hypothesis tracking (i.e. non-monolithic framework)

## Pre-processing

• Contractions: should we change "n't" to " not" instead? I.e. "don't" -> "do not"?
• Autocorrect: should we autocorrect transparently, behind the scenes, or neither?
• Case-correction: should we enforce proper case (i.e. remvoe all caps)?
• Punctuation: should we remove double punctuation, insert regular punctuation, etc.?

## Tokenization

• How deep do we want tokenization to go? Set a max limit? Brute-force tokenization, POS rule-based tokenization, or a variant of Michelbacher's classification strategy? Is our domain-specific terminology too dynamic/broad to prevent phrase look-ups?
• We do not care about non-compositionality (i.e. meaning cannot be inferred from components: "Kick the bucket" has nothing to do with footwork and pails); "The red robot" is a compositional phrase.
• We do not care about non-modifiability (i.e. "Kick the bucket" cannot be modified to "Kick the small bucket"); "The red robot" can be modified to "The fast red robot".
• We do not care about non-substitutability (i.e. "Kick the bucket" cannot be substituted to "Kick the pail"); "The red robot" can be substituted to to "The red robber".
• In tokenizatation, we seek to capture syntagmatic relations between words (the relation between two words is called syntagmatic if they occur in sequence, in that words in "the old man" are syntagmatically related, but "good" and "bad" are paradigmatically related -- a relationship we capture at the end, with Sense Matching)
• Can use a number of (symmetric!) association measures:
• T-score
• z-score
• chi-square
• log-likelihood
• dice coefficient
• pointwise mutual information
• symmetrical conditional probability
• raw frequency

## Tagging

• What tokens are used for CRF training? Include POS column? Lemma column? Fused-word column (e.g., "the_red_car")? Multiple word columns based on max tokenization (e.g., "the\tred\tcar")?

## Template-fitting

• How do we identify new templates?