This notebook is a testing ground for a human-robot chatbox interface. There are three primary elements to our intelligent chatbox:
In general, our goal is to find the template human sensor statement(s) that best matches some natural language input. To solve this, we have developed a workflow as shown below:
We identify the following probabilistic and deterministic variables to track:
Based on conditional independence assumptions, we can form the following Bayesian Network (BN):
We can simplify the joint probability distribution based on the BN: $$ P(T,L,TDC,H \vert D, \mathcal{S}, O) = \prod_{i=1}^\mathcal{T}P(T_i \vert S_i, D) P(L_i \vert T_i) \prod_{j=1}^\mathcal{N} P(TDC_j \vert L_i) P(H_j \vert TDC, O) $$
Our goal is to find the set of template human sensor statements $\mathcal{H} = \{H_1,\dots,H_\mathcal{N}\}$ and tokenizations $\mathcal{T}$ that maximize this distribution, thus:
$$ \DeclareMathOperator*{\argmax}{arg\,max} \argmax_{\mathcal{H},\mathcal{\tilde{T}}} \prod_{i=1}^\mathcal{\tilde{T}}P(T_i \vert S_i, D) P(L_i \vert T_i) \prod_{j=1}^\mathcal{N} P(TDC_j \vert L_i) P(H_j \vert TDC, O) $$In practice, this likely means we'll need to limit the number of tokenizations considered, as there are $2^{(W-1)}$ tokenizations available for each input document consisting of $W$ words (see below for a full description).
Additionally, it will be important to remove 'garbage' sentences:
The following sections will describe how to compute each of the following terms:
Tokenization is the process of splitting an input document into a sequence of tokens: individual portions of the input document. We'll then tag each token with a semantic label in the next step.
Given an input document, say, "Roy is behind the desk." we need to be able to figure out how to split the document the sentence into a set of tokens to be labeled. A first-cut approach would be to split the sentence based on just the words: ["Roy", "is", "behind", "the", "desk."]. But, this causes issues with words that should go together for one label (i.e., one label should be assigned to "the desk" instead of labeling "the" and "desk" separately).
We make the assumption that a document should be tokenized based on n-grams and delimiter punctuation. $n$-grams are arbitrarily grouped sequences of words, with $n$ representing the number of words grouped together (i.e. "the desk" is a bigram or $2$-gram). Delimiter punctuation, (',', ';', '.', '!', '?'), marks divisions between n-grams, such that no two words on either side of the punctuation can be combined into an $n$-gram. The following lists the tokenizations under these assumptions of "Roy is behind the desk.":
Token 1 | Token 2 | Token 3 | Token 4 | Token 5 | Token 6 | Schema | |
---|---|---|---|---|---|---|---|
Tokenization 1 | Roy | is | behind | the | desk | . | [1, 1, 1, 1, 1, 1] |
Tokenization 2 | Roy is | behind | the | desk | . | [2, 1, 1, 1, 1] | |
Tokenization 3 | Roy is | behind the | desk | . | [2, 2, 1, 1] | ||
Tokenization 4 | Roy is | behind | the desk | . | [2, 1, 2, 1] | ||
Tokenization 5 | Roy is | behind the desk | . | [2, 3, 1] | |||
Tokenization 6 | Roy is behind | the | desk | . | [3, 1, 1, 1] | ||
Tokenization 7 | Roy is behind | the desk | . | [3, 2, 1] | |||
Tokenization 8 | Roy is behind the | desk | . | [4, 1, 1] | |||
Tokenization 9 | Roy is behind the desk | . | [5, 1] | ||||
Tokenization 10 | Roy | is behind | the | desk | . | [1, 2, 1, 1, 1] | |
Tokenization 11 | Roy | is behind | the desk | . | [1, 2, 2, 1] | ||
Tokenization 12 | Roy | is behind the | desk | . | [1, 3, 1, 1] | ||
Tokenization 13 | Roy | is behind the desk | . | [1, 4, 1] | |||
Tokenization 14 | Roy | is | behind the | desk | . | [1, 1, 2, 1, 1] | |
Tokenization 15 | Roy | is | behind the desk | . | [1, 1, 3, 1] | ||
Tokenization 16 | Roy | is | behind | the desk | . | [1, 1, 1, 2, 1] |
As clearly visible in the above example, a simple 5-word sentence can easily lead to a large number of possible tokenizations. Specifically, the upper bound for the total number of tokenizations is $2^{(W-1)}$, where $W$ is the number of words in the document. This is an upper bound because of the tokenization reduction effects of delimiter punctuation (i.e., "The woman, in blue, near me." has only $2^1 \times 2^1 \times 2^1 = 8$ tokenizations, as opposed to the $2^5 =32$ tokenizations of "The woman in blue near me.").
We require the definition of $P(T_i \vert S_i, D)$ for each $T_i \in \mathcal{T}$. We assume that each input document $D$ has one correct (unknown) tokenization $T^*$, and that $P(T_i \vert S_i, D)$ represents the probability that the tokenization $T_i = T^*$. Clearly, the score metric chosen will influence this probability.
Given a corpus, we can define association measures between words that define how likely two words are combined to form one token. These can be applied recursively to determine multi-word tokenizations. We begin by defining a few terms (largely taken from Michelbacher 2013).
A corpus is analyzed for each word pair $(U,V)$, where $U$ is the pair's first component and $V$ is the pair's second component. $u$ and $v$ are specific realizations of the random variables $U$ and $V$. For example, $\{ (u,v) \vert U =u \land V\neq v\}$ could represent case of $(U,V) = (\text{the}, \text{cat})$ and $(u,v) = (\text{the}, \text{dog})$.
Observed frequencies for word pairs are:
\begin{align} O_{11} &= \vert \{ (u,v) \vert U =u \land V=v\} \vert \\ O_{12} &= \vert \{ (u,v) \vert U =u \land V\neq v\} \vert \\ O_{21} &= \vert \{ (u,v) \vert U \neq u \land V=v\} \vert \\ O_{22} &= \vert \{ (u,v) \vert U \neq u \land V\neq v\} \vert \end{align}Marginal frequencies for individual words within word pairs are:
\begin{align} R_{1} &= \vert \{ (u,v) \vert U =u\} \vert \\ R_{2} &= \vert \{ (u,v) \vert U \neq u\} \vert \\ C_{1} &= \vert \{ (u,v) \vert V=v\} \vert \\ C_{2} &= \vert \{ (u,v) \vert V\neq v\} \vert \end{align}Expected frequencies, then, are:
\begin{align} E_{11} &= \frac{R_1C_1}{N} \\ E_{12} &= \frac{R_1C_2}{N} \\ E_{21} &= \frac{R_2C_1}{N} \\ E_{22} &= \frac{R_2C_2}{N}\vert \end{align}Measure of how far (positively or negatively) the observed frequency deviates from expected frequency in multiples of sample variance: $$ t =\frac{O_{11} - E_{11}}{\sqrt{O_{11}}} $$
Measure of how far (positively or negatively) the observed frequency deviates from expected frequency in multiples of sample variance (using a different approximation for sample variance than the t-test): $$ t =\frac{O_{11} - E_{11}}{\sqrt{E_{11}}} $$
Measure of independence between $u$ and $v$: $$ \chi^2 = \sum_{i,j} \frac{O_{ij} - E_{i,j}^2}{E_{i,j}} $$
Where $\{i,j\} \in \{\{1,1\},\{1,2\},\{2,1\},\{2,2\}\}$.
Requires two hypotheses:
\begin{align} \mathbf{H_0} &: P((u,v)) = P((u,v^\prime)); v^\prime \neq v\\ \mathbf{H_1} &: P((u,v)) =p_1 \neq p_2 = P((u,v^\prime)); v^\prime \neq v\\ \end{align}Where $\mathbf{H_0}$ is the hypothesis that the probability of $u$ occurring is the same regardless if the next word is $v$ or not, and $\mathbf{H_1}$ is the probability of $u$ occurring is different depending on whether $v$ or some other word $v^\prime$ preceeds it.
The log-likelihood tests
$$ \lambda = \frac{\text{max}L(H_0)}{\text{max}L(H_1)} = \frac{L(O_{11},C_1,p)L(O_{12},C_2,p)}{L(O_{11},C_1,p_1)L(O_{12},C_2,p_2)} $$Where $L(k,n,p) = p^k(1-p)^{(n-k)}$, and maximum likelihood estimates are $p = \frac{R_1}{N}$, $p_1 = \frac{O_{11}}{C_1}$, and $p_2 = \frac{O_{12}}{C_2}$. $-2 \log{\lambda}$ is asymptotically $\chi^2$.
A simplified expression is:
$$ \text{log-likelihood} = 2 \sum_{i,j}O_{i,j} \log{\frac{O_{ij}}{E{ij}}} $$The Dice coefficient represent similarity between two sets. For sets $A$ and $B$, the coefficient is:
$$ \frac{2 \vert A \cap B \vert}{\vert A \vert + \vert B \vert} $$In our case:
$$ \text{Dice} = \frac{2O_{11}}{R_1 + C_1} $$The idea is to measure for two random variables $X$ and $Y$ how much information about $Y$ is contained in $X$ or in other words the reduction in uncertainty about $Y$ when knowing $X$. Thus,
$$ I(x,y) = \frac{p(x,y)}{p(x)p(y)} = \frac{O_{11}}{E_{11}} $$Recall: conditional probability is defined as,
$$ P(A \vert B) = \frac{P(A \cap B)}{P(B)} $$Silvia and Lopes (1999) proposed symmetrical conditional probability as,
$$ P(A \vert B) P(B \vert A) = \frac{P(A \cap B)^2}{P(a)P(B)} $$Which gives us:
$$ \text{scp} = \frac{O_{11}^2}{R_1 C_1} $$Finally, we can simply consider frequency of the observed pairs:
$$ \text{frequency} = O_{11} $$How do we formulate this as a supervised learning problem for tokenization?
If we wanted to accomodate, say, 20-word sentences, we'd need to maximize $\mathcal{H}$ over each of those $1,048,576$ individual possible tokenizations. Instead of doing that, what we'd like to do is find a measure for the $m$ most likely tokenizations and then maximize $\mathcal{H}$ for each of those. But, how do we find a measure for how likely each tokenization is?
To explore this, we use the following cherry-picked example corpus:
The green robot is behind the couch, heading north.
The green robot is moving slowly.
The green robot is moving quickly.
Deckard is next to the green fern.
Mikolov et al. [3] use is the following score metric:
$$ \text{score}(w_i,w_j) = \frac{\text{count}(w_i,w_j) - \delta}{\text{count}(w_i)\text{count}(w_j)} $$Where $\delta$ is some tuned discounting coefficient. This parses the entire corpus for all unigrams and bigrams, counts them, and then returns a higher score for words that are only frequently seen in context of one-another. To identify multi-word tokens greater than two, this process is applied recursively. However, this poses several problems and is only one of several possible metrics.
Most importantly, this score metric ignores possible training data – if we have proper tokenizations of words, we should use that information to modify the score.
If we have access to a manually tokenized corpus, such as:
[The green robot] [is] [behind] [the couch], [heading] [north] [.]
[The green robot] [is] [moving] [slowly] [.]
[The green robot] [is] [moving] [quickly] [.]
[Deckard] [is] [next to] [the green fern] [.]
Then we can use this as training data for our score function.
Conditional Random Fields (CRFs) are a classification technique postulated by Lafferty et al. [4]. They are similar to Hidden Markov Models (HMMs) and stochastic grammars (SGs), but, unlike those two generative models (i.e. concerned with joint label-observation probability), are a discriminative model (i.e. concerned with label probability conditioned on observations). SGs and HMMs can be intractable because of their need to enumerate all possible observation sequences; this is not the case for CRFs.
Other discriminative models, such as Maximum Entropy Markov Models (MEMMs), however, suffer from the label-bias problem, in which the transition probabilities leaving any one state only interact with other transitions leaving that state, rather than the global transition set. This causes a bias towards states with fewer outgoing transitions (as illustrated in [4]). CRFs don't have this problem, as Lafferty et al. states: "The critical difference between CRFs and MEMMs is that a MEMM uses per-state exponential models for the conditional probabilities of next states given the current state, while a CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence."
In essence, we want to learn $p(\mathbf{Y} \vert \mathbf{X})$, where $\mathbf{Y}$ is a sequence of labels given to the sequence of observed values $\mathbf{X}$. Once we've trained our model, we can find the $\mathbf{Y_*}$ that maximizes $p(\mathbf{Y_*} \vert \mathbf{X_*})$, where a $*$ signifies a new sequence.
By the Hammersley-Clifford theorem of random fields, the join distribution over $\mathbf{Y}$, given $\mathbf{X}$, is:
$$ p_\theta(\mathbf{y} \vert \mathbf{x}) \propto \exp\left( \sum_{e \in E,k} \lambda_k f_k(e,\mathbf{y}\vert_e,\mathbf{x}) + \sum_{v \in V,k} \mu_k g_k(v,\mathbf{y}\vert_v,\mathbf{x}) \right) $$where $f_k(\cdot)$ and $g_k(\cdot)$ are transition feature functions and state feature functions, respectively, that are defined a priori, and $\lambda_k$, $\mu_k$ are parameters to be learned from training data.
In our case, we have $P(L_i \vert T_i) \triangleq p_\theta(\mathbf{y} \vert \mathbf{x})$.
In [ ]: