In [ ]:
x = "The cat sat on the mat"
words = x.split()
print(words)
In [ ]:
vocabulary = {w:i for i, w in enumerate(set(x.split()))}
print(vocabulary)
print("------------------")
print(words)
print([vocabulary[w] for w in words])
For now, this is what we have to compute with, but later you will learn more about representing words (other than meaningless IDs).
You | talk | the | talk | but | do | you | walk | the | walk | ? |
---|---|---|---|---|---|---|---|---|---|---|
pronoun | verb | determiner | noun | conjunction | verb | pronoun | verb | determiner | noun | punctuation |
... is a task to label every word in a given text with the appropriate POS tags. An algorithm for that will be our very first NLP task!
From Universal Dependencies using Universal POS tags:
This | killing | of | a | respected | cleric | will | be | causing | us | trouble | for | years | to | come | . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DET | NOUN | ADP | DET | ADJ | NOUN | AUX | AUX | VERB | PRON | NOUN | ADP | NOUN | PART | VERB | PUNCT |
Or a hungarian one
A | gazdaság | ilyen | mértékű | fejlődését | több | folyamat | gerjeszti | . |
---|---|---|---|---|---|---|---|---|
DET | NOUN | DET | ADJ | NOUN | DET | NOUN | VERB | PUNCT |
The Universal tagset is language independent, except some language specific features. For example the words "cleric" and "gazdaság" are both NOUNs. In English "a" and "the" are determinants, in Hungarian "a" and "az" have similar grammatical functions.
From UMBC webbase corpus using Penn Treebank tagset:
Well | , | let | me | just | say | there | is | n't | too | much | I | can | tell | you | right | at | the | moment | . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RB | , | VB | PRP | RB | VBP | EX | VBZ | RB | RB | JJ | PRP | MD | VB | PRP | RB | IN | DT | NN | . |
The latter is just for English, note the EX (existential there) tag and the token "n't" after "is" with tag RB.
These are also tokenized!
$word_1$ | $word_2$ | $word_3$ | $\ldots$ |
---|---|---|---|
$tag_1$ | $tag_2$ | $tag_3$ | $\ldots$ |
Definition (Token) The atoms of interest, in our case the words of a text.
Definition (Corpus) A list of tokens
Definition (Tokenization) The task of splitting a list of characters (raw text) into tokens.
Definition (Tagset) A finite (usually small) set of symbols, which are linguistically defined properties of tokens.
Definition (Labeled corpus) A List of (token, tag) pairs, where the tags are from a given tagset.
Definition (Tagging) Take a given (unlabeled) corpus and tagset. The tagging is the task of assigning the tags to tokens.
Sometimes a corpus is split at sentence boundaries, which means that sentences can be processed separately. Otherwise, a sentence boundary is just a special punctuation or end-of-sentence symbol.
There are cases, where the POS tag of a word cannot be deduced within a sentence, only in context of other sentences.
If the tags would be algorithmically well-defined, then implementing that definition would result a 100% correct tagger without any further ado. However, this is not the case with natural language.
There are rule based approaches, where linguistic or grammatical rules are implemented, but every language has exceptions.
If one does not use such rules, only the previously correctly labeled data, that is the statistical approach, because the resulted algorithm is conditioned to the previously seen data.
In the statistical approach:
Training Setting up the algorithm for a given task. This uses a previously correctly labeled data.
Train set A previously correctly labeled data. Your algorithm can use any information in the training set.
Prediction/inference The labeling of an unlabeled data with your algorithm. This is the main task itself.
Test/unseen data A previously correctly labeled data, but its labels are not communicated with the algorithm.
Evaluation Comparing the output of the predicted and correct labels (it is done on the test data).
Annotating The human labor of making the gold dataset. Manually processing sentences and label them correctly.
Gold data Annotated corpus, usually annotated with serious efforts and for a specific task.
Silver data Not that good quality or automatically labeled data.
Test on train If you evaluate on the same data as you trained on. Every sound algorithm is expected to perform well (if not perfectly) on training data. It is not a decent test of your algorithm.
Beside POS tagging, there are several other tasks.
Named entity recognition.
Uber isn't pulling out of Quebec just yet.
Bill Gates buys Apple. DJIA crashing!
In the example the target labels are just $\{0,1\}$ marking whether it is a named entity or not. There are more detailed NER tagsets which tells you what that entity is (plus one tag for "not a named entity").
tagset | # of tags | language independent | |
---|---|---|---|
CoNLL | 4 | yes | Person, Location, Organization, Misc |
Penn Treebank | 22 | no | Animal, Cardinal, Date, Disease,... |
Identifying these parts of the text helps understanding what real word entities play role in a given sentence, hence retrieving information from the text.
It is a different task, to match various forms of the same entity, like: USA, United States of America, The States, 'merica.
He | reckons | the current account deficit | will narrow | to | only £1.8 billion | in | September | . |
---|---|---|---|---|---|---|---|---|
NP | NP | NP | NP |
Finds noun-phrases which correspond to a single agent in the sentence.
The one who (or what):
It is often called shallow parsing because it finds some components of the sentence but not the whole structure.
Here we discuss some naive approaches and common mistakes.
Neither start with capital letters.
Counter examples:
The begging of the sentence is capital regardless of the named entities.
http://nlp.stanford.edu:8080/corenlp/
This software performs various NLP tasks from POS tagging to grammatical analysis. It is written in Java, you can use it with API, GUI, command line, web service. It has a good performance and speed and uses much more sophisticated methods than what we will learn.
The temperature is 26.1 degree Celsius outside. the Sun is shining.
220 Volts is considered a high voltage.
Supports several major languages (other than English): Chinese, Spanish, Arabic, French and German.
There is a trade-off between the quality and quantity. Human annotated data are of higher quality but fewer in quantity.
This is a serious problem in machine translation, where you don't even have a correct automated testing.
We are doing it by the Book!
Healthy languages changes over time and can have new words, expressions.
The above tasks are a type of a broader class of tasks. Namely, take any dataset which is a set of training examples and their corresponding labels.
The labels are usually from a small set.
data | label |
---|---|
$x_1$ | $y_1$ |
$x_2$ | $y_2$ |
$\vdots$ | $\vdots$ |
$x_{n+1}$ | |
$x_{n+2}$ | |
$\vdots$ |
The task is similar to before: given the training set, try to predict the labels of the test set.
$x_1$ | $x_2$ | $x_3$ | $x_4$ | $x_5$ | $\ldots$ |
---|---|---|---|---|---|
$l_1$ | $l_2$ | $l_3$ | $l_4$ | $l_5$ | $\ldots$ |
data | label |
---|---|
$(\ , \ , x_1, x_2, x_3)$ | $l_1$ |
$(\ , x_1, x_2, x_3, x_4)$ | $l_2$ |
$(x_1, x_2, x_3, x_4, x_5)$ | $l_3$ |
$(x_2, x_3, x_4, x_5, x_6)$ | $l_4$ |
$\vdots$ | $ \vdots$ |
Lets build a lookup table which assigns what POS tag can follow a certain list of words.
In [ ]:
from itertools import chain
from collections import defaultdict, OrderedDict
corpus=list(zip(["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."],
["DET", "ADJ", "ADJ", "NOUN", "VERB", "ADP", "DET", "ADJ", "NOUN", "PUNCT"]))
print(corpus)
n=3
pos_lookup=OrderedDict([(tuple(corpus[j][0] for j in range(max(i-2,0),i+1)), corpus[i][1]) for i in range(len(corpus))])
print()
pos_lookup
Using this table, you can determine what is the POS tag of a given word if you have ever seen that word and its preceding words in the corpus. We will call this a Markov model.
The assumption of the Markov model is that if you take a long enough sequence of words, then they determine, without doubt, the POS tag of the last word. For example:
work -> it can be noun or verb
my work -> noun
I work -> verb
One can use the words after the word-of-interest. Also it may happen that the window around the word is not big enough and the same word with the same surrounding words can have several possible POS tags.
In [ ]:
pos_lookup2 = OrderedDict()
for i in range(len(corpus)):
words_before = tuple(corpus[j][0] for j in range(max(i-2,0),i))
word_of_interest = corpus[i][0]
words_after = tuple(corpus[j][0] for j in range(i+1,min(i+3, len(corpus))))
if (words_before, word_of_interest, words_after) in pos_lookup2:
pos_lookup2[(words_before, word_of_interest, words_after)] |= {corpus[i][1]}
else:
pos_lookup2[(words_before, word_of_interest, words_after)] = {corpus[i][1]}
print()
dict(pos_lookup2)
The algorithm is as follows:
If there is a word in the test set, but not in the train set
If the target word is in the test set, but not between a particular set of words
For example if one takes the word "the", then it is most certainly a determinant. However take the following segment:
... Steve Jobs met the Dalai Lama ...
The word "the" is still unknown if there was no Dalai, Lama, Steve or Jobs mentioned in the training set.
These problems are not independent and they lead to much greater problems in NLP. One cannot solve this problem, only avoid it.
We suppose that there must be some underlying POS tag which is not seen directly, but governs the words in the text.
Think of the pos tags as the grammatical backbone of the sentences.
PRON | VERB | PRON | PREP | NOUN | PUNCT |
---|---|---|---|---|---|
I | saw | you | in | school | . |
You | saw | me | in | school | . |
You | met | me | in | school | . |
You | met | me | in | work | . |
We | met | him | at | work | . |
We will think of a sentence as the sequence of POS tags, for which the actual choice of words is rather arbitrary.
I | saw | you | in | school | . |
---|---|---|---|---|---|
DET | VERB | PRON | PREP | NOUN | PUNCT |
$w_i$ | |||||
$w_{i-2}$ | $w_{i-1}$ | $w_i$ | |||
[ | window | ] |
And also make the assumption that the probability of a sequence is: $$ \mathbb{P}(w_1, w_2, \ldots w_n \ | \ l_1, l_2 \ldots l_n) = \prod_{i=1}^n\mathbb{P}(l_i \ |\ l_{i-k+1},l_{i-k+2}\ldots l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i) $$
The term $\mathbb{P}(l_i \ |\ l_{i-k+1},l_{i-k+2}\ldots l_{i-1})$ is a Markovian probability: the probability of tag $l_i$, given the previous $k-1$ tags.
If the index not positive then we simply omit it: $$ \mathbb{P}(l_2|l_{-1},l_0, l_1) \approx \frac{\#\{l_1,l_2\}}{\#\{l_1\}} \text{ and } \mathbb{P}(l_1|l_{-1},l_0) \approx \frac{\#\{l_1\}}{\text{# of words}} $$
$$ \mathbb{P}(w_i|l_i) \approx \frac{\#\{\text{word }w_i \text{ with the tag }l_i\}}{\#\{l_i\}}$$We detail the case of tri-grams, $k=3$. Let $w_1, w_2, \ldots w_n$ be a sentence and we are looking for the sequence of corresponding tags.
Let $$\pi(n, u, v) = \max_{l_1, l_2 \ldots l_{n-2}\in L \ , \ l_{n-1} = u, l_n = v} \mathbb{P}(w_1, w_2, \ldots w_n \left| \ l_1, l_2 \ldots l_n\right.)$$
The possibility of the most possible tag-sequence of length $n$ ending with $u$ and $v$ with token-sequence $w_1, w_2, \ldots w_n$.
Note that the function $\pi$ satisfies the following property: every head sequences of an optimal tag-sequence should be optimal too.
One can use dynamic programming to compute $\pi(n, u, v)$ and backtracking to get the corresponding tag-sequence. From that, the final two tags ($u$ and $v$) follow easily.
set $\pi(0, u, v) = 1, \forall u, v\in L$
begin for $k = 1\ldots n$
$\quad$ begin for $u, v \in L$
$\quad \quad \pi(k,u,v) = \max_{w\in L} \pi(k−1,w,u)\cdot \mathbb{P}(v \ | \ w,u)\cdot \mathbb{P}(w_k \ | \ v) $
$\quad$end for
end for
return $\max_{u,v\in L} \pi(n,u,v)$
The complexity is $O(n\cdot |L|^k)$, in the example $k=3$.
See more: http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf
Lets suppose that one has a tagger and performed the tagging on the test set.
token | $w_1$ | $w_2$ | $\ldots$ |
---|---|---|---|
gold labels | $l_1$ | $l_2$ | $\ldots$ |
predicted labels | $p_1$ | $p_2$ | $\ldots$ |
The performance of a tagger can be measured several ways:
OOV is out-of-vocabulary, words that were seen in test time, but not in training time.
predicted\gold | 0 | 1 |
---|---|---|
0 | TN | FN |
1 | FP | TP |
TP: true positive hit, "should and did"
And we can measure the following quantities:
If an algorithm produces all $1$s and one correct $0$
F1 score takes both precision and accuracy into account.
Note that in the NER example it is a bit special, since a named entity can have several words and this is not a hit: "Bill Gates."
In [ ]: