Introduction to NLP tasks

  • Written language is a sequence of characters
  • We will use text as list of words
    • tokenization

In [ ]:
x = "The cat sat on the mat"
words = x.split()
print(words)
  • The computer does not know anything about words or any human language
    • we consider the words as abstract labels
  • Usually we put the words into a vocabulary and assign an integer ID to all of them

In [ ]:
vocabulary = {w:i for i, w in enumerate(set(x.split()))}
print(vocabulary)
print("------------------")
print(words)
print([vocabulary[w] for w in words])

For now, this is what we have to compute with, but later you will learn more about representing words (other than meaningless IDs).

POS

Part-of-speech tag is a property of words. More precisely, we put words into categories according what role they play in a given sentence.

Example

The dog saw a cat .
determiner noun verb determiner noun punctuation

  • Interchangeable words should end up in the same category.
    • The category is the pos tag
  • The correct tag depends on the context, not only the word itself.
You talk the talk but do you walk the walk ?
pronoun verb determiner noun conjunction verb pronoun verb determiner noun punctuation
  • One can change the verbs to give which is an other verb, but not the nouns.
  • You cannot put any verb in place of "give" because the sentence loses its meaning.
    • POS tag only tells the grammatical role: what words can take place without compromising the grammar.

POS tagging

... is a task to label every word in a given text with the appropriate POS tags. An algorithm for that will be our very first NLP task!

Tagsets

  • The tags themselves are the result of linguistic/NLP consensus
    • there are several conventions for them.
  • From computational point of view, there is no definition for POS tags
    • Benchmark datasets are the definition (gold standard) of what the correct tags are.

From Universal Dependencies using Universal POS tags:

This killing of a respected cleric will be causing us trouble for years to come .
DET NOUN ADP DET ADJ NOUN AUX AUX VERB PRON NOUN ADP NOUN PART VERB PUNCT

Or a hungarian one

A gazdaság ilyen mértékű fejlődését több folyamat gerjeszti .
DET NOUN DET ADJ NOUN DET NOUN VERB PUNCT

The Universal tagset is language independent, except some language specific features. For example the words "cleric" and "gazdaság" are both NOUNs. In English "a" and "the" are determinants, in Hungarian "a" and "az" have similar grammatical functions.

From UMBC webbase corpus using Penn Treebank tagset:

Well , let me just say there is n't too much I can tell you right at the moment .
RB , VB PRP RB VBP EX VBZ RB RB JJ PRP MD VB PRP RB IN DT NN .

The latter is just for English, note the EX (existential there) tag and the token "n't" after "is" with tag RB.

These are also tokenized!

Tagging in general

$word_1$ $word_2$ $word_3$ $\ldots$
$tag_1$ $tag_2$ $tag_3$ $\ldots$

Definition (Token) The atoms of interest, in our case the words of a text.

Definition (Corpus) A list of tokens

Definition (Tokenization) The task of splitting a list of characters (raw text) into tokens.

Definition (Tagset) A finite (usually small) set of symbols, which are linguistically defined properties of tokens.

Definition (Labeled corpus) A List of (token, tag) pairs, where the tags are from a given tagset.

Definition (Tagging) Take a given (unlabeled) corpus and tagset. The tagging is the task of assigning the tags to tokens.

Sometimes a corpus is split at sentence boundaries, which means that sentences can be processed separately. Otherwise, a sentence boundary is just a special punctuation or end-of-sentence symbol.

There are cases, where the POS tag of a word cannot be deduced within a sentence, only in context of other sentences.

Terminoligy

If the tags would be algorithmically well-defined, then implementing that definition would result a 100% correct tagger without any further ado. However, this is not the case with natural language.

There are rule based approaches, where linguistic or grammatical rules are implemented, but every language has exceptions.

If one does not use such rules, only the previously correctly labeled data, that is the statistical approach, because the resulted algorithm is conditioned to the previously seen data.

In the statistical approach:

Training Setting up the algorithm for a given task. This uses a previously correctly labeled data.

Train set A previously correctly labeled data. Your algorithm can use any information in the training set.

Prediction/inference The labeling of an unlabeled data with your algorithm. This is the main task itself.

Test/unseen data A previously correctly labeled data, but its labels are not communicated with the algorithm.

Evaluation Comparing the output of the predicted and correct labels (it is done on the test data).

Annotating The human labor of making the gold dataset. Manually processing sentences and label them correctly.

Gold data Annotated corpus, usually annotated with serious efforts and for a specific task.

Silver data Not that good quality or automatically labeled data.

Test on train If you evaluate on the same data as you trained on. Every sound algorithm is expected to perform well (if not perfectly) on training data. It is not a decent test of your algorithm.

Other tagging tasks

Beside POS tagging, there are several other tasks.

NER

Named entity recognition.

Uber isn't pulling out of Quebec just yet.
Bill Gates buys Apple. DJIA crashing!

  • Mark names of things in text
  • Find whether a word or sequence of words is a thing or not, mostly persons, places, companies
  • Grammatically they are replaceable with "someone" or "some remarkable place" or "some legal entity"
  • From the human point of view, they carry a lots of valuable informations because of their very being.

In the example the target labels are just $\{0,1\}$ marking whether it is a named entity or not. There are more detailed NER tagsets which tells you what that entity is (plus one tag for "not a named entity").

tagset # of tags language independent
CoNLL 4 yes Person, Location, Organization, Misc
Penn Treebank 22 no Animal, Cardinal, Date, Disease,...

Identifying these parts of the text helps understanding what real word entities play role in a given sentence, hence retrieving information from the text.

It is a different task, to match various forms of the same entity, like: USA, United States of America, The States, 'merica.

NP-chunking

He reckons the current account deficit will narrow to only £1.8 billion in September .
NP NP NP NP

Finds noun-phrases which correspond to a single agent in the sentence.

The one who (or what):

  • does something
  • the action is performed on
  • involved in some action

It is often called shallow parsing because it finds some components of the sentence but not the whole structure.

Naive ways

Here we discuss some naive approaches and common mistakes.

The tag (label) of a word is not a function of the word itself.
  • It depends on the context, the surrounding words.
  • Most of the words can have several part-of-speech tags.
  • In English noun-verbs are common: work, talk, walk.
Names entities are not always proper nouns

Neither start with capital letters.

Counter examples:

  • the States
  • von Neumann

The begging of the sentence is capital regardless of the named entities.

There is no comprehensive list of all named entities
  • Let's suppose that one wants to collect every famous person, geographic place, company, trademark and title (real and fictional ever) in a list.
  • That list will never be comprehensive, since a well known thing can be mentioned in an unusual or abbreviated form.
  • And the list would became obsolete very soon, since new movies, books, famous person and firms are born.

Stanford CoreNLP

http://nlp.stanford.edu:8080/corenlp/

This software performs various NLP tasks from POS tagging to grammatical analysis. It is written in Java, you can use it with API, GUI, command line, web service. It has a good performance and speed and uses much more sophisticated methods than what we will learn.

Examples

The temperature is 26.1 degree Celsius outside. the Sun is shining.
220 Volts is considered a high voltage.

Supports several major languages (other than English): Chinese, Spanish, Arabic, French and German.

Some obstacles

Training data

  • Bad quality or insufficient training data.
  • Luckily, the aforementioned tasks have well established and gold standard databases.
  • Also, every dataset have some mistakes in it.

There is a trade-off between the quality and quantity. Human annotated data are of higher quality but fewer in quantity.

Evaluation

  • Without proper evaluation, any model is questionable.
  • Given a certain amount of gold data, you have to decide how to cut it into train and testing.
  • Insufficient testing data can result an unreliable algorithm.
  • You can see the output of your algorithm for yourself, but lacks statistical perspective
  • Getting 10 of your favorite sentences right doesn't mean that your algorithm rocks!
  • This is called testing by glance, you are advised to avoid it.

This is a serious problem in machine translation, where you don't even have a correct automated testing.

Gold is not gold

  • No matter what NLP task you take, a 100% correct algorithm cannot be reached.
  • There are always exceptions and never-seen structures, words.
  • Even gold data can be contradictory, or even undefined.

We are doing it by the Book!

  • Is "the Book" a named entity or just an expression of doing it right?
  • It can depend on the context, or who you talk to.

Linguistic change

Healthy languages changes over time and can have new words, expressions.

Supervised learning

The above tasks are a type of a broader class of tasks. Namely, take any dataset which is a set of training examples and their corresponding labels.

The labels are usually from a small set.

  • German nouns and their article (der, die, das)
  • pairs of images and what is seen on them
  • movies and their IMDB ratings
data label
$x_1$ $y_1$
$x_2$ $y_2$
$\vdots$ $\vdots$
$x_{n+1}$
$x_{n+2}$
$\vdots$

The task is similar to before: given the training set, try to predict the labels of the test set.

Sequential data

  • In the case of text, sequential information is very relevant in the data.
  • We usually need the order of the data as input information.
  • Also one can take a window around the word-of-interest and think of it as unordered data.
$x_1$ $x_2$ $x_3$ $x_4$ $x_5$ $\ldots$
$l_1$ $l_2$ $l_3$ $l_4$ $l_5$ $\ldots$
data label
$(\ , \ , x_1, x_2, x_3)$ $l_1$
$(\ , x_1, x_2, x_3, x_4)$ $l_2$
$(x_1, x_2, x_3, x_4, x_5)$ $l_3$
$(x_2, x_3, x_4, x_5, x_6)$ $l_4$
$\vdots$ $ \vdots$
  • With this rearranging, one can transform any NLP task into a supervised learning task.
  • The problem is when 5, 10 or 20 width window is not enough, because there may be long distance relations between words.

HMM

A simple POS tagger

  • Let's write a very simple tagger.
  • Given the corpus, establish a vocabulary $V$ with every word in it
    • and the set of labels $L$
  • The corpus is a list of pairs in $V\times L$.
  • Usually $|V|\approx 10^5, 10^6$ and $|L|$ is never more than $100$

Lets build a lookup table which assigns what POS tag can follow a certain list of words.


In [ ]:
from itertools import chain
from collections import defaultdict, OrderedDict

corpus=list(zip(["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."],
                ["DET", "ADJ", "ADJ", "NOUN", "VERB", "ADP", "DET", "ADJ", "NOUN", "PUNCT"]))
print(corpus)
n=3
pos_lookup=OrderedDict([(tuple(corpus[j][0] for j in range(max(i-2,0),i+1)), corpus[i][1]) for i in range(len(corpus))])
print()
pos_lookup

Using this table, you can determine what is the POS tag of a given word if you have ever seen that word and its preceding words in the corpus. We will call this a Markov model.

The assumption of the Markov model is that if you take a long enough sequence of words, then they determine, without doubt, the POS tag of the last word. For example:

work -> it can be noun or verb
my work -> noun
I work -> verb

One can use the words after the word-of-interest. Also it may happen that the window around the word is not big enough and the same word with the same surrounding words can have several possible POS tags.


In [ ]:
pos_lookup2 = OrderedDict()
for i in range(len(corpus)):
    words_before = tuple(corpus[j][0] for j in range(max(i-2,0),i))
    word_of_interest = corpus[i][0]
    words_after = tuple(corpus[j][0] for j in range(i+1,min(i+3, len(corpus))))
    if (words_before, word_of_interest, words_after) in pos_lookup2:
        pos_lookup2[(words_before, word_of_interest, words_after)] |= {corpus[i][1]}
    else:
        pos_lookup2[(words_before, word_of_interest, words_after)] = {corpus[i][1]}
print()
dict(pos_lookup2)
  • You can use the words after the target word
    • but not the gold tags, not even before the target word.
  • Temporarily, you can use previously predicted labels as if they were correct
    • but in the end, test prediction cannot use any gold labels.

The algorithm is as follows:

  • look up a given word with the surrounding words.
  • If the sequence is in the table, then use the corresponding tag, or any of the tags.
  • If not, then make a guess.

Problems

  • If there is a word in the test set, but not in the train set

    • This is called out-of-vocabulary or OOV.
  • If the target word is in the test set, but not between a particular set of words

    • This is called the data sparsity.

For example if one takes the word "the", then it is most certainly a determinant. However take the following segment:

... Steve Jobs met the Dalai Lama ...

The word "the" is still unknown if there was no Dalai, Lama, Steve or Jobs mentioned in the training set.

  • One may require hundreds of millions of sentences to encounter Steve Jobs and the Dalai Lama in the same sentence.
  • Or even worse: no English text ever written in the human history contains those four words in the same sentence (there is one know).

These problems are not independent and they lead to much greater problems in NLP. One cannot solve this problem, only avoid it.

Hidden Markov Model

  • In the following model we take only the preceding words (like in the Markov model)
  • We call the POS tag a hidden parameter.
  • We suppose that there must be some underlying POS tag which is not seen directly, but governs the words in the text.

  • Think of the pos tags as the grammatical backbone of the sentences.

PRON VERB PRON PREP NOUN PUNCT
I saw you in school .
You saw me in school .
You met me in school .
You met me in work .
We met him at work .

We will think of a sentence as the sequence of POS tags, for which the actual choice of words is rather arbitrary.

Generative model

  • We suppose that there is some mechanism which puts the pos tags after each other
  • For a given POS tag, the actual word is just random
    • according to some distribution
  • We are looking for a sequence of tags that can generate the given words with the highest probability.
  • More precisely, have a probabilistic model ($n$ is the length of the sentence). $$ \mathbb{P}(w_1, w_2, \ldots w_n, l_1, l_2 \ldots l_n) $$ For example $$\mathbb{P}(\text{the}, \text{dog}, \text{saw}, \text{the}, \text{cat}, \text{DET}, \text{NOUN}, \text{VERB}, \text{DET}, \text{NOUN})$$
  • Look for the best explaining tag sequence: $$ {\arg\max}_{l_i\in L}\mathbb{P}(w_1, w_2, \ldots w_n \ | \ l_1, l_2 \ldots l_n) $$
  • In general, this would require $O(|L|^n)$ complexity.
  • In the practical model we use some restrictions
  • Use a given window to the past, the window size is a model parameter.
I saw you in school .
DET VERB PRON PREP NOUN PUNCT
$w_i$
$w_{i-2}$ $w_{i-1}$ $w_i$
[ window ]
  • And also make the assumption that the probability of a sequence is: $$ \mathbb{P}(w_1, w_2, \ldots w_n \ | \ l_1, l_2 \ldots l_n) = \prod_{i=1}^n\mathbb{P}(l_i \ |\ l_{i-k+1},l_{i-k+2}\ldots l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i) $$

  • The term $\mathbb{P}(l_i \ |\ l_{i-k+1},l_{i-k+2}\ldots l_{i-1})$ is a Markovian probability: the probability of tag $l_i$, given the previous $k-1$ tags.

  • The terms $\mathbb{P}(w_i|l_i)$ are emission probabilities: the probability of a word given its POS tag.

Estimation from the corpus

$$\mathbb{P}(l_n|l_1,l_2\ldots l_{n-1})\approx \frac{\#\{l_1,l_2\ldots l_{n-1},l_n\}}{\#\{l_1,l_2\ldots l_{n-1}\}}$$

If the index not positive then we simply omit it: $$ \mathbb{P}(l_2|l_{-1},l_0, l_1) \approx \frac{\#\{l_1,l_2\}}{\#\{l_1\}} \text{ and } \mathbb{P}(l_1|l_{-1},l_0) \approx \frac{\#\{l_1\}}{\text{# of words}} $$

$$ \mathbb{P}(w_i|l_i) \approx \frac{\#\{\text{word }w_i \text{ with the tag }l_i\}}{\#\{l_i\}}$$
  • The above formulas do not suffer that much data sparsity.
  • Tagset is way smaller than the vocabulary, so the occurrences of tag-sequences are not that sparse.
  • The possible tags of a given words is saturated
  • After we collected the estimations of the probabilities (training)
  • we need an algorithm to find the argmax (given a sequence of words).
    • this is the inference/prediction

Viterbi algorithm

We detail the case of tri-grams, $k=3$. Let $w_1, w_2, \ldots w_n$ be a sentence and we are looking for the sequence of corresponding tags.

Let $$\pi(n, u, v) = \max_{l_1, l_2 \ldots l_{n-2}\in L \ , \ l_{n-1} = u, l_n = v} \mathbb{P}(w_1, w_2, \ldots w_n \left| \ l_1, l_2 \ldots l_n\right.)$$

The possibility of the most possible tag-sequence of length $n$ ending with $u$ and $v$ with token-sequence $w_1, w_2, \ldots w_n$.

Note that the function $\pi$ satisfies the following property: every head sequences of an optimal tag-sequence should be optimal too.

One can use dynamic programming to compute $\pi(n, u, v)$ and backtracking to get the corresponding tag-sequence. From that, the final two tags ($u$ and $v$) follow easily.

set $\pi(0, u, v) = 1, \forall u, v\in L$
begin for $k = 1\ldots n$
$\quad$ begin for $u, v \in L$
$\quad \quad \pi(k,u,v) = \max_{w\in L} \pi(k−1,w,u)\cdot \mathbb{P}(v \ | \ w,u)\cdot \mathbb{P}(w_k \ | \ v) $
$\quad$end for
end for
return $\max_{u,v\in L} \pi(n,u,v)$

The complexity is $O(n\cdot |L|^k)$, in the example $k=3$.

See more: http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf

Evaluation

Lets suppose that one has a tagger and performed the tagging on the test set.

token $w_1$ $w_2$ $\ldots$
gold labels $l_1$ $l_2$ $\ldots$
predicted labels $p_1$ $p_2$ $\ldots$
  • The predicted and gold labels are compared to each other.
  • The performance of a tagger can be measured several ways:

    • per-token accuracy: $$\frac{\# \text{ correct labels}}{\# \text{ words}}$$
    • per-sentence accuracy: $$\frac{\# \text{ sentences with all correct labels}}{\# \text{ sentences}}$$
    • unknown word accuracy: $$\frac{\# \text{ correct labels of OOV words}}{\# \text{ OOV words}}$$
  • OOV is out-of-vocabulary, words that were seen in test time, but not in training time.

  • For the simple HMM model, we cannot deal with OOVs!

Binary classification

  • If a tagset consists only two labels, then a more detailed measurement is possible.
  • At every token one can have either of the following 4 possibilities.
predicted\gold 0 1
0 TN FN
1 FP TP
  • TN: true negative, correct rejection, "shouldn't and didn't"
  • FN: false negative, miss, "should but didn't"
  • FP: false positive, false alarm, "shouldn't but did"
  • TP: true positive hit, "should and did"

  • And we can measure the following quantities:

    • Precision: $\frac{TP}{TP+FP}$
    • Accuracy: $\frac{TP+TN}{TP+TN+FP+FN}$
    • Recall: $\frac{TP}{TP+FN}$
    • F1 score (F-measure): harmonic mean of precision and recall $$ \frac{2\cdot \text{Precision}\cdot \text{Recall}}{\text{Precision} +\text{Recall}} = \frac{2\cdot TP}{2\cdot TP + FN + FP}$$

Explanation

  • For example if an algorithm produces a single $1$ and all other tags are $0$, but that $1$ happens to be right
    • then the precision is $100\%$, but the recall is low.
  • If an algorithm produces all $1$s and one correct $0$

    • then the recall is high and the precision is low.
  • F1 score takes both precision and accuracy into account.

  • Note that in the NER example it is a bit special, since a named entity can have several words and this is not a hit: "Bill Gates."


In [ ]: