Introduction to NLP tasks

Written language is a sequence of characters
We will use text as list of words
- tokenization



In [ ]:

    
x = "The cat sat on the mat"
words = x.split()
print(words)

The computer does not know anything about words or any human language
- we consider the words as abstract labels
Usually we put the words into a vocabulary and assign an integer ID to all of them



In [ ]:

    
vocabulary = {w:i for i, w in enumerate(set(x.split()))}
print(vocabulary)
print("------------------")
print(words)
print([vocabulary[w] for w in words])

For now, this is what we have to compute with, but later you will learn more about representing words (other than meaningless IDs).

POS

Part-of-speech tag is a property of words. More precisely, we put words into categories according what role they play in a given sentence.

Example

The	dog	saw	a	cat	.
determiner	noun	verb	determiner	noun	punctuation

Interchangeable words should end up in the same category.
- The category is the pos tag
The correct tag depends on the context, not only the word itself.

You	talk	the	talk	but	do	you	walk	the	walk	?
pronoun	verb	determiner	noun	conjunction	verb	pronoun	verb	determiner	noun	punctuation

One can change the verbs to give which is an other verb, but not the nouns.
You cannot put any verb in place of "give" because the sentence loses its meaning.
- POS tag only tells the grammatical role: what words can take place without compromising the grammar.

POS tagging

... is a task to label every word in a given text with the appropriate POS tags. An algorithm for that will be our very first NLP task!

Tagsets

The tags themselves are the result of linguistic/NLP consensus
- there are several conventions for them.
From computational point of view, there is no definition for POS tags
- Benchmark datasets are the definition (gold standard) of what the correct tags are.

From Universal Dependencies using Universal POS tags:

This	killing	of	a	respected	cleric	will	be	causing	us	trouble	for	years	to	come	.
DET	NOUN	ADP	DET	ADJ	NOUN	AUX	AUX	VERB	PRON	NOUN	ADP	NOUN	PART	VERB	PUNCT

Or a hungarian one

A	gazdaság	ilyen	mértékű	fejlődését	több	folyamat	gerjeszti	.
DET	NOUN	DET	ADJ	NOUN	DET	NOUN	VERB	PUNCT

The Universal tagset is language independent, except some language specific features. For example the words "cleric" and "gazdaság" are both NOUNs. In English "a" and "the" are determinants, in Hungarian "a" and "az" have similar grammatical functions.

From UMBC webbase corpus using Penn Treebank tagset:

Well	,	let	me	just	say	there	is	n't	too	much	I	can	tell	you	right	at	the	moment	.
RB	,	VB	PRP	RB	VBP	EX	VBZ	RB	RB	JJ	PRP	MD	VB	PRP	RB	IN	DT	NN	.

The latter is just for English, note the EX (existential there) tag and the token "n't" after "is" with tag RB.

These are also tokenized!

Tagging in general

$word_1$	$word_2$	$word_3$	$\ldots$
$tag_1$	$tag_2$	$tag_3$	$\ldots$

Definition (Token) The atoms of interest, in our case the words of a text.

Definition (Corpus) A list of tokens

Definition (Tokenization) The task of splitting a list of characters (raw text) into tokens.

Definition (Tagset) A finite (usually small) set of symbols, which are linguistically defined properties of tokens.

Definition (Labeled corpus) A List of (token, tag) pairs, where the tags are from a given tagset.

Definition (Tagging) Take a given (unlabeled) corpus and tagset. The tagging is the task of assigning the tags to tokens.

Sometimes a corpus is split at sentence boundaries, which means that sentences can be processed separately. Otherwise, a sentence boundary is just a special punctuation or end-of-sentence symbol.

There are cases, where the POS tag of a word cannot be deduced within a sentence, only in context of other sentences.

Terminoligy

If the tags would be algorithmically well-defined, then implementing that definition would result a 100% correct tagger without any further ado. However, this is not the case with natural language.

There are rule based approaches, where linguistic or grammatical rules are implemented, but every language has exceptions.

If one does not use such rules, only the previously correctly labeled data, that is the statistical approach, because the resulted algorithm is conditioned to the previously seen data.

In the statistical approach:

Training Setting up the algorithm for a given task. This uses a previously correctly labeled data.

Train set A previously correctly labeled data. Your algorithm can use any information in the training set.

Prediction/inference The labeling of an unlabeled data with your algorithm. This is the main task itself.

Test/unseen data A previously correctly labeled data, but its labels are not communicated with the algorithm.

Evaluation Comparing the output of the predicted and correct labels (it is done on the test data).

Annotating The human labor of making the gold dataset. Manually processing sentences and label them correctly.

Gold data Annotated corpus, usually annotated with serious efforts and for a specific task.

Silver data Not that good quality or automatically labeled data.

Test on train If you evaluate on the same data as you trained on. Every sound algorithm is expected to perform well (if not perfectly) on training data. It is not a decent test of your algorithm.

Other tagging tasks

Beside POS tagging, there are several other tasks.

NER

Named entity recognition.

Uber isn't pulling out of Quebec just yet.
Bill Gates buys Apple. DJIA crashing!

Mark names of things in text
Find whether a word or sequence of words is a thing or not, mostly persons, places, companies
Grammatically they are replaceable with "someone" or "some remarkable place" or "some legal entity"
From the human point of view, they carry a lots of valuable informations because of their very being.

In the example the target labels are just $\{0,1\}$ marking whether it is a named entity or not. There are more detailed NER tagsets which tells you what that entity is (plus one tag for "not a named entity").

tagset	# of tags	language independent
CoNLL	4	yes	Person, Location, Organization, Misc
Penn Treebank	22	no	Animal, Cardinal, Date, Disease,...

Identifying these parts of the text helps understanding what real word entities play role in a given sentence, hence retrieving information from the text.

It is a different task, to match various forms of the same entity, like: USA, United States of America, The States, 'merica.

NP-chunking

He	reckons	the current account deficit	will narrow	to	only £1.8 billion	in	September	.
NP		NP			NP		NP

Finds noun-phrases which correspond to a single agent in the sentence.

The one who (or what):

does something
the action is performed on
involved in some action

It is often called shallow parsing because it finds some components of the sentence but not the whole structure.

Naive ways

Here we discuss some naive approaches and common mistakes.

The tag (label) of a word is not a function of the word itself.

It depends on the context, the surrounding words.
Most of the words can have several part-of-speech tags.
In English noun-verbs are common: work, talk, walk.

Names entities are not always proper nouns

Neither start with capital letters.

Counter examples:

the States
von Neumann

The begging of the sentence is capital regardless of the named entities.

There is no comprehensive list of all named entities

Let's suppose that one wants to collect every famous person, geographic place, company, trademark and title (real and fictional ever) in a list.
That list will never be comprehensive, since a well known thing can be mentioned in an unusual or abbreviated form.
And the list would became obsolete very soon, since new movies, books, famous person and firms are born.

Stanford CoreNLP

http://nlp.stanford.edu:8080/corenlp/

This software performs various NLP tasks from POS tagging to grammatical analysis. It is written in Java, you can use it with API, GUI, command line, web service. It has a good performance and speed and uses much more sophisticated methods than what we will learn.

Examples

The temperature is 26.1 degree Celsius outside. the Sun is shining.
220 Volts is considered a high voltage.

Supports several major languages (other than English): Chinese, Spanish, Arabic, French and German.

Some obstacles

Training data

Bad quality or insufficient training data.
Luckily, the aforementioned tasks have well established and gold standard databases.
Also, every dataset have some mistakes in it.

There is a trade-off between the quality and quantity. Human annotated data are of higher quality but fewer in quantity.

Evaluation

Without proper evaluation, any model is questionable.
Given a certain amount of gold data, you have to decide how to cut it into train and testing.
Insufficient testing data can result an unreliable algorithm.

You can see the output of your algorithm for yourself, but lacks statistical perspective
Getting 10 of your favorite sentences right doesn't mean that your algorithm rocks!
This is called testing by glance, you are advised to avoid it.

This is a serious problem in machine translation, where you don't even have a correct automated testing.

Gold is not gold

No matter what NLP task you take, a 100% correct algorithm cannot be reached.
There are always exceptions and never-seen structures, words.
Even gold data can be contradictory, or even undefined.

We are doing it by the Book!

Is "the Book" a named entity or just an expression of doing it right?
It can depend on the context, or who you talk to.

Linguistic change

Healthy languages changes over time and can have new words, expressions.

Supervised learning

The above tasks are a type of a broader class of tasks. Namely, take any dataset which is a set of training examples and their corresponding labels.

The labels are usually from a small set.

German nouns and their article (der, die, das)
pairs of images and what is seen on them
movies and their IMDB ratings

data	label
$x_1$	$y_1$
$x_2$	$y_2$
$\vdots$	$\vdots$
$x_{n+1}$
$x_{n+2}$
$\vdots$

The task is similar to before: given the training set, try to predict the labels of the test set.

Sequential data

In the case of text, sequential information is very relevant in the data.
We usually need the order of the data as input information.
Also one can take a window around the word-of-interest and think of it as unordered data.

$x_1$	$x_2$	$x_3$	$x_4$	$x_5$	$\ldots$
$l_1$	$l_2$	$l_3$	$l_4$	$l_5$	$\ldots$

data	label
$(\ , \ , x_1, x_2, x_3)$	$l_1$
$(\ , x_1, x_2, x_3, x_4)$	$l_2$
$(x_1, x_2, x_3, x_4, x_5)$	$l_3$
$(x_2, x_3, x_4, x_5, x_6)$	$l_4$
$\vdots$	$ \vdots$

With this rearranging, one can transform any NLP task into a supervised learning task.
The problem is when 5, 10 or 20 width window is not enough, because there may be long distance relations between words.

HMM

A simple POS tagger

Let's write a very simple tagger.
Given the corpus, establish a vocabulary $V$ with every word in it
- and the set of labels $L$
The corpus is a list of pairs in $V\times L$.
Usually $|V|\approx 10^5, 10^6$ and $|L|$ is never more than $100$

Lets build a lookup table which assigns what POS tag can follow a certain list of words.



In [ ]:

    
from itertools import chain
from collections import defaultdict, OrderedDict

corpus=list(zip(["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."],
                ["DET", "ADJ", "ADJ", "NOUN", "VERB", "ADP", "DET", "ADJ", "NOUN", "PUNCT"]))
print(corpus)
n=3
pos_lookup=OrderedDict([(tuple(corpus[j][0] for j in range(max(i-2,0),i+1)), corpus[i][1]) for i in range(len(corpus))])
print()
pos_lookup

Using this table, you can determine what is the POS tag of a given word if you have ever seen that word and its preceding words in the corpus. We will call this a Markov model.

The assumption of the Markov model is that if you take a long enough sequence of words, then they determine, without doubt, the POS tag of the last word. For example:

work -> it can be noun or verb
my work -> noun
I work -> verb

One can use the words after the word-of-interest. Also it may happen that the window around the word is not big enough and the same word with the same surrounding words can have several possible POS tags.



In [ ]:

    
pos_lookup2 = OrderedDict()
for i in range(len(corpus)):
    words_before = tuple(corpus[j][0] for j in range(max(i-2,0),i))
    word_of_interest = corpus[i][0]
    words_after = tuple(corpus[j][0] for j in range(i+1,min(i+3, len(corpus))))
    if (words_before, word_of_interest, words_after) in pos_lookup2:
        pos_lookup2[(words_before, word_of_interest, words_after)] |= {corpus[i][1]}
    else:
        pos_lookup2[(words_before, word_of_interest, words_after)] = {corpus[i][1]}
print()
dict(pos_lookup2)

You can use the words after the target word
- but not the gold tags, not even before the target word.
Temporarily, you can use previously predicted labels as if they were correct
- but in the end, test prediction cannot use any gold labels.

The algorithm is as follows:

look up a given word with the surrounding words.
If the sequence is in the table, then use the corresponding tag, or any of the tags.
If not, then make a guess.

Problems

If there is a word in the test set, but not in the train set
- This is called out-of-vocabulary or OOV.
If the target word is in the test set, but not between a particular set of words
- This is called the data sparsity.

For example if one takes the word "the", then it is most certainly a determinant. However take the following segment:

... Steve Jobs met the Dalai Lama ...

The word "the" is still unknown if there was no Dalai, Lama, Steve or Jobs mentioned in the training set.

One may require hundreds of millions of sentences to encounter Steve Jobs and the Dalai Lama in the same sentence.
Or even worse: no English text ever written in the human history contains those four words in the same sentence (there is one know).

These problems are not independent and they lead to much greater problems in NLP. One cannot solve this problem, only avoid it.

Hidden Markov Model

In the following model we take only the preceding words (like in the Markov model)
We call the POS tag a hidden parameter.
We suppose that there must be some underlying POS tag which is not seen directly, but governs the words in the text.
Think of the pos tags as the grammatical backbone of the sentences.

PRON	VERB	PRON	PREP	NOUN	PUNCT
I	saw	you	in	school	.
You	saw	me	in	school	.
You	met	me	in	school	.
You	met	me	in	work	.
We	met	him	at	work	.

We will think of a sentence as the sequence of POS tags, for which the actual choice of words is rather arbitrary.

Generative model

We suppose that there is some mechanism which puts the pos tags after each other
For a given POS tag, the actual word is just random
- according to some distribution
We are looking for a sequence of tags that can generate the given words with the highest probability.
More precisely, have a probabilistic model ($n$ is the length of the sentence). $$ \mathbb{P}(w_1, w_2, \ldots w_n, l_1, l_2 \ldots l_n) $$ For example $$\mathbb{P}(\text{the}, \text{dog}, \text{saw}, \text{the}, \text{cat}, \text{DET}, \text{NOUN}, \text{VERB}, \text{DET}, \text{NOUN})$$
Look for the best explaining tag sequence: $$ {\arg\max}_{l_i\in L}\mathbb{P}(w_1, w_2, \ldots w_n \ | \ l_1, l_2 \ldots l_n) $$

In general, this would require $O(|L|^n)$ complexity.
In the practical model we use some restrictions
Use a given window to the past, the window size is a model parameter.

I	saw	you	in	school	.
DET	VERB	PRON	PREP	NOUN	PUNCT
			$w_i$
	$w_{i-2}$	$w_{i-1}$	$w_i$
	[	window	]

And also make the assumption that the probability of a sequence is: $$ \mathbb{P}(w_1, w_2, \ldots w_n \ | \ l_1, l_2 \ldots l_n) = \prod_{i=1}^n\mathbb{P}(l_i \ |\ l_{i-k+1},l_{i-k+2}\ldots l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i) $$
The term $\mathbb{P}(l_i \ |\ l_{i-k+1},l_{i-k+2}\ldots l_{i-1})$ is a Markovian probability: the probability of tag $l_i$, given the previous $k-1$ tags.
The terms $\mathbb{P}(w_i|l_i)$ are emission probabilities: the probability of a word given its POS tag.

Estimation from the corpus

$$\mathbb{P}(l_n|l_1,l_2\ldots l_{n-1})\approx \frac{\#\{l_1,l_2\ldots l_{n-1},l_n\}}{\#\{l_1,l_2\ldots l_{n-1}\}}$$

If the index not positive then we simply omit it: $$ \mathbb{P}(l_2|l_{-1},l_0, l_1) \approx \frac{\#\{l_1,l_2\}}{\#\{l_1\}} \text{ and } \mathbb{P}(l_1|l_{-1},l_0) \approx \frac{\#\{l_1\}}{\text{# of words}} $$

$$ \mathbb{P}(w_i|l_i) \approx \frac{\#\{\text{word }w_i \text{ with the tag }l_i\}}{\#\{l_i\}}$$

The above formulas do not suffer that much data sparsity.
Tagset is way smaller than the vocabulary, so the occurrences of tag-sequences are not that sparse.
The possible tags of a given words is saturated

After we collected the estimations of the probabilities (training)
we need an algorithm to find the argmax (given a sequence of words).
- this is the inference/prediction

Viterbi algorithm

We detail the case of tri-grams, $k=3$. Let $w_1, w_2, \ldots w_n$ be a sentence and we are looking for the sequence of corresponding tags.

Let $$\pi(n, u, v) = \max_{l_1, l_2 \ldots l_{n-2}\in L \ , \ l_{n-1} = u, l_n = v} \mathbb{P}(w_1, w_2, \ldots w_n \left| \ l_1, l_2 \ldots l_n\right.)$$

The possibility of the most possible tag-sequence of length $n$ ending with $u$ and $v$ with token-sequence $w_1, w_2, \ldots w_n$.

Note that the function $\pi$ satisfies the following property: every head sequences of an optimal tag-sequence should be optimal too.

One can use dynamic programming to compute $\pi(n, u, v)$ and backtracking to get the corresponding tag-sequence. From that, the final two tags ($u$ and $v$) follow easily.

set $\pi(0, u, v) = 1, \forall u, v\in L$
begin for $k = 1\ldots n$
$\quad$ begin for $u, v \in L$
$\quad \quad \pi(k,u,v) = \max_{w\in L} \pi(k−1,w,u)\cdot \mathbb{P}(v \ | \ w,u)\cdot \mathbb{P}(w_k \ | \ v) $
$\quad$end for
end for
return $\max_{u,v\in L} \pi(n,u,v)$

The complexity is $O(n\cdot |L|^k)$, in the example $k=3$.

See more: http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf

Evaluation

Lets suppose that one has a tagger and performed the tagging on the test set.

token	$w_1$	$w_2$	$\ldots$
gold labels	$l_1$	$l_2$	$\ldots$
predicted labels	$p_1$	$p_2$	$\ldots$

The predicted and gold labels are compared to each other.
The performance of a tagger can be measured several ways:
- per-token accuracy: $$\frac{\# \text{ correct labels}}{\# \text{ words}}$$
- per-sentence accuracy: $$\frac{\# \text{ sentences with all correct labels}}{\# \text{ sentences}}$$
- unknown word accuracy: $$\frac{\# \text{ correct labels of OOV words}}{\# \text{ OOV words}}$$
OOV is out-of-vocabulary, words that were seen in test time, but not in training time.
For the simple HMM model, we cannot deal with OOVs!

Binary classification

If a tagset consists only two labels, then a more detailed measurement is possible.
At every token one can have either of the following 4 possibilities.

predicted\gold	0	1
0	TN	FN
1	FP	TP

TN: true negative, correct rejection, "shouldn't and didn't"
FN: false negative, miss, "should but didn't"
FP: false positive, false alarm, "shouldn't but did"
TP: true positive hit, "should and did"
And we can measure the following quantities:
- Precision: $\frac{TP}{TP+FP}$
- Accuracy: $\frac{TP+TN}{TP+TN+FP+FN}$
- Recall: $\frac{TP}{TP+FN}$
- F1 score (F-measure): harmonic mean of precision and recall $$ \frac{2\cdot \text{Precision}\cdot \text{Recall}}{\text{Precision} +\text{Recall}} = \frac{2\cdot TP}{2\cdot TP + FN + FP}$$

Explanation

For example if an algorithm produces a single $1$ and all other tags are $0$, but that $1$ happens to be right
- then the precision is $100\%$, but the recall is low.
If an algorithm produces all $1$s and one correct $0$
- then the recall is high and the precision is low.
F1 score takes both precision and accuracy into account.
Note that in the NER example it is a bit special, since a named entity can have several words and this is not a hit: "Bill Gates."



In [ ]: