[Data, the Humanist's New Best Friend](index.ipynb)
*Class 13*

In this class you are expected to learn:

  • Regular expressions
  • Word inflection and lemmatization
  • Parsing
  • n-grams
  • Part-of-speech Tagging
`n-grams`!

Regular expressions

The basics of regular expressions were covered in class 7. But let's remember the most important.

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

  • a, X, 9 -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( )
  • . (a period) -- matches any single character except newline \n
  • \w (lowercase w) -- matches a word character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
  • \b -- boundary between word and non-word
  • \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
  • \t, \n, \r -- tab, newline, return
  • \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
  • ^ = start, $ = end -- match the start or end of the string
  • \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as @, you can put a slash in front of it, \@, to make sure it is treated just as a character.

The basic rules of regular expression search for a pattern within a string are:

  • The search proceeds through the string from start to end, stopping at the first match found
  • All of the pattern must be matched, but not all of the string
  • If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text

Things get more interesting when you use + and * to specify repetition in the pattern

  • + -- 1 or more occurrences of the pattern to its left, e.g. i+ = one or more i's
  • * -- 0 or more occurrences of the pattern to its left
  • ? -- match 0 or 1 occurrences of the pattern to its left

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

Square brackets can be used to indicate a set of chars, so [abc] matches a or b or c. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails addresses, the square brackets are an easy way to add . and - to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address.

You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

The group feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

If you still have problems visualizing the actual diagram that is being executed when recognizing a regular expression, just go to regexper and play with some examples.

Part-of-Speech Tagging

There are some cases when regular expressions are not enough for recognizing or extracting information. Let's say that we have a text and we want to extract all uses of the verb "to be", in all its present inflections. Our regular expression would look seomthing like:


In [1]:
import re
to_be = re.compile(r"(am|is|are)")
text = "All your base are belong to us"
to_be.findall(text)


Out[1]:
['are']

What about contractions?


In [2]:
to_be = re.compile(r"(am|is|are|'s|'re|'m)")
text = "All your base are belong to us. I'm not the jedi you're looking for."
to_be.findall(text)


Out[2]:
['are', "'m", "'re"]

And that might work sometimes, but in other cases, for example when using possesives, it doesn't.


In [3]:
to_be = re.compile(r"(am|is|are|'s|'re|'m)")
text = "I'll stay at Ben's"
to_be.findall(text)


Out[3]:
["'s"]

And the reason is that in those examples 's has different purposes. In other words, the same word can play a different part in the speech. That's the reason that many times we need to tag the part-of-speech in order to remove ambiguity. Some useful usage includes:

  • Information Retrieval
  • Text to Speech: object(N) vs. object(V), or discount(N) vs. discount(V)
  • Word sense disambiguation
  • As a preprocessing step of parsing
  • Unique tag to each word reduces the number of parses

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. A very commonly used set is the University of Pennsylvania TreeBank II tagset, and it has 36 word tags (plus .,:() for punctuation marks).

Number
Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP\$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP\$ Possessive wh-pronoun
36. WRB Wh-adverb

Another set often used is the so called universal tagset, which NLTK uses by default.

Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy

There are different techniques for POS tagging, but exploiting tags and tagging text automatically is what we are looking for.

For example, in our previous example "I'll stay at Ben's", we want to know if 's is referring to the possesive or to the singular $1^{st}$ person of the imperfective indicative.

TextBlob POS tagger is based on NTLK and parser, and it uses a specific encoding for the tags.


In [4]:
from textblob import TextBlob
TextBlob("I'll stay at Ben's").tags


Out[4]:
[('I', 'PRP'),
 ("'", 'POS'),
 ('ll', 'NN'),
 ('stay', 'VB'),
 ('at', 'IN'),
 ('Ben', 'NNP'),
 ("'", 'POS'),
 ('s', 'PRP')]

As we see, s is tagged with PRP, and from the table above PRP stands for personal pronoun. Therefore, in this case, we can say that 's is acting as a pronoun rather than a verb.


In [5]:
TextBlob("He's not leaving").tags


Out[5]:
[('He', 'PRP'), ("'", 'POS'), ('s', 'PRP'), ('not', 'RB'), ('leaving', 'VBG')]

In the last example, however, the tagger is saying that s is acting as pronoun, when we know that is justa verb. Let's see how NLTK tagger behaves.


In [6]:
import nltk
tokens = nltk.word_tokenize("He's not leaving")
nltk.pos_tag(tokens)


Out[6]:
[('He', 'PRP'), ("'s", 'VBZ'), ('not', 'RB'), ('leaving', 'VBG')]

NLTK tagger is correctly saying that 's is a verb. In fact, is a non-3rd person singular present. That happens because by default TextBlob uses a different tagger, PatternTagger, based on a library called pattern. However, TextBlob can be set to use the tagger included in NLTK, which usually performs a bit better.


In [7]:
from textblob.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob = TextBlob("He's not leaving", pos_tagger=nltk_tagger)
blob.pos_tags


Out[7]:
[('He', 'PRP'), ("'s", 'VBZ'), ('not', 'RB'), ('leaving', 'VBG')]

Activity

Find out which tagger works better for the next sentences. Explain why and extract nouns, verbs and adjectives.

  • One morning I shot an elephant in my pyjamas
  • John saw the man on the mountain with a telescope
  • The word of the Lord came to Zechariah, son of Berekiah, son of Iddo, the prophet
  • The pet of the woman that had the parasol was brown
  • The dog brings me the newspaper every morning

Chunking

Another level of analysis is extracting the bigger chunks that tags are included into. It's the basic technique we will use for entity detection. Chunking segments and labels multi-token sequences. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.

Segmentation and Labeling at both the Token and Chunk Levels

We will begin by considering the task of noun phrase chunking, or NP-chunking, where we search for chunks corresponding to individual noun phrases. For example, here is some Wall Street Journal text with NP-chunks marked using brackets:

[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.

As we can see, NP-chunks are often smaller pieces than complete noun phrases. For example, the market for system-management software for Digital's hardware is a single noun phrase (containing two nested noun phrases), but it is captured in NP-chunks by the simpler chunk the market. One of the motivations for this difference is that NP-chunks are defined so as not to contain other NP-chunks. Consequently, any prepositional phrases or subordinate clauses that modify a nominal will not be included in the corresponding NP-chunk, since they almost certainly contain further noun phrases.

One of the most useful sources of information for NP-chunking is part-of-speech tags. This is one of the motivations for performing part-of-speech tagging in our information extraction system. We demonstrate this approach using an example sentence that has been part-of-speech tagged. In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser, and test it on our example sentence. The result is a tree, which we can either print, or display graphically.


In [8]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
            ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

grammar = "NP: {<DT>?<JJ>*<NN>}"

cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)


(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))

The rules that make up a chunk grammar use tag patterns to describe sequences of tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. <DT>?<JJ>*<NN>. Tag patterns are similar to regular expression patterns. Now, consider the following noun phrases from the Wall Street Journal:

another/DT sharp/JJ dive/NN
trade/NN figures/NNS
any/DT new/JJ policy/NN measures/NNS
earlier/JJR stages/NNS
Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP

To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

In the next example, the first rule matches an optional determiner or possessive pronoun, zero or more adjectives, then a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked, and run the chunker on this input.


In [9]:
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

print(cp.parse(sentence))


(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))

Sometimes it is easier to define what we want to exclude from a chunk. We can define a chink to be a sequence of tokens that is not included in a chunk. In the following example, barked/VBD at/IN is a chink.


In [10]:
grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
  """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))


(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))

As befits their intermediate status between tagging and parsing, chunk structures can be represented using either tags or trees. The most widespread file representation uses IOB tags. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O.


Tag Representation of Chunk Structures

IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format.

We PRP B-NP
saw VBD O
the DT B-NP
yellow JJ I-NP
dog NN I-NP

In this representation there is one token per line, each with its part-of-speech tag and chunk tag. This format permits us to represent more than one chunk type, so long as the chunks do not overlap. As we saw earlier, chunk structures can also be represented using trees. These have the benefit that each chunk is a constituent that can be manipulated directly.


Tree Representation of Chunk Structures

However fun can be to create our own chunkers and taggers, NLTK and TextBlob do that for us. TextBlob even has already a handy method to extract noun phrases.


In [11]:
b = TextBlob("Leela save me! And my Banjo! And Fry! And yourself I guess!")
print(b.parse())


Leela/NNP/B-NP/O save/VB/B-VP/O me/PRP/B-NP/O !/./O/O
And/CC/O/O my/PRP$/B-NP/O Banjo/NNP/I-NP/O !/./O/O
And/CC/O/O Fry/VB/B-VP/O !/./O/O
And/CC/O/O yourself/PRP/B-NP/O I/PRP/I-NP/O guess/VBP/B-VP/O !/./O/O

In [12]:
b.noun_phrases


Out[12]:
WordList(['leela', 'banjo', 'fry'])

Activity

`TextBlob` also has a sentence tokenizer. Just by invoking `.sentences` on a `TextBlob` object. For the text in [this article](http://tmagazine.blogs.nytimes.com/2015/03/06/ordos-china-tourist-city/?hp&action=click&pgtype=Homepage&module=mini-moth&region=top-stories-below&WT.nav=top-stories-below&_r=0), split it into sentences and extract noun phrases and verbs. For each, plot two charts, one containing the 10 most frequent noun phrases, and the other the 10 most frequent verbs.

N-grams

Other useful methods of TextBlob include extrating of n-grams, lemmatization and word inflection. An n-gram is a contiguous sequence of n items from a given sequence of text. The items can be phonemes, syllables, letters, words or base pairs according, although in our case are just words.

An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.

Beyond computational linguistics applications, such as statistical natural language processing, n-gram models are now widely used in probability, communication theory, computational biology, and data compression. There are even methods for authorship attribution of anonymous texts based on n-grams analysis.


In [13]:
TextBlob("Now is better than never. Now is better than ever.").ngrams(n=3)


Out[13]:
[WordList(['Now', 'is', 'better']),
 WordList(['is', 'better', 'than']),
 WordList(['better', 'than', 'never']),
 WordList(['than', 'never', 'Now']),
 WordList(['never', 'Now', 'is']),
 WordList(['Now', 'is', 'better']),
 WordList(['is', 'better', 'than']),
 WordList(['better', 'than', 'ever'])]

Word inflection and lemmatization

Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.


In [14]:
sentence = TextBlob('Use 4 spaces per indentation level.')
sentence.words


Out[14]:
WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])

In [15]:
sentence.words[2].singularize()


Out[15]:
'space'

In [16]:
sentence.words[-1].pluralize()


Out[16]:
'levels'

Words can be lemmatized by calling the lemmatize method.


In [17]:
from textblob import Word
w = Word("octopi")
w.lemmatize()


Out[17]:
'octopus'

In [18]:
w = Word("went")
w.lemmatize("v")  # Pass in part of speech (verb)


Out[18]:
'go'

For the next class