In this class you are expected to learn:
The basics of regular expressions were covered in class 7. But let's remember the most important.
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:
a
, X
, 9
-- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: .
^
$
*
+
?
{
[
]
\
|
(
)
.
(a period) -- matches any single character except newline \n
\w
(lowercase w) -- matches a word character: a letter or digit or underbar [a-zA-Z0-9_]
. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.\b
-- boundary between word and non-word\s
-- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]
. \S
(upper case S) matches any non-whitespace character.\t
, \n
, \r
-- tab, newline, return\d
-- decimal digit [0-9]
(some older regex utilities do not support but \d
, but they all support \w
and \s
)^
= start, $
= end -- match the start or end of the string\
-- inhibit the "specialness" of a character. So, for example, use \.
to match a period or \\
to match a slash. If you are unsure if a character has special meaning, such as @
, you can put a slash in front of it, \@
, to make sure it is treated just as a character.The basic rules of regular expression search for a pattern within a string are:
match = re.search(pat, str)
is successful, match is not None
and in particular match.group()
is the matching textThings get more interesting when you use +
and *
to specify repetition in the pattern
+
-- 1 or more occurrences of the pattern to its left, e.g. i+
= one or more i's*
-- 0 or more occurrences of the pattern to its left?
-- match 0 or 1 occurrences of the pattern to its leftFirst the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. +
and *
go as far as possible (the +
and *
are said to be "greedy").
Square brackets can be used to indicate a set of chars, so [abc]
matches a
or b
or c
. The codes \w
, \s
etc. work inside square brackets too with the one exception that dot (.
) just means a literal dot. For the emails addresses, the square brackets are an easy way to add .
and -
to the set of chars which can appear around the @
with the pattern r'[\w.-]+@[\w.-]+'
to get the whole email address.
You can also use a dash to indicate a range, so [a-z]
matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]
. An up-hat (^
) at the start of a square-bracket set inverts it, so [^ab
] means any char except 'a'
or 'b'
.
The group feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis (
)
around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'
. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1)
is the match text corresponding to the 1st left parenthesis, and match.group(2)
is the text corresponding to the 2nd left parenthesis. The plain match.group()
is still the whole match text as usual.
If you still have problems visualizing the actual diagram that is being executed when recognizing a regular expression, just go to regexper and play with some examples.
In [1]:
import re
to_be = re.compile(r"(am|is|are)")
text = "All your base are belong to us"
to_be.findall(text)
Out[1]:
What about contractions?
In [2]:
to_be = re.compile(r"(am|is|are|'s|'re|'m)")
text = "All your base are belong to us. I'm not the jedi you're looking for."
to_be.findall(text)
Out[2]:
And that might work sometimes, but in other cases, for example when using possesives, it doesn't.
In [3]:
to_be = re.compile(r"(am|is|are|'s|'re|'m)")
text = "I'll stay at Ben's"
to_be.findall(text)
Out[3]:
And the reason is that in those examples 's
has different purposes. In other words, the same word can play a different part in the speech. That's the reason that many times we need to tag the part-of-speech in order to remove ambiguity. Some useful usage includes:
object(N)
vs. object(V)
, or discount(N)
vs. discount(V)
The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. A very commonly used set is the University of Pennsylvania TreeBank II tagset, and it has 36 word tags (plus .,:() for punctuation marks).
Number
|
Tag
|
Description
|
1. | CC | Coordinating conjunction |
2. | CD | Cardinal number |
3. | DT | Determiner |
4. | EX | Existential there |
5. | FW | Foreign word |
6. | IN | Preposition or subordinating conjunction |
7. | JJ | Adjective |
8. | JJR | Adjective, comparative |
9. | JJS | Adjective, superlative |
10. | LS | List item marker |
11. | MD | Modal |
12. | NN | Noun, singular or mass |
13. | NNS | Noun, plural |
14. | NNP | Proper noun, singular |
15. | NNPS | Proper noun, plural |
16. | PDT | Predeterminer |
17. | POS | Possessive ending |
18. | PRP | Personal pronoun |
19. | PRP\$ | Possessive pronoun |
20. | RB | Adverb |
21. | RBR | Adverb, comparative |
22. | RBS | Adverb, superlative |
23. | RP | Particle |
24. | SYM | Symbol |
25. | TO | to |
26. | UH | Interjection |
27. | VB | Verb, base form |
28. | VBD | Verb, past tense |
29. | VBG | Verb, gerund or present participle |
30. | VBN | Verb, past participle |
31. | VBP | Verb, non-3rd person singular present |
32. | VBZ | Verb, 3rd person singular present |
33. | WDT | Wh-determiner |
34. | WP | Wh-pronoun |
35. | WP\$ | Possessive wh-pronoun |
36. | WRB | Wh-adverb |
Another set often used is the so called universal tagset, which NLTK uses by default.
Tag | Meaning | English Examples |
---|---|---|
ADJ | adjective | new, good, high, special, big, local |
ADP | adposition | on, of, at, with, by, into, under |
ADV | adverb | really, already, still, early, now |
CONJ | conjunction | and, or, but, if, while, although |
DET | determiner, article | the, a, some, most, every, no, which |
NOUN | noun | year, home, costs, time, Africa |
NUM | numeral | twenty-four, fourth, 1991, 14:24 |
PRT | particle | at, on, out, over per, that, up, with |
PRON | pronoun | he, their, her, its, my, I, us |
VERB | verb | is, say, told, given, playing, would |
. | punctuation marks | . , ; ! |
X | other | ersatz, esprit, dunno, gr8, univeristy |
There are different techniques for POS tagging, but exploiting tags and tagging text automatically is what we are looking for.
For example, in our previous example "I'll stay at Ben's"
, we want to know if 's
is referring to the possesive or to the singular $1^{st}$ person of the imperfective indicative.
TextBlob POS tagger is based on NTLK and parser, and it uses a specific encoding for the tags.
In [4]:
from textblob import TextBlob
TextBlob("I'll stay at Ben's").tags
Out[4]:
As we see, s
is tagged with PRP
, and from the table above PRP
stands for personal pronoun. Therefore, in this case, we can say that 's
is acting as a pronoun rather than a verb.
In [5]:
TextBlob("He's not leaving").tags
Out[5]:
In the last example, however, the tagger is saying that s
is acting as pronoun, when we know that is justa verb. Let's see how NLTK tagger behaves.
In [6]:
import nltk
tokens = nltk.word_tokenize("He's not leaving")
nltk.pos_tag(tokens)
Out[6]:
NLTK tagger is correctly saying that 's
is a verb. In fact, is a non-3rd person singular present. That happens because by default TextBlob
uses a different tagger, PatternTagger
, based on a library called pattern
. However, TextBlob
can be set to use the tagger included in NLTK, which usually performs a bit better.
In [7]:
from textblob.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob = TextBlob("He's not leaving", pos_tagger=nltk_tagger)
blob.pos_tags
Out[7]:
Activity
Find out which tagger works better for the next sentences. Explain why and extract nouns, verbs and adjectives.
Another level of analysis is extracting the bigger chunks that tags are included into. It's the basic technique we will use for entity detection. Chunking segments and labels multi-token sequences. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.
We will begin by considering the task of noun phrase chunking, or NP-chunking, where we search for chunks corresponding to individual noun phrases. For example, here is some Wall Street Journal text with NP-chunks marked using brackets:
[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.
As we can see, NP-chunks are often smaller pieces than complete noun phrases. For example, the market for system-management software for Digital's hardware
is a single noun phrase (containing two nested noun phrases), but it is captured in NP-chunks by the simpler chunk the market. One of the motivations for this difference is that NP-chunks are defined so as not to contain other NP-chunks. Consequently, any prepositional phrases or subordinate clauses that modify a nominal will not be included in the corresponding NP-chunk, since they almost certainly contain further noun phrases.
One of the most useful sources of information for NP-chunking is part-of-speech tags. This is one of the motivations for performing part-of-speech tagging in our information extraction system. We demonstrate this approach using an example sentence that has been part-of-speech tagged. In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser, and test it on our example sentence. The result is a tree, which we can either print, or display graphically.
In [8]:
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
The rules that make up a chunk grammar use tag patterns to describe sequences of tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. <DT>?<JJ>*<NN>
. Tag patterns are similar to regular expression patterns. Now, consider the following noun phrases from the Wall Street Journal:
another/DT sharp/JJ dive/NN
trade/NN figures/NNS
any/DT new/JJ policy/NN measures/NNS
earlier/JJR stages/NNS
Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP
To find the chunk structure for a given sentence, the RegexpParser
chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.
In the next example, the first rule matches an optional determiner or possessive pronoun, zero or more adjectives, then a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked, and run the chunker on this input.
In [9]:
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and noun
{<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
print(cp.parse(sentence))
Sometimes it is easier to define what we want to exclude from a chunk. We can define a chink to be a sequence of tokens that is not included in a chunk. In the following example, barked/VBD at/IN is a chink.
In [10]:
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))
As befits their intermediate status between tagging and parsing, chunk structures can be represented using either tags or trees. The most widespread file representation uses IOB tags. In this scheme, each token is tagged with one of three special chunk tags, I
(inside), O
(outside), or B
(begin). A token is tagged as B
if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I
. All other tokens are tagged O
. The B
and I
tags are suffixed with the chunk type, e.g. B-NP
, I-NP
. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O
.
IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format.
We PRP B-NP
saw VBD O
the DT B-NP
yellow JJ I-NP
dog NN I-NP
In this representation there is one token per line, each with its part-of-speech tag and chunk tag. This format permits us to represent more than one chunk type, so long as the chunks do not overlap. As we saw earlier, chunk structures can also be represented using trees. These have the benefit that each chunk is a constituent that can be manipulated directly.
However fun can be to create our own chunkers and taggers, NLTK and TextBlob do that for us. TextBlob even has already a handy method to extract noun phrases.
In [11]:
b = TextBlob("Leela save me! And my Banjo! And Fry! And yourself I guess!")
print(b.parse())
In [12]:
b.noun_phrases
Out[12]:
Activity
`TextBlob` also has a sentence tokenizer. Just by invoking `.sentences` on a `TextBlob` object. For the text in [this article](http://tmagazine.blogs.nytimes.com/2015/03/06/ordos-china-tourist-city/?hp&action=click&pgtype=Homepage&module=mini-moth®ion=top-stories-below&WT.nav=top-stories-below&_r=0), split it into sentences and extract noun phrases and verbs. For each, plot two charts, one containing the 10 most frequent noun phrases, and the other the 10 most frequent verbs.
Other useful methods of TextBlob
include extrating of n-grams, lemmatization and word inflection. An n-gram is a contiguous sequence of n items from a given sequence of text. The items can be phonemes, syllables, letters, words or base pairs according, although in our case are just words.
An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.
Beyond computational linguistics applications, such as statistical natural language processing, n-gram models are now widely used in probability, communication theory, computational biology, and data compression. There are even methods for authorship attribution of anonymous texts based on n-grams analysis.
In [13]:
TextBlob("Now is better than never. Now is better than ever.").ngrams(n=3)
Out[13]:
In [14]:
sentence = TextBlob('Use 4 spaces per indentation level.')
sentence.words
Out[14]:
In [15]:
sentence.words[2].singularize()
Out[15]:
In [16]:
sentence.words[-1].pluralize()
Out[16]:
Words can be lemmatized by calling the lemmatize method.
In [17]:
from textblob import Word
w = Word("octopi")
w.lemmatize()
Out[17]:
In [18]:
w = Word("went")
w.lemmatize("v") # Pass in part of speech (verb)
Out[18]: