NLTK Regular Expressions

Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/

This is a tutorial for simple text processing with python using the nltk library. For further reading I recommend the extensive online book of nltk available here. In this notebook we will

  • load text files from disk
  • find word patterns with regular expressions (and see where they fail)

It is assumed that you have some general knowledge on

  • basic python

Setup

If you have never used nltk before you need to download the example copora. Uncomment the nltk.download to do so. We also want the nltk library, the library (re) for regular expression.


In [1]:
import nltk, re
from nltk import word_tokenize
# NOTE if the data (corpora, example files) is not yet downloaded, this needs to be done first
# nltk.download()

Let's see which free resources are readily available. And then let's have a closer look at Shakespeare's Hamlet (to pretent we are literature freaks).


In [2]:
print(nltk.corpus.gutenberg.fileids())
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
print(len(hamlet))


['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
37360

Regular Expressions

So Shakespeare used 37360 words to write Hamlet. Let's investigate which patterns we find there.

  • In which word does the character sequence "wre" occur?

In [3]:
[w for w in hamlet if re.search('wre', w)]


Out[3]:
['wretch',
 'wretch',
 'wretched',
 'powres',
 'Powres',
 'wretched',
 'wretched',
 'showres',
 'wretch',
 'wretched']
  • And which of them actually start with "wre"?

In [4]:
[w for w in hamlet if re.search('^wre', w)]


Out[4]:
['wretch', 'wretch', 'wretched', 'wretched', 'wretched', 'wretch', 'wretched']
  • Find all words that start with "T" or "t", end with "r" and have exactly 3 other characters in the middle. To implement the "T" or "t" we use a character class specified by the brackets []. [Tt]matches either "T" or "t". For macthing any character (no whitespace) we could use the character class [a-zA-Z], but using the abbreviation \Dis much more convenient. Further predefined character classes are:
    • \d Matches any decimal digit
    • \D Matches any non-digit character
    • \s Matches any whitespace character (this could be line endings, blanks or tabs). This is tricky, because some of them are not visible if you look at the text with a text editor.

In [5]:
[w for w in hamlet if re.search('^[Tt]\w{5,5}r$', w)]


Out[5]:
['Thunder',
 'truster',
 'thither',
 'Thunder',
 'Theater',
 'thicker',
 'thither',
 'thether']
  • Did Shakespeare use any numbers (written as digits?) For macthing all the digits, we could similarly use [0123456789] or [0-9], but using the abbreviation \dis much more convenient.

In [6]:
[w for w in hamlet if re.search('\d', w)]


Out[6]:
['1599', '1', '1', '1', '1']
  • And is there something that starts with z and ends with g?

In [7]:
[w for w in hamlet if re.search('^z.*g$', w)]


Out[7]:
[]

In the last example we can not be sure whether there is definitely nothing or whether we got the regular expression wrong. To find out which one is the case, create a string you know should match and test your expression there.


In [8]:
[w for w in ["zarhhg","zhang","zg","42"] if re.search('^z.*g$', w)]


Out[8]:
['zarhhg', 'zhang', 'zg']

That's all for the short introduction. See the documentation of the re library for more examples on regular expressions.