Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/
This is a tutorial for simple text processing with python using the nltk library. For further reading I recommend the extensive online book of nltk available here. In this notebook we will
It is assumed that you have some general knowledge on
In [1]:
import nltk, re
from nltk import word_tokenize
# NOTE if the data (corpora, example files) is not yet downloaded, this needs to be done first
# nltk.download()
Let's see which free resources are readily available. And then let's have a closer look at Shakespeare's Hamlet (to pretent we are literature freaks).
In [2]:
print(nltk.corpus.gutenberg.fileids())
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
print(len(hamlet))
In [3]:
[w for w in hamlet if re.search('wre', w)]
Out[3]:
In [4]:
[w for w in hamlet if re.search('^wre', w)]
Out[4]:
[]
. [Tt]
matches either "T" or "t".
For macthing any character (no whitespace) we could use the character class [a-zA-Z]
, but using the abbreviation \D
is much more convenient. Further predefined character classes are:\d
Matches any decimal digit\D
Matches any non-digit character\s
Matches any whitespace character (this could be line endings, blanks or tabs). This is tricky, because some of them are not visible if you look at the text with a text editor.
In [5]:
[w for w in hamlet if re.search('^[Tt]\w{5,5}r$', w)]
Out[5]:
[0123456789]
or [0-9]
, but using the abbreviation \d
is much more convenient.
In [6]:
[w for w in hamlet if re.search('\d', w)]
Out[6]:
In [7]:
[w for w in hamlet if re.search('^z.*g$', w)]
Out[7]:
In the last example we can not be sure whether there is definitely nothing or whether we got the regular expression wrong. To find out which one is the case, create a string you know should match and test your expression there.
In [8]:
[w for w in ["zarhhg","zhang","zg","42"] if re.search('^z.*g$', w)]
Out[8]:
That's all for the short introduction. See the documentation of the re library for more examples on regular expressions.