A regular expression (RegEx) is a sequence of chatacters that expresses a pattern to be searched withing a longer piece of text. re is a Python library for regular expressions, which has several nice methods for working with strings. The list of Frequenctly used characters is available on Moodle and the course page on GitHub.
Sometimes people use RegEx for scraping the web, yet one is not encouraged to do so, as better/safer alternatives exist. Yet, once text data is scraped, RegEx is an important tool for cleaningsand tidying up the dataset.
Let's now go to Project Gutenberg page and find some book to download. Below, I used as an example the Financier by Theodore Dreiser. The latter can be downloaded and read from local stroage or directly from the URL. We will go for the "download and read" option.
In [3]:
import re
In [4]:
with open("financier.txt","r") as f:
financier = f.readlines()
In [5]:
print financier[2:4]
In [6]:
type(financier)
Out[6]:
Let's see how many times Mr. Dreiser uses the $ sign in his book. For that purpose, the findall() function from the re library will be used. The function receives the expression to search for asa first argument (always inside quotes) and the string to conduct the search on as a second argument. Please, note:
In [11]:
output = re.findall("\$",str(financier))
print output
Let's see at what occasions he used it. More precicely, let's read the amount of money cited in the book. Amount usually comes after the sign, so we will look for all non-whitespace characters after the dollar sign that are followed by a whitespace (that's where the amoun ends). The brackets indicate the component we want to receive as an output.
In [12]:
output = re.findall("(\$\S*)\s",str(financier))
print output
Let's use the | operator (i.e. or) to understand how many $ or @ signs were used by Mr. Dreiser.
In [13]:
output = re.findall("(@|\$)",str(financier))
print output
Let's see how many times the word euro is used. Yet, we do not know whether the author typed Euro with a capital letter or not. So we will have to search both. If we simply put () Python will think that's the text we need to receive. So we must explicitly mention (using ?:) that the text inside brackets is only for OR function, still not meaning that it is the only part of text we want to receive.
In [14]:
output = re.findall("(?:E|e)uro",str(financier))
print output
Of course, there is an easier approac using flags:
In [15]:
output = re.findall("euro",str(financier),re.IGNORECASE)
print output
Now about substitution. If you want to find some text in the file and substitute it with something else, then re.sub command may come in handy. Let's promote me to Harvard:
In [16]:
sample_text = "My email is hdavtyan@aua.am"
In [17]:
# Let's match e-mail first
output = re.findall('\S+@.+',sample_text)
print output
When brackets are used in RegEx, they for an enumerated group, that can be further called based on its order (e.g. first part of the string inside brackets fill be enumerated as the group 1).
In [19]:
# Let's now promote me to Harvard
print re.sub(r'(\S+@)(.+)', r'\1harvard.edu', sample_text)