Regular Expressions

A regular expression (RegEx) is a sequence of chatacters that expresses a pattern to be searched withing a longer piece of text. re is a Python library for regular expressions, which has several nice methods for working with strings. The list of Frequenctly used characters is available on Moodle and the course page on GitHub.

Sometimes people use RegEx for scraping the web, yet one is not encouraged to do so, as better/safer alternatives exist. Yet, once text data is scraped, RegEx is an important tool for cleaningsand tidying up the dataset.

Let's now go to Project Gutenberg page and find some book to download. Below, I used as an example the Financier by Theodore Dreiser. The latter can be downloaded and read from local stroage or directly from the URL. We will go for the "download and read" option.


In [3]:
import re

In [4]:
with open("financier.txt","r") as f:
    financier = f.readlines()

In [5]:
print financier[2:4]


['This eBook is for the use of anyone anywhere at no cost and with\n', 'almost no restrictions whatsoever.  You may copy it, give it away or\n']

In [6]:
type(financier)


Out[6]:
list

Let's see how many times Mr. Dreiser uses the $ sign in his book. For that purpose, the findall() function from the re library will be used. The function receives the expression to search for asa first argument (always inside quotes) and the string to conduct the search on as a second argument. Please, note:

  • as financier is a list, we convert it to string to be able to pass as an argument to our fucntion,
  • as dollar sign is a special character for RegEx, we use the forward slach before to indicate that in this case we do not use "$" as a special character, instead it is just a text.

In [11]:
output = re.findall("\$",str(financier))
print output


['$', '$', '$']

Let's see at what occasions he used it. More precicely, let's read the amount of money cited in the book. Amount usually comes after the sign, so we will look for all non-whitespace characters after the dollar sign that are followed by a whitespace (that's where the amoun ends). The brackets indicate the component we want to receive as an output.


In [12]:
output = re.findall("(\$\S*)\s",str(financier))
print output


['$250,000', '$1', '$5,000)']

Let's use the | operator (i.e. or) to understand how many $ or @ signs were used by Mr. Dreiser.


In [13]:
output = re.findall("(@|\$)",str(financier))
print output


['$', '@', '@', '$', '$']

Let's see how many times the word euro is used. Yet, we do not know whether the author typed Euro with a capital letter or not. So we will have to search both. If we simply put () Python will think that's the text we need to receive. So we must explicitly mention (using ?:) that the text inside brackets is only for OR function, still not meaning that it is the only part of text we want to receive.


In [14]:
output = re.findall("(?:E|e)uro",str(financier))
print output


['Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro']

Of course, there is an easier approac using flags:


In [15]:
output = re.findall("euro",str(financier),re.IGNORECASE)
print output


['Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro', 'Euro']

Now about substitution. If you want to find some text in the file and substitute it with something else, then re.sub command may come in handy. Let's promote me to Harvard:


In [16]:
sample_text = "My email is hdavtyan@aua.am"

In [17]:
# Let's match e-mail first
output = re.findall('\S+@.+',sample_text)
print output


['hdavtyan@aua.am']

When brackets are used in RegEx, they for an enumerated group, that can be further called based on its order (e.g. first part of the string inside brackets fill be enumerated as the group 1).


In [19]:
# Let's now promote me to Harvard
print re.sub(r'(\S+@)(.+)', r'\1harvard.edu', sample_text)


My email is hdavtyan@harvard.edu