In [1]:
name = '2017-03-10-regex'
title = 'Regular expressions and how to use them'
tags = 'basics'
author = 'Maria Zamyatina'
In [2]:
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML, Image
html = connect_notebook_to_post(name, title, tags, author)
A regular expression (regex, RE) is a sequence of characters that define a search pattern. Usually this pattern is used by string searching algorithms for "find" or "find and replace" operations on strings. For example, search engines use regular expressions to find matches to your query as do various text editors when you, e.g., enter a search and replace dialogue.
re module provides regular expression matching operations in Python. It lets you check if a particular string matches a given regular expression or if a given regular expression matches a particular string.
In [3]:
import re
There are two types of characters in regular expressions, ordinary and special characters. Ordinary characters, like 'A', 'z', or '0', simply match themselves, while special characters, like '\' or '(', either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. In other words, special characters help you to specify how regular expressions work and what will be returned to you if you find a match.
Let us learn some special characters:
'.'
(Dot.) In the default mode, this matches any character except a newline.
'*'
(Asterisk) Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
To test how these special characters work we need to create two variables, one for a string and one for a regular expression that we will try to match with a specific pattern in a string.
In [4]:
string = 'Sic Parvis Magna'
pattern = r'.*' # any character as many times as possible
r in r'.*' indicates that we are using Python's raw string notation, which, in short, differs from ordinary Python strings by its interpretation of the backslash character.
To search for a pattern in a string we will use re.search() function:
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding
MatchObject
instance. Return None if no position in the string matches the pattern.
In [5]:
re.search(r'.*', string)
Out[5]:
What if we want to find only 'Magna'?
In [6]:
pattern = r'Magna'
re.search(pattern, string)
Out[6]:
What about 'magna'?
In [7]:
pattern = r'magna'
re.search(pattern, string)
Nothing was returned because no match was found.
Let us change our string to something that contains numbers and assume that we need to find only those numbers.
In [8]:
string = 'Station : Boulder, CO \n Station Height : 1743 meters \n Latitude : 39.95'
\d
Matches any decimal digit; this is equivalent to the class [0-9].
'+'
Causes the resulting RE to match 1 or more repetitions of the preceding RE.
In [9]:
pattern = r'\d+' # one or more digit
re.search(pattern, string)
Out[9]:
Why we found only 1743, but not 1743 and 39 or 1743 and 39.95?
Answer: re.search() scans through string looking for the first location where the regular expression pattern produces a match [...].
Let us now try to find 39.95 for latitude.
There is no special character for a float number, but we can combine existing special characters to produce a regular expression that will match only float numbers. In other words, we need to include the dot '.' character into our new regular expression. However, dot has a special meaning in Python's raw string notation (see above). To construct the right regular expression we need to add the backslash character '\' before the dot character in order to avoid invoking its special meaning, i.e. quote or escape it.
In [10]:
re.search(r'\d+\.\d+', string) # float number
Out[10]:
But how to find both numbers? For that we need to use the pipeline character '|' and re.findall()
function since we want to get more than one result in return.
'|'
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
In [11]:
re.findall(r'\d+\.\d+|\d+', string) # float or integer number
Out[11]:
Moving on to a more science related example. Let us assume that we have a list of chemical reaction equations and rate coefficients and we want to separate equations from rate coefficients.
In [12]:
raw_data = 'O1D = OH + OH : 2.14e-10*H2O;\nOH + O3 = HO2 : 1.70e-12*EXP(-940/TEMP);'
raw_lines = raw_data.split('\n')
raw_lines
Out[12]:
When we apply re.search() function to a line in raw_lines, we will get a MatchObject
in return. MatchObject
s support various methods, .group() is among them.
group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument.
For example,
In [13]:
m = re.search(r'(.*) (\d)', 'The Witcher 3')
m.group(0) # entire match
Out[13]:
In [14]:
m.group(1) # first parenthesized subgroup
Out[14]:
In [15]:
m.group(2) # second parenthesized subgroup
Out[15]:
In [16]:
m.group(1, 2) # multiple arguments give us a tuple
Out[16]:
So let us indicate that we want to return two subgroups, one for an equation and one for a rate coefficient. If we put them simply one after another in the regular expression, we do not get what we want:
In [17]:
for l in raw_lines:
line = re.search(r'(.*)(.*)', l).group(1, 2)
print(line)
The equation part is separated from the rate coefficient part by the double colon ':' and two whitespaces, therefore we need to put those characters between the subgroups, as well as the semicolon ';' at the end if we do not want to see it in the resulting string.
\s
Matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].
In [18]:
for l in raw_lines:
line = re.search(r'(.*)\s:\s(.*);', l).group(1, 2)
print(line)
Now we want to separate chemical reactants from products and store them in lists of strings without any arithmetic signs. To do that let us use re.findall()
and a regular expression that matches letters and numbers that comprise our chemical species names:
\w
Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_].
'+'
Causes the resulting RE to match 1 or more repetitions of the preceding RE.
In [19]:
alphanum_pattern = r'\w+' # any number or character as many times as possible
In [20]:
for l in raw_lines:
line = re.search(r'(.*)\s:\s(.*);', l).group(1,2)
subline_reac, subline_prod = line[0].split('=') # split equation into reactants and products parts using '=' as a separator
print('Reactants: '+subline_reac, 'Products: '+subline_prod)
reac = re.findall(alphanum_pattern, subline_reac)
prod = re.findall(alphanum_pattern, subline_prod)
print(reac, prod)
We finally got all pieces of information we wanted about each chemical reaction: what reactants and products are and what the corresponding rate coefficient is. The best way to store this information is to create a dictionary for each chemical reaction and append those dictionaries into a list.
In [21]:
eqs = []
for l in raw_lines:
line = re.search(r'(.*)\s:\s(.*);', l).group(1,2)
subline_reac, subline_prod = line[0].split('=')
reac = re.findall(alphanum_pattern, subline_reac)
prod = re.findall(alphanum_pattern, subline_prod)
eqs.append(dict(reac=reac, prod=prod, coef=line[1]))
print(eqs)
This approach becomes pretty handy if you have thousands of reactions to work with (as I do), and there is still plenty of room for using re module.
In [22]:
HTML(html)
Out[22]: