Regular Expressions

An experiment in Jupyter slides.

J. Roberts

$E = mc^2$

Regular expressions built in via the re module. Super simple example:


In [7]:
import re
match = re.search(r'\d+', r'abc123def')  # note the "r" prefix
print match.span() # what do the numbers represent?


(3, 6)

Special Sequences

Several, special keys are used for sequences of importance in the re module.

name description
\d any digit, i.e., [0-9]
\D any non-digit, i.e., [^0-9]
\s any whitespace, i.e., [ \t\n\r\f\v]
\S any non-whitespace, i.e., [^ \t\n\r\f\v]
\w alphanumeric, i.e., [a-zA-Z0-9_]
\W non alphanumeric, i.e., [^ a-zA-Z0-9_]

Metacharacters

Several, special "metacharacters" are used to define regular expressions with the re module.

name description
. any character but \n
^ match at beginning or class complement
$ match at ending
* match 0 or more times
? match 0 or 1 times
\ escape character
| "or"
[] defines character class, e.g., [a-z]
{} for repeated qualifier, e.g., ab{2,3}
() for groups

Example 1

Consider the pattern ca*t. Does it match the following? If so, what is the match?

  • ct
  • cat
  • caaat
  • go cats!

In [8]:
pattern = r'ca*t'
print re.match(pattern, r'ct').span()
print re.match(pattern, r'cat').span()
print re.match(pattern, r'caaat').span()
print re.match(pattern, r'go cats!')


(0, 2)
(0, 3)
(0, 5)
None

Example 2

How about this slight modification? Consider ca*[\w ]+t applied to catenkerous cat. Is it a match? How much?


In [9]:
print re.match(r'ca*[\w ]+t', r'catenkerous cat!').span()


(0, 15)

This highlights the fact that * is greedy. In other words, it grabs as large a match as possible.

Now, for the fun stuff. Do

  cd /path/to/ME701_examples
  git pull

You should now have a new folder re with some fun, real-world data to munge!