Regular Expressions

Whether it is a refresher, a primer or simply more reps. Regular expressions (or regex as I will refer to them) are a way to work effectivly with strings. It is quite possible that you did not even know you could do some of the things that are possible with regex.

Regular expressions are essentially a tiny, highly specialized programming language embedded inside Python (or another language) and in Python they are made available through the re module. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use regex to modify a string or to split it apart in various ways. The official python docs has a howto that is a great starting place.

Did you know you could use the following regex to pull out all the any email addresses?


In [4]:
import re
my_str = "This (frodo@gmail.com) is what you are looking for, but here is my other address: samwise@minis-tirith.edu"
pattern = r"[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*" 
print(re.findall(pattern, my_str))


['frodo@gmail.com', 'samwise@minis-tirith.edu']

Searching, matching, substitution and more

We saw several key ingredients in the email example that we will go into detail:

  1. a string that we are searching against
  2. a pattern that seems fairly complicated
  3. a call to the standard library package re

By the way there is also a standard library option to parse emails. First lets start with the re python module. There are a number of ways to search/match pattern, but these are the most common

  • re.search - Does the pattern exist anywhere in the string?
  • re.match - Does the pattern exist at the beginning of the sting? (useless?)
  • re.split - Split the string on the occurance of a pattern
  • re.findall - Return all the matches
  • re.finditer - Like findall, but returns an iterator
  • re.sub - Find a pattern and upon matching substitute it with another

Search and match are similar---there is a page in the docs for that. All of these methods shown above are the common methods that you will use to interface with the regular expression class.

EXERCISE: Create your own examples from the following code


In [6]:
s = "la la la ti dum da"
m = re.search("la",s)
print(m.group(0))
re.findall("la",s)


la
Out[6]:
['la', 'la', 'la']

And before we get in to deep it is worth noting that when there is a match these methods return a MatchObject. Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement. So this means that we can test to see if a match exists before iterating through individual matches.

match = re.search(pattern, string)
if match:
    process(match)

In [9]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
print(m.group(0))    # The entire match
print(m.group(1))    # The first parenthesized subgroup.
print(m.group(2))    # The second parenthesized subgroup.
print(m.group(1, 2)) # Multiple arguments give us a tuple.
print(m.groups())


Isaac Newton
Isaac
Newton
('Isaac', 'Newton')
('Isaac', 'Newton')

Basis Regular Expression Syntax

  • \d Matches any decimal digit; this is equivalent to the class [0-9].
  • \D Matches any non-digit character; this is equivalent to the class [^0-9].
  • \s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
  • \S Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
  • \w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
  • \W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

EXERCISES Hint: add a * to the end of your pattern to see what happens.


In [24]:
s = "Today you are you! That is truer than true! "
s += "There is no one alive who is you-er than you! (Quote 42)"

## match All of the 'T's
print(re.findall("T",s))

## match all upper case letters
print(re.findall("[A-Z]",s))

## match only the numbers
print(re.findall("[\d|\s]+",s))


## match any word characters
print(re.findall("[A-Za-z0-9-]+",s))

## match any non-word characters
#print(re.findall("\w+",s))


['T', 'T', 'T']
['T', 'T', 'T', 'Q']
['42']
['Today', 'you', 'are', 'you', 'That', 'is', 'truer', 'than', 'true', 'There', 'is', 'no', 'one', 'alive', 'who', 'is', 'you-er', 'than', 'you', 'Quote', '42']

Regular expression special characters

Where the fun really begins

  • '.' - Matches any character except a newline.
  • '^' - Matches the start of the string
  • '$' - Matches the end of the string
  • '*' - Match 0 or more repetitions of the preceding RE
  • '+' - Match 1 or more repetitions of the preceding RE
  • '?' - Match 0 or 1 repetitions of the preceding RE
  • '*?, +?, ??' - non-greedy form of The '*', '+', and '?' qualifiers
  • {m} - Match exactly m copies of the previous RE
  • {m,n} - Match from m to n repetitions of the preceding RE
  • [ ] - Used to indicate a set of characters.
  • '|' - A|B, match either A or B.

EXERCISES


In [44]:
s = "You have brains in your head. "
s += "You have feet in your shoes. "
s += "You can steer yourself any direction you choose.  --Dr. Suess"

## match all words
print(re.findall("\w+",s))

## match only the word touching the begining of the line
print(re.findall("^\w+",s))

## match the word touching the end of the line
print(re.findall("\w+$",s))

## match all letters that come after a single whitespace character
print(re.findall("\s(\w+)",s))

## Can you create a pattern that will match "you anyword"? 
print(re.findall("[Y|y]ou\s(\w+)",s))


['You', 'have', 'brains', 'in', 'your', 'head', 'You', 'have', 'feet', 'in', 'your', 'shoes', 'You', 'can', 'steer', 'yourself', 'any', 'direction', 'you', 'choose', 'Dr', 'Suess']
['You']
['Suess']
['have', 'brains', 'in', 'your', 'head', 'You', 'have', 'feet', 'in', 'your', 'shoes', 'You', 'can', 'steer', 'yourself', 'any', 'direction', 'you', 'choose', 'Suess']
['have', 'have', 'can', 'choose']

The Backslash Scourge

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\' as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.

Here is a question that an instructor recently asked in a slack channel. If I have the following use cases can we parse the paths?

EXERCISE: ## can you create a list where each directory is separated by commas?


In [3]:
import os

a = r'this_is/some directory/'                                                                                                                                                        
b = r"this_is/some\ directory"                                                                                                                                                       
c = r"this_is/some\ other\ directory/"

In [33]:


In [35]:



Out[35]:
['lon: 8']

In [36]:



Out[36]:
['8']

In [ ]: