Regexs

Up until now, to search in text we have used string methods find, startswith, endswith, etc. But sometimes you need more power.

Regular expressions are their own little language that allows you to search through text and find matches with incredibly complex patterns.

A regular expression, also referred to as "regex" or "regexp", provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.

To use regular you need to import python's regex library re https://docs.python.org/2/library/re.html


In [ ]:
import re

In [ ]:
# To run the examples we are going to use some of the logs from the 
# django project, a web framework for python

django_logs = '''commit 722344ee59fb89ea2cd5b906d61b35f76579de4e
Author: Simon Charette <charette.s@gmail.com>
Date:   Thu May 19 09:31:49 2016 -0400

    Refs #24067 -- Fixed contenttypes rename tests failures on Oracle.

    Broke the initial migration in two to work around #25530 and added
    'django.contrib.auth' to the available_apps to make sure its tables are also
    flushed as Oracle doesn't implement cascade deletion in sql_flush().

    Thanks Tim for the report.

commit 9fed4ec418a4e391a3af8790137ab147efaf17c2
Author: Simon Charette <charette.s@gmail.com>
Date:   Sat May 21 13:18:22 2016 -0400

    Removed an obsolete comment about a fixed ticket.

commit 94486fb005e878d629595942679ba6d23401bc22
Author: Markus Holtermann <info@markusholtermann.eu>
Date:   Sat May 21 13:20:40 2016 +0200

    Revert "Disable patch coverage checks"

    Mistakenly pushed to django/django instead of another repo

    This reverts commit 6dde884c01156e36681aa51a5e0de4efa9575cfd.

commit 6dde884c01156e36681aa51a5e0de4efa9575cfd
Author: Markus Holtermann <info@markusholtermann.eu>
Date:   Sat May 21 13:18:18 2016 +0200

    Disable patch coverage checks

commit 46a38307c245ab7ed0b4d5d5ebbaf523a81e3b75
Author: Tim Graham <timograham@gmail.com>
Date:   Fri May 20 10:50:51 2016 -0400

    Removed versionadded/changed annotations for 1.9.

commit 1915a7e5c56d996b0e98decf8798c7f47ff04e76
Author: Tim Graham <timograham@gmail.com>
Date:   Fri May 20 09:18:55 2016 -0400

    Increased the default PBKDF2 iterations.

commit 97c3dfe12e095005dad9e6750ad5c5a54eee8721
Author: Tim Graham <timograham@gmail.com>
Date:   Thu May 19 22:28:24 2016 -0400

    Added stub 1.11 release notes.

commit 8df083a3ce21ca73ff77d3844a578f3da3ae78d7
Author: Tim Graham <timograham@gmail.com>
Date:   Thu May 19 22:20:21 2016 -0400

    Bumped version; master is now 1.11 pre-alpha.'''

Searching

The simplest thing you can do with regexs in python is search through text to see if there is a match. To do this you use the methods search or match. match only checks if it matches at the beginning of the string and search check the whole string.

re.match(pattern, string)  
re.search(pattern, string) 

In [ ]:
print(re.match('a', 'abcde'))
print(re.match('c', 'abcde'))

In [ ]:
print(re.search('a', 'abcde'))
print(re.search('c', 'abcde'))

In [ ]:
print(re.match('version', django_logs))
print(re.search('version', django_logs))

In [ ]:
if re.search('commit', django_logs):
    print("Someone has been doing work.")

TRY IT

Search for the word May in the django logs


In [ ]:

Special Characters

So far we can't do anything that you couldn't do with find, but don't worry. Regexs have many special characters to allow you to look for thing like the beginning of a word, whitespace or classes of characters.

You include the character in the pattern.

  • ^ Matches the beginning of a line
  • $ Matches the end of the line
  • . Matches any character
  • \s Matches whitespace
  • \S Matches any non-whitespace character
  • * Repeats a character zero or more times
  • *? Repeats a character zero or more times (non-greedy)
  • + Repeats a character one or more times
  • +? Repeats a character one or more times (non-greedy)
  • ? Repeats a character 0 or one time
  • [aeiou] Matches a single character in the listed set
  • [^XYZ] Matches a single character not in the listed set
  • [a-z0-9] The set of characters can include a range
  • {10} Specifics a match the preceding character(s) {num} number or times
  • \d Matches any digit
  • \b Matches a word boundary

Hint if you want to match the literal character (like $) as opposed to its special meaning, you would escape it with a \


In [ ]:
# Start simple, match any character 2 times
print(re.search('..', django_logs))

# just to prove it works
print(re.search('..', 'aa'))
print(re.search('..', 'a'))
print(re.search('..', '^%'))

In [ ]:
# to match a commit hash (numbers and letters a-f repeated) we can use a regex
commit_pattern = '[0-9a-f]+'
print(re.search(commit_pattern, django_logs))

In [ ]:
# Let's match the time syntax
time_pattern = '\d\d:\d\d:\d\d'
time_pattern = '\d{2}:\d{2}:\d{2}'
print(re.search(time_pattern, django_logs))

TRY IT

Match anything between angled brackets < >


In [ ]:

Ignoring case

match and search both take an optional third argument that allows you to include flags. The most common flag is ignore case.

re.search(pattern, string, re.IGNORECASE)
re.match(pattern, string, re.IGNORECASE)

In [ ]:
print(re.search('markus holtermann', django_logs))
print(re.search('markus holtermann', django_logs, re.IGNORECASE))

TRY IT

search for 'django' in 'Both Django and Flask are very useful python frameworks' ignoring case


In [ ]:

Extracting Matches

Finding is only half the battle. You can also extract what you match.

To get the string that your regex matched you can store the match object in a variable and run the group method on that

m = re.search(pattern, string)
print m.group(0)

In [ ]:
# Let's match the time syntax
time_pattern = '\d\d:\d\d:\d\d'
m = re.search(time_pattern, django_logs)
print(m.group(0))

If you want to find all the matches, not just the first, you can use the findall method. It returns a list of all the matches

re.findall(pattern, string)

In [ ]:
time_pattern = '\d\d:\d\d:\d\d'
print(re.findall(time_pattern, django_logs))

If you want to have only part of the match returned to you in findall, you can use parenthesis to set a capture point

pattern = 'sads (part to capture) asdjklajsd'
print re.findall(pattern, string) # prints part to capture

In [ ]:
time_pattern = '(\d\d):\d\d:\d\d'
hours = re.findall(time_pattern, django_logs)
print(sorted(hours))

In [ ]:
# you can capture more than one match
time_pattern = '(\d\d):(\d\d):\d\d'
times = re.findall(time_pattern, django_logs)
print(times)

# Unpacking the tuple in the first line
for hours, mins in times:
    print("{} hr {} min".format(hours, mins))

TRY IT

Capture the host of the email address (alphanumerics between @ and .com) Hint remember to escape the . in .com


In [ ]:

Practice

There is a lot more that you can do, but it can feel overwhelming. The best way to learn is with practice. A great way to experiment is this website http://www.regexr.com/ You can put a section of text and see what regexs match patterns in your text. The site also has a cheatsheet for special characters.


In [ ]:
# Lets try some now

Project: Doc Clerk

Let's imagine you are working in a law office. You have millions of e-mails and other documents to go through to see what is relevant to the case. You are going to write a program to go though a file, check for key words (client's name, phone number, defendant's name) and print out the whole paragraph. It should not print any paragraphs with no relevant info. Paragraphs will be separated by an empty line.

Your program should match the following items: Gold E. Locks (case insensitive, E. or E) Three bears or 3 bears 571 209-4000 (with parens, dashes, or no spaces)

  1. Import re
  2. Initialize a variable called paragraph to be an empty list and a variable called found_match to false.
  3. Create a list of patterns to match and store in variable called patterns
  4. Read in test file 'evidence.txt'.
  5. For line in evidence:
     a. check if it matches any of the patterns, if so set found_match to true
     b. append line to paragraph
     c. if line is empty (just a newline character) 
         - print paragraph if found_match is true **Hint** use the join method to print a string instead of a list
         - reset paragraph to empty list and found_match to false

In [ ]: