Up until now, to search in text we have used string methods find, startswith, endswith, etc. But sometimes you need more power.
Regular expressions are their own little language that allows you to search through text and find matches with incredibly complex patterns.
A regular expression, also referred to as "regex" or "regexp", provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.
To use regular you need to import python's regex library re
https://docs.python.org/2/library/re.html
In [ ]:
import re
In [ ]:
# To run the examples we are going to use some of the logs from the
# django project, a web framework for python
django_logs = '''commit 722344ee59fb89ea2cd5b906d61b35f76579de4e
Author: Simon Charette <charette.s@gmail.com>
Date: Thu May 19 09:31:49 2016 -0400
Refs #24067 -- Fixed contenttypes rename tests failures on Oracle.
Broke the initial migration in two to work around #25530 and added
'django.contrib.auth' to the available_apps to make sure its tables are also
flushed as Oracle doesn't implement cascade deletion in sql_flush().
Thanks Tim for the report.
commit 9fed4ec418a4e391a3af8790137ab147efaf17c2
Author: Simon Charette <charette.s@gmail.com>
Date: Sat May 21 13:18:22 2016 -0400
Removed an obsolete comment about a fixed ticket.
commit 94486fb005e878d629595942679ba6d23401bc22
Author: Markus Holtermann <info@markusholtermann.eu>
Date: Sat May 21 13:20:40 2016 +0200
Revert "Disable patch coverage checks"
Mistakenly pushed to django/django instead of another repo
This reverts commit 6dde884c01156e36681aa51a5e0de4efa9575cfd.
commit 6dde884c01156e36681aa51a5e0de4efa9575cfd
Author: Markus Holtermann <info@markusholtermann.eu>
Date: Sat May 21 13:18:18 2016 +0200
Disable patch coverage checks
commit 46a38307c245ab7ed0b4d5d5ebbaf523a81e3b75
Author: Tim Graham <timograham@gmail.com>
Date: Fri May 20 10:50:51 2016 -0400
Removed versionadded/changed annotations for 1.9.
commit 1915a7e5c56d996b0e98decf8798c7f47ff04e76
Author: Tim Graham <timograham@gmail.com>
Date: Fri May 20 09:18:55 2016 -0400
Increased the default PBKDF2 iterations.
commit 97c3dfe12e095005dad9e6750ad5c5a54eee8721
Author: Tim Graham <timograham@gmail.com>
Date: Thu May 19 22:28:24 2016 -0400
Added stub 1.11 release notes.
commit 8df083a3ce21ca73ff77d3844a578f3da3ae78d7
Author: Tim Graham <timograham@gmail.com>
Date: Thu May 19 22:20:21 2016 -0400
Bumped version; master is now 1.11 pre-alpha.'''
The simplest thing you can do with regexs in python is search through text to see if there is a match. To do this you use the methods search
or match
. match
only checks if it matches at the beginning of the string and search
check the whole string.
re.match(pattern, string)
re.search(pattern, string)
In [ ]:
print(re.match('a', 'abcde'))
print(re.match('c', 'abcde'))
In [ ]:
print(re.search('a', 'abcde'))
print(re.search('c', 'abcde'))
In [ ]:
print(re.match('version', django_logs))
print(re.search('version', django_logs))
In [ ]:
if re.search('commit', django_logs):
print("Someone has been doing work.")
In [ ]:
So far we can't do anything that you couldn't do with find, but don't worry. Regexs have many special characters to allow you to look for thing like the beginning of a word, whitespace or classes of characters.
You include the character in the pattern.
Hint if you want to match the literal character (like $) as opposed to its special meaning, you would escape it with a \
In [ ]:
# Start simple, match any character 2 times
print(re.search('..', django_logs))
# just to prove it works
print(re.search('..', 'aa'))
print(re.search('..', 'a'))
print(re.search('..', '^%'))
In [ ]:
# to match a commit hash (numbers and letters a-f repeated) we can use a regex
commit_pattern = '[0-9a-f]+'
print(re.search(commit_pattern, django_logs))
In [ ]:
# Let's match the time syntax
time_pattern = '\d\d:\d\d:\d\d'
time_pattern = '\d{2}:\d{2}:\d{2}'
print(re.search(time_pattern, django_logs))
In [ ]:
In [ ]:
print(re.search('markus holtermann', django_logs))
print(re.search('markus holtermann', django_logs, re.IGNORECASE))
In [ ]:
In [ ]:
# Let's match the time syntax
time_pattern = '\d\d:\d\d:\d\d'
m = re.search(time_pattern, django_logs)
print(m.group(0))
If you want to find all the matches, not just the first, you can use the findall method. It returns a list of all the matches
re.findall(pattern, string)
In [ ]:
time_pattern = '\d\d:\d\d:\d\d'
print(re.findall(time_pattern, django_logs))
If you want to have only part of the match returned to you in findall, you can use parenthesis to set a capture point
pattern = 'sads (part to capture) asdjklajsd'
print re.findall(pattern, string) # prints part to capture
In [ ]:
time_pattern = '(\d\d):\d\d:\d\d'
hours = re.findall(time_pattern, django_logs)
print(sorted(hours))
In [ ]:
# you can capture more than one match
time_pattern = '(\d\d):(\d\d):\d\d'
times = re.findall(time_pattern, django_logs)
print(times)
# Unpacking the tuple in the first line
for hours, mins in times:
print("{} hr {} min".format(hours, mins))
In [ ]:
There is a lot more that you can do, but it can feel overwhelming. The best way to learn is with practice. A great way to experiment is this website http://www.regexr.com/ You can put a section of text and see what regexs match patterns in your text. The site also has a cheatsheet for special characters.
In [ ]:
# Lets try some now
Let's imagine you are working in a law office. You have millions of e-mails and other documents to go through to see what is relevant to the case. You are going to write a program to go though a file, check for key words (client's name, phone number, defendant's name) and print out the whole paragraph. It should not print any paragraphs with no relevant info. Paragraphs will be separated by an empty line.
Your program should match the following items: Gold E. Locks (case insensitive, E. or E) Three bears or 3 bears 571 209-4000 (with parens, dashes, or no spaces)
a. check if it matches any of the patterns, if so set found_match to true
b. append line to paragraph
c. if line is empty (just a newline character)
- print paragraph if found_match is true **Hint** use the join method to print a string instead of a list
- reset paragraph to empty list and found_match to false
In [ ]: