In [5]:

    
import re # Python's regular expression module

def re_test(regex, query):
    """A helper function to test if a regex has a match in a query."""
    p = re.compile(regex)
    result = 'MATCH' if p.match(query) else 'NOT FOUND'
    print '"{}" with regex "{}": {}'.format(query, regex, result)

Regular Expressions

Daniel Rice

Introduction
- Definition
- Examples
- Exercise 1
Decomposing the syntax
- Character classes
- Metacharacters
  - Repetition
  - Capture groups
Regex's in Python
- match
- search

Introduction

Definition

A regular expression (also known as a RE, regex, regex pattern, or regexp) is a sequence of symbols and characters expressing a text pattern. A regular expression allows us to specify a string pattern that we can then search for within a body of text. The idea is to make a pattern template (regex), and then query some text to see if the template is present or not.

Example 1

Let's say we want to determine if a string begins with the word PASS. Our regular expression will simply be:



In [71]:

    
pass_regex = 'PASS'

This pattern will match the occurence of PASS in the query text. Now let's test it out:



In [72]:

    
re_test(pass_regex, 'PASS: Data good')









    



"PASS: Data good" with regex "PASS": MATCH



In [73]:

    
re_test(pass_regex, 'FAIL: Data bad')









    



"FAIL: Data bad" with regex "PASS": NOT FOUND

Example 2

Let's say we have a text file that contains numerical readings that we need to perform some analysis on. Here's the first few lines from the file:



In [24]:

    
lines = \
"""
Device-initialized.
Version-19.23
12-12-2014
12
4353
3452
ERROR
498
34598734
345982398
23
ERROR
3434345798
"""

We don't want the header lines and those ERROR lines are going to ruin our analysis! Let's filter these out with with a regex. First we will create the pattern template (or regex) for what we want to find:

^\d+$

This regex can be split into four parts:

^ This indicates the start of the string.
\d This specifies we want to match decimal digits (the numbers 0-9).
+ This symbol means we want to find one or more of the previous symbol (which in this case is a decimal digit).
$ This indicates the end of the string.

Putting it all together we want to find patterns that are one or more (+) numbers (\d) from start (^) to finish ($).

Let's load the regex into Python's re module:



In [25]:

    
integer_regex = re.compile('\d+$')

Now let's get our string of lines into a list of strings:



In [26]:

    
lines = lines.split()
print lines









    



['Device-initialized.', 'Version-19.23', '12-12-2014', '12', '4353', '3452', 'ERROR', '498', '34598734', '345982398', '23', 'ERROR', '3434345798']

Now we need to run through each of these lines and determine if it matches our regex. Converting to integer would be nice as well.



In [27]:

    
clean_data = [] # We will accumulate our filtered integers here
for line in lines:
    if integer_regex.match(line):
        clean_data.append(int(line))
print clean_data

# If you're into one liners you could also do one of these:
# clean_data = [int(line) for line in lines if integer_regex.match(line)]
# clean_data = map(int, filter(integer_regex.match, lines))









    



[12, 4353, 3452, 498, 34598734, 345982398, 23, 3434345798]

It worked like a dream. You may be arguing that there other non-regex solutions to this problem and indeed there are (for example integer typecasting with a catch clause) but this example was given to show you the process of:

Creating a regex pattern for what you want to find.
Appyling it to some text.
Extracting the positive hits.

There will be situations where regex's will really be the only viable solution when you want to match some super-complex strings.

Exercise 1

You have a file consisting of DNA bases which you want to perform analysis on:



In [36]:

    
lines = \
"""
Acme-DNA-Reader
ACTG
AA
-1
CCTC
TTTCG
C
TGCTA
-1
TCCCCCC
"""

The -1 represent reading erros and we want these removed. Using the preceeding example as a guide, filter out the header and the reading errors.

Hint The bases can be represented with the pattern [ACGT].



In [37]:

    
bases_regex = re.compile('[ACGT]+$')
lines = lines.split()
#print lines
clean_data = [] # We will accumulate our filtered integers here
for line in lines:
    print line
    if bases_regex.match(line):
        clean_data.append(line)
print clean_data









    



Acme-DNA-Reader
ACTG
AA
-1
CCTC
TTTCG
C
TGCTA
-1
TCCCCCC
['ACTG', 'AA', 'CCTC', 'TTTCG', 'C', 'TGCTA', 'TCCCCCC']

Decomposing the syntax

Regexps can appear cryptic but they can be decomposed into character classes and metacharacters.

Character classes

These allow us to concisely specify the types or classes of characters to match. In the example above \d is a character class that represents decimal digits. There are many such character classes and we will go through these below. The square brackets allow us to specify a set of characters to match. We have already seen this with [ACGT]. We can also use the hyphen - to specify ranges.

Character Class	Description	Match Examples
`\d`	Matches any decimal digit; this is equivalent to the class `[0-9]`.	`0`, `1`, `2`, ...
`\D`	Matches any non-digit character; this is equivalent to the class `[^0-9]`.	`a`, `@`, `;`
`\s`	Matches any whitespace character; this is equivalent to the class `[ \t\n\r\f\v]`.	space, tab, newline
`\S`	Matches any non-whitespace character; this is equivalent to the class `[^ \t\n\r\f\v]`.	`1`, `A`, `&`
`\w`	Matches any alphanumeric character (word character) ; this is equivalent to the class `[a-zA-Z0-9_]`.	`x`, `Z`, `2`
`\W`	Matches any non-alphanumeric character; this is equivalent to the class `[^a-zA-Z0-9_]`.	`£`, `(`, space
`.`	Matches anything (except newline).	`8`, `(`, `a`, space

This can look like a lot to remember but there are some menomics here:

Character Class	Mnemonic
`\d`	decimal digit
`\D`	uppercase so not `\d`
`\s`	whitespace character
`\S`	uppercase so not `\s`
`\w`	word character
`\W`	uppercase so not `\w`

Metacharacters

Repitition

The character classes will match only a single character. How can say match exactly 3 occurences of Q? The metacharacters include different sybmols to reflect repetition:

Repetition Metacharacter	Description
`*`	Matches zero or more occurences of the previous character (class).
`+`	Matches one or more occurences of the previous character (class).
`{m,n}`	With integers `m` and `n`, specifies at least `m` and at most `n` occurences of the previous character (class). Do not put any space after the comma as this prevents the metacharacter from being recognized.

Examples



In [38]:

    
re_test('A*', ' ')
re_test('A*', 'A')
re_test('A*', 'AA')
re_test('A*', 'Z12345')









    



" " with regex "A*": MATCH
"A" with regex "A*": MATCH
"AA" with regex "A*": MATCH
"Z12345" with regex "A*": MATCH



In [39]:

    
re_test('A+', ' ')
re_test('A+', 'A')
re_test('A+', 'ZZZZ')









    



" " with regex "A+": NOT FOUND
"A" with regex "A+": MATCH
"ZZZZ" with regex "A+": NOT FOUND



In [40]:

    
re_test('BA{1,3}B', 'BB')
re_test('BA{1,3}B', 'BAB')
re_test('BA{1,3}B', 'BAAAB')
re_test('BA{1,3}B', 'BAAAAAB')









    



"BB" with regex "BA{1,3}B": NOT FOUND
"BAB" with regex "BA{1,3}B": MATCH
"BAAAB" with regex "BA{1,3}B": MATCH
"BAAAAAB" with regex "BA{1,3}B": NOT FOUND



In [41]:

    
re_test('.*', 'AB12[]9023')
re_test('\d{1,3}B', '123B')
re_test('\w{1,3}\d+', 'aaa2')
re_test('\w{1,3}\d+', 'aaaa2')









    



"AB12[]9023" with regex ".*": MATCH
"123B" with regex "\d{1,3}B": MATCH
"aaa2" with regex "\w{1,3}\d+": MATCH
"aaaa2" with regex "\w{1,3}\d+": NOT FOUND



In [47]:

    
#http://path/ssh://dr9@farm3-login:/path
p = re.compile(r'http://(\w+)/ssh://(\w+)@(\w+):/(\w+)')



In [48]:

    
m = p.match(r'http://path/ssh://dr9@farm3-login:/path')



In [59]:

    
RE_SSH = re.compile(r'/ssh://(\w+)@(.+):(.+)/(?:chr)?([mxy0-9]{1,2}):(\d+)-(\d+)$', re.IGNORECASE)



In [65]:

    
RE_SSH = re.compile(r'/ssh://(\w+)@(.+)$', re.IGNORECASE)



In [69]:

    
t = '/ssh://dr9@farm3-login'



In [70]:

    
m = RE_SSH.match(t)
#user, server, path, lchr, lmin, lmax = m.groups()



In [72]:

    
for el in m.groups():
    print el









    



dr9
farm3-login

Exercise 2

Determine if a string contains "wazup" or "wazzup" or "wazzzup" where the number of z's must be greater than zero. Use the following list of strings:



In [73]:

    
L = [
'So I said wazzzzzzzup?',
'And she said wazup back to me',
'waup isn\'t a word',
'what is up',
'wazzzzzzzzzzzzzzzzzzzzzzzup']



In [74]:

    
wazup_regex = re.compile(r'.*waz+up.*')
matches = [el for el in L if wazup_regex.match(el)]
print matches









    



['So I said wazzzzzzzup?', 'And she said wazup back to me', 'wazzzzzzzzzzzzzzzzzzzzzzzup']

Example

We have a list of strings and some of these contain names that we want to extract. The names have the format

0123_FirstName_LastName

where the quantity of numbers at the beginning of the string are variable (e.g. 1_Bob_Smith, 12_Bob_Smith, 123456_Bob_Smith) are all valid).



In [82]:

    
L = [
'123_George_Washington',
'Blah blah',
'894542342_Winston_Churchill',
'More blah blah',
'String_without_numbers']

Don't worry if the following regex looks cryptic, it will soon be broken down.



In [83]:

    
p = re.compile(r'\d+_([A-Z,a-z]+)_([A-Z,a-z]+)')



In [85]:

    
for el in L:
    m = p.match(el)
    if m:
        print m.groups()









    



('George', 'Washington')
('Winston', 'Churchill')

Exercise 3

Find all occurences of AGT within a string of DNA where contiguous repeated occurences should be counted only once (e.g. AGTAGTAGT will be counted once and not three times).



In [75]:

    
dna = 'AGTAGTACTACAAGTAGTCCAGTCCTTGGGAGTAGTAGTAGTAAGGGCCT'



In [76]:

    
p = re.compile(r'(AGT)+')
m = p.finditer(dna)
for match in m:
    print '(start, stop): {}'.format(match.span())
    print 'matching string: {}'.format(match.group())









    



(start, stop): (0, 6)
matching string: AGTAGT
(start, stop): (12, 18)
matching string: AGTAGT
(start, stop): (20, 23)
matching string: AGT
(start, stop): (30, 42)
matching string: AGTAGTAGTAGT



In [77]:

    
p.finditer?

Exercise 4

A text file contains some important information about a test that has been run. The individual who wrote this file is inconsistent with date formats.



In [80]:

    
L = [
'Test 1-2 commencing 2012-12-12 for multiple reads.',
'Date of birth of individual 803232435345345 is 1983/06/27.',
'Test 1-2 complete 20130420.']

Convert all dates to the format YYYYMMDD.

Hints:

Use groups ()
Use {m, n} where m=n=2 or m=n=4
Use ? for the bits between date components
You can use either search or match, though in the latter you will need to specify what happens before and after the date (.* maybe)?
The second element in the list will present you with issues as there is a number there that may accidentally be captured as a date. Use \D to make sure your date is not surrounded by decimal digits.



In [78]:

    
p = re.compile(r'\D+\d{4,4}[-/]?\d{2,2}[-/]?\d{2,2}\D')



In [85]:

    
date_regex = re.compile(r'\D(\d{4,4})[-/]?(\d{2,2})[-/]?(\d{2,2})\D')
standard_dates = []
for el in L:
    m = date_regex.search(el)
    if m:
        standard_dates.append(''.join(m.groups()))
print standard_dates









    



['20121212', '19830627', '20130420']

Resources

Resource	Description
https://docs.python.org/2/howto/regex.html	A great in-depth tutorial from the official Python documentation.
https://www.regex101.com/#python	A useful online tool to quickly test regular expressions.
http://regexcrossword.com/	A nice way to practice your regular expression skills.



In [16]:

    
text = 'abcd \e'



In [17]:

    
print text









    



abcd \e



In [15]:

    
re.compile(r'\\')









    Out[15]:





re.compile(r'\\')



In [ ]: