In [5]:
import re # Python's regular expression module
def re_test(regex, query):
"""A helper function to test if a regex has a match in a query."""
p = re.compile(regex)
result = 'MATCH' if p.match(query) else 'NOT FOUND'
print '"{}" with regex "{}": {}'.format(query, regex, result)
A regular expression (also known as a RE, regex, regex pattern, or regexp) is a sequence of symbols and characters expressing a text pattern. A regular expression allows us to specify a string pattern that we can then search for within a body of text. The idea is to make a pattern template (regex), and then query some text to see if the template is present or not.
Let's say we want to determine if a string begins with the word PASS
. Our regular expression will simply be:
In [71]:
pass_regex = 'PASS'
This pattern will match the occurence of PASS
in the query text. Now let's test it out:
In [72]:
re_test(pass_regex, 'PASS: Data good')
In [73]:
re_test(pass_regex, 'FAIL: Data bad')
In [24]:
lines = \
"""
Device-initialized.
Version-19.23
12-12-2014
12
4353
3452
ERROR
498
34598734
345982398
23
ERROR
3434345798
"""
We don't want the header lines and those ERROR
lines are going to ruin our analysis! Let's filter these out with with a regex. First we will create the pattern template (or regex) for what we want to find:
^\d+$
This regex can be split into four parts:
^
This indicates the start of the string.\d
This specifies we want to match decimal digits (the numbers 0-9).+
This symbol means we want to find one or more of the previous symbol (which in this case is a decimal digit).$
This indicates the end of the string.Putting it all together we want to find patterns that are one or more (+
) numbers (\d
) from start (^
) to finish ($
).
Let's load the regex into Python's re
module:
In [25]:
integer_regex = re.compile('\d+$')
Now let's get our string of lines into a list of strings:
In [26]:
lines = lines.split()
print lines
Now we need to run through each of these lines and determine if it matches our regex. Converting to integer would be nice as well.
In [27]:
clean_data = [] # We will accumulate our filtered integers here
for line in lines:
if integer_regex.match(line):
clean_data.append(int(line))
print clean_data
# If you're into one liners you could also do one of these:
# clean_data = [int(line) for line in lines if integer_regex.match(line)]
# clean_data = map(int, filter(integer_regex.match, lines))
It worked like a dream. You may be arguing that there other non-regex solutions to this problem and indeed there are (for example integer typecasting with a catch clause) but this example was given to show you the process of:
There will be situations where regex's will really be the only viable solution when you want to match some super-complex strings.
In [36]:
lines = \
"""
Acme-DNA-Reader
ACTG
AA
-1
CCTC
TTTCG
C
TGCTA
-1
TCCCCCC
"""
The -1
represent reading erros and we want these removed. Using the preceeding example as a guide, filter out the header and the reading errors.
Hint The bases can be represented with the pattern [ACGT]
.
In [37]:
bases_regex = re.compile('[ACGT]+$')
lines = lines.split()
#print lines
clean_data = [] # We will accumulate our filtered integers here
for line in lines:
print line
if bases_regex.match(line):
clean_data.append(line)
print clean_data
Regexps can appear cryptic but they can be decomposed into character classes and metacharacters.
These allow us to concisely specify the types or classes of characters to match. In the example above \d
is a character class that represents decimal digits. There are many such character classes and we will go through these below.
The square brackets allow us to specify a set of characters to match. We have already seen this with [ACGT]
. We can also use the hyphen -
to specify ranges.
Character Class | Description | Match Examples |
---|---|---|
\d |
Matches any decimal digit; this is equivalent to the class [0-9] . |
0 , 1 , 2 , ... |
\D |
Matches any non-digit character; this is equivalent to the class [^0-9] . |
a , @ , ; |
\s |
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v] . |
space, tab, newline |
\S |
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v] . |
1 , A , & |
\w |
Matches any alphanumeric character (word character) ; this is equivalent to the class [a-zA-Z0-9_] . |
x , Z , 2 |
\W |
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_] . |
£ , ( , space |
. |
Matches anything (except newline). | 8 , ( , a , space |
This can look like a lot to remember but there are some menomics here:
Character Class | Mnemonic |
---|---|
\d |
decimal digit |
\D |
uppercase so not \d |
\s |
whitespace character |
\S |
uppercase so not \s |
\w |
word character |
\W |
uppercase so not \w |
The character classes will match only a single character. How can say match exactly 3
occurences of Q
? The metacharacters include different sybmols to reflect repetition:
Repetition Metacharacter | Description |
---|---|
* |
Matches zero or more occurences of the previous character (class). |
+ |
Matches one or more occurences of the previous character (class). |
{m,n} |
With integers m and n , specifies at least m and at most n occurences of the previous character (class). Do not put any space after the comma as this prevents the metacharacter from being recognized. |
In [38]:
re_test('A*', ' ')
re_test('A*', 'A')
re_test('A*', 'AA')
re_test('A*', 'Z12345')
In [39]:
re_test('A+', ' ')
re_test('A+', 'A')
re_test('A+', 'ZZZZ')
In [40]:
re_test('BA{1,3}B', 'BB')
re_test('BA{1,3}B', 'BAB')
re_test('BA{1,3}B', 'BAAAB')
re_test('BA{1,3}B', 'BAAAAAB')
In [41]:
re_test('.*', 'AB12[]9023')
re_test('\d{1,3}B', '123B')
re_test('\w{1,3}\d+', 'aaa2')
re_test('\w{1,3}\d+', 'aaaa2')
In [47]:
#http://path/ssh://dr9@farm3-login:/path
p = re.compile(r'http://(\w+)/ssh://(\w+)@(\w+):/(\w+)')
In [48]:
m = p.match(r'http://path/ssh://dr9@farm3-login:/path')
In [59]:
RE_SSH = re.compile(r'/ssh://(\w+)@(.+):(.+)/(?:chr)?([mxy0-9]{1,2}):(\d+)-(\d+)$', re.IGNORECASE)
In [65]:
RE_SSH = re.compile(r'/ssh://(\w+)@(.+)$', re.IGNORECASE)
In [69]:
t = '/ssh://dr9@farm3-login'
In [70]:
m = RE_SSH.match(t)
#user, server, path, lchr, lmin, lmax = m.groups()
In [72]:
for el in m.groups():
print el
In [73]:
L = [
'So I said wazzzzzzzup?',
'And she said wazup back to me',
'waup isn\'t a word',
'what is up',
'wazzzzzzzzzzzzzzzzzzzzzzzup']
In [74]:
wazup_regex = re.compile(r'.*waz+up.*')
matches = [el for el in L if wazup_regex.match(el)]
print matches
In [82]:
L = [
'123_George_Washington',
'Blah blah',
'894542342_Winston_Churchill',
'More blah blah',
'String_without_numbers']
Don't worry if the following regex looks cryptic, it will soon be broken down.
In [83]:
p = re.compile(r'\d+_([A-Z,a-z]+)_([A-Z,a-z]+)')
In [85]:
for el in L:
m = p.match(el)
if m:
print m.groups()
In [75]:
dna = 'AGTAGTACTACAAGTAGTCCAGTCCTTGGGAGTAGTAGTAGTAAGGGCCT'
In [76]:
p = re.compile(r'(AGT)+')
m = p.finditer(dna)
for match in m:
print '(start, stop): {}'.format(match.span())
print 'matching string: {}'.format(match.group())
In [77]:
p.finditer?
In [80]:
L = [
'Test 1-2 commencing 2012-12-12 for multiple reads.',
'Date of birth of individual 803232435345345 is 1983/06/27.',
'Test 1-2 complete 20130420.']
Convert all dates to the format YYYYMMDD.
Hints:
()
{m, n}
where m=n=2 or m=n=4
?
for the bits between date components.*
maybe)?\D
to make sure your date is not surrounded by decimal digits.
In [78]:
p = re.compile(r'\D+\d{4,4}[-/]?\d{2,2}[-/]?\d{2,2}\D')
In [85]:
date_regex = re.compile(r'\D(\d{4,4})[-/]?(\d{2,2})[-/]?(\d{2,2})\D')
standard_dates = []
for el in L:
m = date_regex.search(el)
if m:
standard_dates.append(''.join(m.groups()))
print standard_dates
Resource | Description |
---|---|
https://docs.python.org/2/howto/regex.html | A great in-depth tutorial from the official Python documentation. |
https://www.regex101.com/#python | A useful online tool to quickly test regular expressions. |
http://regexcrossword.com/ | A nice way to practice your regular expression skills. |
In [16]:
text = 'abcd \e'
In [17]:
print text
In [15]:
re.compile(r'\\')
Out[15]:
In [ ]: