Regular Expressions

Motivation

  • Regular expressions are an integral part of many biomedical informatics research projects.
  • Essentially regular expressions are compact pieces of code that allow you to match, extract, or substitute complex textual patterns.
  • Some applications include:
    • Decision support: extracting pertinent medical information from text archive (Is patient’s digoxin drug level > 3?)
    • Public Health: extracting names, locations, dates of death from obituaries
    • NLP: Negation of findings and diseases in reports if a phrase is within six characters of a negation term

Are Regular Expressions Fun?

  • Maybe...But

    If you've heard about regular expressions before, you probably know how powerful they are; if you haven't, prepare to be amazed.
    You should note, however, that mastering regular expressions may be a bit tricky at first. Okay, very tricky actually. (Beginning Python: From Novice to Professional)

Before We Get Started

  • Always build regular expressions from raw strings
    • What is a raw string?
      • "This is not a raw string"
      • r"This is a raw string"
    • raw strings don't evaluate special characters in the string.
    • This will be important because regular expressions will be built using many special characters.

Getting Started, Real Slowly

  • A very simple regular expression can be created to search for a fixed string (e.g. "Brian")

Steps

  • import re
  • compile the regular expression
  • Use the resulting regular expression object (r1), find all matches in some string (nameString)

In [ ]:
import re
nameString = \
 """Wendy, Brian, Karen, Charlene, Jeff. 
     wendy, brian, Karen, charlene, jeff"""
r1 = re.compile(r"""Brian""")
print (r1.findall(nameString))
  • If I want, I can make the regular expression case insensitive with a COMPLIATION FLAG.

In [ ]:
r1 = re.compile(r"""Brian""")
print (r1.findall(nameString))
r2 = re.compile(r"""Brian""", re.IGNORECASE)
print (r2.findall(nameString))
r3 = re.compile(r"""Brian""",re.I) # alias for IGNORECASE
print (r3.findall(nameString))

The Real Power of Regular Expressions...

  • Comes from metacharacters
  • Here is a list of all the metacharacters . ^ $ * + ? { [ ] \ | ( )

Metacharacters: []

  • Square brackets are used to specify a "character class, which is a set of characters that you wish to match."

    [abc] [a-c] [a-z] [a-zA-Z] [a-zA-Z0-9]


In [ ]:
test = """abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN
                       OPQRSTUVWXYZ0123456789"""

r4 = re.compile(r"""[abc]""")

print (r4.findall(test))

In [ ]:
r5 = re.compile(r"""[a-c]""")
print (r5.findall(test))

In [ ]:
r6 = re.compile(r"""[a-z]""")
print (r6.findall(test))

In [ ]:
r7 = re.compile(r"""[a-zA-Z0-9]""")
print (r7.findall(test))

Character Class Size

  • The number of characters in a class can be specified by metacharacters

    • * matches zero or more characters:
      • [0-9]* (matches zero or more digits)
    • + matches one or more characters
    • ? matches exactly one character
    • {m,n} match at least m characters and at most n characters
  • QUESTION: How would you express *,+, and ? with m,n syntax?

  • QUESTION: What would the following regular expressions match?

In [ ]:
r8 = re.compile(r"""a[bcd]*b""",re.I)

r9 = re.compile(r"""a{1,3}b""")
print (r8.findall(test))
print (r9.findall(test))
print(r9.findall("""aabaaabab"""))

Metacharacters in Character Classes

  • Within character classes, metacharacters do not have special meaning.
  • That is, they are treated just like any other character to be matched.
    • One exception: ^{} as the first character in the class denotes the compliment.
    • That is, match all characters except those denoted in the class.

In [ ]:
r10 = re.compile(r"""[^A-Z]""")
print (r10.findall(test))

Metacharacters (\)

  • The backslash character is used to escape all the metacharacters, so to search for a * you would type \*.
  • Try the regular expression with out the backslash

In [ ]:
r11 = re.compile(r"\*")
print (r11.findall("""The range of the function f1* is [0,12] 
                  and the domain of f1* is [0,144]"""))
  • \d : Matches any decimal digit;
    • this is equivalent to the class [0-9].

In [ ]:
r12 = re.compile(r"\d")
print (r12.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108."))
  • \D : Matches any non-digit character;
    • this is equivalent to the class [^0-9].

In [ ]:
r13 = re.compile(r"\D")
print (r13.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
)
  • \s : Matches any whitespace character;
    • this is equivalent to the [ \t\n\r\f\v].

In [ ]:
r14 = re.compile(r"\s")
result14= r14.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
print (result14)
r15 = re.compile(r"[ \t\n\r\f\v]")
result15 = r15.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
print (result15)
print( result14 == result15)
  • \S : Matches any non-whitespace character;
    • this is equivalent to the class [^ \t\n\r\f\v].

In [ ]:
r16 = re.compile(r"\S")
r17 = re.compile(r"[^ \t\n\r\f\v]")
result16 = r16.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
result17 = r17.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
print (result16)
print (result17)
print (result16 == result17)
  • \w: Matches any alphanumeric character;
    • this is equivalent to the class [a-zA-Z0-9_].

In [ ]:
r18 = re.compile(r"\w")
r19 = re.compile(r"[a-zA-Z0-9_]")
result18 = r18.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
result19 = r19.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
print (result18)
print (result19)
print (result18 == result19)
  • \W : Matches any non-alphanumeric character;
    • this is equivalent to the class [^a-zA-Z0-9_].

In [ ]:
r18 = re.compile(r"\W")
r19 = re.compile(r"[^a-zA-Z0-9_]")
result18 = r18.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
result19 = r19.findall("My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108.")
print (result18)
print (result19)
print (result18 == result19)
print (',' in result18)

Putting Things Together

  • How could we use regular expressions to recognize zip codes?

In [ ]:
address = "My work address is 729 Arapeen Drive, Salt Lake City, UT, 84108."

findZipcode = re.compile(r"""\d{5,5}""")
findZipcode2 = re.compile(r"""[0-9]{5,5}""")
print (findZipcode.findall(address))
print (findZipcode2.findall(address))
  • How about finding telephones?

In [ ]:
txt = open("../Resources/contact.html","r").read()
findPhone = re.compile(r"""[0-9]{3,3}-\d{4,4}|[0-9]{3,3}-[0-9]{3,3}-\d{4,4}""")
print (findPhone.findall(txt))
  • How about finding names?
  • Let's look at two formats
    • Brian Chapman
    • Chapman, Brian

In [ ]:
findName1 = re.compile(r"""[A-Z][a-z]+\s+[A-Z][a-z]+""")
#print findName1.findall("Brian Chapman, Wendy Chapman, Jeremiah Chapman")
names1 = findName1.findall(txt)
print (names1[:80])

In [ ]:
findName2 = re.compile(r"""[A-Z][a-z]*,\s+[A-Z][a-z]*""")
names2 = findName2.findall(txt)
print (names2[:20])

In [ ]:
testString = """Brian has a nephew named Ben. Br. Chapman died yesterday. Brian Chapman Brian E. Chapman Brian Earl Chapman Wendy Webber Chapman Clare 1234 4321.1234
python python.org http://python.org www.python.org jython zython Brad Bob cpython brian http://www.python.org perl Perl PERL"""
  1. Find all strings that end in "ython"
  2. Find all instances of "Brian"
  3. Find all jython or python instances
  4. Find all python or perl instances
  5. Find all names that start with B

In [ ]:
rEx1 = re.compile(r"""[a-zA-Z]*ython""")
print rEx1.findall(testString)

In [ ]:
rEx2 = re.compile(r"""Brian""",re.I)
print rEx2.findall(testString)

In [ ]:
rEx3 = re.compile(r"""jython|python|ziggy|zoom""",re.I) # this illustrates an OR
print rEx3.findall(testString)

In [ ]:
rEx4 = re.compile(r"""python|perl""") # this illustrates an OR
rEx4b = re.compile(r"""python|perl""",re.IGNORECASE) # this illustrates an OR

print rEx4.findall(testString)
print rEx4b.findall(testString)

In [ ]:
rEx5 = re.compile(r"""B[a-z]*""") # this illustrates an AND
print rEx5.findall(testString)

In-class Exercises

  1. Write a regular expression to extract the sequence ID from a fasta file.
  2. Write a regular expression to extract the sequence ID from a fastq file.
  3. Write a regular expression to extract date of death from obits.txt
  4. Write a regular expression to extract names form obits.txt
  5. Write a regular expression to extract place of residence from obits.txt
  6. Write regular expressions to extract %stenosis from us.txt
  7. Write a regular expression to identify dates in us.txt

In [ ]: