Text Processing : Regular Expression Syntax

A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, i.e. "find and replace" like operations. Consider, for example, asking a computer to find all email addresses in a document. How would you go about this problem? Perhaps you would break an email address into its elements: some characters that are not spaces, followed by $@$, followed by some other characters that are not spaces. How would you tell a computer to look for this? It is for these types of problems that the regular expression language began to be developed in the 1950s.


In [1]:
# standar library
import re

The module, re, has many functions, like re.search(), re.findall(), etc. The regular expression language has many special metacharacters, like $*, +, ?$, etc. We will use re.findall() and $*$ as an example first.

  • re.findall(pattern, string) return all non-overlapping matches of pattern in string, in the order in which they were found.
  • $*$ Matches 0 or more repetitions of the preceding character, as many as possible

In [3]:
re.findall("fun", "fuuun")


Out[3]:
[]

In [5]:
re.findall("fu*n", "fuuun")


Out[5]:
['fuuun']

In [6]:
re.findall("fu*n", "fn")


Out[6]:
['fn']

In [8]:
re.findall("fu*n", "fn fun fuun fuuun fuuunnn")


Out[8]:
['fn', 'fun', 'fuun', 'fuuun', 'fuuun']

More Metacharacters in Regular Expressions:

  • $.$ Matches any character except a new line
  • $*$ Matches 0 or more repetitions of the preceding character, as many as possible
  • $+$ Matches 1 or more repetitions of the preceding character, as many as possible
  • $?$ Matches 0 or 1 of the preceding character
  • $\{m\}$ Matches m occurrences of the preceding character
  • $\{m,n\}$ Matches m to n occurrences of the preceding character, as many as possible
  • $\{m,n\}?$ Matches m to n occurrences of the preceding character, as few as possible
  • \ Escape character (eg. \. would match all periods in a string)
  • A|B Matches expression A or B
  • [ ] Used to indicate a set. In a set:
    • You can list characters individually: [amk] will match a, m or k
    • You can specify ranges of characters specified by a dash: [a-z] will match lowercase letters, [A-Z] uppercase, [0-9] the digits.
    • Special characters lose their special meanings in sets.
    • Set negation. For example, [^\n] would match any character other than a new-line character.

In [13]:
# Exercise
# In the word Mississippi, find:
# 1. Groups of one or more 's'. This should return ['ss','ss']
word = 'Mississippi'
re.findall('s+',word)


Out[13]:
['ss', 'ss']

In [14]:
# 2. Groups of 'i' followed by 0 or more 's'. This should return ['iss', 'iss', 'i', 'i']
re.findall('is*',word)


Out[14]:
['iss', 'iss', 'i', 'i']

In [15]:
# 3. Groups of 'i' followed by 0 or one 's'. This should return ['is','is','i','i']
re.findall('is?',word)


Out[15]:
['is', 'is', 'i', 'i']

In [16]:
# 4. An s followed by one or more non-linebreak characters followed by a p. This should return ['ssissipp']
re.findall('s[^\n]+p',word)


Out[16]:
['ssissipp']

In [17]:
# 5. Groups of one or more characters in the set [is]. This should return ['ississi','i']
re.findall('[is]+',word)


Out[17]:
['ississi', 'i']

Regular Expression Methods

  • re.findall(pattern, string) Return all non-overlapping matches of pattern in string, in the order in which they were found
  • re.split(pattern,string) Return string, but with each element of pattern breaking apart pieces of string.
  • string.join(list) Concatenate items of a list with string between each of them.
  • re.sub(pattern, repl, string) Return string, with all instances of pattern replaced by repl.
  • re.search(pattern,string) Search through string looking for a location where the regular expression pattern produces a match. Return a MatchObject instance.

In [21]:
# Split "This is a sentence" into a list of words.
sen = "This is a sentence"
l_words = re.split(' ',sen)

In [22]:
# Join the list of words into one string, with spaces separaing the words.
' '.join(l_words)


Out[22]:
'This is a sentence'

In [59]:
# grouping
document = """
CLIENT: PHILIPP EISENHAUER
GENDER: M
"""
rst = re.search('(?P<garbage>\nCLIENT: )(?P<name>[^\n]+)',document)
rst.group('name')


Out[59]:
'PHILIPP EISENHAUER'

In [62]:
# re(?=str) finding re which is followed exactly by str
tx = "The fat content of fried chicken has decreased 11 percent from 9g to 8g per bite"
re.findall('[0-9]+(?=g)',tx)
# re(?!str) finding re which is not followed by str
re.findall('[0-9]+(?!g)',tx)
# (?<=str)re finding re which follows exactly str
# (?<!str)re finding re which doesn't follow str


Out[62]:
['11']

Working with the Operating System


In [63]:
import os
# get the current working directory
os.getcwd()


Out[63]:
'/home/vagrant/Documents/Bootcamp_Notebooks'

In [65]:
# list all the items in the current directory
os.listdir(os.getcwd())


Out[65]:
['.ipynb_checkpoints',
 'Lecture3_Numpy.ipynb',
 'Untitled1.ipynb',
 'Lecture4_SciPy.ipynb',
 'Untitled0.ipynb']

In [74]:
# change working directory
os.chdir('/home')
os.chdir('/home/vagrant/Documents/Bootcamp_Notebooks')

Working with Files


In [120]:
# open a file in the current directory with wirte-only mode
file = open('file.txt','w')
file.write("Hello World!")
file.write("\n1\n2")
file.close()

In [121]:
# open a file in the current directory with read-only mode
file = open('file.txt','r')
file.read()


Out[121]:
'Hello World!\n1\n2'

In [122]:
# open a file in the current directory with append mode
file = open('file.txt','a')
file.write("\nappend")
file.close()

In [119]:
# delete the file
os.system("rm file.txt")


Out[119]:
0

Formatting


In [2]:
import urllib; from IPython.core.display import HTML
HTML(urllib.urlopen('http://bit.ly/1Ki3iXw').read())


Out[2]: