Text Processing : Regular Expression Syntax

A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, i.e. "find and replace" like operations. Consider, for example, asking a computer to find all email addresses in a document. How would you go about this problem? Perhaps you would break an email address into its elements: some characters that are not spaces, followed by $@$, followed by some other characters that are not spaces. How would you tell a computer to look for this? It is for these types of problems that the regular expression language began to be developed in the 1950s.



In [1]:

    
# standar library
import re

The module, re, has many functions, like re.search(), re.findall(), etc. The regular expression language has many special metacharacters, like $*, +, ?$, etc. We will use re.findall() and $*$ as an example first.

re.findall(pattern, string) return all non-overlapping matches of pattern in string, in the order in which they were found.
$*$ Matches 0 or more repetitions of the preceding character, as many as possible



In [3]:

    
re.findall("fun", "fuuun")









    Out[3]:





[]



In [5]:

    
re.findall("fu*n", "fuuun")









    Out[5]:





['fuuun']



In [6]:

    
re.findall("fu*n", "fn")









    Out[6]:





['fn']



In [8]:

    
re.findall("fu*n", "fn fun fuun fuuun fuuunnn")









    Out[8]:





['fn', 'fun', 'fuun', 'fuuun', 'fuuun']

More Metacharacters in Regular Expressions:

$.$ Matches any character except a new line
$*$ Matches 0 or more repetitions of the preceding character, as many as possible
$+$ Matches 1 or more repetitions of the preceding character, as many as possible
$?$ Matches 0 or 1 of the preceding character
$\{m\}$ Matches m occurrences of the preceding character
$\{m,n\}$ Matches m to n occurrences of the preceding character, as many as possible
$\{m,n\}?$ Matches m to n occurrences of the preceding character, as few as possible
\ Escape character (eg. \. would match all periods in a string)
A|B Matches expression A or B
[ ] Used to indicate a set. In a set:
- You can list characters individually: [amk] will match a, m or k
- You can specify ranges of characters specified by a dash: [a-z] will match lowercase letters, [A-Z] uppercase, [0-9] the digits.
- Special characters lose their special meanings in sets.
- Set negation. For example, [^\n] would match any character other than a new-line character.



In [13]:

    
# Exercise
# In the word Mississippi, find:
# 1. Groups of one or more 's'. This should return ['ss','ss']
word = 'Mississippi'
re.findall('s+',word)









    Out[13]:





['ss', 'ss']



In [14]:

    
# 2. Groups of 'i' followed by 0 or more 's'. This should return ['iss', 'iss', 'i', 'i']
re.findall('is*',word)









    Out[14]:





['iss', 'iss', 'i', 'i']



In [15]:

    
# 3. Groups of 'i' followed by 0 or one 's'. This should return ['is','is','i','i']
re.findall('is?',word)









    Out[15]:





['is', 'is', 'i', 'i']



In [16]:

    
# 4. An s followed by one or more non-linebreak characters followed by a p. This should return ['ssissipp']
re.findall('s[^\n]+p',word)









    Out[16]:





['ssissipp']



In [17]:

    
# 5. Groups of one or more characters in the set [is]. This should return ['ississi','i']
re.findall('[is]+',word)









    Out[17]:





['ississi', 'i']

Regular Expression Methods

re.findall(pattern, string) Return all non-overlapping matches of pattern in string, in the order in which they were found
re.split(pattern,string) Return string, but with each element of pattern breaking apart pieces of string.
string.join(list) Concatenate items of a list with string between each of them.
re.sub(pattern, repl, string) Return string, with all instances of pattern replaced by repl.
re.search(pattern,string) Search through string looking for a location where the regular expression pattern produces a match. Return a MatchObject instance.



In [21]:

    
# Split "This is a sentence" into a list of words.
sen = "This is a sentence"
l_words = re.split(' ',sen)



In [22]:

    
# Join the list of words into one string, with spaces separaing the words.
' '.join(l_words)









    Out[22]:





'This is a sentence'



In [59]:

    
# grouping
document = """
CLIENT: PHILIPP EISENHAUER
GENDER: M
"""
rst = re.search('(?P<garbage>\nCLIENT: )(?P<name>[^\n]+)',document)
rst.group('name')









    Out[59]:





'PHILIPP EISENHAUER'



In [62]:

    
# re(?=str) finding re which is followed exactly by str
tx = "The fat content of fried chicken has decreased 11 percent from 9g to 8g per bite"
re.findall('[0-9]+(?=g)',tx)
# re(?!str) finding re which is not followed by str
re.findall('[0-9]+(?!g)',tx)
# (?<=str)re finding re which follows exactly str
# (?<!str)re finding re which doesn't follow str









    Out[62]:





['11']

Working with the Operating System



In [63]:

    
import os
# get the current working directory
os.getcwd()









    Out[63]:





'/home/vagrant/Documents/Bootcamp_Notebooks'



In [65]:

    
# list all the items in the current directory
os.listdir(os.getcwd())









    Out[65]:





['.ipynb_checkpoints',
 'Lecture3_Numpy.ipynb',
 'Untitled1.ipynb',
 'Lecture4_SciPy.ipynb',
 'Untitled0.ipynb']



In [74]:

    
# change working directory
os.chdir('/home')
os.chdir('/home/vagrant/Documents/Bootcamp_Notebooks')

Working with Files



In [120]:

    
# open a file in the current directory with wirte-only mode
file = open('file.txt','w')
file.write("Hello World!")
file.write("\n1\n2")
file.close()



In [121]:

    
# open a file in the current directory with read-only mode
file = open('file.txt','r')
file.read()









    Out[121]:





'Hello World!\n1\n2'



In [122]:

    
# open a file in the current directory with append mode
file = open('file.txt','a')
file.write("\nappend")
file.close()



In [119]:

    
# delete the file
os.system("rm file.txt")









    Out[119]:





0

Formatting



In [2]:

    
import urllib; from IPython.core.display import HTML
HTML(urllib.urlopen('http://bit.ly/1Ki3iXw').read())









    Out[2]: