A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, i.e. "find and replace" like operations. Consider, for example, asking a computer to find all email addresses in a document. How would you go about this problem? Perhaps you would break an email address into its elements: some characters that are not spaces, followed by $@$, followed by some other characters that are not spaces. How would you tell a computer to look for this? It is for these types of problems that the regular expression language began to be developed in the 1950s.
In [1]:
# standar library
import re
The module, re, has many functions, like re.search(), re.findall(), etc. The regular expression language has many special metacharacters, like $*, +, ?$, etc. We will use re.findall() and $*$ as an example first.
In [3]:
re.findall("fun", "fuuun")
Out[3]:
In [5]:
re.findall("fu*n", "fuuun")
Out[5]:
In [6]:
re.findall("fu*n", "fn")
Out[6]:
In [8]:
re.findall("fu*n", "fn fun fuun fuuun fuuunnn")
Out[8]:
More Metacharacters in Regular Expressions:
In [13]:
# Exercise
# In the word Mississippi, find:
# 1. Groups of one or more 's'. This should return ['ss','ss']
word = 'Mississippi'
re.findall('s+',word)
Out[13]:
In [14]:
# 2. Groups of 'i' followed by 0 or more 's'. This should return ['iss', 'iss', 'i', 'i']
re.findall('is*',word)
Out[14]:
In [15]:
# 3. Groups of 'i' followed by 0 or one 's'. This should return ['is','is','i','i']
re.findall('is?',word)
Out[15]:
In [16]:
# 4. An s followed by one or more non-linebreak characters followed by a p. This should return ['ssissipp']
re.findall('s[^\n]+p',word)
Out[16]:
In [17]:
# 5. Groups of one or more characters in the set [is]. This should return ['ississi','i']
re.findall('[is]+',word)
Out[17]:
Regular Expression Methods
In [21]:
# Split "This is a sentence" into a list of words.
sen = "This is a sentence"
l_words = re.split(' ',sen)
In [22]:
# Join the list of words into one string, with spaces separaing the words.
' '.join(l_words)
Out[22]:
In [59]:
# grouping
document = """
CLIENT: PHILIPP EISENHAUER
GENDER: M
"""
rst = re.search('(?P<garbage>\nCLIENT: )(?P<name>[^\n]+)',document)
rst.group('name')
Out[59]:
In [62]:
# re(?=str) finding re which is followed exactly by str
tx = "The fat content of fried chicken has decreased 11 percent from 9g to 8g per bite"
re.findall('[0-9]+(?=g)',tx)
# re(?!str) finding re which is not followed by str
re.findall('[0-9]+(?!g)',tx)
# (?<=str)re finding re which follows exactly str
# (?<!str)re finding re which doesn't follow str
Out[62]:
In [63]:
import os
# get the current working directory
os.getcwd()
Out[63]:
In [65]:
# list all the items in the current directory
os.listdir(os.getcwd())
Out[65]:
In [74]:
# change working directory
os.chdir('/home')
os.chdir('/home/vagrant/Documents/Bootcamp_Notebooks')
In [120]:
# open a file in the current directory with wirte-only mode
file = open('file.txt','w')
file.write("Hello World!")
file.write("\n1\n2")
file.close()
In [121]:
# open a file in the current directory with read-only mode
file = open('file.txt','r')
file.read()
Out[121]:
In [122]:
# open a file in the current directory with append mode
file = open('file.txt','a')
file.write("\nappend")
file.close()
In [119]:
# delete the file
os.system("rm file.txt")
Out[119]:
Formatting
In [2]:
import urllib; from IPython.core.display import HTML
HTML(urllib.urlopen('http://bit.ly/1Ki3iXw').read())
Out[2]: