| ^ | the start of a line | '^From:' |
| $ | end of a line | |
| . | wildcard for any character | |
| * | Repeating a character 0 or more times | '\s*' or '.*' |
| *? | ||
| + | Repeating a character 1 or more times | '[0-9]+' |
| +? | ||
| \s | white space | |
| \S | non-white space (any non-blank character) | |
| [list] | matching a single character in the list | |
| [^list] | matching any character not in the list | |
| [a-z0-9] | range of characters a to z, and digits 0-9 | |
| ( ) | String extraction |
If two intersecting matches were found:
Greedy expressions will output the largest matches
Non-greedy: satisfying the expression with the shortest match
To search for a bigger match, but extract a subset of the match:
Example: '^From: (\S+@\S+)'
import re
re.search()
Enron email dataset: https://www.cs.cmu.edu/~./enron/
Python regular expression functions:
re.search() to see if there is any pattern match
re.findall() to extract all the matches in a list
In [21]:
import re
emaildata = open('enron-email-dataset.txt')
for line in emaildata:
line = line.rstrip()
if re.search('^From:', line):
print(line)
In [3]:
x = 'Team A beat team B 38-7. That was the greatest record for team A since 1987.'
y = re.findall('[0-9]+', x)
y
Out[3]:
In [8]:
x = 'My work email address is example@work.com and \
my personal email is example@personal.com.'
re.findall('\S+@\S+', x)
Out[8]:
In [9]:
x = 'From: me@you.com My work email address is example@work.com and \
my personal email is example@personal.com.'
re.findall('^From: (\S+@\S+)', x)
Out[9]:
In [10]:
x = 'My work email address is example@work.com and \
my personal email is example@personal.com.'
re.findall('\S+@(\S+)', x)
Out[10]:
In [13]:
re.findall('@([^ ]+)', x.rstrip())
Out[13]:
In [20]:
emaildata = open('enron-email-dataset.txt')
for line in emaildata:
line = line.rstrip()
res = re.findall('^X-To: (.*@\S+)', line)
if (len(res)>0):
print(res)
In [24]:
x = "It's a big weekend sale! 70% Everything. \
You can get jeans for $9.99 or get 2 for only $14.99"
re.findall('\$([0-9.]+)', x)
Out[24]:
In [ ]: