Regular Expression


^	the start of a line	'^From:'
$	end of a line
.	wildcard for any character
*	Repeating a character 0 or more times	`'\s' or '.'`
*?
+	Repeating a character 1 or more times	`'[0-9]+'`
+?
\s	white space
\S	non-white space (any non-blank character)
[list]	matching a single character in the list
[^list]	matching any character not in the list
[a-z0-9]	range of characters a to z, and digits 0-9
( )	String extraction

If two intersecting matches were found:
Greedy expressions will output the largest matches
Non-greedy: satisfying the expression with the shortest match
To search for a bigger match, but extract a subset of the match:
Example: '^From: (\S+@\S+)'

import re

re.search()

Enron email dataset: https://www.cs.cmu.edu/~./enron/

Python regular expression functions:
re.search() to see if there is any pattern match
re.findall() to extract all the matches in a list



In [21]:

    
import re

emaildata = open('enron-email-dataset.txt')
for line in emaildata:
    line = line.rstrip()
    if re.search('^From:', line):
        print(line)









    



From: heather.dunton@enron.com
From: 	Allen, Phillip K.
From: 	Dunton, Heather
From: 	Allen, Phillip K.
From: 	Dunton, Heather
From: brad.jones@enron.com
From: david.port@enron.com
From: 	Hayden, Frank
From: 	Jones, Brad
From: c..gossett@enron.com
From: steven.matthews@ubspw.com
From: louise.kitchen@enron.com
From: gthorse@about-cis.com
From: software@mail02.unitedmarketingstrategies.com
From: unsubscribe-i@networkpromotion.com



In [3]:

    
x = 'Team A beat team B 38-7. That was the greatest record for team A since 1987.'

y = re.findall('[0-9]+', x)

y









    Out[3]:





['38', '7', '1987']

Extracting email addresses from text



In [8]:

    
x = 'My work email address is example@work.com and \
my personal email is example@personal.com.'

re.findall('\S+@\S+', x)









    Out[8]:





['me@you.com', 'example@work.com', 'example@personal.com.']



In [9]:

    
x = 'From: me@you.com  My work email address is example@work.com and \
my personal email is example@personal.com.'

re.findall('^From: (\S+@\S+)', x)









    Out[9]:





['me@you.com']

Extracting the domain name in email addresses



In [10]:

    
x = 'My work email address is example@work.com and \
my personal email is example@personal.com.'

re.findall('\S+@(\S+)', x)









    Out[10]:





['work.com', 'personal.com.']



In [13]:

    
re.findall('@([^ ]+)', x.rstrip())









    Out[13]:





['work.com', 'personal.com.']



In [20]:

    
emaildata = open('enron-email-dataset.txt')
for line in emaildata:
    line = line.rstrip()
    res = re.findall('^X-To: (.*@\S+)', line)
    if (len(res)>0):
        print(res)









    



["'daniel.mcdonagh@chase.com', Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen>, 'pallen70@hotmail.com'"]
["'steven.l.allen@chase.com', 'daniel.mcdonagh@chase.com'"]
['pallen70@hotmail.com']
['pallen@enron.com']
['PALLEN@ENRON.COM']

Extracting prices in text



In [24]:

    
x = "It's a big weekend sale! 70% Everything. \
You can get jeans for $9.99 or get 2 for only $14.99"

re.findall('\$([0-9.]+)', x)









    Out[24]:





['9.99', '14.99']



In [ ]: