Regular Expression

^ the start of a line '^From:'
$ end of a line
. wildcard for any character
* Repeating a character 0 or more times '\s*' or '.*'
*?
+ Repeating a character 1 or more times '[0-9]+'
+?
\s white space
\S non-white space (any non-blank character)
[list] matching a single character in the list
[^list] matching any character not in the list
[a-z0-9] range of characters a to z, and digits 0-9
( ) String extraction
  • If two intersecting matches were found:
    Greedy expressions will output the largest matches
    Non-greedy: satisfying the expression with the shortest match

  • To search for a bigger match, but extract a subset of the match:
    Example: '^From: (\S+@\S+)'

import re

re.search()

Enron email dataset: https://www.cs.cmu.edu/~./enron/

Python regular expression functions:
re.search() to see if there is any pattern match
re.findall() to extract all the matches in a list


In [21]:
import re

emaildata = open('enron-email-dataset.txt')
for line in emaildata:
    line = line.rstrip()
    if re.search('^From:', line):
        print(line)


From: heather.dunton@enron.com
From: 	Allen, Phillip K.
From: 	Dunton, Heather
From: 	Allen, Phillip K.
From: 	Dunton, Heather
From: brad.jones@enron.com
From: david.port@enron.com
From: 	Hayden, Frank
From: 	Jones, Brad
From: c..gossett@enron.com
From: steven.matthews@ubspw.com
From: louise.kitchen@enron.com
From: gthorse@about-cis.com
From: software@mail02.unitedmarketingstrategies.com
From: unsubscribe-i@networkpromotion.com

In [3]:
x = 'Team A beat team B 38-7. That was the greatest record for team A since 1987.'

y = re.findall('[0-9]+', x)

y


Out[3]:
['38', '7', '1987']

Extracting email addresses from text


In [8]:
x = 'My work email address is example@work.com and \
my personal email is example@personal.com.'

re.findall('\S+@\S+', x)


Out[8]:
['me@you.com', 'example@work.com', 'example@personal.com.']

In [9]:
x = 'From: me@you.com  My work email address is example@work.com and \
my personal email is example@personal.com.'

re.findall('^From: (\S+@\S+)', x)


Out[9]:
['me@you.com']

Extracting the domain name in email addresses


In [10]:
x = 'My work email address is example@work.com and \
my personal email is example@personal.com.'

re.findall('\S+@(\S+)', x)


Out[10]:
['work.com', 'personal.com.']

In [13]:
re.findall('@([^ ]+)', x.rstrip())


Out[13]:
['work.com', 'personal.com.']

In [20]:
emaildata = open('enron-email-dataset.txt')
for line in emaildata:
    line = line.rstrip()
    res = re.findall('^X-To: (.*@\S+)', line)
    if (len(res)>0):
        print(res)


["'daniel.mcdonagh@chase.com', Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen>, 'pallen70@hotmail.com'"]
["'steven.l.allen@chase.com', 'daniel.mcdonagh@chase.com'"]
['pallen70@hotmail.com']
['pallen@enron.com']
['PALLEN@ENRON.COM']

Extracting prices in text


In [24]:
x = "It's a big weekend sale! 70% Everything. \
You can get jeans for $9.99 or get 2 for only $14.99"

re.findall('\$([0-9.]+)', x)


Out[24]:
['9.99', '14.99']

In [ ]: