There was a lot of interesting discussion on Piazza about extracting email addresses from an email archive using regular expressions (see post @9). As of the time of this writing, the collective student solution was the following:
In [1]:
# Students solution, as of Tue Aug 25 01:00:03 EDT 2015.
# See: https://piazza.com/class/idap9v1ktp94u9?cid=9
import re
import string
def reademails(myfile):
inbox = open (myfile, 'r') # 'r' = read mode; use 'w' for writing
assert inbox # Makes sure it opened OK
emails = []
lastline = inbox.readline() # init lastline
for line in inbox:
if lastline.endswith('=\n'):
lastline = string.replace(lastline,'=','').replace('\n', '')
line1 = (lastline + line).lower() # remove newline and make emails lowercase
emails.extend(re.findall("[\w.-]+@[\w.-]+[.][A-Za-z]{2,4}", line1)) #gets list of emails in the line
lastline = line
print emails[0:9] #sample print
emails = set(emails) #gets unique items
inbox.close()
return emails
print len(reademails('skilling-j.inbox'))
This solution is really great, as it works hard to try to extract email addresses robustly from a file with a lot of funny formatting, while retaining the memory efficiency of line-by-line file parsing. (For instance, messages where each line of the body is trimmed at a fixed column and ends with an =
character, so the above code considers pairs of lines and concatenates them.)
But in a sense, we actually made the problem harder for you than necessary!
For instance, the email archive we provided is actually a subset of the full Enron email archive, which is available here. (Warning: The full archive is $\approx$ 423 MiB compressed.) In the full archive, every email is in its own file, which means it is more conceivable to assume you can read one whole message into main memory for subsequent processing.
Moreover, we didn't say what the real application might be. For example, suppose what you are mining is who corresponds with whom most frequently. In that case, you don't need to parse the whole message body; you just need to inspect the From: ...
and To: ...
headers in each email message.
So, for the sake of variety, let's consider a different solution that exploits the above simplifications, i.e., one email per file and from/to headers. This scenario will allow us to introduce a couple other useful Python modules: the os
module, which has some portable file and directory manipulation utilities, and the email
module, which helps you parse email messages. We'll just show you some code, below; you should read it and compare it against the online documentation for these modules.
Note: To run this code, you will need to download the Enron email database. For testing purposes, you can use a smaller subset that we have prepared: http://cse6040.gatech.edu/fa15/enron-maildir-subset.zip
In [10]:
import os
from email.parser import Parser
import re
def getAllFilePaths (root_dirname):
"""Returns a list of all file paths from a given root directory."""
AllPaths = []
for base, Dirs, Files in os.walk (root_dirname):
for filename in Files:
filepath = os.path.join (base, filename)
AllPaths.append (filepath)
return AllPaths
# A regex object to recognize an email address:
EMAIL_PATTERN = re.compile (r'[\w.-]+@[\w.-]+')
def getCommunicators (email_msg):
"""Given an email message object, returns the sender and recipients."""
Comms = []
Comms.extend (EMAIL_PATTERN.findall (email_msg['From']))
if email_msg.has_key ('To'):
Comms.extend (EMAIL_PATTERN.findall (email_msg['To']))
return Comms
# Main loop
EmailFilenames = getAllFilePaths ('./enron-maildir-subset')
UniqueAddresses = set () # Stores unique sender/recipient email addresses
for email_filename in EmailFilenames:
email_file = open (email_filename)
msg = Parser ().parse (email_file)
email_file.close ()
Addresses = getCommunicators (msg)
UniqueAddresses.update (Addresses)
print (len (UniqueAddresses))
Suppose you are a retailer (e.g., Publix, Amazon) who sells items, and you want to discover whether customers buy certain pairs of items together frequently. The data you have are baskets: a basket is the list of items that some customer bought during a given visit. We will refer to this problem as the pairwise association mining problem.
The more general form, where you are interested in subsets of items, rather than only pairs, is the association mining problem.
With your nearest neighbor(s), briefly discuss the following questions.
Q: How might you use the information about co-occurring pairs?
Q: Give another example of a data analysis problem that, abstractly, “looks like” this pairwise association mining problem.
Q: How would you approach the problem of getting this information from the data?
(Enter your responses to the above questions here)
Let’s consider some specific algorithms for discovering pairwise association rules.
Let $n$ be the number of items, represented by the integers $\{0, 1, 2, \ldots, n-1\}$, and let $m$ be the number of baskets, numbered from $\{0, 1, 2, \ldots, m-1\}$.
Denote the $i$-th basket by $b_i^T \in \{0, 1\}^n$, which is a binary (row) vector of length $n$. Furthermore, let $b_{i,j}$ be the the $j$-th component of $b_i^T$, which has the value of $1$ if the $i$-th transaction included a purchase of item $j$, and $0$ otherwise.
Put differently, the matrix
$$B \equiv \left(\begin{array}{c} b_0^T \\ b_1^T \\ \vdots \\ b_{m-1}^T \end{array}\right) = \left(\begin{array}{ccc} b_{0,0} & \cdots & b_{0,n-1} \\ \vdots & & \vdots \\ b_{m-1,0} & \cdots & b_{m-1,n-1} \end{array}\right)$$is the matrix of all transactions.
A first simple algorithm might be the following. The algorithm maintains an $n \times n$ table $T \in \mathcal{Z}_{*}^{n \times n}$, which holds a count, $\{t_{i,j} \geq 0\}$, for each possible pair of items $(i, j)$. Initially, $T = 0$. To save some space, you could store just the upper- or lower-triangle. The algorithm would then scan all baskets, and for each basket, increment the counts for all pairs of items in the basket. After reading all baskets, the algorithm could scan the table and pull out the top occurrences. The "top occurrences" might be all those that appear more than $s$ times, where $s$ is some threshold.
Q: Computationally, what are the positive or negative aspects of this approach?
Q: How should you store the table, $T$? (What sort of data structure would you try to use?)
(Enter your responses to the above questions here.)
An alternative to the above algorithm is the a-priori algorithm. The key idea is to exploit monotonicity, which is the following natural property: if the pair of items, $(i, j)$, appears at least $s$ times, then items $i$ and $j$ must also appear at least $s$ times.
Q: Based on this observation, devise a scheme that can identify frequent pairs by reading the entire data set only twice, using at most $O(n + k_s^2)$ storage, where $n$ is the number of items and $k_s$ is the number of items that appear more than $s$ times.
(Enter your response here.)