In our association mining example, recall that we wanted to maintain a sparse table of the number of occurrences of pairs of items. The focus of today's class is on some Python constructs that will enable us to write compact code to build and maintain such a table, specifically for the problem of mining an email corpus for "co-occurring correspondents."
Specific topics in today's notebook are:
If all goes well, you will by the end of this notebook implement and test an A-Priori algorithm from Lab 3 on a subset of the Enron corpus, available here: http://cse6040.gatech.edu/fa15/enron-maildir-subset.zip
In [ ]:
sample_list = ['The', 'Quick', 'Brown', 'Fox']
lower_list = []
for x in sample_list:
lower_list.append (x.lower ())
upper_list = []
for x in sample_list:
upper_list.append (x.upper ())
print ("sample_list = %s" % str (sample_list))
print ("lower_list = %s" % str (lower_list))
print ("upper_list = %s" % str (upper_list))
Alternative idiom: List comprehensions. The idiom of creating a list by transforming another list is very common. As such, there is a handy compact notation for it, called a list comprehension.
Inspect this code and check that it produces the expected results.
In [ ]:
lower_list2 = [x.lower () for x in sample_list] # A list comprehension.
upper_list2 = [x.upper () for x in sample_list] # Another one!
print ("lower_list2 = %s" % str (lower_list2))
print ("upper_list2 = %s" % str (upper_list2))
There is an analogous concept for constructing a set, which is called a set comprehension. A set comprehension constructs a set from an input list or set.
In [ ]:
list1 = """
how much wood could a woodchuck chuck
if a woodchuck could chuck wood
""".split ()
set_from_list = {x for x in list1}
list_from_set = [x for x in set_from_list]
print (list1)
print (set_from_list)
print (list_from_set)
You've undoubtedly noticed the common use of for ... in ...:
constructs in Python programs. The in
part is quite flexible, allowing you to iterate in a compact and readable way over many kinds of collections.
# Characters in a string
text = 'The quick brown fox jumps over the lazy dog'
for letter in text:
...
# Lists
x = ['a', 'b', 'c']
for y in x:
...
# Dictionaries
x = {'a': 1, 'b': 2, 'c': 3}
for key, value in x.items (): # also possible: x.keys(), x.values()
...
# Range objects
for i in range (0, 10):
...
# Files (line-by-line)
for line in open ('filename.txt', 'r'):
...
The last two examples---range objects and files---are especially interesting. They are in fact examples of a special kind of custom iteration that you can design. Here, you'll see an example of one technique called a generator.
Generators. A generator is a special kind of "interruptible" function that yields zero or more values or objects. During its execution, a generator may produce an object or value, temporarily transfer control back to the caller with that object or value, and then resume control from the same place when called again. These control transfer points are marked by yield
statements.
For example, suppose you are given a dictionary, and you wish to find all keys whose values match or exceed some threshold. Here is how you might use a generator to do the job.
In [ ]:
def keys_geq_threshold (Dict, threshold):
"""
(Generator) Given a dictionary, yields the keys whose values
are at or above (greater than or equal to) a given threshold.
"""
for key, value in Dict.items ():
if value >= threshold:
yield key
This code is a loop over key-value pairs. However, when a matching key is detected, the function yields control back to the caller. If the caller calls this function again, it will resume after the yield
statement.
You can call such a generator function as part of the in
statement of a for
loop to get a sequence of keys. The range()
and open()
functions are themselves generators!
Take a look at the code and run it to verify that it produces the expected result.
In [ ]:
inventory = {'apples': 6, 'bananas': 3, 'milk': 5, 'peanuts': 10}
# Apply the generator:
overstock = [key for key in keys_geq_threshold (inventory, 5)]
print ("=== Overstock, via list comprehensions ===")
print (overstock)
# Reminder: Here is a different way to implement the above
overstock2 = []
for key in keys_geq_threshold (inventory, 5):
overstock2.append (key)
print ("\n=== Overstock, via explicit loops ===")
print (overstock2)
To see how generators can be useful, let's return to the problem of mining an email archive.
(Review) Parsing an email. First, recall that Python has an email
module, which you can use to parse plaintext emails.
As a minor technical aside, these emails should be formatted according to the RFC 822 standard.
Given a string formatted in such a way, you can create a structured email object. You can then query the object to extract fields of the email, like the sender, the recipient, the date, the subject, or the body, among others.
Here is some sample code to look for email addresses in certain fields of the email; run it to see that it produces the expected result.
In [2]:
email_msg_text = """Message-ID: aslkj42t90wdk23o5uxc@jinx
Date: Tue Aug 25 23:44:06 EDT 2015
From: Richard (Rich) Vuduc <richie@cc.gatech.edu>
To: Jebediah Springfield <jebby@mindsprang.com>
Reply-To: junk@vuduc.org
Subject: Spam -- delete me
Please read the subject, or reply me at prez@whitehouse.gov or junk@vuduc.org
Thanks,
-- Rich"""
import re
import email.parser
import re
# This line was missing in class
EMAIL_PATTERN = re.compile (r'[\w+.]+@[\w.]+')
# Parses email text into an object
email_msg_obj = email.parser.Parser ().parsestr (email_msg_text)
# Poke around for email addresses in the From, To, Reply-To, and body:
addrs = set (EMAIL_PATTERN.findall (email_msg_obj['From']))
addrs.update (EMAIL_PATTERN.findall (email_msg_obj['Reply-To']))
addrs.update (EMAIL_PATTERN.findall (email_msg_obj['To']))
addrs.update (EMAIL_PATTERN.findall (email_msg_obj.get_payload ()))
print ("Email addresses found: %s" % str (addrs))
Notice that the code above calls,
... email.parser.Parser ().parsestr (email_msg_text)
There is another handy function, parse (email_file_obj)
, that does the same thing except that it reads the email contents from a file instead of a string.
Email directories (maildir
). Next, let's introduce the concept of an email directory, or a maildir
for short. A maildir
is any directory that contains any number of nested subdirectories and files, where each file is an email message.
Here is a generator to produce email objects for every message in a
maildir
. Take a look and make sure you understand how it works.
In [ ]:
def messages (maildir_root):
"""
(Generator) Given a mailbox directory name, yields an
email object for each message therein.
"""
for base, dirs, files in os.walk (maildir_root):
for filename in files:
filepath = os.path.join (base, filename)
email_file = open (filepath)
msg = email.parser.Parser ().parse (email_file)
email_file.close ()
yield msg
Exercise. Now it's your turn!
As part of this exercise, you should download and unpack the sample maildir
available here, which is a subset of the full Enron corpus: http://cse6040.gatech.edu/fa15/enron-maildir-subset.zip
Given the
messages()
generator, complete the function,count_messages ()
, so that it returns the number of messages in amaildir
.
In [ ]:
# Your task: Complete this function!
def count_messages (maildir_root):
"""Returns the # of messages stored in a given mailbox directory."""
return sum ([1 for msg in messages (maildir_root)])
return len (msgs)
count = 0
for msg in messages (maildir_root):
count += 1
return count
# Our testing code:
MAILDIR = './enron-maildir-subset/skilling-j' # Change path if needed
ANSWER = 4139
num_msgs = count_messages (MAILDIR)
print ("Found %d messages." % num_msgs)
assert num_msgs == ANSWER # The answer for the above maildir
collections.defaultdict
The next new concept you will explore is a twist on the usual dictionary object, which is a default dictionary.
To motivate it, consider the following common pattern.
Suppose you are given a string and you wish to count the number of occurrences of alphabetic characters. A first and natural way might be to scan the string letter by letter, using a dictionary to store the counts. This approach would then involve a common idiom: for each letter, check whether we created a dictionary entry already; if so, then update the entry; otherwise, create a new entry.
Inspect this example and try it.
In [ ]:
def alpha_chars (text):
"""
(Generator) Yields each of the alphabetic characters in a string.
"""
for letter in text:
if letter.isalpha ():
yield letter
# Example: Count letter frequencies (case-insensitive), take 1
text = """How much wood could a woodchuck chuck
if a woodchuck could chuck wood?"""
freqs1 = {} # Frequency table for method 1
for letter in alpha_chars (text.lower ()):
freqs1[letter] += 1
print (freqs1)
# Quick check
assert freqs1['o'] == 11 and freqs1['h'] == 6
print ("\n(Passed a quick, partial test.)")
As it happens, we can express the same pattern in a slightly more compact notation using a special form of a dictionary called a default dictionary, which is defined in Python's collections
module.
Here is an example of what it would look like. Notice that it obviates the explicit conditional to test for the presence of an existing key!
Q: Inspect this code. Notice that the
defaultdict()
call takes an argument. Why might such an argument be needed?
In [ ]:
from collections import defaultdict
# Frequency table, take 2
freqs2 = defaultdict (int) # take note of argument, `int`
for letter in alpha_chars (text.lower ()):
freqs2[letter] += 1
print (freqs2)
print ("===")
print (int ())
# Check answers against the first method
for key, value in freqs1.items (): assert freqs2[key] == value
for key, value in freqs2.items (): assert freqs1[key] == value
print ("\n(Passed: Method 2 gives the same answer as method 1.)")
Exercise: Sparse matrices. This exercise is a kind of test to see whether you understand how default dictionaries work.
First, let's "package up" the above example into an abstract data type, which is a sparse (integer) vector.
Inspect and try this example, to confirm it gives the same results as the preceding examples.
In [ ]:
def sparse_vector ():
return defaultdict (int)
def print_sparse_vector (x):
for key, value in x.items ():
print ("%s: %d" % (key, value))
letter_freqs = sparse_vector ()
for letter in alpha_chars (text.lower ()):
letter_freqs[letter] += 1
print_sparse_vector (letter_freqs)
for key, value in freqs2.items (): assert letter_freqs[key] == value
for key, value in letter_freqs.items (): assert freqs2[key] == value
print ("\n(Passed check against method 2.)")
Suppose that we instead want to compute how frequently pairs of letters occur within words. Instead of a sparse vector, you might instead maintain a table, or sparse matrix, such that the $(i, j)$ entry of the matrix counts the number of times the letters $i$ and $j$ co-occur within a word.
Complete the code below to implement a sparse matrix that counts the number of times that a pair of letters co-occurs in a word. In particular, fill in the code for
sparse_matrix()
andalpha_chars_pairs()
.
In [ ]:
# === COMPLETE THIS FUNCTION ===
# Hint: See the definition of `print_sparse_matrix()`
# to see the interface to which your sparse matrix object
# should conform.
def sparse_matrix ():
"""
Returns an empty sparse matrix that can hold integer counts
of pairs of elements.
"""
pass
def print_sparse_matrix (x):
for i, row_i in x.items ():
for j, value in row_i.items ():
print ("[%s, %s]: %d" % (i, j, value))
# === COMPLETE THIS FUNCTION ===
# Hint: Look at how this function is used, below.
def alpha_chars_pairs (text):
"""
(Generator) Yields every one of the 4-choose-2 pairs of
'positionally distinct' alphabetic characters in a string.
That is, each position of the string is regarded as distinct,
but the pair of characters coming from positions (i, j),
where i != j, are considered the "same" as the paired
positions (j, i). Non-alphabetic characters should be
ignored.
For instance, `alpha_chars_pairs ("te3x_t")` should produce
has just 4 positionally distinct characters, so this routine
should return the 4 choose 2 == 6 pairs:
('t', 'e') <-- from positions (0, 1)
('t', 'x') <-- from positions (0, 3)
('t', 't') <-- from positions (0, 5)
('e', 'x') <-- from positions (1, 3)
('e', 't') <-- from positions (1, 5)
('x', 't') <-- from positions (3, 5)
"""
pass
# === Testing code follows ===
# Compute frequency of pairs of positionally distinct,
# case-insensitive alphabetic characters in a word.
letter_pair_counts = sparse_matrix ()
words = text.split ()
for word in words:
for w_i, w_j in alpha_chars_pairs (word.lower ()):
# Enforce convention: w_i < w_j
w_i, w_j = min (w_i, w_j), max (w_i, w_j)
letter_pair_counts[w_i][w_j] += 1
print ("Text: '%s'" % text)
print ("\n==> Frequencies:")
print_sparse_matrix (letter_pair_counts)
assert letter_pair_counts['c']['c'] == 4
assert letter_pair_counts['h']['o'] == 5
Using all of the preceding material, implement the A-Priori algorithm from the previous Lab 3 notebook to detect frequent email correspondents.
You may make the following simplifying assumptions, which may or may not be valid depending on what the true analysis end-goal is.
bob@gatech.edu
sending to kate@aol.com
as the same pair as kate@aol.com
sending to bob@gatech.edu
.For Jeffrey Skilling's
maildir
and a threshold of 65 or more co-occurrences, our solution finds 10 frequently corresponding pairs. For the full data set, it finds 140 pairs.
In [ ]:
# Specify maildir location; you may need to update these paths.
MAILDIR = 'enron-maildir-subset/skilling-j' # Skilling's mail only
#MAILDIR = 'enron-maildir-subset' # Full subset
# Specify the minimum number of occurrences to be considered "frequent"
THRESHOLD = 65
# === FILL-IN YOUR IMPLEMENTATION AND TEST CODE BELOW ==
pass
In [ ]: