The first part of today's class is the following notebook, which continues the last two exercises from Lab 4.
In [1]:
# === Sparse vector: definition ===
from collections import defaultdict
def sparse_vector ():
return defaultdict (int)
def print_sparse_vector (x):
for key, value in x.items ():
print ("%s: %d" % (key, value))
In [2]:
# === Sparse vector demo ===
def alpha_chars (text):
"""
(Generator) Yields each of the alphabetic characters in a string.
"""
for letter in text:
if letter.isalpha ():
yield letter
text = """How much wood could a woodchuck chuck
if a woodchuck could chuck wood?"""
letter_freqs = sparse_vector ()
for letter in alpha_chars (text.lower ()):
letter_freqs[letter] += 1
# If you really wanted an list of the letters
letter_freqs2 = [a for a in alpha_chars (text.lower ())]
print_sparse_vector (letter_freqs)
assert letter_freqs['o'] == 11 and letter_freqs['h'] == 6
print ("\n(Passed partial test.)")
Exercise: Sparse matrices. Suppose that we instead want to compute how frequently pairs of letters occur within words.
Instead of a sparse vector, you might instead maintain a table, or sparse matrix, such that the $(i, j)$ entry of the matrix counts the number of times the letters $i$ and $j$ co-occur within a word.
Complete the code below to implement a sparse matrix that counts the number of times that a pair of letters co-occurs in a word. In particular, fill in the code for
sparse_matrix()
andalpha_chars_pairs()
.
In [5]:
# === COMPLETE THIS FUNCTION ===
# Hint: See the definition of `print_sparse_matrix()`
# to see the interface to which your sparse matrix object
# should conform.
def sparse_matrix ():
"""
Returns an empty sparse matrix that can hold integer counts
of pairs of elements.
"""
return defaultdict (sparse_vector)
def print_sparse_matrix (x):
for i, row_i in x.items ():
for j, value in row_i.items ():
print ("[%s, %s]: %d" % (i, j, value))
# === COMPLETE THIS FUNCTION ===
# Hint: Look at how this function is used, below.
import itertools
def alpha_chars_pairs (text):
"""
(Generator) Yields every one of the 4-choose-2 pairs of
'positionally distinct' alphabetic characters in a string.
Assume 'text' is a single word.
That is, each position of the string is regarded as distinct,
but the pair of characters coming from positions (i, j),
where i != j, are considered the "same" as the paired
positions (j, i). Non-alphabetic characters should be
ignored.
For instance, `alpha_chars_pairs ("te3x_t")` should produce
has just 4 positionally distinct characters, so this routine
should return the 4 choose 2 == 6 pairs:
('t', 'e') <-- from positions (0, 1)
('t', 'x') <-- from positions (0, 3)
('t', 't') <-- from positions (0, 5)
('e', 'x') <-- from positions (1, 3)
('e', 't') <-- from positions (1, 5)
('x', 't') <-- from positions (3, 5)
"""
# Shang's neat solution!
return itertools.combinations (alpha_chars (text.lower ()), 2)
""" # Rich's original solution, which is less neat:
alpha_text = list (alpha_chars (text.lower ()))
for i in range (0, len (alpha_text)):
for j in range (i+1, len (alpha_text)):
yield (alpha_text[i], alpha_text[j])
"""
# === Testing code follows ===
# Compute frequency of pairs of positionally distinct,
# case-insensitive alphabetic characters in a word.
letter_pair_counts = sparse_matrix ()
words = text.split ()
for word in words:
for w_i, w_j in alpha_chars_pairs (word.lower ()):
# Enforce convention: w_i < w_j
w_i, w_j = min (w_i, w_j), max (w_i, w_j)
letter_pair_counts[w_i][w_j] += 1
print ("Text: '%s'" % text)
print ("\n==> Frequencies:")
print_sparse_matrix (letter_pair_counts)
assert letter_pair_counts['c']['c'] == 4
assert letter_pair_counts['h']['o'] == 5
print ("\n(Passed partial test.)")
Using all of the preceding material, implement the A-Priori algorithm from the previous Lab 3 notebook to detect frequent email correspondents.
But first, here's a little bit of helper code from last time, which you'll find useful.
In [ ]:
# This code box contains some helper routines needed to implement
# the A-Priori algorithm.
import re
import email.parser
import os
EMAIL_PATTERN = re.compile (r'[\w+.]+@[\w.]+')
def messages (maildir_root):
"""
(Generator) Given a mailbox directory name, yields an
email object for each message therein.
"""
for base, dirs, files in os.walk (maildir_root):
for filename in files:
filepath = os.path.join (base, filename)
email_file = open (filepath)
msg = email.parser.Parser ().parse (email_file)
email_file.close ()
if len (msg) > 0: # Patch for non-email files?
yield msg
Exercise: The A-Priori algorithm applied to email. Your task is to implement the a-priori algorithm to generate a list of commonly co-occurring correspondents.
You may make the following simplifying assumptions, which may or may not be valid depending on what the true analysis end-goal is.
bob@gatech.edu
sending to kate@aol.com
as the same pair as kate@aol.com
sending to bob@gatech.edu
.For Jeffrey Skilling's
maildir
and a threshold of 65 or more co-occurrences, our solution finds 10 frequently corresponding pairs. For the full data set, it finds 140 pairs.
In [ ]:
# Specify maildir location; you may need to update these paths.
MAILDIR = 'enron-maildir-subset/skilling-j' # Skilling's mail only
#MAILDIR = 'enron-maildir-subset' # Full subset
# Specify the minimum number of occurrences to be considered "frequent"
THRESHOLD = 65
# === FILL-IN YOUR IMPLEMENTATION AND TEST CODE BELOW ==
pass
In [ ]: