CSE 6040, Fall 2015 [05, Part A]: A-priori algorithm wrap-up

The first part of today's class is the following notebook, which continues the last two exercises from Lab 4.

Sparse vectors and matrices

When we ended Lab 3 in class, we asked you to complete the following exercise, which is a test to see whether you understand how default dictionaries work.

First, recall the notion of a sparse (integer) vector.


In [1]:
# === Sparse vector: definition ===
from collections import defaultdict

def sparse_vector ():
    return defaultdict (int)

def print_sparse_vector (x):
    for key, value in x.items ():
        print ("%s: %d" % (key, value))

In [2]:
# === Sparse vector demo ===
def alpha_chars (text):
    """
    (Generator) Yields each of the alphabetic characters in a string.
    """
    for letter in text:
        if letter.isalpha ():
            yield letter

text = """How much wood could a woodchuck chuck
if a woodchuck could chuck wood?"""
letter_freqs = sparse_vector ()
for letter in alpha_chars (text.lower ()):
    letter_freqs[letter] += 1
    
# If you really wanted an list of the letters
letter_freqs2 = [a for a in alpha_chars (text.lower ())]
    
print_sparse_vector (letter_freqs)

assert letter_freqs['o'] == 11 and letter_freqs['h'] == 6
print ("\n(Passed partial test.)")


a: 2
c: 11
d: 6
f: 1
i: 1
h: 6
k: 4
m: 1
l: 2
o: 11
u: 7
w: 5

(Passed partial test.)

Exercise: Sparse matrices. Suppose that we instead want to compute how frequently pairs of letters occur within words.

Instead of a sparse vector, you might instead maintain a table, or sparse matrix, such that the $(i, j)$ entry of the matrix counts the number of times the letters $i$ and $j$ co-occur within a word.

Complete the code below to implement a sparse matrix that counts the number of times that a pair of letters co-occurs in a word. In particular, fill in the code for sparse_matrix() and alpha_chars_pairs().


In [5]:
# === COMPLETE THIS FUNCTION ===
# Hint: See the definition of `print_sparse_matrix()`
# to see the interface to which your sparse matrix object
# should conform.
def sparse_matrix ():
    """
    Returns an empty sparse matrix that can hold integer counts
    of pairs of elements.
    """
    return defaultdict (sparse_vector)


def print_sparse_matrix (x):
    for i, row_i in x.items ():
        for j, value in row_i.items ():
            print ("[%s, %s]: %d" % (i, j, value))
            
            
# === COMPLETE THIS FUNCTION ===
# Hint: Look at how this function is used, below.

import itertools

def alpha_chars_pairs (text):
    """
    (Generator) Yields every one of the 4-choose-2 pairs of
    'positionally distinct' alphabetic characters in a string.
    
    Assume 'text' is a single word.
    
    That is, each position of the string is regarded as distinct,
    but the pair of characters coming from positions (i, j),
    where i != j, are considered the "same" as the paired
    positions (j, i). Non-alphabetic characters should be
    ignored.
    
    For instance, `alpha_chars_pairs ("te3x_t")` should produce
    has just 4 positionally distinct characters, so this routine
    should return the 4 choose 2 == 6 pairs:
      ('t', 'e')    <-- from positions (0, 1)
      ('t', 'x')    <-- from positions (0, 3)
      ('t', 't')    <-- from positions (0, 5)
      ('e', 'x')    <-- from positions (1, 3)
      ('e', 't')    <-- from positions (1, 5)
      ('x', 't')    <-- from positions (3, 5)
    """
    # Shang's neat solution!
    return itertools.combinations (alpha_chars (text.lower ()), 2)

""" # Rich's original solution, which is less neat:
    alpha_text = list (alpha_chars (text.lower ()))
    for i in range (0, len (alpha_text)):
        for j in range (i+1, len (alpha_text)):
            yield (alpha_text[i], alpha_text[j])
"""

 

# === Testing code follows ===

# Compute frequency of pairs of positionally distinct,
# case-insensitive alphabetic characters in a word.
letter_pair_counts = sparse_matrix ()
words = text.split ()
for word in words:
    for w_i, w_j in alpha_chars_pairs (word.lower ()):
        # Enforce convention: w_i < w_j
        w_i, w_j = min (w_i, w_j), max (w_i, w_j)
        letter_pair_counts[w_i][w_j] += 1

print ("Text: '%s'" % text)
print ("\n==> Frequencies:")
print_sparse_matrix (letter_pair_counts)

assert letter_pair_counts['c']['c'] == 4
assert letter_pair_counts['h']['o'] == 5
print ("\n(Passed partial test.)")


Text: 'How much wood could a woodchuck chuck
if a woodchuck could chuck wood?'

==> Frequencies:
[c, c]: 4
[c, d]: 6
[c, h]: 9
[c, k]: 8
[c, m]: 1
[c, l]: 2
[c, o]: 10
[c, u]: 11
[c, w]: 4
[d, h]: 2
[d, k]: 2
[d, l]: 2
[d, o]: 10
[d, u]: 4
[d, w]: 4
[f, i]: 1
[h, u]: 5
[h, k]: 4
[h, m]: 1
[h, w]: 3
[h, o]: 5
[k, u]: 4
[k, o]: 4
[k, w]: 2
[m, u]: 1
[l, u]: 2
[l, o]: 2
[o, u]: 6
[o, o]: 4
[o, w]: 9
[u, w]: 2

(Passed partial test.)

Putting it all together: The A-Priori algorithm

Using all of the preceding material, implement the A-Priori algorithm from the previous Lab 3 notebook to detect frequent email correspondents.

But first, here's a little bit of helper code from last time, which you'll find useful.


In [ ]:
# This code box contains some helper routines needed to implement
# the A-Priori algorithm.

import re
import email.parser
import os

EMAIL_PATTERN = re.compile (r'[\w+.]+@[\w.]+')

def messages (maildir_root):
    """
    (Generator) Given a mailbox directory name, yields an
    email object for each message therein.
    """
    for base, dirs, files in os.walk (maildir_root):
        for filename in files:
            filepath = os.path.join (base, filename)
            email_file = open (filepath)
            msg = email.parser.Parser ().parse (email_file)
            email_file.close ()
            if len (msg) > 0: # Patch for non-email files?
                yield msg

Exercise: The A-Priori algorithm applied to email. Your task is to implement the a-priori algorithm to generate a list of commonly co-occurring correspondents.

You may make the following simplifying assumptions, which may or may not be valid depending on what the true analysis end-goal is.

  • You need only examine the 'From:' and 'To:' fields of an email message. Ignore all other fields.
  • You should only "count" an email address if both the 'From:' and 'To:' fields are set. Otherwise, you cannot tell from whom the message was sent or who is the recipient, and may therefore ignore the interaction.
  • Consider pairs that consist of a sender and a recipient. In other words, do not match multiple recipients of a single message as a "pair."
  • Ignore the direction of the exchange. That is, regard bob@gatech.edu sending to kate@aol.com as the same pair as kate@aol.com sending to bob@gatech.edu.

For Jeffrey Skilling's maildir and a threshold of 65 or more co-occurrences, our solution finds 10 frequently corresponding pairs. For the full data set, it finds 140 pairs.


In [ ]:
# Specify maildir location; you may need to update these paths.
MAILDIR = 'enron-maildir-subset/skilling-j' # Skilling's mail only
#MAILDIR = 'enron-maildir-subset' # Full subset

# Specify the minimum number of occurrences to be considered "frequent"
THRESHOLD = 65

# === FILL-IN YOUR IMPLEMENTATION AND TEST CODE BELOW ==
pass

In [ ]: