The Task:

Train a classifier that can predict whether an email is "important" or not.

We'll approach this using our own Gmail data--from building and evaluating our classifier's performance to deploying it using a combination of Rackspace's Mailgun and Amazon's EC2 services.

Document classification

Common classification task--we can take a peak at the cheatsheet when it comes to what model we should use:

The data

But what's the tradeoff between the size of our dataset and classifier performance?

How much data do we need to make a "decent" classifier?

Source

Let's obtain our data

First, grab your gmail data: https://www.google.com/settings/takeout

You can be conservative, and only fetch your inbox data as that's all we really need:

This'll take a bit to wait for--we'll continue on, using my pre-fetched personal Gmail data.

We'll get a zip file, containing .mbox files, one for each folder in your Gmail account. mbox is a file format for storing emails--it's simply a plain-text file of all your emails concatenated together, we can take a peak at one:


In [2]:
# unzip 'em
!unzip /Users/max/Downloads/max.mautner@gmail.com-20131218T185235Z-Mail.zip -d ./data/


Archive:  /Users/max/Downloads/max.mautner@gmail.com-20131218T185235Z-Mail.zip
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/meetup.com.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Chat.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Important.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Sent Messages.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Unread.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Archived.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Spam.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/chipy.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/[Imap]/Sent.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/comcast.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Trash.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Notes.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/OS-Dev/Django-Dev.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/OS-Dev/SciKit learn.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Drafts.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/[Imap]/Trash.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/amazon.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Receipts.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Tracked Email.mbox  
  inflating: ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Starred.mbox  

In [6]:
!ls -l ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/


total 4024984
-rw-r--r--@ 1 max  staff  303432822 Dec 18 13:10 Archived.mbox
-rw-r--r--@ 1 max  staff   33416651 Dec 18 13:07 Chat.mbox
-rw-r--r--@ 1 max  staff       2632 Dec 18 13:10 Drafts.mbox
-rw-r--r--@ 1 max  staff  545667670 Dec 18 13:07 Important.mbox
-rw-r--r--@ 1 max  staff  833349412 Dec 18 13:08 Inbox.mbox
-rw-r--r--@ 1 max  staff       5953 Dec 18 13:10 Notes.mbox
drwxr-xr-x@ 4 max  staff        136 Feb 13 04:28 OS-Dev
-rw-r--r--@ 1 max  staff     193226 Dec 18 13:10 Receipts.mbox
-rw-r--r--@ 1 max  staff        929 Dec 18 13:08 Sent Messages.mbox
-rw-r--r--@ 1 max  staff    1639936 Dec 18 13:10 Spam.mbox
-rw-r--r--@ 1 max  staff      29869 Dec 18 13:10 Starred.mbox
-rw-r--r--@ 1 max  staff      10119 Dec 18 13:10 Tracked Email.mbox
-rw-r--r--@ 1 max  staff     780533 Dec 18 13:10 Trash.mbox
-rw-r--r--@ 1 max  staff  196938301 Dec 18 13:09 Unread.mbox
drwxr-xr-x@ 4 max  staff        136 Feb 13 04:28 [Imap]
-rw-r--r--@ 1 max  staff   29579153 Dec 18 13:10 amazon.mbox
-rw-r--r--@ 1 max  staff   48258540 Dec 18 13:10 chipy.mbox
-rw-r--r--@ 1 max  staff     455999 Dec 18 13:10 comcast.mbox
-rw-r--r--@ 1 max  staff   66992914 Dec 18 13:07 meetup.com.mbox

In [7]:
from glob import glob
mailboxes = glob('./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/*')
mailboxes


Out[7]:
['./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/[Imap]',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/amazon.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Archived.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Chat.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/chipy.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/comcast.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Drafts.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Important.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/meetup.com.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Notes.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/OS-Dev',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Receipts.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Sent Messages.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Spam.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Starred.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Tracked Email.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Trash.mbox',
 './data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Unread.mbox']

In [8]:
!head ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox











The pipeline

Let's enumerate our tasks for building our "email importance" classifier:

  1. define, extract our predictive features
  2. fit our model to the training data
  3. score our model against held-back test data

If it's good enough, we can go ahead and apply the model in production.

It sucks? Either:

  • Go back to step 1 and consider more features
  • Get more data

For now, we'll just assume we've got enough data, and that bag-of-words feature extraction will be adequate.

Feature extraction

Let's start w/ feature extraction--and let's only consider a subset of our folders:


In [13]:
interesting_mboxes = ['./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox']

Let's import some things to do some basic pre-processing:


In [14]:
import mailbox
import email
from nltk import clean_html
import sys

In [15]:
%time len(mailbox.mbox(interesting_mboxes[0]).items())


CPU times: user 1min 23s, sys: 2.38 s, total: 1min 25s
Wall time: 1min 26s
Out[15]:
9687

In [16]:
corpus = []
labels = []

def create_corpus():
    for fname in interesting_mboxes:
        print fname
        sys.stdout.flush() # make sure to flush to output
        category = fname.split('/')[-1].split('.')[0].lower()
        mbox = mailbox.mbox(fname)
        for msg_id, email_obj in mbox.items():
            if 'Sent' not in email_obj['X-Gmail-Labels']:
                category = 1 if 'Important' in email_obj['X-Gmail-Labels'].split(',') else 0
            else:
                continue

            body = ''
            for part in email_obj.walk():
                if part.get_content_type() == 'text/html':
                    body = clean_html(part.get_payload())
                    break
                elif part.get_content_type() == 'text/plain':
                    body = part.get_payload()
            else:
                continue
            body += ' ' + ' '.join(email_obj.keys())
                
            corpus.append(body)
            labels.append(category)

In [17]:
%time create_corpus()


./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox
CPU times: user 1min 32s, sys: 2.79 s, total: 1min 35s
Wall time: 1min 38s

What do these look like?


In [18]:
print labels[0]
print corpus[0]


1
Hello Max Mautner, 
 
The expiration date of your Hubway membership is 2013-10-28. To easily rene=
w your membership, please visit our website and log into your Member Profile. 
 
Don=E2=80=99t miss a day on Hubway! Sign up for auto-renew and forget about=
 having to remember to manually renew your membership next year. More infor=
mation about Hubway=E2=80=99s auto-renewing feature is available on your Membe=
r Profile page.
Thank you from the Hubway Team! 
 
---------------------------------------------------------------------------=
-------------- 
Check us out online at www.thehubway.com or at Facebook.com/Hubway . 
 
Contact us directly: 
Phone: 855-4HUBWAY (448-2929) 
Email: customerservice@theh=
ubway.com X-GM-THRID X-Gmail-Labels Delivered-To Received X-Received Return-Path Received Received-SPF Authentication-Results Received Received Date From To Message-ID Subject MIME-Version Content-Type Precedence X-Spam-Score X-Spam-Level X-Spam-Report

How large is our corpus (in # of documents)?


In [19]:
len(corpus), len(labels)


Out[19]:
(4721, 4721)

How many emails do we have of each label?


In [20]:
import pandas as pd
d = pd.DataFrame(labels, columns=['labels'])
print d.labels.value_counts()/float(d.shape[0])


0    0.679729
1    0.320271
dtype: float64
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.pyc, but /usr/local/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream

Extract those features

Bag of words


In [21]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

vectorizer = CountVectorizer(tokenizer=nltk.word_tokenize,
                             stop_words='english',
                             max_features=6000,
                             ngram_range=(1,1))
#vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1) # bigrams
#vectorizer = TfidfTransformer() # tf-idf

In [23]:
vectorizer


Out[23]:
CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=6000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=<function word_tokenize at 0x10c4b8488>, vocabulary=None)

In [24]:
CountVectorizer?

In [22]:
%time vectors = vectorizer.fit_transform(corpus)


CPU times: user 19.7 s, sys: 387 ms, total: 20 s
Wall time: 19.8 s

Let's feed our bag-of-words model our extracted features and cross-validate its performance:


In [25]:
import numpy as np
from sklearn.cross_validation import ShuffleSplit
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from collections import defaultdict

X = vectors
y = np.array(labels)

In [26]:
label_train_scores = defaultdict(list)
label_test_scores = defaultdict(list)
train_scores = []
test_scores = []

from sklearn import metrics

cv = ShuffleSplit(len(corpus), n_iter=10, test_size=0.1, random_state=0)

for cv_index, (train, test) in enumerate(cv):
    print cv_index
    sys.stdout.flush()
    
    gnb = MultinomialNB().fit(X[train], y[train])
    
    for label in d.labels.unique():
        train_special = [a for a in d.index[d.labels == label] if a in train]
        test_special = [a for a in d.index[d.labels == label] if a in test]
        
        label_train_scores[label].append(gnb.score(X[train_special], y[train_special]))
        label_test_scores[label].append(gnb.score(X[test_special], y[test_special]))
                
    train_scores.append(gnb.score(X[train], y[train]))
    test_scores.append(gnb.score(X[test], y[test]))


0
1
2
3
4
5
6
7
8
9

In [39]:
from pprint import pprint
for l in d.labels.unique():
    print l
    print "Training:\t %.1f%%" % (np.multiply(np.average(label_train_scores[l]), 100))
    print "Test:\t\t*%.1f%%*" % (np.multiply(np.average(label_test_scores[l]), 100))


1
Training:	 95.8%
Test:		*93.4%*
0
Training:	 76.0%
Test:		*74.0%*

Are we done?

There are lots of improvements to be made to this model besides gathering more data.

One common technique is stemming or lemmatizing the words after tokenization:


In [80]:
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

LEMMATIZER = WordNetLemmatizer()
STOP_SET = set(stopwords.words('english'))
words = 'run runs running ran'
for word in words.split(' '):
    print LEMMATIZER.lemmatize(word.lower())


run
run
running
ran

Deploying

Now that we have a model that's "adequately" fit to our data, we can go ahead and place it on a server for it to make predictions on our inbound emails. For this, we'll handle emails forwarded from our gmail account.

We must first setup an EC2 instance for handling inbound emails from Mailgun--we'll be relaying emails from Gmail to Mailgun (SMTP) and from Mailgun to EC2 (HTTP). Here's an EC2 AMI (Amazon Machine Image) that is an ubuntu server w/ all of the requisite python libraries (namely scikit-learn) that we need to run our model:

NOTE: Handling our emails via HTTP is a lot easier for a multitude of reasons (namely that setting up an email server is arduous and requires more domain knowledge than is worth your or my time).

W/ the server up, we have to serialize our model so that it can be loaded on the server:


In [40]:
# train on entire dataset:
gnb = MultinomialNB().fit(X, y)

In [41]:
vectorizer.transform?

In [42]:
from sklearn.externals import joblib
joblib.dump(gnb, 'email_importance.pkl', compress=9)
joblib.dump(vectorizer, 'vectorizer.pkl', compress=9)


Out[42]:
['vectorizer.pkl']

In [44]:
!du -hc *.pkl


 40K	email_importance.pkl
1.6M	vectorizer.pkl
1.7M	total

In [45]:
model_clone = joblib.load('email_importance.pkl')
vectorizer_clone = joblib.load('vectorizer.pkl')

In [46]:
type(model_clone), type(vectorizer_clone)


Out[46]:
(sklearn.naive_bayes.MultinomialNB,
 sklearn.feature_extraction.text.CountVectorizer)

In [47]:
model_clone.predict(X[0]), y[0]


Out[47]:
(array([1]), 1)

In [137]:
!scp -i /Users/max/.ssh/keys/ivendorz.pem email_importance.pkl ubuntu@ec2-54-202-114-193.us-west-2.compute.amazonaws.com:~/
!scp -i /Users/max/.ssh/keys/ivendorz.pem vectorizer.pkl ubuntu@ec2-54-202-114-193.us-west-2.compute.amazonaws.com:~/


email_importance.pkl                          100%   39KB  39.1KB/s   00:00    
vectorizer.pkl                                100% 1674KB   1.6MB/s   00:00    

Now we're set to receive our emails and classify them =)

This requires having a server w/ a simple web app to handle HTTP POSTs of emails from Mailgun (remember, we wanted to avoid setting up an SMTP server). I'll demo this live, but you can take a peak at the git repo for this talk for an example webapp.

Pending your email-receiving plumbing working, you can go to your Gmail account's "Settings" => "Forwarding and POP/IMAP" where you can add an email address to forward to ("whatever@sandbox12345.mailgun.org").

This will trigger a confirmation email to your Mailgun address containing a URL you must navigate to in order to "opt-in" to receiving all forwarded emails from your Gmail account.

Setting up email-forwarding can be done w/ any of your other accounts (e.g. with Yahoo, Hotmail), allowing you to use your now generalizable "importance" model to screen all your inbound emails.

Takeaways:

  1. This stuff is not scary.

  2. There's a lot of no-brainer optimizations.

  3. Data is really valuable--particularly behavioral data.

If you've got feedback or want to chat about this sort of thing, feel free to drop me an email: max.mautner[at]gmail.com

Questions?