Train a classifier that can predict whether an email is "important" or not.
We'll approach this using our own Gmail data--from building and evaluating our classifier's performance to deploying it using a combination of Rackspace's Mailgun and Amazon's EC2 services.
Common classification task--we can take a peak at the cheatsheet when it comes to what model we should use:
But what's the tradeoff between the size of our dataset and classifier performance?
How much data do we need to make a "decent" classifier?
First, grab your gmail data: https://www.google.com/settings/takeout
You can be conservative, and only fetch your inbox data as that's all we really need:
This'll take a bit to wait for--we'll continue on, using my pre-fetched personal Gmail data.
We'll get a zip file, containing .mbox files, one for each folder in your Gmail account. mbox is a file format for storing emails--it's simply a plain-text file of all your emails concatenated together, we can take a peak at one:
In [2]:
# unzip 'em
!unzip /Users/max/Downloads/max.mautner@gmail.com-20131218T185235Z-Mail.zip -d ./data/
In [6]:
!ls -l ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/
In [7]:
from glob import glob
mailboxes = glob('./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/*')
mailboxes
Out[7]:
In [8]:
!head ./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox
Let's enumerate our tasks for building our "email importance" classifier:
If it's good enough, we can go ahead and apply the model in production.
It sucks? Either:
For now, we'll just assume we've got enough data, and that bag-of-words feature extraction will be adequate.
In [13]:
interesting_mboxes = ['./data/max.mautner@gmail.com-20131218T185235Z-Mail/Mail/Inbox.mbox']
Let's import some things to do some basic pre-processing:
In [14]:
import mailbox
import email
from nltk import clean_html
import sys
In [15]:
%time len(mailbox.mbox(interesting_mboxes[0]).items())
Out[15]:
In [16]:
corpus = []
labels = []
def create_corpus():
for fname in interesting_mboxes:
print fname
sys.stdout.flush() # make sure to flush to output
category = fname.split('/')[-1].split('.')[0].lower()
mbox = mailbox.mbox(fname)
for msg_id, email_obj in mbox.items():
if 'Sent' not in email_obj['X-Gmail-Labels']:
category = 1 if 'Important' in email_obj['X-Gmail-Labels'].split(',') else 0
else:
continue
body = ''
for part in email_obj.walk():
if part.get_content_type() == 'text/html':
body = clean_html(part.get_payload())
break
elif part.get_content_type() == 'text/plain':
body = part.get_payload()
else:
continue
body += ' ' + ' '.join(email_obj.keys())
corpus.append(body)
labels.append(category)
In [17]:
%time create_corpus()
What do these look like?
In [18]:
print labels[0]
print corpus[0]
How large is our corpus (in # of documents)?
In [19]:
len(corpus), len(labels)
Out[19]:
How many emails do we have of each label?
In [20]:
import pandas as pd
d = pd.DataFrame(labels, columns=['labels'])
print d.labels.value_counts()/float(d.shape[0])
In [21]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer(tokenizer=nltk.word_tokenize,
stop_words='english',
max_features=6000,
ngram_range=(1,1))
#vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1) # bigrams
#vectorizer = TfidfTransformer() # tf-idf
In [23]:
vectorizer
Out[23]:
In [24]:
CountVectorizer?
In [22]:
%time vectors = vectorizer.fit_transform(corpus)
Let's feed our bag-of-words model our extracted features and cross-validate its performance:
In [25]:
import numpy as np
from sklearn.cross_validation import ShuffleSplit
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from collections import defaultdict
X = vectors
y = np.array(labels)
In [26]:
label_train_scores = defaultdict(list)
label_test_scores = defaultdict(list)
train_scores = []
test_scores = []
from sklearn import metrics
cv = ShuffleSplit(len(corpus), n_iter=10, test_size=0.1, random_state=0)
for cv_index, (train, test) in enumerate(cv):
print cv_index
sys.stdout.flush()
gnb = MultinomialNB().fit(X[train], y[train])
for label in d.labels.unique():
train_special = [a for a in d.index[d.labels == label] if a in train]
test_special = [a for a in d.index[d.labels == label] if a in test]
label_train_scores[label].append(gnb.score(X[train_special], y[train_special]))
label_test_scores[label].append(gnb.score(X[test_special], y[test_special]))
train_scores.append(gnb.score(X[train], y[train]))
test_scores.append(gnb.score(X[test], y[test]))
In [39]:
from pprint import pprint
for l in d.labels.unique():
print l
print "Training:\t %.1f%%" % (np.multiply(np.average(label_train_scores[l]), 100))
print "Test:\t\t*%.1f%%*" % (np.multiply(np.average(label_test_scores[l]), 100))
There are lots of improvements to be made to this model besides gathering more data.
One common technique is stemming or lemmatizing the words after tokenization:
In [80]:
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
LEMMATIZER = WordNetLemmatizer()
STOP_SET = set(stopwords.words('english'))
words = 'run runs running ran'
for word in words.split(' '):
print LEMMATIZER.lemmatize(word.lower())
Now that we have a model that's "adequately" fit to our data, we can go ahead and place it on a server for it to make predictions on our inbound emails. For this, we'll handle emails forwarded from our gmail account.
We must first setup an EC2 instance for handling inbound emails from Mailgun--we'll be relaying emails from Gmail to Mailgun (SMTP) and from Mailgun to EC2 (HTTP). Here's an EC2 AMI (Amazon Machine Image) that is an ubuntu server w/ all of the requisite python libraries (namely scikit-learn) that we need to run our model:
NOTE: Handling our emails via HTTP is a lot easier for a multitude of reasons (namely that setting up an email server is arduous and requires more domain knowledge than is worth your or my time).
W/ the server up, we have to serialize our model so that it can be loaded on the server:
In [40]:
# train on entire dataset:
gnb = MultinomialNB().fit(X, y)
In [41]:
vectorizer.transform?
In [42]:
from sklearn.externals import joblib
joblib.dump(gnb, 'email_importance.pkl', compress=9)
joblib.dump(vectorizer, 'vectorizer.pkl', compress=9)
Out[42]:
In [44]:
!du -hc *.pkl
In [45]:
model_clone = joblib.load('email_importance.pkl')
vectorizer_clone = joblib.load('vectorizer.pkl')
In [46]:
type(model_clone), type(vectorizer_clone)
Out[46]:
In [47]:
model_clone.predict(X[0]), y[0]
Out[47]:
In [137]:
!scp -i /Users/max/.ssh/keys/ivendorz.pem email_importance.pkl ubuntu@ec2-54-202-114-193.us-west-2.compute.amazonaws.com:~/
!scp -i /Users/max/.ssh/keys/ivendorz.pem vectorizer.pkl ubuntu@ec2-54-202-114-193.us-west-2.compute.amazonaws.com:~/
Now we're set to receive our emails and classify them =)
This requires having a server w/ a simple web app to handle HTTP POSTs of emails from Mailgun (remember, we wanted to avoid setting up an SMTP server). I'll demo this live, but you can take a peak at the git repo for this talk for an example webapp.
Pending your email-receiving plumbing working, you can go to your Gmail account's "Settings" => "Forwarding and POP/IMAP" where you can add an email address to forward to ("whatever@sandbox12345.mailgun.org").
This will trigger a confirmation email to your Mailgun address containing a URL you must navigate to in order to "opt-in" to receiving all forwarded emails from your Gmail account.
Setting up email-forwarding can be done w/ any of your other accounts (e.g. with Yahoo, Hotmail), allowing you to use your now generalizable "importance" model to screen all your inbound emails.