This notebook shows how BigBang can help you analyze the senders in a particular mailing list archive.

First, use this IPython magic to tell the notebook to display matplotlib graphics inline. This is a nice way to display results.


In [1]:
%matplotlib inline


/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Import the BigBang modules as needed. These should be in your Python environment if you've installed BigBang correctly.


In [2]:
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
reload(process)


Out[2]:
<module 'bigbang.process' from '/Users/nick/code/mailing-list-analysis/bigbang/bigbang/process.pyc'>

Also, let's import a number of other dependencies we'll use later.


In [3]:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os

pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options


/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2885: FutureWarning: 
mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.

  exec(code_obj, self.user_global_ns, self.user_ns)

Now let's load the data for analysis.


In [4]:
urls = ["http://www.ietf.org/mail-archive/text/ietf-privacy/",
        "http://lists.w3.org/Archives/Public/public-privacy/"]
mlists = [mailman.open_list_archives(url,"../archives") for url in urls]
activities = [Archive.get_activity(Archive(ml)) for ml in mlists]


Opening 36 archive files
Opening 17 archive files
/Users/nick/code/mailing-list-analysis/bigbang/bigbang/archive.py:74: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  self.data.sort(columns='Date', inplace=True)

This variable is for the range of days used in computing rolling averages.

Now, let's see: who are the authors of the most messages to one particular list?


In [5]:
a  = activities[1] # for the first mailing list
ta = a.sum(0) # sum along the first axis
ta.sort()
ta[-10:].plot(kind='barh', width=1)


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
  app.launch_new_instance()
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1050c5bd0>

This might be useful for seeing the distribution (does the top message sender dominate?) or for identifying key participants to talk to.


Many mailing lists will have some duplicate senders: individuals who use multiple email addresses or are recorded as different senders when using the same email address. We want to identify those potential duplicates in order to get a more accurate representation of the distribution of senders.

To begin with, let's calculate the similarity of the From strings, based on the Levenshtein distance.


In [11]:
levdf = process.sorted_matrix(a) # creates a slightly more nuanced edit distance matrix
                                 # and sorts by rows/columns that have the best candidates
levdf_corner = levdf.iloc[:25,:25] # just take the top 25

In [12]:
fig = plt.figure(figsize=(15, 12))
plt.pcolor(levdf_corner)
plt.yticks(np.arange(0.5, len(levdf_corner.index), 1), levdf_corner.index)
plt.xticks(np.arange(0.5, len(levdf_corner.columns), 1), levdf_corner.columns, rotation='vertical')
plt.colorbar()
plt.show()


For this still naive measure (edit distance on a normalized string), it appears that there are many duplicates in the <10 range, but that above that the edit distance of short email addresses at common domain names can take over.


In [13]:
consolidates = []

# gather pairs of names which have a distance of less than 10
for col in levdf.columns:
  for index, value in levdf.loc[levdf[col] < 10, col].iteritems():
        if index != col: # the name shouldn't be a pair for itself
            consolidates.append((col, index))
  
print str(len(consolidates)) + ' candidates for consolidation.'


64 candidates for consolidation.

In [14]:
c = process.consolidate_senders_activity(a, consolidates)
print 'We removed: ' + str(len(a.columns) - len(c.columns)) + ' columns.'


We removed: 25 columns.

We can create the same color plot with the consolidated dataframe to see how the distribution has changed.


In [15]:
lev_c = process.sorted_matrix(c)
levc_corner = lev_c.iloc[:25,:25]
fig = plt.figure(figsize=(15, 12))
plt.pcolor(levc_corner)
plt.yticks(np.arange(0.5, len(levc_corner.index), 1), levc_corner.index)
plt.xticks(np.arange(0.5, len(levc_corner.columns), 1), levc_corner.columns, rotation='vertical')
plt.colorbar()
plt.show()


Of course, there are still some duplicates, mostly people who are using the same name, but with a different email address at an unrelated domain name.

How does our consolidation affect the graph of distribution of senders?


In [16]:
fig, axes = plt.subplots(nrows=2, figsize=(15, 12))

ta = a.sum(0) # sum along the first axis
ta.sort()
ta[-20:].plot(kind='barh',ax=axes[0], width=1, title='Before consolidation')
tc = c.sum(0)
tc.sort()
tc[-20:].plot(kind='barh',ax=axes[1], width=1, title='After consolidation')
plt.show()


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:4: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:7: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting

Okay, not dramatically different, but the consolidation makes the head heavier. There are more people close to that high end, a stronger core group and less a power distribution smoothly from one or two people.

We could also use sender email addresses as a naive inference for affiliation, especially for mailing lists where corporate/organizational email addresses are typically used.

Pandas lets us group by the results of a keying function, which we can use to group participants sending from email addresses with the same domain.


In [19]:
grouped = tc.groupby(process.domain_name_from_email)
domain_groups = grouped.size()
domain_groups.sort(ascending=True)
domain_groups[-20:].plot(kind='barh', width=1, title="Number of participants at domain")


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:4: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1071fcbd0>

We can also aggregate the number of messages that come from addresses at each domain.


In [21]:
domain_messages_sum = grouped.sum()
domain_messages_sum.sort(ascending=True)
domain_messages_sum[-20:].plot(kind='barh', width=1, title="Number of messages from domain")


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
  from ipykernel import kernelapp as app
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x107839090>

This shows distinct results from the participants and from the top individual contributors. For example, while there are many @gmail.com addresses among the participants, they don't send as many messages. Microsoft, Google and Mozilla (major browser vendors) send many messages to the list as a domain even though no individual from those organizations is among the top senders.


In [ ]: