This notebook shows how BigBang can help you explore a mailing list archive.

First, use this IPython magic to tell the notebook to display matplotlib graphics inline. This is a nice way to display results.

In [1]:
%matplotlib inline

Import the BigBang modules as needed. These should be in your Python environment if you've installed BigBang correctly.

In [2]:
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
#from bigbang.functions import *
from bigbang.archive import Archive

Also, let's import a number of other dependencies we'll use later.

In [4]:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os

pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options

Now let's load the data for analysis.

In [4]:
urls = ["",

archives = [Archive(url,archive_dir="../archives") for url in urls]

activities = [arx.get_activity() for arx in archives]

/home/sb/projects/bigbang/bigbang/ SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  mdf2['Date'] = mdf['Date'].apply(lambda x: x.toordinal())

This variable is for the range of days used in computing rolling averages.

In [5]:
window = 100

For each of the mailing lists we are looking at, plot the rolling average of number of emails sent per day.

In [6]:
plt.figure(figsize=(12.5, 7.5))

for i, activity in enumerate(activities):

    colors = 'rgbkm'

    ta = activity.sum(1)
    rmta = pd.rolling_mean(ta,window)
    rmtadna = rmta.dropna()
                  label=mailman.get_list_name(urls[i]) + ' activity',xdate=True)


/home/sb/anaconda/envs/bigbang/lib/python2.7/site-packages/matplotlib/ UserWarning: findfont: Font family ['monospace'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Now, let's see: who are the authors of the most messages to one particular list?

In [7]:
a  = activities[0] # for the first mailing list
ta = a.sum(0) # sum along the first axis

<matplotlib.axes.AxesSubplot at 0x7f898c861ed0>

This might be useful for seeing the distribution (does the top message sender dominate?) or for identifying key participants to talk to.

Many mailing lists will have some duplicate senders: individuals who use multiple email addresses or are recorded as different senders when using the same email address. We want to identify those potential duplicates in order to get a more accurate representation of the distribution of senders.

To begin with, let's do a naive calculation of the similarity of the From strings, based on the Levenshtein distance.

This can take a long time for a large matrix, so we will truncate it for purposes of demonstration.

In [9]:
import Levenshtein
distancedf = process.matricize(a.columns[:100], lambda a,b: Levenshtein.distance(a,b)) # calculate the edit distance between the two From titles
df = distancedf.astype(int) # specify that the values in the matrix are integers

In [10]:
fig = plt.figure(figsize=(18, 18))
#plt.yticks(np.arange(0.5, len(df.index), 1), df.index) # these lines would show labels, but that gets messy
#plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)

The dark blue diagonal is comparing an entry to itself (we know the distance is zero in that case), but a few other dark blue patches suggest there are duplicates even using this most naive measure.

Below is a variant of the visualization for inspecting the particular apparent duplicates.

In [11]:
levdf = process.sorted_lev(a) # creates a slightly more nuanced edit distance matrix
                              # and sorts by rows/columns that have the best candidates
levdf_corner = levdf.iloc[:25,:25] # just take the top 25

TypeError                                 Traceback (most recent call last)
<ipython-input-11-883567802061> in <module>()
----> 1 levdf = process.sorted_lev(a) # creates a slightly more nuanced edit distance matrix
      2                               # and sorts by rows/columns that have the best candidates
      3 levdf_corner = levdf.iloc[:25,:25] # just take the top 25

/home/sb/projects/bigbang/bigbang/process.pyc in sorted_lev(from_dataframe)
     77 def sorted_lev(from_dataframe):
---> 78     distancedf = matricize(from_dataframe.columns, lev_distance_normalized)
     79     # specify that the values in the matrix are integers
     80     df = distancedf.astype(int)

/home/sb/projects/bigbang/bigbang/process.pyc in matricize(series, func)
     50     for index, element in enumerate(series):
     51         for second_index, second_element in enumerate(series):
---> 52             matrix.iloc[index, second_index] = func(element, second_element)
     54     return matrix

/home/sb/projects/bigbang/bigbang/process.pyc in lev_distance_normalized(a, b)
     70     stop_characters = unicode('"<>')
     71     stop_characters_map = dict((ord(char), None) for char in stop_characters)
---> 72     a_normal = a.lower().translate(stop_characters_map)
     73     b_normal = b.lower().translate(stop_characters_map)
     74     return Levenshtein.distance(a_normal, b_normal)

TypeError: expected a character buffer object

In [12]:
fig = plt.figure(figsize=(15, 12))
plt.yticks(np.arange(0.5, len(levdf_corner.index), 1), levdf_corner.index)
plt.xticks(np.arange(0.5, len(levdf_corner.columns), 1), levdf_corner.columns, rotation='vertical')

For this still naive measure (edit distance on a normalized string), it appears that there are many duplicates in the <10 range, but that above that the edit distance of short email addresses at common domain names can take over.

In [13]:
consolidates = []

# gather pairs of names which have a distance of less than 10
for col in levdf.columns:
  for index, value in levdf.loc[levdf[col] < 10, col].iteritems():
        if index != col: # the name shouldn't be a pair for itself
            consolidates.append((col, index))
print str(len(consolidates)) + ' candidates for consolidation.'

34 candidates for consolidation.

In [14]:
c = process.consolidate_senders_activity(a, consolidates)
print 'We removed: ' + str(len(a.columns) - len(c.columns)) + ' columns.'

We removed: 10 columns.

We can create the same color plot with the consolidated dataframe to see how the distribution has changed.

In [15]:
lev_c = process.sorted_lev(c)
levc_corner = lev_c.iloc[:25,:25]
fig = plt.figure(figsize=(15, 12))
plt.yticks(np.arange(0.5, len(levc_corner.index), 1), levc_corner.index)
plt.xticks(np.arange(0.5, len(levc_corner.columns), 1), levc_corner.columns, rotation='vertical')

Of course, there are still some duplicates, mostly people who are using the same name, but with a different email address at an unrelated domain name.

How does our consolidation affect the graph of distribution of senders?

In [17]:
fig, axes = plt.subplots(nrows=2, figsize=(15, 12))

ta = a.sum(0) # sum along the first axis
ta[-20:].plot(kind='barh',ax=axes[0], title='Before consolidation')
tc = c.sum(0)
tc[-20:].plot(kind='barh',ax=axes[1], title='After consolidation')

Okay, not dramatically different, but the consolidation makes the head heavier. There are more people close to that high end, a stronger core group and less a power distribution smoothly from one or two people.