Import the BigBang modules as needed. These should be in your Python environment if you've installed BigBang correctly.
In :import bigbang.mailman as mailman import bigbang.graph as graph import bigbang.process as process from bigbang.parse import get_date from bigbang.archive import Archive import imp imp.reload(process)
/home/sb/projects/bigbang-multi/bigbang/config/config.py:8: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. dictionary = yaml.load(stream)Out:<module 'bigbang.process' from '/home/sb/projects/bigbang-multi/bigbang/bigbang/process.py'>
Also, let's import a number of other dependencies we'll use later.
In :import pandas as pd import datetime import matplotlib.pyplot as plt import numpy as np import math import pytz import pickle import os
Now let's load the data for analysis.
In :urls = ["http://www.ietf.org/mail-archive/text/ietf-privacy/", "http://lists.w3.org/Archives/Public/public-privacy/"] mlists = [mailman.open_list_archives(url) for url in urls] activities = [Archive.get_activity(Archive(ml)) for ml in mlists]
This variable is for the range of days used in computing rolling averages.
Now, let's see: who are the authors of the most messages to one particular list?
In :a = activities # for the first mailing list ta = a.sum(0) # sum along the first axis ta.sort_values()[-10:].plot(kind='barh', width=1)
Out:<matplotlib.axes._subplots.AxesSubplot at 0x7fb84fa8e780>
This might be useful for seeing the distribution (does the top message sender dominate?) or for identifying key participants to talk to.
Many mailing lists will have some duplicate senders: individuals who use multiple email addresses or are recorded as different senders when using the same email address. We want to identify those potential duplicates in order to get a more accurate representation of the distribution of senders.
To begin with, let's calculate the similarity of the From strings, based on the Levenshtein distance.
In :levdf = process.sorted_matrix(a) # creates a slightly more nuanced edit distance matrix # and sorts by rows/columns that have the best candidates levdf_corner = levdf.iloc[:25,:25] # just take the top 25
In :fig = plt.figure(figsize=(15, 12)) plt.pcolor(levdf_corner) plt.yticks(np.arange(0.5, len(levdf_corner.index), 1), levdf_corner.index) plt.xticks(np.arange(0.5, len(levdf_corner.columns), 1), levdf_corner.columns, rotation='vertical') plt.colorbar() plt.show()
For this still naive measure (edit distance on a normalized string), it appears that there are many duplicates in the <10 range, but that above that the edit distance of short email addresses at common domain names can take over.
In :consolidates =  # gather pairs of names which have a distance of less than 10 for col in levdf.columns: for index, value in levdf.loc[levdf[col] < 10, col].items(): if index != col: # the name shouldn't be a pair for itself consolidates.append((col, index)) print(str(len(consolidates)) + ' candidates for consolidation.')
132 candidates for consolidation.
In :c = process.consolidate_senders_activity(a, consolidates) print('We removed: ' + str(len(a.columns) - len(c.columns)) + ' columns.')
We removed: 51 columns.
We can create the same color plot with the consolidated dataframe to see how the distribution has changed.
In :lev_c = process.sorted_matrix(c) levc_corner = lev_c.iloc[:25,:25] fig = plt.figure(figsize=(15, 12)) plt.pcolor(levc_corner) plt.yticks(np.arange(0.5, len(levc_corner.index), 1), levc_corner.index) plt.xticks(np.arange(0.5, len(levc_corner.columns), 1), levc_corner.columns, rotation='vertical') plt.colorbar() plt.show()
Of course, there are still some duplicates, mostly people who are using the same name, but with a different email address at an unrelated domain name.
How does our consolidation affect the graph of distribution of senders?
In :fig, axes = plt.subplots(nrows=2, figsize=(15, 12)) ta = a.sum(0) # sum along the first axis ta.sort_values()[-20:].plot(kind='barh',ax=axes, width=1, title='Before consolidation') tc = c.sum(0) tc.sort_values()[-20:].plot(kind='barh',ax=axes, width=1, title='After consolidation') plt.show()
Okay, not dramatically different, but the consolidation makes the head heavier. There are more people close to that high end, a stronger core group and less a power distribution smoothly from one or two people.
We could also use sender email addresses as a naive inference for affiliation, especially for mailing lists where corporate/organizational email addresses are typically used.
Pandas lets us group by the results of a keying function, which we can use to group participants sending from email addresses with the same domain.
In :grouped = tc.groupby(process.domain_name_from_email) domain_groups = grouped.size() domain_groups.sort_values(ascending=True)[-20:].plot(kind='barh', width=1, title="Number of participants at domain")
Out:<matplotlib.axes._subplots.AxesSubplot at 0x7fb84c9a0a20>
We can also aggregate the number of messages that come from addresses at each domain.
In :domain_messages_sum = grouped.sum() domain_messages_sum.sort_values(ascending=True)[-20:].plot(kind='barh', width=1, title="Number of messages from domain")
Out:<matplotlib.axes._subplots.AxesSubplot at 0x7fb849fff518>
This shows distinct results from the participants and from the top individual contributors. For example, while there are many @gmail.com addresses among the participants, they don't send as many messages. Microsoft, Google and Mozilla (major browser vendors) send many messages to the list as a domain even though no individual from those organizations is among the top senders.
In [ ]: