This notebook shows how BigBang can help you analyze the senders in a particular mailing list archive.

First, use this IPython magic to tell the notebook to display matplotlib graphics inline. This is a nice way to display results.



In [4]:

    
%matplotlib inline

Import the BigBang modules as needed. These should be in your Python environment if you've installed BigBang correctly.



In [5]:

    
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
import imp
imp.reload(process)









    Out[5]:





<module 'bigbang.process' from '/home/lem/Data/bigbang/bigbang/process.pyc'>

Also, let's import a number of other dependencies we'll use later.



In [6]:

    
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os

pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options









    



---------------------------------------------------------------------------
OptionError                               Traceback (most recent call last)
<ipython-input-6-1baa2c578901> in <module>()
      8 import os
      9 
---> 10 pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options

/home/lem/.local/lib/python2.7/site-packages/pandas/core/config.pyc in __setattr__(self, key, val)
    190             _set_option(prefix, val)
    191         else:
--> 192             raise OptionError("You can only set the value of existing options")
    193 
    194     def __getattr__(self, key):

OptionError: 'You can only set the value of existing options'

Now let's load the data for analysis.



In [7]:

    
from bigbang.archive import load as load_archive
urls = ["http://mm.icann.org/pipermail/wp4/"]
try:
    arch_paths =[]
    for url in urls:
        arch_paths.append('../archives/'+url[:-1].replace('://','_/')+'.csv')
    archives = [load_archive(arch_path).data for arch_path in arch_paths]
except:
    arch_paths =[]
    for url in urls:
        arch_paths.append('../archives/'+url[:-1].replace('//','/')+'.csv')
    archives = [load_archive(arch_path).data for arch_path in arch_paths]
archives = pd.concat(archives)
activities = Archive(archives).get_activity()









    



---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-7-20a5607464b7> in <module>()
     10     for url in urls:
     11         arch_paths.append('../archives/'+url[:-1].replace('//','/')+'.csv')
---> 12     archives = [load_archive(arch_path).data for arch_path in arch_paths]
     13 archives = pd.concat(archives)
     14 activities = Archive(archives).get_activity()

/home/lem/Data/bigbang/bigbang/archive.pyc in load(path)
     12 
     13 def load(path):
---> 14     data = pd.read_csv(path)
     15     return Archive(data)
     16 

/home/lem/.local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

/home/lem/.local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    447 
    448     # Create the parser.
--> 449     parser = TextFileReader(filepath_or_buffer, **kwds)
    450 
    451     if chunksize or iterator:

/home/lem/.local/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    816             self.options['has_index_names'] = kwds['has_index_names']
    817 
--> 818         self._make_engine(self.engine)
    819 
    820     def close(self):

/home/lem/.local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
   1047     def _make_engine(self, engine='c'):
   1048         if engine == 'c':
-> 1049             self._engine = CParserWrapper(self.f, **self.options)
   1050         else:
   1051             if engine == 'python':

/home/lem/.local/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1693         kwds['allow_leading_cols'] = self.index_col is not False
   1694 
-> 1695         self._reader = parsers.TextReader(src, **kwds)
   1696 
   1697         # XXX

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

IOError: File ../archives/http:/mm.icann.org/pipermail/wp4.csv does not exist



In [21]:

    
icann_path = "../archives/http:/mm.icann.org/pipermail"
ncuc_path = "../archives/http:/lists.ncuc.org/pipermail"

'''
paths = [os.path.join(icann_path,"ipc-gnso.csv"),
        os.path.join(icann_path,"wp4.csv"),
        os.path.join(icann_path,"alac.csv"),
        os.path.join(icann_path,"gnso-rds-pdp-wg.csv"),
        os.path.join(icann_path,"accountability-cross-community.csv"),
        os.path.join(icann_path,"cc-humanrights.csv"),
        os.path.join(ncuc_path,"ncuc-discuss.csv")]
'''

path = [os.path.join(ncuc_path,"ncuc-discuss.csv")]

datas = [load_archive(path).data for path in paths]
         
arx = Archive(pd.concat(datas))
activities = arx.get_activity()



In [22]:

    
a  = activities
ta = a.sum(0) # sum along the first axis
ta.sort()









    



/home/sb/anaconda/envs/nllz/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
  app.launch_new_instance()



In [23]:

    
levdf = process.sorted_matrix(a) # creates a slightly more nuanced edit distance matrix
                                 # and sorts by rows/columns that have the best candidates



In [37]:

    
import re

ren = "([\w\+\.\-]+(\@| at )[\w+\.\-]*) \((.*)\)"

matches = levdf < 4

def name_match(row):
    match = row[0]
    matches = [item[0] for item in row[1].items() if item[1]]
    name = re.match(ren,matches[0]).groups()[2]
    
    return (name,match)

m = pd.Series({nm[1] : nm[0] 
               for nm in 
               [name_match(row) for row in matches.iterrows()]})
    
m.to_csv("entity_matches.csv")



In [ ]:

    
consolidates = []

# gather pairs of names which have a distance of less than 10
for col in levdf.columns:
  for index, value in levdf.loc[levdf[col] < 10, col].items():
        if index != col: # the name shouldn't be a pair for itself
            consolidates.append((col, index))
  
print(str(len(consolidates)) + ' candidates for consolidation.')



In [ ]:

    
c = process.consolidate_senders_activity(a, consolidates)
print('We removed: ' + str(len(a.columns) - len(c.columns)) + ' columns.')
c



In [ ]:

    
c[1]["From"]

We can create the same color plot with the consolidated dataframe to see how the distribution has changed.

Of course, there are still some duplicates, mostly people who are using the same name, but with a different email address at an unrelated domain name. BUT THEY ARE LESS!



In [15]:

    
fig, axes = plt.subplots(nrows=2, figsize=(15, 12))

ta = a.sum(0) # sum along the first axis
ta.sort()
ta[-20:].plot(kind='barh',ax=axes[0], width=1, title='Before consolidation')
tc = c.sum(0)
tc.sort()
tc[-20:].plot(kind='barh',ax=axes[1], width=1, title='After consolidation')
plt.show()
print(tc)









    



C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:4: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:7: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting






    












    



From
kimberly.carlson at icann.org (Kimberly Carlson)                                                                    1.0
turcotte.bernard at gmail.com (Bernard Turcotte)                                                                    1.0
stephanie.perrin at mail.utoronto.ca (Stephanie Perrin)                                                             1.0
vipul.kharbanda at gmail.com (Vipul Kharbanda)                                                                      1.0
gonzalo.navarro at icann.org (Gonzalo Navarro)                                                                      1.0
drive-shares-noreply at google.com (=?UTF-8?Q?Le=C3=B3n_Felipe_S=C3=A1nchez_Amb=C3=ADa_=28via_Google_Docs=29?=)     1.0
wolfgang.kleinwaechter at medienkomm.uni-halle.de (=?iso-8859-1?Q?=22Kleinw=E4chter=2C_Wolfgang=22?=)               1.0
crossini at publicknowledge.org (Carolina Rossini @ PK )                                                            1.0
Samantha.Eisner at icann.org (Samantha Eisner)                                                                      1.0
Brett.Schaefer at heritage.org (Schaefer, Brett)                                                                    3.0
Ellen.M.Blackler at disney.com (Blackler, Ellen M.)                                                                 3.0
Jorge.Cancio at bakom.admin.ch (Jorge.Cancio at bakom.admin.ch)                                                     5.0
mariliamaciel at gmail.com (Marilia Maciel)                                                                         6.0
dmcauley at verisign.com (McAuley, David)                                                                           6.0
brenda.brewer at icann.org (Brenda Brewer)                                                                          9.0
mshears at cdt.org (Matthew Shears)                                                                                10.0
seun.ojedeji at gmail.com (Seun Ojedeji)                                                                           11.0
t.tropina at mpicc.de (Dr. Tatiana Tropina)                                                                        11.0
kavouss.arasteh at gmail.com (Kavouss Arasteh)                                                                     12.0
leonfelipe at sanchez.mx (=?utf-8?Q?Le=C3=B3n_Felipe_S=C3=A1nchez_Amb=C3=ADa?=)                                    17.0
lists at digitaldissidents.org (Niels ten Oever)                                                                   22.0
paul.rosenzweig at gmail.com (Paul Rosenzweig)                                                                     22.0
directors at omadhina.NET (Dr Eberhard W Lisse)                                                                    28.0
ram.mohan at icann.org (Ram Mohan)                                                                                 30.0
gregshatanipc at gmail.com (Greg Shatan (via Google Docs))                                                         44.0
nigel at channelisles.net (Nigel Roberts)                                                                          47.0
dtype: float64

Okay, not dramatically different, but the consolidation makes the head heavier. There are more people close to that high end, a stronger core group and less a power distribution smoothly from one or two people.

We could also use sender email addresses as a naive inference for affiliation, especially for mailing lists where corporate/organizational email addresses are typically used.

Pandas lets us group by the results of a keying function, which we can use to group participants sending from email addresses with the same domain.



In [13]:

    
print(tc)
grouped = tc.groupby(process.domain_name_from_email)
domain_groups = grouped.size()
domain_groups.sort(ascending=True)
domain_groups[-20:].plot(kind='barh', width=1, title="Number of participants at domain")









    



From
kimberly.carlson at icann.org (Kimberly Carlson)                                                                    1.0
turcotte.bernard at gmail.com (Bernard Turcotte)                                                                    1.0
stephanie.perrin at mail.utoronto.ca (Stephanie Perrin)                                                             1.0
vipul.kharbanda at gmail.com (Vipul Kharbanda)                                                                      1.0
gonzalo.navarro at icann.org (Gonzalo Navarro)                                                                      1.0
drive-shares-noreply at google.com (=?UTF-8?Q?Le=C3=B3n_Felipe_S=C3=A1nchez_Amb=C3=ADa_=28via_Google_Docs=29?=)     1.0
wolfgang.kleinwaechter at medienkomm.uni-halle.de (=?iso-8859-1?Q?=22Kleinw=E4chter=2C_Wolfgang=22?=)               1.0
crossini at publicknowledge.org (Carolina Rossini @ PK )                                                            1.0
Samantha.Eisner at icann.org (Samantha Eisner)                                                                      1.0
Brett.Schaefer at heritage.org (Schaefer, Brett)                                                                    3.0
Ellen.M.Blackler at disney.com (Blackler, Ellen M.)                                                                 3.0
Jorge.Cancio at bakom.admin.ch (Jorge.Cancio at bakom.admin.ch)                                                     5.0
mariliamaciel at gmail.com (Marilia Maciel)                                                                         6.0
dmcauley at verisign.com (McAuley, David)                                                                           6.0
brenda.brewer at icann.org (Brenda Brewer)                                                                          9.0
mshears at cdt.org (Matthew Shears)                                                                                10.0
seun.ojedeji at gmail.com (Seun Ojedeji)                                                                           11.0
t.tropina at mpicc.de (Dr. Tatiana Tropina)                                                                        11.0
kavouss.arasteh at gmail.com (Kavouss Arasteh)                                                                     12.0
leonfelipe at sanchez.mx (=?utf-8?Q?Le=C3=B3n_Felipe_S=C3=A1nchez_Amb=C3=ADa?=)                                    17.0
lists at digitaldissidents.org (Niels ten Oever)                                                                   22.0
paul.rosenzweig at gmail.com (Paul Rosenzweig)                                                                     22.0
directors at omadhina.NET (Dr Eberhard W Lisse)                                                                    28.0
ram.mohan at icann.org (Ram Mohan)                                                                                 30.0
gregshatanipc at gmail.com (Greg Shatan (via Google Docs))                                                         44.0
nigel at channelisles.net (Nigel Roberts)                                                                          47.0
dtype: float64






    



C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:4: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting






    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x99f42f0>

We can also aggregate the number of messages that come from addresses at each domain.



In [14]:

    
domain_messages_sum = grouped.sum()
domain_messages_sum.sort(ascending=True)
domain_messages_sum[-20:].plot(kind='barh', width=1, title="Number of messages from domain")









    



C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
  from ipykernel import kernelapp as app






    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x994f4f0>

This shows distinct results from the participants and from the top individual contributors. For example, while there are many @gmail.com addresses among the participants, they don't send as many messages. Microsoft, Google and Mozilla (major browser vendors) send many messages to the list as a domain even though no individual from those organizations is among the top senders.



In [ ]: