This notebook shows how BigBang can help you explore a mailing list archive.

First, use this IPython magic to tell the notebook to display matplotlib graphics inline. This is a nice way to display results.


In [1]:
%matplotlib inline

Import the BigBang modules as needed. These should be in your Python environment if you've installed BigBang correctly.


In [2]:
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
#from bigbang.functions import *
from bigbang.archive import Archive


Couldn't import dot_parser, loading of dot files will not be possible.

Also, let's import a number of other dependencies we'll use later.


In [3]:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os

pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options

Now let's load the data for analysis.


In [4]:
urls = ["ipython-dev",
        "ipython-user"]

archives = [Archive(url,archive_dir="../archives",mbox=True) for url in urls]

activities = [arx.get_activity() for arx in archives]


/home/sb/projects/bigbang/bigbang/mailman.py:105: UserWarning: No mailing list name found at ipython-dev
  warnings.warn("No mailing list name found at %s" % url)
/home/sb/projects/bigbang/bigbang/mailman.py:105: UserWarning: No mailing list name found at ipython-user
  warnings.warn("No mailing list name found at %s" % url)
Opening 143 archive files
Opening 143 archive files
/home/sb/projects/bigbang/bigbang/archive.py:92: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  mdf2['Date'] = mdf['Date'].apply(lambda x: x.toordinal())

In [5]:
archives[0].data


Out[5]:
From Subject Date In-Reply-To References Body
Message-ID
<3E9DE124.8080309@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] Mailing lists indexed at gmane 2003-04-16 23:03:00 None None Hi all,\n\nafter a suggestion by Jacek Generow...
<3E9DE124.8080309@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] Mailing lists indexed at gmane 2003-04-16 23:03:00 None None Hi all,\n\nafter a suggestion by Jacek Generow...
<3E9E4094.7030802@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] Re: Refactoring of bdist_wininst... 2003-04-17 05:50:12 <003d01c28a9a$3dcb8560$e301340a@cyberhigh.fcoe... <003d01c28a9a$3dcb8560$e301340a@cyberhigh.fcoe... Hi Cory,\n\n> Done. install command will now ...
<3E9E4094.7030802@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] Re: Refactoring of bdist_wininst... 2003-04-17 05:50:12 <003d01c28a9a$3dcb8560$e301340a@cyberhigh.fcoe... <003d01c28a9a$3dcb8560$e301340a@cyberhigh.fcoe... Hi Cory,\n\n> Done. install command will now ...
<000c01c304ee$3cb79e60$e901340a@cyberhigh.fcoe.k12.ca.us> cdodt@fcoe.k12.ca.us (Cory Dodt) [IPython-dev] RE: Refactoring of bdist_wininst... 2003-04-17 14:32:56 <3E9E4094.7030802@colorado.edu> None Distutils 1.0.3 is not included with Python 2....
<000c01c304ee$3cb79e60$e901340a@cyberhigh.fcoe.k12.ca.us> cdodt at fcoe.k12.ca.us (Cory Dodt) [IPython-dev] RE: Refactoring of bdist_wininst... 2003-04-17 14:32:56 <3E9E4094.7030802@colorado.edu> None Distutils 1.0.3 is not included with Python 2....
<3E9EC1CA.3060800@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] RE: Refactoring of bdist_wininst... 2003-04-17 15:01:30 <000c01c304ee$3cb79e60$e901340a@cyberhigh.fcoe... <000c01c304ee$3cb79e60$e901340a@cyberhigh.fcoe... Cory Dodt wrote:\n> Distutils 1.0.3 is not inc...
<3E9EC1CA.3060800@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] RE: Refactoring of bdist_wininst... 2003-04-17 15:01:30 <000c01c304ee$3cb79e60$e901340a@cyberhigh.fcoe... <000c01c304ee$3cb79e60$e901340a@cyberhigh.fcoe... Cory Dodt wrote:\n> Distutils 1.0.3 is not inc...
<3E9EF5E3.8080100@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] [Fwd: [ANN] A new IPython is out... 2003-04-17 18:43:47 None None Hi all,\n\nI've just put out a new pre-release...
<3E9EF5E3.8080100@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] [Fwd: [ANN] A new IPython is out... 2003-04-17 18:43:47 None None Hi all,\n\nI've just put out a new pre-release...
<3E9EFC95.7040309@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] ToDo for 0.4.0 2003-04-17 19:12:21 None None Hi all,\n\nI'd like to put out a list of thing...
<3E9EFC95.7040309@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] ToDo for 0.4.0 2003-04-17 19:12:21 None None Hi all,\n\nI'd like to put out a list of thing...
<3E9F3B79.7070005@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] New bug tracker for IPython 2003-04-17 23:40:41 None None Hi all,\n\nI just wanted to let you know that,...
<3E9F3B79.7070005@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] New bug tracker for IPython 2003-04-17 23:40:41 None None Hi all,\n\nI just wanted to let you know that,...
<3E9F3D9B.8040807@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] Re: iPython on Windows 2003-04-17 23:49:47 <GCEDKONBLEFPPADDJCOECEOIIPAA.whisper@oz.net> <GCEDKONBLEFPPADDJCOECEOIIPAA.whisper@oz.net> Hi David,\n\nmy apologies for the long delay i...
<3E9F3D9B.8040807@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] Re: iPython on Windows 2003-04-17 23:49:47 <GCEDKONBLEFPPADDJCOECEOIIPAA.whisper@oz.net> <GCEDKONBLEFPPADDJCOECEOIIPAA.whisper@oz.net> Hi David,\n\nmy apologies for the long delay i...
<200304291817.05898.Kasper.Souren@ircam.fr> Kasper.Souren@ircam.fr (Kasper Souren) [IPython-dev] possible feature request: auto-run 2003-04-29 18:17:05 None None Hi!\n\nI just had a little idea for a new IPyt...
<200304291817.05898.Kasper.Souren@ircam.fr> Kasper.Souren at ircam.fr (Kasper Souren) [IPython-dev] possible feature request: auto-run 2003-04-29 18:17:05 None None Hi!\n\nI just had a little idea for a new IPyt...
<3EAEF194.5030709@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] possible feature request: auto-run 2003-04-29 21:41:40 <200304291817.05898.Kasper.Souren@ircam.fr> <200304291817.05898.Kasper.Souren@ircam.fr> Kasper Souren wrote:\n> Hi!\n> \n> I just had ...
<3EAEF194.5030709@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] possible feature request: auto-run 2003-04-29 21:41:40 <200304291817.05898.Kasper.Souren@ircam.fr> <200304291817.05898.Kasper.Souren@ircam.fr> Kasper Souren wrote:\n> Hi!\n> \n> I just had ...
<200304292248.10994.Kasper.Souren@ircam.fr> Kasper.Souren at ircam.fr (Kasper Souren) [IPython-dev] possible feature request: auto-run 2003-04-29 22:48:10 <3EAEF194.5030709@colorado.edu> <200304291817.05898.Kasper.Souren@ircam.fr> <3... > It's rather complicated to get it right, and...
<200304292248.10994.Kasper.Souren@ircam.fr> Kasper.Souren@ircam.fr (Kasper Souren) [IPython-dev] possible feature request: auto-run 2003-04-29 22:48:10 <3EAEF194.5030709@colorado.edu> <200304291817.05898.Kasper.Souren@ircam.fr> <3... > It's rather complicated to get it right, and...
<CB0365D517B7D611B5E100508B9498B6022A9B50@erlh904a.med.siemens.de> christopher.drexler@siemens.com (Drexler Chris... [IPython-dev] RE: [Fwd: [IPython-user] re: Fwd... 2003-05-12 07:28:55 None None Dear List,\n\nI'm working with IPython since a...
<CB0365D517B7D611B5E100508B9498B6022A9B50@erlh904a.med.siemens.de> christopher.drexler at siemens.com (Drexler Ch... [IPython-dev] RE: [Fwd: [IPython-user] re: Fwd... 2003-05-12 07:28:55 None None Dear List,\n\nI'm working with IPython since a...
<200305121234.h4CCYmXo027167@wren.cs.unc.edu> gb@cs.unc.edu (Gary Bishop) [IPython-dev] RE: [Fwd: [IPython-user] re: Fwd... 2003-05-12 08:34:48 None None Thanks Chris,\n\nWith that hint and some googl...
<200305121234.h4CCYmXo027167@wren.cs.unc.edu> gb at cs.unc.edu (Gary Bishop) [IPython-dev] RE: [Fwd: [IPython-user] re: Fwd... 2003-05-12 08:34:48 None None Thanks Chris,\n\nWith that hint and some googl...
<3EC143D7.8050907@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-dev] Re: IPython Crash Report 2003-05-13 19:13:27 <200305131849.h4DInjXo018909@wren.cs.unc.edu> <200305131849.h4DInjXo018909@wren.cs.unc.edu> Hi Gary,\n\n> The idea is simple. I assume tha...
<3EC143D7.8050907@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-dev] Re: IPython Crash Report 2003-05-13 19:13:27 <200305131849.h4DInjXo018909@wren.cs.unc.edu> <200305131849.h4DInjXo018909@wren.cs.unc.edu> Hi Gary,\n\n> The idea is simple. I assume tha...
<200305171149.h4HBneXo024735@wren.cs.unc.edu> gb at cs.unc.edu (Gary Bishop) [IPython-dev] re: 0.4.0 ready for Monday 2003-05-17 07:49:39 None None It still says it is 0.2.15.pre5, I guess that ...
<200305171149.h4HBneXo024735@wren.cs.unc.edu> gb@cs.unc.edu (Gary Bishop) [IPython-dev] re: 0.4.0 ready for Monday 2003-05-17 07:49:39 None None It still says it is 0.2.15.pre5, I guess that ...
... ... ... ... ... ... ...
<CAHAreOqzMqg+LqQ7EY+t2efWJWB1dSCK8c9QR=VHF_5nWoB+Cg@mail.gmail.com> fperez.net@gmail.... (Fernando Perez) [IPython-dev] ITorch on IPython 3 -- problems? 2015-02-17 07:03:53 <CABbuyuVPn0gVVmupbxRpfR2psMo7sVHAs=fU4vZRktKD... <CABbuyuVPn0gVVmupbxRpfR2psMo7sVHAs=fU4vZRktKD... On Mon, Feb 16, 2015 at 5:19 PM, Andrew Payne ...
<CAKZ-Uq28owHAvZi-Bt_VQmCnzDh2iMSO04d1yNT=m5vP25wg7g@mail.gmail.com> mwaskom@stanford.... (Michael Waskom) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 18:57:31 <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Hi Min,\n\nIt looks like the scaling of figure...
<940C0CF6-2714-439E-B57A-D9D00946D9BC@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 19:09:35 <CAKZ-Uq28owHAvZi-Bt_VQmCnzDh2iMSO04d1yNT=m5vP... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... No, it does not seem to be on purpose. \n\nIt ...
<CAH4pYpRccFSKyGgwz_mTEqWQn_+yLGBe-G6MO6L0ypvv8TbGow@mail.gmail.com> ellisonbg@gmail.... (Brian Granger) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 19:10:38 <CAKZ-Uq28owHAvZi-Bt_VQmCnzDh2iMSO04d1yNT=m5vP... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Thanks for the report, we have made some chang...
<CAKZ-Uq1xmAX-+bur0dhVN2gS1TKNzxAvGF-YPLFtjNr23p2_-A@mail.gmail.com> mwaskom@stanford.... (Michael Waskom) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 20:43:38 <CAH4pYpRccFSKyGgwz_mTEqWQn_+yLGBe-G6MO6L0ypvv... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Awesome, thanks guys!\n\nOne other thing I hav...
<54E3B1E0.3030902@gmx.de> max_linke@gmx... (Max Linke) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 21:25:52 <F8353BED-03C9-4BB3-8BFF-62F94E916597@gmail.com> <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Thanks for the help I found a way now that wor...
<54E3B613.2020301@gmx.de> max_linke@gmx... (Max Linke) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 21:43:47 <F8353BED-03C9-4BB3-8BFF-62F94E916597@gmail.com> <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Isn't there a guarantee about which functions ...
<CAOvn4qjBwNkB0xvKTPO9cZ19DiDxSdyBN3LA_Wk3xvZyrvQkAw@mail.gmail.com> takowl@gmail.... (Thomas Kluyver) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 21:53:22 <54E3B613.2020301@gmx.de> <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... On 17 February 2015 at 13:43, Max Linke <max_l...
<20150217224440.GL13270@janus.cbl.uh.edu> zaki.mughal@gmail.... (Zakariyya Mughal) [IPython-dev] IPython messaging spec for warni... 2015-02-17 22:44:40 None None Hello,\n\nI'm working on the IPerl language ke...
<CAKZ-Uq1nnDmhe1+y_o1=PyaouCbWekUjXTfS-ufApaVKp1gEOg@mail.gmail.com> mwaskom@stanford.... (Michael Waskom) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 22:45:41 <CAOvn4qjBwNkB0xvKTPO9cZ19DiDxSdyBN3LA_Wk3xvZy... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Ooop one more thing I've noticed. ctrl-j/k no ...
<AE9FFF31-DCAF-4508-8744-960B26E8D776@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-dev] IPython messaging spec for warni... 2015-02-17 22:55:48 <20150217224440.GL13270@janus.cbl.uh.edu> <20150217224440.GL13270@janus.cbl.uh.edu> Le 17 f?vr. 2015 ? 14:44, Zakariyya Mughal <za...
<F3531DC1-F0D4-48D9-A678-3F3AFB635BEA@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-17 22:56:34 <CAKZ-Uq1nnDmhe1+y_o1=PyaouCbWekUjXTfS-ufApaVK... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Le 17 f?vr. 2015 ? 14:45, Michael Waskom <mwas...
<CAOvn4qj=hm17ZqAEzh3SgB4H9XAJtdmyYu3pz9h5PTdTdJB6bA@mail.gmail.com> takowl@gmail.... (Thomas Kluyver) [IPython-dev] IPython messaging spec for warni... 2015-02-17 23:09:13 <AE9FFF31-DCAF-4508-8744-960B26E8D776@gmail.com> <20150217224440.GL13270@janus.cbl.uh.edu>\n\t<... On 17 February 2015 at 14:55, Matthias Bussonn...
<CAHNn8BVc3NgqTj5LevyOQ3dNO=rCYhrhGqKBQ4v2VR+3ktKQNA@mail.gmail.com> benjaminrk@gmail.... (MinRK) [IPython-dev] ITorch on IPython 3 -- problems? 2015-02-18 07:20:10 <CAHAreOqzMqg+LqQ7EY+t2efWJWB1dSCK8c9QR=VHF_5n... <CABbuyuVPn0gVVmupbxRpfR2psMo7sVHAs=fU4vZRktKD... I didn?t play with it too much, but I submitte...
<CAF-LYKKQfRBpNSLP_3XTzVyaVK3SqbBELhZWDFXFRYGw4SjXaA@mail.gmail.com> andrew.gibiansky@gmail.... (Andrew Gibiansky) [IPython-dev] IPython messaging spec for warni... 2015-02-18 07:27:51 <CAOvn4qj=hm17ZqAEzh3SgB4H9XAJtdmyYu3pz9h5PTdT... <20150217224440.GL13270@janus.cbl.uh.edu>\n\t<... IHaskell currently just publishes a display_da...
<CAHNn8BUMO+9BpPCVDCERcFv71hAD0arc8mS4Eiit1BWyNe5EMg@mail.gmail.com> benjaminrk@gmail.... (MinRK) [IPython-dev] IPython messaging spec for warni... 2015-02-18 07:44:44 <CAF-LYKKQfRBpNSLP_3XTzVyaVK3SqbBELhZWDFXFRYGw... <20150217224440.GL13270@janus.cbl.uh.edu>\n\t<... On Tue, Feb 17, 2015 at 11:27 PM, Andrew Gibia...
<CAHAreOqg+mnt921x3++-Q9QvR+ibfuJwgGonKoAAohrzCXC6Sw@mail.gmail.com> fperez.net@gmail.... (Fernando Perez) [IPython-dev] Google Summer of Code and NumFOCUS 2015-02-19 02:45:23 <20150219024934.GC11191@pupunha> <20150219024934.GC11191@pupunha> Hi Rainiere,\n\nright now we don't have the ne...
<20150219024934.GC11191@pupunha> ra092767@ime.unicamp... (Raniere Silva) [IPython-dev] Google Summer of Code and NumFOCUS 2015-02-19 02:49:35 None None Hi,\n\nNumFOCUS has promotes and supports the ...
<CAKZ-Uq1U78wHkCq_q=EjEk5nV1Jh-5kBXVArw8A8LrAhpFckXg@mail.gmail.com> mwaskom@stanford.... (Michael Waskom) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-19 16:35:59 <F3531DC1-F0D4-48D9-A678-3F3AFB635BEA@gmail.com> <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Hi, sorry to be a nuisance, but I'm worried th...
<CA+-1RQS5YOpXYgF2XVdf+53S1QFe=U0x3EWBhJcLGGbL59WN_w@mail.gmail.com> cyrille.rossant@gmail.... (Cyrille Rossant) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-19 16:57:24 <CAKZ-Uq1U78wHkCq_q=EjEk5nV1Jh-5kBXVArw8A8LrAh... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... 2015-02-19 17:35 GMT+01:00 Michael Waskom <mwa...
<CAOvn4qiMpyWYQA8vJ8513H_Yi=qnE_h+pi4tH5omTKL30rGdqQ@mail.gmail.com> takowl@gmail.... (Thomas Kluyver) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-19 17:54:44 <CAKZ-Uq1U78wHkCq_q=EjEk5nV1Jh-5kBXVArw8A8LrAh... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Sorry, I did see that part of the message befo...
<CAKOFcwqJkrsW86kWei_TxymryFMBif4o6Jm3aFWaCW6yF=harQ@mail.gmail.com> john@omernik.... (John Omernik) [IPython-dev] Hide input 2015-02-19 19:27:30 None None So I see this has been discussed before (Git H...
<A86636A8-26C8-49B2-9812-1BDD01134404@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-19 19:38:45 <CAOvn4qiMpyWYQA8vJ8513H_Yi=qnE_h+pi4tH5omTKL3... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Le 19 f?vr. 2015 ? 09:54, Thomas Kluyver <tako...
<CAHNn8BW40OHyXVV+pPYCdzR+o77Zg1WwACpok6O55VmSEWc8pw@mail.gmail.com> benjaminrk@gmail.... (MinRK) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-19 20:13:21 <A86636A8-26C8-49B2-9812-1BDD01134404@gmail.com> <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... On Thu, Feb 19, 2015 at 8:38 PM, Matthias Buss...
<CAHNn8BVePns+eF57RpAXeR5iupKbnRN2UxZ1qVb2zUiiV3-k-A@mail.gmail.com> benjaminrk@gmail.... (MinRK) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-19 20:17:55 <CAHNn8BW40OHyXVV+pPYCdzR+o77Zg1WwACpok6O55VmS... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... On Thu, Feb 19, 2015 at 9:13 PM, MinRK <benjam...
<6633F332-453F-4C19-8C5F-E545436A06AE@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-19 22:23:05 <CAHNn8BW40OHyXVV+pPYCdzR+o77Zg1WwACpok6O55VmS... <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... Le 19 f?vr. 2015 ? 12:13, MinRK <benjaminrk@gm...
<ADCC9CFA-919B-4ED2-8F66-630A11396E0F@gmail.com> benjaminrk@gmail.... (Min RK) [IPython-dev] [ANN] IPython 3.0.0rc1 2015-02-20 13:56:45 <6633F332-453F-4C19-8C5F-E545436A06AE@gmail.com> <CAHNn8BVP-0RpaZBF0zgSnex1Ase8pOkhkK-gPn+k1a6E... > On Feb 19, 2015, at 23:23, Matthias Bussonni...
<CADT3MEDawwwV-dyDh3C7YgyD8HTviisAhnZBPRuC3iJzu2J_6g@mail.gmail.com> pmhobson@gmail.... (Paul Hobson) [IPython-dev] Hide input 2015-02-20 17:42:04 <CAKOFcwqJkrsW86kWei_TxymryFMBif4o6Jm3aFWaCW6y... <CAKOFcwqJkrsW86kWei_TxymryFMBif4o6Jm3aFWaCW6y... I can't speak for the devs here, but I recall ...
<20150220203309.GN12853@pupunha> ra092767@ime.unicamp... (Raniere Silva) [IPython-dev] =?utf-8?q?ANN=3A_SciPy_Latin_Am=... 2015-02-20 20:33:09 None None *Call for Proposals*\n\n*SciPy Latin Am?rica 2...
<54E7A089.1090007@gmx.de> max_linke@gmx... (Max Linke) [IPython-dev] Hide input 2015-02-20 21:00:57 <CADT3MEDawwwV-dyDh3C7YgyD8HTviisAhnZBPRuC3iJz... <CAKOFcwqJkrsW86kWei_TxymryFMBif4o6Jm3aFWaCW6y... You can use the codefolding extension to hide ...

15220 rows × 6 columns

This variable is for the range of days used in computing rolling averages.


In [6]:
window = 100

For each of the mailing lists we are looking at, plot the rolling average of number of emails sent per day.


In [7]:
plt.figure(figsize=(12.5, 7.5))

for i, activity in enumerate(activities):

    colors = 'rgbkm'

    ta = activity.sum(1)
    rmta = pd.rolling_mean(ta,window)
    rmtadna = rmta.dropna()
    plt.plot_date(rmtadna.index,
                  rmtadna.values,
                  colors[i],
                  label=mailman.get_list_name(urls[i]) + ' activity',xdate=True)

    plt.legend()
    
plt.savefig("activites-marked.png")
plt.show()


/home/sb/.virtualenvs/bigbang3/local/lib/python2.7/site-packages/matplotlib/font_manager.py:1282: UserWarning: findfont: Font family [u'monospace'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

In [8]:
arx.data


Out[8]:
From Subject Date In-Reply-To References Body
Message-ID
<3271DBB88437ED41A0AB239E6C2554A401117873@ussunm001.palmsource.com> Robin.Siebler at palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 20:27:08 None None I installed IPython-0.2.15pre3, played with it...
<3271DBB88437ED41A0AB239E6C2554A401117873@ussunm001.palmsource.com> Robin.Siebler@palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 20:27:08 None None I installed IPython-0.2.15pre3, played with it...
<3E8364F0.2000107@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 20:54:08 <3271DBB88437ED41A0AB239E6C2554A401117873@ussu... <3271DBB88437ED41A0AB239E6C2554A401117873@ussu... Robin Siebler wrote:\n> I installed IPython-0....
<3E8364F0.2000107@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 20:54:08 <3271DBB88437ED41A0AB239E6C2554A401117873@ussu... <3271DBB88437ED41A0AB239E6C2554A401117873@ussu... Robin Siebler wrote:\n> I installed IPython-0....
<1048798697.25990.6.camel@localhost.localdomain> jives at gorge.net (Jason Ives) [IPython-user] IPython under Jython? 2003-03-27 20:58:15 None None Hi,\n\nI'm wondering if anyone's had success r...
<1048798697.25990.6.camel@localhost.localdomain> jives@gorge.net (Jason Ives) [IPython-user] IPython under Jython? 2003-03-27 20:58:15 None None Hi,\n\nI'm wondering if anyone's had success r...
<3E836641.8000008@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-user] IPython under Jython? 2003-03-27 20:59:45 <1048798697.25990.6.camel@localhost.localdomain> <1048798697.25990.6.camel@localhost.localdomain> Jason Ives wrote:\n\n> I'm wondering if anyone...
<3E836641.8000008@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-user] IPython under Jython? 2003-03-27 20:59:45 <1048798697.25990.6.camel@localhost.localdomain> <1048798697.25990.6.camel@localhost.localdomain> Jason Ives wrote:\n\n> I'm wondering if anyone...
<3271DBB88437ED41A0AB239E6C2554A401117875@ussunm001.palmsource.com> Robin.Siebler@palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 21:13:13 None None I searched but couldn't find any such file.\n\...
<3271DBB88437ED41A0AB239E6C2554A401117875@ussunm001.palmsource.com> Robin.Siebler at palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 21:13:13 None None I searched but couldn't find any such file.\n\...
<3E836B5E.4020702@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 21:21:34 <3271DBB88437ED41A0AB239E6C2554A401117875@ussu... <3271DBB88437ED41A0AB239E6C2554A401117875@ussu... Robin Siebler wrote:\n> I searched but couldn'...
<3E836B5E.4020702@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 21:21:34 <3271DBB88437ED41A0AB239E6C2554A401117875@ussu... <3271DBB88437ED41A0AB239E6C2554A401117875@ussu... Robin Siebler wrote:\n> I searched but couldn'...
<1048802417.25990.16.camel@localhost.localdomain> jives at gorge.net (Jason Ives) [IPython-user] IPython under Jython? 2003-03-27 22:00:14 <3E836641.8000008@colorado.edu> <1048798697.25990.6.camel@localhost.localdomai... Hi,\n\nFernando Perez wrote:\n\n Insofar as...
<1048802417.25990.16.camel@localhost.localdomain> jives@gorge.net (Jason Ives) [IPython-user] IPython under Jython? 2003-03-27 22:00:14 <3E836641.8000008@colorado.edu> <1048798697.25990.6.camel@localhost.localdomai... Hi,\n\nFernando Perez wrote:\n\n Insofar as...
<3271DBB88437ED41A0AB239E6C2554A401117878@ussunm001.palmsource.com> Robin.Siebler@palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 22:01:57 None None I just searched for 'ipyt*'. I didn't get a hi...
<3271DBB88437ED41A0AB239E6C2554A401117878@ussunm001.palmsource.com> Robin.Siebler at palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 22:01:57 None None I just searched for 'ipyt*'. I didn't get a hi...
<3E8376BF.5060203@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 22:10:07 <3271DBB88437ED41A0AB239E6C2554A401117878@ussu... <3271DBB88437ED41A0AB239E6C2554A401117878@ussu... Robin Siebler wrote:\n> I just searched for 'i...
<3E8376BF.5060203@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 22:10:07 <3271DBB88437ED41A0AB239E6C2554A401117878@ussu... <3271DBB88437ED41A0AB239E6C2554A401117878@ussu... Robin Siebler wrote:\n> I just searched for 'i...
<3E837EE7.5050302@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 22:44:55 <3271DBB88437ED41A0AB239E6C2554A43F5C1F@ussunm... <3271DBB88437ED41A0AB239E6C2554A43F5C1F@ussunm... Ah! I see now, the problem is with the curses...
<3E837EE7.5050302@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 22:44:55 <3271DBB88437ED41A0AB239E6C2554A43F5C1F@ussunm... <3271DBB88437ED41A0AB239E6C2554A43F5C1F@ussunm... Ah! I see now, the problem is with the curses...
<3271DBB88437ED41A0AB239E6C2554A40111787D@ussunm001.palmsource.com> Robin.Siebler at palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 22:49:25 None None There might be one, but I don't have it instal...
<3271DBB88437ED41A0AB239E6C2554A40111787D@ussunm001.palmsource.com> Robin.Siebler@palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 22:49:25 None None There might be one, but I don't have it instal...
<3E83832A.4020506@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 23:03:06 <3271DBB88437ED41A0AB239E6C2554A40111787D@ussu... <3271DBB88437ED41A0AB239E6C2554A40111787D@ussu... Ok, there's something seriously strange here g...
<3E83832A.4020506@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 23:03:06 <3271DBB88437ED41A0AB239E6C2554A40111787D@ussu... <3271DBB88437ED41A0AB239E6C2554A40111787D@ussu... Ok, there's something seriously strange here g...
<3271DBB88437ED41A0AB239E6C2554A401117882@ussunm001.palmsource.com> Robin.Siebler at palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 23:14:45 None None >My python2.2 installation doesn't even have t...
<3271DBB88437ED41A0AB239E6C2554A401117882@ussunm001.palmsource.com> Robin.Siebler@palmsource.com (Robin Siebler) [IPython-user] Crash 2003-03-27 23:14:45 None None >My python2.2 installation doesn't even have t...
<3E838774.9020006@colorado.edu> fperez at colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 23:21:24 <3271DBB88437ED41A0AB239E6C2554A401117882@ussu... <3271DBB88437ED41A0AB239E6C2554A401117882@ussu... Robin Siebler wrote:\n>>My python2.2 installat...
<3E838774.9020006@colorado.edu> fperez@colorado.edu (Fernando Perez) [IPython-user] Crash 2003-03-27 23:21:24 <3271DBB88437ED41A0AB239E6C2554A401117882@ussu... <3271DBB88437ED41A0AB239E6C2554A401117882@ussu... Robin Siebler wrote:\n>>My python2.2 installat...
<20030329055915.GL21370@i.cantcode.com> jack@xiph.org (Jack Moffitt) [IPython-user] ipython -p numeric problem 2003-03-29 05:59:15 None None I'm sure I'm missing something, but it's not o...
<20030329055915.GL21370@i.cantcode.com> jack at xiph.org (Jack Moffitt) [IPython-user] ipython -p numeric problem 2003-03-29 05:59:15 None None I'm sure I'm missing something, but it's not o...
... ... ... ... ... ... ...
<CACpqBg2asgMihubuDcT3hd49L2NdKBGGNSKTN6c+BZoAg-CJnA@mail.gmail.com> jakevdp@cs.washington.... (Jacob Vanderplas) [IPython-User] [IPython-dev] Standard cells at... 2014-12-01 15:04:10 <547C6363.9030104@tenner.nl> <CAAipwu9_23hXRz_hKVSiYHyuM67hsgP6X1bvo1PfZY+6... Hi,\nAdrian Price-Whelan has a macrocell exten...
<CAB2ViTbivOpUg1_aimBoj-S+90JmaNkpj+KEm84=ufMeAVfwQQ@mail.gmail.com> phillip.m.feldman@gmail.... (Phillip Feldman) [IPython-User] in ipython notebook,\n possible... 2014-12-01 17:53:47 <547C0D31.5010808@gmail.com> <CAB2ViTYFjc9kMN86MnrjfWBiZd798zDneiCTVC9yzEXH... Hello Zoltan,\n\nMy html/css (see below) now u...
<CABbuyuW6rJaUBUtKxB0n_RCS5o2jqRNStYtyB0BE5KE63-BZgA@mail.gmail.com> andy@payne.... (Andrew Payne) [IPython-User] in ipython notebook,\n possible... 2014-12-01 20:33:16 <CAB2ViTbivOpUg1_aimBoj-S+90JmaNkpj+KEm84=ufMe... <CAB2ViTYFjc9kMN86MnrjfWBiZd798zDneiCTVC9yzEXH... > My html/css (see below) now uses "border: no...
<CAB2ViTY4wL+Sp3O0b2P4muajDYoucbh5iHoAyUXyvB1qFSFxQw@mail.gmail.com> phillip.m.feldman@gmail.... (Phillip Feldman) [IPython-User] in ipython notebook,\n possible... 2014-12-01 20:46:57 <CABbuyuW6rJaUBUtKxB0n_RCS5o2jqRNStYtyB0BE5KE6... <CAB2ViTYFjc9kMN86MnrjfWBiZd798zDneiCTVC9yzEXH... That's great. Thanks!\n\nOn Mon, Dec 1, 2014 ...
<547CF05A.8030708@relativita.com> emanuele@relativita.... (Emanuele Olivetti) [IPython-User] Qt issue in IPython 0.12: worka... 2014-12-01 22:48:58 <547A50D5.8090008@relativita.com> <547898E1.9000709@relativita.com> <547A50D5.80... On 11/30/2014 12:03 AM, Emanuele Olivetti wrot...
<CAGz2ECZu14LCsdEp_eOrn+FU9b5D0huo8_7Bb8kmHgcL2LPpVg@mail.gmail.com> jonnojohnson@gmail.... (Jonno) [IPython-User] Custom css through nbviewer 2014-12-03 22:07:04 None None I'm trying to figure out if it's yet possible ...
<CABRXM4nKUggpZZoD3JUoJTveibCYPPFp+4tj87w+6QLOqK2CPQ@mail.gmail.com> cappy2112@gmail.... (Tony Cappellini) [IPython-User] Peculiar problem with requests ... 2014-12-06 05:37:10 None None On OSX 10.9.5, Python 2.7.8 (32 Bit), iPython ...
<BA6A2F24-8D14-49F9-A74D-4E0D30862C0E@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-User] Peculiar problem with requests ... 2014-12-06 10:57:57 <CABRXM4nKUggpZZoD3JUoJTveibCYPPFp+4tj87w+6QLO... <CABRXM4nKUggpZZoD3JUoJTveibCYPPFp+4tj87w+6QLO... Hi, \n\nHere is what I did:\n\nTry in console ...
<CABRXM4=VmD5t0QmYqRBsMf4UvGawrAFUEXWMdEUFoh5n=qBh6Q@mail.gmail.com> cappy2112@gmail.... (Tony Cappellini) [IPython-User] Peculiar problem with requests ... 2014-12-07 02:47:51 None None Message: 2\nDate: Sat, 6 Dec 2014 11:57:57 +01...
<FAA513CF-1660-40CD-B1B9-5FEA4048E080@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-User] Peculiar problem with requests ... 2014-12-07 09:48:32 <CABRXM4=VmD5t0QmYqRBsMf4UvGawrAFUEXWMdEUFoh5n... <CABRXM4=VmD5t0QmYqRBsMf4UvGawrAFUEXWMdEUFoh5n... Le 7 d?c. 2014 ? 03:47, Tony Cappellini <cappy...
<EA039EC4-17F1-4D11-9DE4-01841B73521E@gmail.com> jzuhone@gmail.... (John ZuHone) [IPython-User] Trouble with configuration file... 2014-12-11 04:18:56 None None Hello, \n\nI used \n\nipython profile create \...
<2C08D6F1-8A0C-4818-8840-7455675208C1@gmail.com> jzuhone@gmail.... (John ZuHone) [IPython-User] Trouble with configuration file... 2014-12-11 04:43:12 <EA039EC4-17F1-4D11-9DE4-01841B73521E@gmail.com> <EA039EC4-17F1-4D11-9DE4-01841B73521E@gmail.com> Check that, it turns out that it is only the l...
<2107085751.5438082.1418729306654.JavaMail.zimbra@phimeca.com> schueller@phimeca.... (Julien Schueller | PHIM... [IPython-User] graph displayed twice while ove... 2014-12-16 11:28:26 <750580697.5430991.1418728868883.JavaMail.zimb... None Hello,\n\nI'm having trouble overloading _repr...
<CAB2ViTagZsF3F8XV7v6cynnD-FiGrMcOM0sDvDCvMLgbzpsLVw@mail.gmail.com> phillip.m.feldman@gmail.... (Phillip Feldman) [IPython-User] hiding code in ipython notebooks 2014-12-24 05:45:32 None None There are situations where I would like to hid...
<50E70FD4-8EB8-441E-87C3-7034AC20C2C0@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-User] hiding code in ipython notebooks 2014-12-24 09:06:14 <CAB2ViTagZsF3F8XV7v6cynnD-FiGrMcOM0sDvDCvMLgb... <CAB2ViTagZsF3F8XV7v6cynnD-FiGrMcOM0sDvDCvMLgb... Le 24 d?c. 2014 ? 06:45, Phillip Feldman <phil...
<CAK6O52moioRUXPyKm2n6D5dmiQ7EFg952_-4iSm=HSxQ=K5RQg@mail.gmail.com> dsdale24@gmail.... (Darren Dale) [IPython-User] embedding ipython, namespace qu... 2014-12-27 16:07:23 None None Hello,\n\nI'm working on embedding an ipython ...
<27C60D72-6E87-4858-8C6A-6BC98584A9C7@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-User] embedding ipython, namespace qu... 2014-12-27 19:49:54 <CAK6O52moioRUXPyKm2n6D5dmiQ7EFg952_-4iSm=HSxQ... <CAK6O52moioRUXPyKm2n6D5dmiQ7EFg952_-4iSm=HSxQ... Hi Darren, \n\nIPython-user is sunsetting, so ...
<CAOdVj+Po44CE4ZJubU=qmKHhRYOkuzCxPDHgtoBb2KKPvSSPZg@mail.gmail.com> reabow@gmail.... (Aaron Reabow) [IPython-User] ipython notebook kernel shows b... 2015-01-04 06:55:48 None None I am having a problem whereby ipython notebook...
<77F267E6-A724-4EA2-9579-2E46FF38C24B@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-User] ipython notebook kernel shows b... 2015-01-04 17:17:23 <CAOdVj+Po44CE4ZJubU=qmKHhRYOkuzCxPDHgtoBb2KKP... <CAOdVj+Po44CE4ZJubU=qmKHhRYOkuzCxPDHgtoBb2KKP... Hi Aron,\n\nPlease try on Ipython-dev@scipy.o....
<CAGCtaoPWJoykOD7oN_D11WcaPSHkndMAy2CQ91Jy3G__m179HA@mail.gmail.com> catherine.devlin@gmail.... (Catherine Devlin) [IPython-User] trapping error thrown by line m... 2015-01-06 16:42:49 None None I really enjoy being able to build informal IP...
<CAHAreOrs00CW4W8VwwksKLbrTjqtHa=YKrU=ainBQLtcjPZjpQ@mail.gmail.com> fperez.net@gmail.... (Fernando Perez) [IPython-User] trapping error thrown by line m... 2015-01-09 04:43:18 <CAGCtaoPWJoykOD7oN_D11WcaPSHkndMAy2CQ91Jy3G__... <CAGCtaoPWJoykOD7oN_D11WcaPSHkndMAy2CQ91Jy3G__... Dear Catherine,\n\ncertain magics swallow thei...
<54B7C468.8060708@gmail.com> a.h.jaffe@gmail.... (Andrew Jaffe) [IPython-User] InlineBackend.figure_formats 2015-01-15 13:45:12 None None Hi all,\n\nI see that you can set the figure f...
<CAOvn4qhi1M7tefPLA=n418dGoDhm0z10i7izYTwi3Y+CNTyqEA@mail.gmail.com> takowl@gmail.... (Thomas Kluyver) [IPython-User] InlineBackend.figure_formats 2015-01-15 17:31:29 <54B7C468.8060708@gmail.com> <54B7C468.8060708@gmail.com> On 15 January 2015 at 05:45, Andrew Jaffe <a.h...
<CAHNn8BUQUnB72E5sNZOrvmceW+sRTVDe73ZmpZ1sxEZRKiAbGQ@mail.gmail.com> benjaminrk@gmail.... (MinRK) [IPython-User] InlineBackend.figure_formats 2015-01-15 17:54:36 <CAOvn4qhi1M7tefPLA=n418dGoDhm0z10i7izYTwi3Y+C... <54B7C468.8060708@gmail.com>\n\t<CAOvn4qhi1M7t... A possible use case being:\n\n- I want png or ...
<54B8C4A8.30807@gmail.com> a.h.jaffe@gmail.... (Andrew Jaffe) [IPython-User] InlineBackend.figure_formats 2015-01-16 07:58:32 <CAHNn8BUQUnB72E5sNZOrvmceW+sRTVDe73ZmpZ1sxEZR... <54B7C468.8060708@gmail.com>\t<CAOvn4qhi1M7tef... Dear All,\n\n>>> On 15 January 2015 at 05:...
<C8B6208B-795B-44E8-92B3-31872178026A@gmail.com> ellisonbg@gmail.... (Brian Granger) [IPython-User] InlineBackend.figure_formats 2015-01-16 16:30:29 <54B8C4A8.30807@gmail.com> <54B7C468.8060708@gmail.com>\n\t<CAOvn4qhi1M7t... Nbconvert latex output will already use the od...
<B708C8E2-A5E3-4630-846A-8B1EAE089383@gmail.com> benjaminrk@gmail.... (Min RK) [IPython-User] InlineBackend.figure_formats 2015-01-16 19:27:14 <54B8C4A8.30807@gmail.com> <54B7C468.8060708@gmail.com>\n\t<CAOvn4qhi1M7t... retina is 2x png, so it will be displayed in t...
<CALxxJLREXf+KYmGF9wH44A=50Z9uz19mgV0W93UkqNM-O67+iA@mail.gmail.com> denis.akhiyarov@gmail.... (Denis Akhiyarov) [IPython-User] survey/poll using ipython noteb... 2015-01-29 22:28:07 None None Is it possible to create a survey/poll in ipyt...
<CALAe=OLFhnyzXH0svqkX9kov_vbT-7QDnhJt7+RziNf6AqRWyg@mail.gmail.com> alexgarciac@gmail.... (Alexander Garcia Castro) [IPython-User] ipython in scholarly communication 2015-01-30 03:07:13 None None Dear all, Sepublica is particularly interested...
<F07B42A7-7577-4509-946C-EFAF32424E1D@gmail.com> bussonniermatthias@gmail.... (Matthias Bussonn... [IPython-User] survey/poll using ipython noteb... 2015-01-30 10:39:18 <CALxxJLREXf+KYmGF9wH44A=50Z9uz19mgV0W93UkqNM-... <CALxxJLREXf+KYmGF9wH44A=50Z9uz19mgV0W93UkqNM-... Hi, \n\n\nFirst I would suggest posting to IPy...

13145 rows × 6 columns

Now, let's see: who are the authors of the most messages to one particular list?


In [9]:
a  = activities[0] # for the first mailing list
ta = a.sum(0) # sum along the first axis
ta.sort()
ta[-10:].plot(kind='barh')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7b8ccf2ed0>

This might be useful for seeing the distribution (does the top message sender dominate?) or for identifying key participants to talk to.


Many mailing lists will have some duplicate senders: individuals who use multiple email addresses or are recorded as different senders when using the same email address. We want to identify those potential duplicates in order to get a more accurate representation of the distribution of senders.

To begin with, let's do a naive calculation of the similarity of the From strings, based on the Levenshtein distance.

This can take a long time for a large matrix, so we will truncate it for purposes of demonstration.


In [10]:
import Levenshtein
distancedf = process.matricize(a.columns[:100], lambda a,b: Levenshtein.distance(a,b)) # calculate the edit distance between the two From titles
df = distancedf.astype(int) # specify that the values in the matrix are integers

In [11]:
fig = plt.figure(figsize=(18, 18))
plt.pcolor(df)
#plt.yticks(np.arange(0.5, len(df.index), 1), df.index) # these lines would show labels, but that gets messy
#plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.show()


The dark blue diagonal is comparing an entry to itself (we know the distance is zero in that case), but a few other dark blue patches suggest there are duplicates even using this most naive measure.

Below is a variant of the visualization for inspecting the particular apparent duplicates.


In [12]:
levdf = process.sorted_lev(a) # creates a slightly more nuanced edit distance matrix
                              # and sorts by rows/columns that have the best candidates
levdf_corner = levdf.iloc[:25,:25] # just take the top 25


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-883567802061> in <module>()
----> 1 levdf = process.sorted_lev(a) # creates a slightly more nuanced edit distance matrix
      2                               # and sorts by rows/columns that have the best candidates
      3 levdf_corner = levdf.iloc[:25,:25] # just take the top 25

/home/sb/projects/bigbang/bigbang/process.pyc in sorted_lev(from_dataframe)
     76 
     77 def sorted_lev(from_dataframe):
---> 78     distancedf = matricize(from_dataframe.columns, lev_distance_normalized)
     79     # specify that the values in the matrix are integers
     80     df = distancedf.astype(int)

/home/sb/projects/bigbang/bigbang/process.pyc in matricize(series, func)
     50     for index, element in enumerate(series):
     51         for second_index, second_element in enumerate(series):
---> 52             matrix.iloc[index, second_index] = func(element, second_element)
     53 
     54     return matrix

/home/sb/projects/bigbang/bigbang/process.pyc in lev_distance_normalized(a, b)
     70     stop_characters = unicode('"<>')
     71     stop_characters_map = dict((ord(char), None) for char in stop_characters)
---> 72     a_normal = a.lower().translate(stop_characters_map)
     73     b_normal = b.lower().translate(stop_characters_map)
     74     return Levenshtein.distance(a_normal, b_normal)

TypeError: expected a character buffer object

In [12]:
fig = plt.figure(figsize=(15, 12))
plt.pcolor(levdf_corner)
plt.yticks(np.arange(0.5, len(levdf_corner.index), 1), levdf_corner.index)
plt.xticks(np.arange(0.5, len(levdf_corner.columns), 1), levdf_corner.columns, rotation='vertical')
plt.colorbar()
plt.show()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-452e5119904a> in <module>()
      1 fig = plt.figure(figsize=(15, 12))
----> 2 plt.pcolor(levdf_corner)
      3 plt.yticks(np.arange(0.5, len(levdf_corner.index), 1), levdf_corner.index)
      4 plt.xticks(np.arange(0.5, len(levdf_corner.columns), 1), levdf_corner.columns, rotation='vertical')
      5 plt.colorbar()

NameError: name 'levdf_corner' is not defined
<matplotlib.figure.Figure at 0x7f4ec19b0a10>

For this still naive measure (edit distance on a normalized string), it appears that there are many duplicates in the <10 range, but that above that the edit distance of short email addresses at common domain names can take over.


In [15]:
consolidates = []

# gather pairs of names which have a distance of less than 10
for col in levdf.columns:
  for index, value in levdf.loc[levdf[col] < 10, col].iteritems():
        if index != col: # the name shouldn't be a pair for itself
            consolidates.append((col, index))
  
print str(len(consolidates)) + ' candidates for consolidation.'


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-147a3ecc1516> in <module>()
      2 
      3 # gather pairs of names which have a distance of less than 10
----> 4 for col in levdf.columns:
      5   for index, value in levdf.loc[levdf[col] < 10, col].iteritems():
      6         if index != col: # the name shouldn't be a pair for itself

NameError: name 'levdf' is not defined

In [14]:
c = process.consolidate_senders_activity(a, consolidates)
print 'We removed: ' + str(len(a.columns) - len(c.columns)) + ' columns.'


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-98a80ae98515> in <module>()
----> 1 c = process.consolidate_senders_activity(a, consolidates)
      2 print 'We removed: ' + str(len(a.columns) - len(c.columns)) + ' columns.'

NameError: name 'consolidates' is not defined

We can create the same color plot with the consolidated dataframe to see how the distribution has changed.


In [13]:
lev_c = process.sorted_lev(c)
levc_corner = lev_c.iloc[:25,:25]
fig = plt.figure(figsize=(15, 12))
plt.pcolor(levc_corner)
plt.yticks(np.arange(0.5, len(levc_corner.index), 1), levc_corner.index)
plt.xticks(np.arange(0.5, len(levc_corner.columns), 1), levc_corner.columns, rotation='vertical')
plt.colorbar()
plt.show()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-345ed6bc31cd> in <module>()
----> 1 lev_c = process.sorted_lev(c)
      2 levc_corner = lev_c.iloc[:25,:25]
      3 fig = plt.figure(figsize=(15, 12))
      4 plt.pcolor(levc_corner)
      5 plt.yticks(np.arange(0.5, len(levc_corner.index), 1), levc_corner.index)

NameError: name 'c' is not defined

Of course, there are still some duplicates, mostly people who are using the same name, but with a different email address at an unrelated domain name.

How does our consolidation affect the graph of distribution of senders?


In [17]:
fig, axes = plt.subplots(nrows=2, figsize=(15, 12))

ta = a.sum(0) # sum along the first axis
ta.sort()
ta[-20:].plot(kind='barh',ax=axes[0], title='Before consolidation')
tc = c.sum(0)
tc.sort()
tc[-20:].plot(kind='barh',ax=axes[1], title='After consolidation')
plt.show()


Okay, not dramatically different, but the consolidation makes the head heavier. There are more people close to that high end, a stronger core group and less a power distribution smoothly from one or two people.