In [1]:
%matplotlib inline

In [2]:
from bigbang.archive import Archive
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

One interesting question for open source communities is whether they are growing. Often the founding members of a community would like to see new participants join and become active in the community. This is important for community longevity; ultimatley new members are required to take leadership roles if a project is to sustain itself over time.

The data available for community participation is very granular, as it can include the exact traces of the messages sent by participants over a long history. One way of summarizing this information to get a sense of overall community growth is a cohort visualization.

In this notebook, we will produce a visualization of changing participation over time.


In [3]:
url = "http://mail.scipy.org/pipermail/numpy-discussion/"
arx = Archive(url,archive_dir="../archives")


No data found at http://mail.scipy.org/pipermail/numpy-discussion/. Attempting to collect data from URL.
This could take a while.
'Getting archive page for numpy-discussion'
['2014-November.txt.gz',
 '2014-October.txt.gz',
 '2014-September.txt.gz',
 '2014-August.txt.gz',
 '2014-July.txt.gz',
 '2014-June.txt.gz',
 '2014-May.txt.gz',
 '2014-April.txt.gz',
 '2014-March.txt.gz',
 '2014-February.txt.gz',
 '2014-January.txt.gz',
 '2013-December.txt.gz',
 '2013-November.txt.gz',
 '2013-October.txt.gz',
 '2013-September.txt.gz',
 '2013-August.txt.gz',
 '2013-July.txt.gz',
 '2013-June.txt.gz',
 '2013-May.txt.gz',
 '2013-April.txt.gz',
 '2013-March.txt.gz',
 '2013-February.txt.gz',
 '2013-January.txt.gz',
 '2012-December.txt.gz',
 '2012-November.txt.gz',
 '2012-October.txt.gz',
 '2012-September.txt.gz',
 '2012-August.txt.gz',
 '2012-July.txt.gz',
 '2012-June.txt.gz',
 '2012-May.txt.gz',
 '2012-April.txt.gz',
 '2012-March.txt.gz',
 '2012-February.txt.gz',
 '2012-January.txt.gz',
 '2011-December.txt.gz',
 '2011-November.txt.gz',
 '2011-October.txt.gz',
 '2011-September.txt.gz',
 '2011-August.txt.gz',
 '2011-July.txt.gz',
 '2011-June.txt.gz',
 '2011-May.txt.gz',
 '2011-April.txt.gz',
 '2011-March.txt.gz',
 '2011-February.txt.gz',
 '2011-January.txt.gz',
 '2010-December.txt.gz',
 '2010-November.txt.gz',
 '2010-October.txt.gz',
 '2010-September.txt.gz',
 '2010-August.txt.gz',
 '2010-July.txt.gz',
 '2010-June.txt.gz',
 '2010-May.txt.gz',
 '2010-April.txt.gz',
 '2010-March.txt.gz',
 '2010-February.txt.gz',
 '2010-January.txt.gz',
 '2009-December.txt.gz',
 '2009-November.txt.gz',
 '2009-October.txt.gz',
 '2009-September.txt.gz',
 '2009-August.txt.gz',
 '2009-July.txt.gz',
 '2009-June.txt.gz',
 '2009-May.txt.gz',
 '2009-April.txt.gz',
 '2009-March.txt.gz',
 '2009-February.txt.gz',
 '2009-January.txt.gz',
 '2008-December.txt.gz',
 '2008-November.txt.gz',
 '2008-October.txt.gz',
 '2008-September.txt.gz',
 '2008-August.txt.gz',
 '2008-July.txt.gz',
 '2008-June.txt.gz',
 '2008-May.txt.gz',
 '2008-April.txt.gz',
 '2008-March.txt.gz',
 '2008-February.txt.gz',
 '2008-January.txt.gz',
 '2007-December.txt.gz',
 '2007-November.txt.gz',
 '2007-October.txt.gz',
 '2007-September.txt.gz',
 '2007-August.txt.gz',
 '2007-July.txt.gz',
 '2007-June.txt.gz',
 '2007-May.txt.gz',
 '2007-April.txt.gz',
 '2007-March.txt.gz',
 '2007-February.txt.gz',
 '2007-January.txt.gz',
 '2006-December.txt.gz',
 '2006-November.txt.gz',
 '2006-October.txt.gz',
 '2006-September.txt.gz',
 '2006-August.txt.gz',
 '2006-July.txt.gz',
 '2006-June.txt.gz',
 '2006-May.txt.gz',
 '2006-April.txt.gz',
 '2006-March.txt.gz',
 '2006-February.txt.gz',
 '2006-January.txt.gz',
 '2005-December.txt.gz',
 '2005-November.txt.gz',
 '2005-October.txt.gz',
 '2005-September.txt.gz',
 '2005-August.txt.gz',
 '2005-July.txt.gz',
 '2005-June.txt.gz',
 '2005-May.txt.gz',
 '2005-April.txt.gz',
 '2005-March.txt.gz',
 '2005-February.txt.gz',
 '2005-January.txt.gz',
 '2004-December.txt.gz',
 '2004-November.txt.gz',
 '2004-October.txt.gz',
 '2004-September.txt.gz',
 '2004-August.txt.gz',
 '2004-July.txt.gz',
 '2004-June.txt.gz',
 '2004-May.txt.gz',
 '2004-April.txt.gz',
 '2004-March.txt.gz',
 '2004-February.txt.gz',
 '2004-January.txt.gz',
 '2003-December.txt.gz',
 '2003-November.txt.gz',
 '2003-October.txt.gz',
 '2003-September.txt.gz',
 '2003-August.txt.gz',
 '2003-July.txt.gz',
 '2003-June.txt.gz',
 '2003-May.txt.gz',
 '2003-April.txt.gz',
 '2003-March.txt.gz',
 '2003-February.txt.gz',
 '2003-January.txt.gz',
 '2002-December.txt.gz',
 '2002-November.txt.gz',
 '2002-October.txt.gz',
 '2002-September.txt.gz',
 '2002-August.txt.gz',
 '2002-July.txt.gz',
 '2002-June.txt.gz',
 '2002-May.txt.gz',
 '2002-April.txt.gz',
 '2002-March.txt.gz',
 '2002-February.txt.gz',
 '2002-January.txt.gz',
 '2001-December.txt.gz',
 '2001-November.txt.gz',
 '2001-October.txt.gz',
 '2001-September.txt.gz',
 '2001-August.txt.gz',
 '2001-July.txt.gz',
 '2001-June.txt.gz',
 '2001-May.txt.gz',
 '2001-April.txt.gz',
 '2001-March.txt.gz',
 '2001-February.txt.gz',
 '2001-January.txt.gz',
 '2000-December.txt.gz',
 '2000-November.txt.gz',
 '2000-October.txt.gz',
 '2000-September.txt.gz',
 '2000-August.txt.gz',
 '2000-July.txt.gz',
 '2000-June.txt.gz',
 '2000-May.txt.gz',
 '2000-April.txt.gz',
 '2000-March.txt.gz',
 '2000-February.txt.gz',
 '2000-January.txt.gz']
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-November.txt.gz'
200 - writing file to archives/numpy-discussion/2014-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-October.txt.gz'
200 - writing file to archives/numpy-discussion/2014-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-September.txt.gz'
200 - writing file to archives/numpy-discussion/2014-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-August.txt.gz'
200 - writing file to archives/numpy-discussion/2014-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-July.txt.gz'
200 - writing file to archives/numpy-discussion/2014-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-June.txt.gz'
200 - writing file to archives/numpy-discussion/2014-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-May.txt.gz'
200 - writing file to archives/numpy-discussion/2014-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-April.txt.gz'
200 - writing file to archives/numpy-discussion/2014-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-March.txt.gz'
200 - writing file to archives/numpy-discussion/2014-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-February.txt.gz'
200 - writing file to archives/numpy-discussion/2014-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2014-January.txt.gz'
200 - writing file to archives/numpy-discussion/2014-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-December.txt.gz'
200 - writing file to archives/numpy-discussion/2013-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-November.txt.gz'
200 - writing file to archives/numpy-discussion/2013-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-October.txt.gz'
200 - writing file to archives/numpy-discussion/2013-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-September.txt.gz'
200 - writing file to archives/numpy-discussion/2013-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-August.txt.gz'
200 - writing file to archives/numpy-discussion/2013-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-July.txt.gz'
200 - writing file to archives/numpy-discussion/2013-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-June.txt.gz'
200 - writing file to archives/numpy-discussion/2013-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-May.txt.gz'
200 - writing file to archives/numpy-discussion/2013-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-April.txt.gz'
200 - writing file to archives/numpy-discussion/2013-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-March.txt.gz'
200 - writing file to archives/numpy-discussion/2013-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-February.txt.gz'
200 - writing file to archives/numpy-discussion/2013-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2013-January.txt.gz'
200 - writing file to archives/numpy-discussion/2013-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-December.txt.gz'
200 - writing file to archives/numpy-discussion/2012-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-November.txt.gz'
200 - writing file to archives/numpy-discussion/2012-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-October.txt.gz'
200 - writing file to archives/numpy-discussion/2012-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-September.txt.gz'
200 - writing file to archives/numpy-discussion/2012-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-August.txt.gz'
200 - writing file to archives/numpy-discussion/2012-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-July.txt.gz'
200 - writing file to archives/numpy-discussion/2012-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-June.txt.gz'
200 - writing file to archives/numpy-discussion/2012-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-May.txt.gz'
200 - writing file to archives/numpy-discussion/2012-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-April.txt.gz'
200 - writing file to archives/numpy-discussion/2012-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-March.txt.gz'
200 - writing file to archives/numpy-discussion/2012-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-February.txt.gz'
200 - writing file to archives/numpy-discussion/2012-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2012-January.txt.gz'
200 - writing file to archives/numpy-discussion/2012-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-December.txt.gz'
200 - writing file to archives/numpy-discussion/2011-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-November.txt.gz'
200 - writing file to archives/numpy-discussion/2011-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-October.txt.gz'
200 - writing file to archives/numpy-discussion/2011-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-September.txt.gz'
200 - writing file to archives/numpy-discussion/2011-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-August.txt.gz'
200 - writing file to archives/numpy-discussion/2011-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-July.txt.gz'
200 - writing file to archives/numpy-discussion/2011-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-June.txt.gz'
200 - writing file to archives/numpy-discussion/2011-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-May.txt.gz'
200 - writing file to archives/numpy-discussion/2011-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-April.txt.gz'
200 - writing file to archives/numpy-discussion/2011-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-March.txt.gz'
200 - writing file to archives/numpy-discussion/2011-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-February.txt.gz'
200 - writing file to archives/numpy-discussion/2011-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2011-January.txt.gz'
200 - writing file to archives/numpy-discussion/2011-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-December.txt.gz'
200 - writing file to archives/numpy-discussion/2010-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-November.txt.gz'
200 - writing file to archives/numpy-discussion/2010-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-October.txt.gz'
200 - writing file to archives/numpy-discussion/2010-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-September.txt.gz'
200 - writing file to archives/numpy-discussion/2010-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-August.txt.gz'
200 - writing file to archives/numpy-discussion/2010-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-July.txt.gz'
200 - writing file to archives/numpy-discussion/2010-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-June.txt.gz'
200 - writing file to archives/numpy-discussion/2010-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-May.txt.gz'
200 - writing file to archives/numpy-discussion/2010-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-April.txt.gz'
200 - writing file to archives/numpy-discussion/2010-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-March.txt.gz'
200 - writing file to archives/numpy-discussion/2010-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-February.txt.gz'
200 - writing file to archives/numpy-discussion/2010-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2010-January.txt.gz'
200 - writing file to archives/numpy-discussion/2010-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-December.txt.gz'
200 - writing file to archives/numpy-discussion/2009-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-November.txt.gz'
200 - writing file to archives/numpy-discussion/2009-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-October.txt.gz'
200 - writing file to archives/numpy-discussion/2009-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-September.txt.gz'
200 - writing file to archives/numpy-discussion/2009-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-August.txt.gz'
200 - writing file to archives/numpy-discussion/2009-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-July.txt.gz'
200 - writing file to archives/numpy-discussion/2009-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-June.txt.gz'
200 - writing file to archives/numpy-discussion/2009-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-May.txt.gz'
200 - writing file to archives/numpy-discussion/2009-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-April.txt.gz'
200 - writing file to archives/numpy-discussion/2009-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-March.txt.gz'
200 - writing file to archives/numpy-discussion/2009-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-February.txt.gz'
200 - writing file to archives/numpy-discussion/2009-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2009-January.txt.gz'
200 - writing file to archives/numpy-discussion/2009-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-December.txt.gz'
200 - writing file to archives/numpy-discussion/2008-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-November.txt.gz'
200 - writing file to archives/numpy-discussion/2008-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-October.txt.gz'
200 - writing file to archives/numpy-discussion/2008-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-September.txt.gz'
200 - writing file to archives/numpy-discussion/2008-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-August.txt.gz'
200 - writing file to archives/numpy-discussion/2008-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-July.txt.gz'
200 - writing file to archives/numpy-discussion/2008-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-June.txt.gz'
200 - writing file to archives/numpy-discussion/2008-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-May.txt.gz'
200 - writing file to archives/numpy-discussion/2008-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-April.txt.gz'
200 - writing file to archives/numpy-discussion/2008-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-March.txt.gz'
200 - writing file to archives/numpy-discussion/2008-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-February.txt.gz'
200 - writing file to archives/numpy-discussion/2008-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2008-January.txt.gz'
200 - writing file to archives/numpy-discussion/2008-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-December.txt.gz'
200 - writing file to archives/numpy-discussion/2007-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-November.txt.gz'
200 - writing file to archives/numpy-discussion/2007-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-October.txt.gz'
200 - writing file to archives/numpy-discussion/2007-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-September.txt.gz'
200 - writing file to archives/numpy-discussion/2007-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-August.txt.gz'
200 - writing file to archives/numpy-discussion/2007-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-July.txt.gz'
200 - writing file to archives/numpy-discussion/2007-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-June.txt.gz'
200 - writing file to archives/numpy-discussion/2007-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-May.txt.gz'
200 - writing file to archives/numpy-discussion/2007-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-April.txt.gz'
200 - writing file to archives/numpy-discussion/2007-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-March.txt.gz'
200 - writing file to archives/numpy-discussion/2007-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-February.txt.gz'
200 - writing file to archives/numpy-discussion/2007-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2007-January.txt.gz'
200 - writing file to archives/numpy-discussion/2007-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-December.txt.gz'
200 - writing file to archives/numpy-discussion/2006-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-November.txt.gz'
200 - writing file to archives/numpy-discussion/2006-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-October.txt.gz'
200 - writing file to archives/numpy-discussion/2006-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-September.txt.gz'
200 - writing file to archives/numpy-discussion/2006-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-August.txt.gz'
200 - writing file to archives/numpy-discussion/2006-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-July.txt.gz'
200 - writing file to archives/numpy-discussion/2006-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-June.txt.gz'
200 - writing file to archives/numpy-discussion/2006-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-May.txt.gz'
200 - writing file to archives/numpy-discussion/2006-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-April.txt.gz'
200 - writing file to archives/numpy-discussion/2006-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-March.txt.gz'
200 - writing file to archives/numpy-discussion/2006-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-February.txt.gz'
200 - writing file to archives/numpy-discussion/2006-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2006-January.txt.gz'
200 - writing file to archives/numpy-discussion/2006-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-December.txt.gz'
200 - writing file to archives/numpy-discussion/2005-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-November.txt.gz'
200 - writing file to archives/numpy-discussion/2005-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-October.txt.gz'
200 - writing file to archives/numpy-discussion/2005-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-September.txt.gz'
200 - writing file to archives/numpy-discussion/2005-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-August.txt.gz'
200 - writing file to archives/numpy-discussion/2005-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-July.txt.gz'
200 - writing file to archives/numpy-discussion/2005-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-June.txt.gz'
200 - writing file to archives/numpy-discussion/2005-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-May.txt.gz'
200 - writing file to archives/numpy-discussion/2005-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-April.txt.gz'
200 - writing file to archives/numpy-discussion/2005-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-March.txt.gz'
200 - writing file to archives/numpy-discussion/2005-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-February.txt.gz'
200 - writing file to archives/numpy-discussion/2005-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2005-January.txt.gz'
200 - writing file to archives/numpy-discussion/2005-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-December.txt.gz'
200 - writing file to archives/numpy-discussion/2004-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-November.txt.gz'
200 - writing file to archives/numpy-discussion/2004-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-October.txt.gz'
200 - writing file to archives/numpy-discussion/2004-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-September.txt.gz'
200 - writing file to archives/numpy-discussion/2004-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-August.txt.gz'
200 - writing file to archives/numpy-discussion/2004-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-July.txt.gz'
200 - writing file to archives/numpy-discussion/2004-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-June.txt.gz'
200 - writing file to archives/numpy-discussion/2004-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-May.txt.gz'
200 - writing file to archives/numpy-discussion/2004-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-April.txt.gz'
200 - writing file to archives/numpy-discussion/2004-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-March.txt.gz'
200 - writing file to archives/numpy-discussion/2004-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-February.txt.gz'
200 - writing file to archives/numpy-discussion/2004-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2004-January.txt.gz'
200 - writing file to archives/numpy-discussion/2004-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-December.txt.gz'
200 - writing file to archives/numpy-discussion/2003-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-November.txt.gz'
200 - writing file to archives/numpy-discussion/2003-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-October.txt.gz'
200 - writing file to archives/numpy-discussion/2003-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-September.txt.gz'
200 - writing file to archives/numpy-discussion/2003-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-August.txt.gz'
200 - writing file to archives/numpy-discussion/2003-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-July.txt.gz'
200 - writing file to archives/numpy-discussion/2003-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-June.txt.gz'
200 - writing file to archives/numpy-discussion/2003-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-May.txt.gz'
200 - writing file to archives/numpy-discussion/2003-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-April.txt.gz'
200 - writing file to archives/numpy-discussion/2003-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-March.txt.gz'
200 - writing file to archives/numpy-discussion/2003-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-February.txt.gz'
200 - writing file to archives/numpy-discussion/2003-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2003-January.txt.gz'
200 - writing file to archives/numpy-discussion/2003-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-December.txt.gz'
200 - writing file to archives/numpy-discussion/2002-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-November.txt.gz'
200 - writing file to archives/numpy-discussion/2002-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-October.txt.gz'
200 - writing file to archives/numpy-discussion/2002-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-September.txt.gz'
200 - writing file to archives/numpy-discussion/2002-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-August.txt.gz'
200 - writing file to archives/numpy-discussion/2002-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-July.txt.gz'
200 - writing file to archives/numpy-discussion/2002-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-June.txt.gz'
200 - writing file to archives/numpy-discussion/2002-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-May.txt.gz'
200 - writing file to archives/numpy-discussion/2002-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-April.txt.gz'
200 - writing file to archives/numpy-discussion/2002-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-March.txt.gz'
200 - writing file to archives/numpy-discussion/2002-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-February.txt.gz'
200 - writing file to archives/numpy-discussion/2002-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2002-January.txt.gz'
200 - writing file to archives/numpy-discussion/2002-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-December.txt.gz'
200 - writing file to archives/numpy-discussion/2001-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-November.txt.gz'
200 - writing file to archives/numpy-discussion/2001-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-October.txt.gz'
200 - writing file to archives/numpy-discussion/2001-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-September.txt.gz'
200 - writing file to archives/numpy-discussion/2001-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-August.txt.gz'
200 - writing file to archives/numpy-discussion/2001-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-July.txt.gz'
200 - writing file to archives/numpy-discussion/2001-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-June.txt.gz'
200 - writing file to archives/numpy-discussion/2001-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-May.txt.gz'
200 - writing file to archives/numpy-discussion/2001-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-April.txt.gz'
200 - writing file to archives/numpy-discussion/2001-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-March.txt.gz'
200 - writing file to archives/numpy-discussion/2001-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-February.txt.gz'
200 - writing file to archives/numpy-discussion/2001-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2001-January.txt.gz'
200 - writing file to archives/numpy-discussion/2001-January.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-December.txt.gz'
200 - writing file to archives/numpy-discussion/2000-December.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-November.txt.gz'
200 - writing file to archives/numpy-discussion/2000-November.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-October.txt.gz'
200 - writing file to archives/numpy-discussion/2000-October.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-September.txt.gz'
200 - writing file to archives/numpy-discussion/2000-September.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-August.txt.gz'
200 - writing file to archives/numpy-discussion/2000-August.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-July.txt.gz'
200 - writing file to archives/numpy-discussion/2000-July.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-June.txt.gz'
200 - writing file to archives/numpy-discussion/2000-June.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-May.txt.gz'
200 - writing file to archives/numpy-discussion/2000-May.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-April.txt.gz'
200 - writing file to archives/numpy-discussion/2000-April.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-March.txt.gz'
200 - writing file to archives/numpy-discussion/2000-March.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-February.txt.gz'
200 - writing file to archives/numpy-discussion/2000-February.txt.gz
'retrieving http://mail.scipy.org/pipermail/numpy-discussion/2000-January.txt.gz'
200 - writing file to archives/numpy-discussion/2000-January.txt.gz
unzipping 179 archive files
Opening 179 archive files
Date parsing error on: 
Wed, 01 Nov 2006 15:46:73 +0800
Date parsing error on: 
Wed, 01 Nov 2006 15:46:73 +0800

In [4]:
arx.data[:1]


Out[4]:
From Subject Date In-Reply-To References Body
Message-ID
<NDBBIEFMILBFPMDHJIMFEEAGCCAA.pauldubois@home.com> pauldubois at home.com (Paul F. Dubois) [Numpy-discussion] test 2000-01-20 23:07:52 None None This is a test.\nIgnore.\n\nPaul\n\n\n\n

1 rows × 6 columns

Archive objects have a method that reports for each user how many emails they sent each day.


In [5]:
act = arx.get_activity()


/home/aryan/urap/bigbang/bigbang/archive.py:92: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  mdf2['Date'] = mdf['Date'].apply(lambda x: x.toordinal())

This plot will show when each sender sent their first post. A slow ascent means a period where many people joined.


In [6]:
fig = plt.figure(figsize=(12.5, 7.5))

#act.idxmax().order().T.plot()
(act > 0).idxmax().order().plot()

fig.axes[0].yaxis_date()


This is the same data, but plotted as a histogram. It's easier to see the trends here.


In [7]:
fig = plt.figure(figsize=(12.5, 7.5))

(act > 0).idxmax().order().hist()

fig.axes[0].xaxis_date()


While this is interesting, what if we are interested in how much different "cohorts" of participants stick around and continue to participate in the community over time?

What we want to do is divide the participants into N cohorts based on the percentile of when they joined the mailing list. I.e, the first 1/N people to participate in the mailing list are the first cohort. The second 1/N people are in the second cohort. And so on.

Then we can combine the activities of each cohort and do a stackplot of how each cohort has participated over time.


In [8]:
n = 5

In [9]:
import numpy as np

# A series, indexed by users, of the day of their first post
# This series is ordered by time
first_post = (act > 0).idxmax().order()

# Splitting the previous series into five equal parts,
# each representing a chronological quintile of list members
cohorts = np.array_split(first_post,n)

# In order to make the data easier to visualize,
# we will take a rolling average over ten days.
# In signal processing, this operation is called
# a convolution
#   http://en.wikipedia.org/wiki/Convolution
convulation_array = [.1,.1,.1,.1,.1,.1,.1,.1,.1,.1]

cohort_activity = [np.convolve((act.as_matrix(cohorts[i].index.values)).sum(1),
                               convulation_array,mode='same')
                   for i in range(n)]

In [10]:
fig = plt.figure(figsize=(12.5, 7.5))

d = np.row_stack(cohort_activity)

plt.stackplot(act.index,d,linewidth=0)

fig.axes[0].xaxis_date()

plt.show()


This gives us a sense of when new members are taking the lead in the community. But what if the old members are just changing their email addresses? To test that case, we should clean our data with entity resolution techniques.


In [10]: