``````

In :

%matplotlib inline
from bigbang.archive import Archive
import matplotlib.pyplot as plt
import datetime

``````

First, collect data from a public email archive.

``````

In :

url = "https://lists.wikimedia.org/pipermail/analytics/"
arx = Archive(url,archive_dir="../archives")

``````

We can count the number of threads in the archive easily. The first time you run `Archive.get_thread` it may take some time to compute, but the result is cached in the Archive object.

``````

In :

``````
``````

Out:

628

``````

We can plot a histogram of the number of messages in each thread. In most cases this will be a power law distribution.

``````

In :

y = [t.get_num_messages() for t in arx.get_threads()]

plt.hist(y, bins=30)
plt.xlabel('number of messages in a thread')
plt.show()

``````
``````

``````

We can also plot the number of people participating in each thread. Here, the participants are differentiated by the From: header on the emails they've sent.

``````

In :

n = [t.get_num_people() for t in arx.get_threads()]

plt.hist(n, bins = 20)
plt.show()

``````
``````

``````

The duration of a thread is the amount of elapsed time between its first and last message.

``````

In :

y = [t.get_duration().days for t in arx.get_threads()]

plt.hist(y, bins = (10))
plt.show()

``````
``````

``````
``````

In :

y = [t.get_duration().seconds for t in arx.get_threads()]

plt.hist(y, bins = (10))
plt.show()

``````
``````

``````

You can examine the properties of a single thread.

``````

In :

``````
``````

19:49:47

``````
``````

In :

``````
``````

In :

content

``````
``````

Out:

'Welcome to the the inaugural Analytics Mailing list email.\n\nHere all your analytics wishes comes true, \n\n\nso proposals, ideas, crazy ideas, crazy crazy ideas are welcome here!\nas long as we can count something it is welcome. \n\n\nD\n\n'

``````
``````

In :

len(content.split())

``````
``````

Out:

38

``````

Suppose we want to know whether or not longer threads (that contain more distinct messages) have more words.

``````

In :

``````
``````

In :

``````
``````

471
157

``````
``````

In :

``````
``````

Out:

13

``````
``````

In :

dist_short = []
dist_long = []
avg_short = sum([len(i.split()) for i in t.get_content()]) / len(t.get_content())
dist_short.append(avg_short)
avg_long = sum([len(i.split()) for i in t.get_content()]) / len(t.get_content())
dist_long.append(avg_long)

``````
``````

In :

plt.hist(dist_short, bins = (15))
plt.show()

``````
``````

``````
``````

In :

plt.hist(dist_long, bins = (15))
plt.show()

``````
``````

``````
``````

In :

print((sum(dist_short)/ len(dist_short)))
print((sum(dist_long)/ len(dist_long)))

``````
``````

140
110

``````
``````

In :

s_leaves = []
s_notleaves = []
for node in t.get_leaves():
s_leaves.append(len(node.data['Body'].split()))
for node in t.get_not_leaves():
s_notleaves.append(len(node.data['Body'].split()))

``````
``````

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-69067a14eb67> in <module>()
2 s_leaves = []
3 s_notleaves = []
----> 4 for t in threads:
5     for node in t.get_leaves():
6         s_leaves.append(len(node.data['Body'].split()))

TypeError: 'instancemethod' object is not iterable

``````
``````

In :

plt.hist(s_leaves, bins = (15))
plt.show()

``````
``````

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-20-804337a1ae9d> in <module>()
----> 1 plt.hist(s_leaves, bins = (15))
2 plt.show()

/home/sb/anaconda/envs/bigbang/lib/python2.7/site-packages/matplotlib/pyplot.pyc in hist(x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, hold, **kwargs)
2825                       histtype=histtype, align=align, orientation=orientation,
2826                       rwidth=rwidth, log=log, color=color, label=label,
-> 2827                       stacked=stacked, **kwargs)
2828         draw_if_interactive()
2829     finally:

/home/sb/anaconda/envs/bigbang/lib/python2.7/site-packages/matplotlib/axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
8247         # Massage 'x' for processing.
8248         # NOTE: Be sure any changes here is also done below to 'weights'
-> 8249         if isinstance(x, np.ndarray) or not iterable(x):
8250             # TODO: support masked arrays;
8251             x = np.asarray(x)

IndexError: list index out of range

``````
``````

In :

plt.hist(s_notleaves, bins = (15))
plt.show()

``````
``````

``````
``````

In :

print((sum(s_leaves)/len(s_leaves)))
print((sum(s_notleaves)/len(s_notleaves)))

``````
``````

264
295

``````
``````

In :

import re

``````
``````

In :

print(mess)

``````
``````

Fernando Perez wrote:

>
> Great!  Many thanks for this.  Please give it a bit more pounding, and
> I'd encourage other users of dreload to also try it out.  The code is
> definitely far simpler than the original dreload, but since I don't
> understand that code too well, I'd like to tiptoe a bit on this
> issue.  If it survives a bit of pounding and discussion, I'll
> definitely be glad to put it in.

One of the things where it differ's from the old deep_reload is that
when importing a submodule, say ipbug.vm, it will not reload
ipbug/__init__.py. I've attached another version, which tries to do just
that and I'm using that version currently without problems.
However I think there must be an even  less complicated version.
Clearing sys.modules and 'reimporting' the module like in the following
code seems to work ok.
====
import sys, bbutils.textproc.spellcheck
m=sys.modules.copy()
sys.modules.clear()
sys.modules['sys'] = sys
import bbutils.textproc.spellcheck
====

Currently I don't have the time to investigate this further, but in a
week or two, I'll  have another look at this.

- Ralf

-------------- next part --------------
import sys
import __builtin__

builtin_import = None   # will be set to __builtin__.__import__ by reload function
old_modules = {}        # will be set to sys.modules by reload function
reloaded = {}           # names of reloaded modules, uses same keys as sys.modules

def determineParentName(globals):
"""determine name of the module which has called the import statement"""
if not globals or  not globals.has_key("__name__"):
return None
pname = globals['__name__']
if globals.has_key("__path__"):
return pname
if '.' in pname:
i = pname.rfind('.')
pname = pname[:i]
return pname
return None

"""for name='some.module.bar', reload some, some.module, some.module.bar.
Determines module, which has called the import statement, and prefers
module relative paths, i.e. 'import os' in module m, from m/__init__.py
imports m/os.py if it's there.
"""

if old_modules.has_key(n):

sys.modules[mname] = old_modules[mname]

if mname != old_modules[mname].__name__:
# module changed it's name. otherwise dreload(xml.sax) fails.
print "Module changed name:", mname, old_modules[mname].__name__
else:

return None

mods = name.split(".")
parent = determineParentName(globals)

retval = None
for i in range(len(mods)):
mname = ".".join(mods[:i+1])
if parent:
relative = "%s.%s" % (parent, mname)
if old_modules.has_key(relative) and old_modules[relative]:
continue
return retval

def my_import_hook(name, globals=None, locals=None, fromlist=None):
"""replacement for __builtin__.__import__

and then calls original __builtin__.__import__.
"""

##     if fromlist:
##         print 'Importing', fromlist, 'from module', name
##     else:
##         print 'Importing module', name
return builtin_import(name, globals, locals, fromlist)

"""Recursively reload all modules used in the given module.  Optionally
takes a list of modules to exclude from reloading.  The default exclude
list contains sys, __main__, and __builtin__, to prevent, e.g., resetting
display, exception, and io hooks.
"""
global builtin_import
global old_modules

old_modules = sys.modules.copy()
sys.modules.clear()
for ex in exclude+list(sys.builtin_module_names):
if old_modules.has_key(ex) and not sys.modules.has_key(ex):
print "EXCLUDING", ex
sys.modules[ex] = old_modules[ex]

builtin_import = __builtin__.__import__
__builtin__.__import__ = my_import_hook

try:
finally:
# restore old values
__builtin__.__import__ = builtin_import
for m in old_modules:
if not sys.modules.has_key(m):
sys.modules[m] = old_modules[m]

``````
``````

In :

mess.split('\n')
message = list()
for l in mess.split('\n'):
n = len(l)
if(len(l)!=0 and l != '>' and l[n-6:n] != 'wrote:'):
message.append(l)
new = str()
for l in message:
new = new + l + '\n'

``````
``````

In :

print(new)

``````
``````

One of the things where it differ's from the old deep_reload is that
when importing a submodule, say ipbug.vm, it will not reload
ipbug/__init__.py. I've attached another version, which tries to do just
that and I'm using that version currently without problems.
However I think there must be an even  less complicated version.
Clearing sys.modules and 'reimporting' the module like in the following
code seems to work ok.
====
import sys, bbutils.textproc.spellcheck
m=sys.modules.copy()
sys.modules.clear()
sys.modules['sys'] = sys
import bbutils.textproc.spellcheck
====
Currently I don't have the time to investigate this further, but in a
week or two, I'll  have another look at this.
- Ralf
-------------- next part --------------
import sys
import __builtin__
builtin_import = None   # will be set to __builtin__.__import__ by reload function
old_modules = {}        # will be set to sys.modules by reload function
reloaded = {}           # names of reloaded modules, uses same keys as sys.modules
def determineParentName(globals):
"""determine name of the module which has called the import statement"""
if not globals or  not globals.has_key("__name__"):
return None
pname = globals['__name__']
if globals.has_key("__path__"):
return pname
if '.' in pname:
i = pname.rfind('.')
pname = pname[:i]
return pname
return None
"""for name='some.module.bar', reload some, some.module, some.module.bar.
Determines module, which has called the import statement, and prefers
module relative paths, i.e. 'import os' in module m, from m/__init__.py
imports m/os.py if it's there.
"""
if old_modules.has_key(n):

sys.modules[mname] = old_modules[mname]
if mname != old_modules[mname].__name__:
# module changed it's name. otherwise dreload(xml.sax) fails.
print "Module changed name:", mname, old_modules[mname].__name__
else:
return None
mods = name.split(".")
parent = determineParentName(globals)
retval = None
for i in range(len(mods)):
mname = ".".join(mods[:i+1])
if parent:
relative = "%s.%s" % (parent, mname)
if old_modules.has_key(relative) and old_modules[relative]:
continue
return retval

def my_import_hook(name, globals=None, locals=None, fromlist=None):
"""replacement for __builtin__.__import__
and then calls original __builtin__.__import__.
"""
##     if fromlist:
##         print 'Importing', fromlist, 'from module', name
##     else:
##         print 'Importing module', name
return builtin_import(name, globals, locals, fromlist)
"""Recursively reload all modules used in the given module.  Optionally
takes a list of modules to exclude from reloading.  The default exclude
list contains sys, __main__, and __builtin__, to prevent, e.g., resetting
display, exception, and io hooks.
"""
global builtin_import
global old_modules
old_modules = sys.modules.copy()
sys.modules.clear()
for ex in exclude+list(sys.builtin_module_names):
if old_modules.has_key(ex) and not sys.modules.has_key(ex):
print "EXCLUDING", ex
sys.modules[ex] = old_modules[ex]
builtin_import = __builtin__.__import__
__builtin__.__import__ = my_import_hook
try:
finally:
# restore old values
__builtin__.__import__ = builtin_import
for m in old_modules:
if not sys.modules.has_key(m):
sys.modules[m] = old_modules[m]

``````
``````

In :

``````
``````

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-671024fec275> in <module>()

NameError: name 'EmailReplyParser' is not defined

``````
``````

In [ ]:

print(mess)

``````
``````

In [ ]: