In :%matplotlib inline from bigbang.archive import Archive from bigbang.thread import Thread from bigbang.thread import Node import matplotlib.pyplot as plt import datetime
First, collect data from a public email archive.
In :url = "https://lists.wikimedia.org/pipermail/analytics/" arx = Archive(url,archive_dir="../archives")
We can count the number of threads in the archive easily. The first time you run
Archive.get_thread it may take some time to compute, but the result is cached in the Archive object.
In :#threads = arx.get_threads() len(arx.get_threads())
We can plot a histogram of the number of messages in each thread. In most cases this will be a power law distribution.
In :y = [t.get_num_messages() for t in arx.get_threads()] plt.hist(y, bins=30) plt.xlabel('number of messages in a thread') plt.show()
We can also plot the number of people participating in each thread. Here, the participants are differentiated by the From: header on the emails they've sent.
In :n = [t.get_num_people() for t in arx.get_threads()] plt.hist(n, bins = 20) plt.xlabel('number of email-address in a thread') plt.show()
The duration of a thread is the amount of elapsed time between its first and last message.
In :y = [t.get_duration().days for t in arx.get_threads()] plt.hist(y, bins = (10)) plt.xlabel('duration of a thread(days)') plt.show()
In :y = [t.get_duration().seconds for t in arx.get_threads()] plt.hist(y, bins = (10)) plt.xlabel('duration of a thread(seconds)') plt.show()
You can examine the properties of a single thread.
In :content = arx.get_threads().get_root().data['Body'] content
Out:'Welcome to the the inaugural Analytics Mailing list email.\n\nHere all your analytics wishes comes true, \n\n\nso proposals, ideas, crazy ideas, crazy crazy ideas are welcome here!\nas long as we can count something it is welcome. \n\n\nD\n\n'
Suppose we want to know whether or not longer threads (that contain more distinct messages) have fewer words in them per message.
In :short_threads =  long_threads =  for t in arx.get_threads(): if(t.get_num_messages() < 6): short_threads.append(t) else: long_threads.append(t)
In :print((len(short_threads))) print((len(long_threads)))
You can get the content of a thread like this:
How would you test to see if longer threads contain less words per message than shorter ones?
In [ ]: