This notebook divide a single mailing list corpus into threads.
What it does: -identifies the more participated threads -identifies the long lasting threads -export each thread's emails into seperate .csv files, setting thresholds of participation and duration
Parameters to set options: -set a single URL related to a mailing list, setting the 'url' variable -it exports files in the file path specified in the variable ‘path’ -you can set a threshold of participation and of duration for the threads to export, by setting 'min_participation' and 'min_duration' variables
In [1]:
%matplotlib inline
from bigbang.archive import Archive
from bigbang.archive import load as load_archive
from bigbang.thread import Thread
from bigbang.thread import Node
from bigbang.utils import remove_quoted
import matplotlib.pyplot as plt
import datetime
import csv
from collections import defaultdict
First, collect data from a public email archive.
In [6]:
#insert one URL related to the mailing list of interest
url = "http://mm.icann.org/pipermail/wp4/"
try:
arch_path = '../archives/'+url[:-1].replace('://','_/')+'.csv'
arx = load_archive(arch_path)
except:
arch_path = '../archives/'+url[:-1].replace('//','/')+'.csv'
print(url)
arx = load_archive(arch_path)
Let's check the number of threads in this mailing list corpus
In [27]:
print(len(arx.get_threads()))
Out[27]:
We can plot the number of people participating in each thread.
In [60]:
n = [t.get_num_people() for t in arx.get_threads()]
plt.hist(n, bins = 20)
plt.xlabel('number of email-address in a thread')
plt.show()
The duration of a thread is the amount of elapsed time between its first and last message.
Let's plot the number of threads per each number of days of duration
In [61]:
y = [t.get_duration().days for t in arx.get_threads()]
plt.hist(y, bins = (10))
plt.xlabel('duration of a thread(days)')
plt.show()
Export the content of each thread into a .csv file (named: thread_1.csv, thread2.csv, ...).
You can set a minimum level of participation and duration, based on the previous analyses
In [62]:
#Insert the participation threshold (number of people)
#(for no threeshold: 'min_participation = 0')
min_participation = 0
#Insert the duration threshold (number of days)
#(for no threeshold: 'min_duration = 0')
min_duration = 0
#Insert the directory path where to save the files
path = 'c:/users/davide/bigbang/'
i = 0
for thread in arx.get_threads():
if thread.get_num_people() >= min_participation and thread.get_duration().days >= min_duration:
i += 1
f = open(path+'thread_'+str(i)+'.csv', "wb")
f_w = csv.writer(f)
f_w.writerow(thread.get_content())
f.close()