This notebook divide a single mailing list corpus into threads.
What it does: -identifies the more participated threads -identifies the long lasting threads -export each thread's emails into seperate .csv files, setting thresholds of participation and duration
Parameters to set options: -set a single URL related to a mailing list, setting the 'url' variable -it exports files in the file path specified in the variable ‘path’ -you can set a threshold of participation and of duration for the threads to export, by setting 'min_participation' and 'min_duration' variables
In :%matplotlib inline from bigbang.archive import Archive from bigbang.archive import load as load_archive from bigbang.thread import Thread from bigbang.thread import Node from bigbang.utils import remove_quoted import matplotlib.pyplot as plt import datetime import csv from collections import defaultdict
First, collect data from a public email archive.
In :#insert one URL related to the mailing list of interest url = "http://mm.icann.org/pipermail/wp4/" try: arch_path = '../archives/'+url[:-1].replace('://','_/')+'.csv' arx = load_archive(arch_path) except: arch_path = '../archives/'+url[:-1].replace('//','/')+'.csv' print(url) arx = load_archive(arch_path)
Let's check the number of threads in this mailing list corpus
We can plot the number of people participating in each thread.
In :n = [t.get_num_people() for t in arx.get_threads()] plt.hist(n, bins = 20) plt.xlabel('number of email-address in a thread') plt.show()
The duration of a thread is the amount of elapsed time between its first and last message.
Let's plot the number of threads per each number of days of duration
In :y = [t.get_duration().days for t in arx.get_threads()] plt.hist(y, bins = (10)) plt.xlabel('duration of a thread(days)') plt.show()
Export the content of each thread into a .csv file (named: thread_1.csv, thread2.csv, ...).
You can set a minimum level of participation and duration, based on the previous analyses
In :#Insert the participation threshold (number of people) #(for no threeshold: 'min_participation = 0') min_participation = 0 #Insert the duration threshold (number of days) #(for no threeshold: 'min_duration = 0') min_duration = 0 #Insert the directory path where to save the files path = 'c:/users/davide/bigbang/' i = 0 for thread in arx.get_threads(): if thread.get_num_people() >= min_participation and thread.get_duration().days >= min_duration: i += 1 f = open(path+'thread_'+str(i)+'.csv', "wb") f_w = csv.writer(f) f_w.writerow(thread.get_content()) f.close()