This notebook was put together by [Roman Prokofyev](http://prokofyev.ch)@[eXascale Infolab](http://exascale.info/). Source and license info is on [GitHub](https://github.com/dragoon/kilogram/).

Prerequisites

You will need to install the following python packages to run the notebook:

Step 0: Downloading StackExchange data

Download from: https://archive.org/details/stackexchange.

Each file in the downloaded archives is a 7z archive that contains file PostHistory.xml inside. For this example, I have upacked the travel forum archive and renamed history XML to travel_post_history.xml.

Step 1: Processing XML edit history files

Now we are going to extract all edits that had the word "grammar" in the comment field, implying that a user have fixed grammar of a post by this edit:


In [1]:
import subprocess
print subprocess.check_output("se_parse_edit_history.py -o /home/roman/travel.tsv /home/roman/travel_post_history.xml; exit 0",
                              shell=True, stderr=subprocess.STDOUT)


Processing XML /home/roman/travel_post_history.xml ...
Sorting...
Filtering edits...

Use pandas to look at our data:


In [2]:
from pandas import DataFrame, isnull
import pandas as pd
df = pd.read_csv('/home/roman/travel.tsv', sep='\t', names=('text1', 'text2'))
df.head(10)


Out[2]:
text1 text2
0 My finance and myself are looking for a good C... My fiance and I are looking for a good Caribbe...
1 What's are some Caribbean cruises for October What are some Caribbean cruises for October?
2 I am looking for am exhaustive and light set o... I am looking for an exhaustive and light set o...
3 November is very much off-season for Disney - ... November is very much off-season for Disney - ...
4 As homebase choose Irkutsk. Travel to Olkhon i... Choose Irkutsk as home base. Travel to Olkhon ...
5 I live in Poland and I would like to take my c... I live in Poland and I would like to take my c...
6 1) To avoid that you're laptop is stolen I wou... 1) To avoid your laptop from being stolen I wo...
7 Not sure if this is quite the scenario you are... Not sure if this is quite the scenario you are...
8 While going to India via Emirates, i am planni... While going to India via Emirates, I am planni...
9 Is the first time I post in this forum but I n... I have been accepted as student for the 2013 J...

10 rows × 2 columns

Check for NULL values:


In [3]:
# check for None values
print [i for i, x in enumerate(isnull(df['text1'])) if x]
df.iloc[210]


[210]
Out[3]:
text1                                                  NaN
text2    A country in the Caucasus region bordering [ta...
Name: 210, dtype: object

Since our data is not numeric, we are not going to use dataframes for iteration:


In [4]:
del df

Data looks good, now let's continue to extracting specific edits.

Step 2: Edit extraction

The following processed every edit made by users and extracts specific tokens that were changed. It can be one or more words:


In [5]:
from kilogram import extract_edits
edits = extract_edits('/home/roman/travel.tsv')


Total edits extracted: 1334

In [6]:
print edits[0]


finance→fiance
My fiance and I are

Let's filter only edits that altered prepositions, for this I have prepared a small file with a set of prepositions to consider.

You can find the file here: https://github.com/dragoon/kilogram/blob/master/extra/preps.txt


In [7]:
PREPS_1GRAM = set(open('/home/roman/ngrams_data/preps.txt').read().split('\n'))
prep_edits = [x for x in edits if x.edit1 in PREPS_1GRAM and x.edit2 in PREPS_1GRAM]
print ', '.join([x.edit1+u'→'+x.edit2 for x in prep_edits])
print
print prep_edits[2]


of→off, towards→of, on→in, on→in, on→in, on→of, off→of, at→on, in→on, through→with, in→around, at→in, on→of, on→of, at→on, in→at, on→in, for→at, in→during, on→in, of→over, of→off, to→into, on→in, from→of, in→on, in→of, of→off, on→in, for→with, from→of, on→in, to→into, at→in, for→on, for→in, in→at, at→in, than→from, in→at, of→off, for→over, into→in

on→in
the biggest building in the world .

Each edit represents an object of type kilogram.edit.Edit and you can retrieve the following info from it:


In [8]:
print prep_edits[2].text1
print
print prep_edits[2].text2
print prep_edits[2].edit1, prep_edits[2].edit2
print
print str(prep_edits[2].tokens)


Today I saw a picture of a building that claims to be the biggest building on the whole world . I really doubt that , but the problem is , I couldn't verify it![enter image description here][1.It is somewhere in Eastern Europe , probably a capital . So where should I travel to , when I want to see this building . [ 1 ] : http://i.stack.imgur.com/VbPEl.jpg

Today I saw a picture of a building that claims to be the biggest building in the world . I really doubt that , but the problem is , I couldn't verify it . Here it is![enter image description here][1.It is somewhere in Eastern Europe , probably a capital . So where should I travel to , when I want to see this building . [ 1 ] : http://i.stack.imgur.com/VbPEl.jpg
on in

[u'Today', u'I', u'saw', u'a', u'picture', u'of', u'a', u'building', u'that', u'claims', u'to', u'be', u'the', u'biggest', u'building', u'in', u'the', u'world', u'.', u'I', u'really', u'doubt', u'that', u',', u'but', u'the', u'problem', u'is', u',', u'I', u"couldn't", u'verify', u'it', u'.', u'Here', u'it', u'is![enter', u'image', u'description', u'here][1.It', u'is', u'somewhere', u'in', u'Eastern', u'Europe', u',', u'probably', u'a', u'capital', u'.', u'So', u'where', u'should', u'I', u'travel', u'to', u',', u'when', u'I', u'want', u'to', u'see', u'this', u'building', u'.', u'[', u'1', u']', u':', u'http://i.stack.imgur.com/VbPEl.jpg']

It is also possible to retrieve n-gram contexts of a specified size around the edit:


In [9]:
print prep_edits[2].ngram_context(size=2)
print prep_edits[2].ngram_context(size=3)


{2: [building in, in the]}
{2: [building in, in the], 3: [biggest building in, building in the, in the world]}

The next part of this notebook describes how to process Google Books N-gram corpus in order to calculate association measures between words which we are going to use to fix grammar.