This notebook was put together by [Roman Prokofyev](http://prokofyev.ch)@[eXascale Infolab](http://exascale.info/). Source and license info is on [GitHub](https://github.com/dragoon/kilogram/).

This notebook is a part of bigger tutorial on fixing grammatical edits.

Prerequisites

You will need to install the following python packages to run the notebook:

sudo apt-get install libxml2-dev libxslt-dev (to compile lxml)
pip install nltk
pip install pandas
pip install -U https://github.com/dragoon/kilogram/zipball/master

Step 0: Downloading StackExchange data

Download from: https://archive.org/details/stackexchange.

Each file in the downloaded archives is a 7z archive that contains file PostHistory.xml inside. For this example, I have upacked the travel forum archive and renamed history XML to travel_post_history.xml.

Step 1: Processing XML edit history files

Now we are going to extract all edits that had the word "grammar" in the comment field, implying that a user have fixed grammar of a post by this edit:



In [1]:

    
import subprocess
print subprocess.check_output("se_parse_edit_history.py -o /home/roman/travel.tsv /home/roman/travel_post_history.xml; exit 0",
                              shell=True, stderr=subprocess.STDOUT)









    



Processing XML /home/roman/travel_post_history.xml ...
Sorting...
Filtering edits...

Use pandas to look at our data:



In [2]:

    
from pandas import DataFrame, isnull
import pandas as pd
df = pd.read_csv('/home/roman/travel.tsv', sep='\t', names=('text1', 'text2'))
df.head(10)









    Out[2]:






  
    
      
      text1
      text2
    
  
  
    
      0
       My finance and myself are looking for a good C...
       My fiance and I are looking for a good Caribbe...
    
    
      1
           What's are some Caribbean cruises for October
            What are some Caribbean cruises for October?
    
    
      2
       I am looking for am exhaustive and light set o...
       I am looking for an exhaustive and light set o...
    
    
      3
       November is very much off-season for Disney - ...
       November is very much off-season for Disney - ...
    
    
      4
       As homebase choose Irkutsk. Travel to Olkhon i...
       Choose Irkutsk as home base. Travel to Olkhon ...
    
    
      5
       I live in Poland and I would like to take my c...
       I live in Poland and I would like to take my c...
    
    
      6
       1) To avoid that you're laptop is stolen I wou...
       1) To avoid your laptop from being stolen I wo...
    
    
      7
       Not sure if this is quite the scenario you are...
       Not sure if this is quite the scenario you are...
    
    
      8
       While going to India via Emirates, i am planni...
       While going to India via Emirates, I am planni...
    
    
      9
       Is the first time I post in this forum but I n...
       I have been accepted as student for the 2013 J...
    
  

10 rows × 2 columns

Check for NULL values:



In [3]:

    
# check for None values
print [i for i, x in enumerate(isnull(df['text1'])) if x]
df.iloc[210]









    



[210]






    Out[3]:





text1                                                  NaN
text2    A country in the Caucasus region bordering [ta...
Name: 210, dtype: object

Since our data is not numeric, we are not going to use dataframes for iteration:



In [4]:

    
del df

Data looks good, now let's continue to extracting specific edits.

Step 2: Edit extraction

The following processed every edit made by users and extracts specific tokens that were changed. It can be one or more words:



In [5]:

    
from kilogram import extract_edits
edits = extract_edits('/home/roman/travel.tsv')









    



Total edits extracted: 1334



In [6]:

    
print edits[0]









    



finance→fiance
My fiance and I are

Let's filter only edits that altered prepositions, for this I have prepared a small file with a set of prepositions to consider.

You can find the file here: https://github.com/dragoon/kilogram/blob/master/extra/preps.txt



In [7]:

    
PREPS_1GRAM = set(open('/home/roman/ngrams_data/preps.txt').read().split('\n'))
prep_edits = [x for x in edits if x.edit1 in PREPS_1GRAM and x.edit2 in PREPS_1GRAM]
print ', '.join([x.edit1+u'→'+x.edit2 for x in prep_edits])
print
print prep_edits[2]









    



of→off, towards→of, on→in, on→in, on→in, on→of, off→of, at→on, in→on, through→with, in→around, at→in, on→of, on→of, at→on, in→at, on→in, for→at, in→during, on→in, of→over, of→off, to→into, on→in, from→of, in→on, in→of, of→off, on→in, for→with, from→of, on→in, to→into, at→in, for→on, for→in, in→at, at→in, than→from, in→at, of→off, for→over, into→in

on→in
the biggest building in the world .

Each edit represents an object of type kilogram.edit.Edit and you can retrieve the following info from it:



In [8]:

    
print prep_edits[2].text1
print
print prep_edits[2].text2
print prep_edits[2].edit1, prep_edits[2].edit2
print
print str(prep_edits[2].tokens)









    



Today I saw a picture of a building that claims to be the biggest building on the whole world . I really doubt that , but the problem is , I couldn't verify it![enter image description here][1.It is somewhere in Eastern Europe , probably a capital . So where should I travel to , when I want to see this building . [ 1 ] : http://i.stack.imgur.com/VbPEl.jpg

Today I saw a picture of a building that claims to be the biggest building in the world . I really doubt that , but the problem is , I couldn't verify it . Here it is![enter image description here][1.It is somewhere in Eastern Europe , probably a capital . So where should I travel to , when I want to see this building . [ 1 ] : http://i.stack.imgur.com/VbPEl.jpg
on in

[u'Today', u'I', u'saw', u'a', u'picture', u'of', u'a', u'building', u'that', u'claims', u'to', u'be', u'the', u'biggest', u'building', u'in', u'the', u'world', u'.', u'I', u'really', u'doubt', u'that', u',', u'but', u'the', u'problem', u'is', u',', u'I', u"couldn't", u'verify', u'it', u'.', u'Here', u'it', u'is![enter', u'image', u'description', u'here][1.It', u'is', u'somewhere', u'in', u'Eastern', u'Europe', u',', u'probably', u'a', u'capital', u'.', u'So', u'where', u'should', u'I', u'travel', u'to', u',', u'when', u'I', u'want', u'to', u'see', u'this', u'building', u'.', u'[', u'1', u']', u':', u'http://i.stack.imgur.com/VbPEl.jpg']

It is also possible to retrieve n-gram contexts of a specified size around the edit:



In [9]:

    
print prep_edits[2].ngram_context(size=2)
print prep_edits[2].ngram_context(size=3)









    



{2: [building in, in the]}
{2: [building in, in the], 3: [biggest building in, building in the, in the world]}

The next part of this notebook describes how to process Google Books N-gram corpus in order to calculate association measures between words which we are going to use to fix grammar.

	text1	text2
0	My finance and myself are looking for a good C...	My fiance and I are looking for a good Caribbe...
1	What's are some Caribbean cruises for October	What are some Caribbean cruises for October?
2	I am looking for am exhaustive and light set o...	I am looking for an exhaustive and light set o...
3	November is very much off-season for Disney - ...	November is very much off-season for Disney - ...
4	As homebase choose Irkutsk. Travel to Olkhon i...	Choose Irkutsk as home base. Travel to Olkhon ...
5	I live in Poland and I would like to take my c...	I live in Poland and I would like to take my c...
6	1) To avoid that you're laptop is stolen I wou...	1) To avoid your laptop from being stolen I wo...
7	Not sure if this is quite the scenario you are...	Not sure if this is quite the scenario you are...
8	While going to India via Emirates, i am planni...	While going to India via Emirates, I am planni...
9	Is the first time I post in this forum but I n...	I have been accepted as student for the 2013 J...