This notebook was put together by [Roman Prokofyev](http://prokofyev.ch)@[eXascale Infolab](http://exascale.info/). Source and license info is on [GitHub](https://github.com/dragoon/kilogram/).
This notebook is a part of bigger tutorial on fixing grammatical edits.
You will need to install the following python packages to run the notebook:
Download from: https://archive.org/details/stackexchange.
Each file in the downloaded archives is a 7z archive that contains file PostHistory.xml
inside. For this example, I have upacked the travel forum archive and renamed history XML to travel_post_history.xml
.
In [1]:
import subprocess
print subprocess.check_output("se_parse_edit_history.py -o /home/roman/travel.tsv /home/roman/travel_post_history.xml; exit 0",
shell=True, stderr=subprocess.STDOUT)
Use pandas to look at our data:
In [2]:
from pandas import DataFrame, isnull
import pandas as pd
df = pd.read_csv('/home/roman/travel.tsv', sep='\t', names=('text1', 'text2'))
df.head(10)
Out[2]:
Check for NULL values:
In [3]:
# check for None values
print [i for i, x in enumerate(isnull(df['text1'])) if x]
df.iloc[210]
Out[3]:
Since our data is not numeric, we are not going to use dataframes for iteration:
In [4]:
del df
Data looks good, now let's continue to extracting specific edits.
In [5]:
from kilogram import extract_edits
edits = extract_edits('/home/roman/travel.tsv')
In [6]:
print edits[0]
Let's filter only edits that altered prepositions, for this I have prepared a small file with a set of prepositions to consider.
You can find the file here: https://github.com/dragoon/kilogram/blob/master/extra/preps.txt
In [7]:
PREPS_1GRAM = set(open('/home/roman/ngrams_data/preps.txt').read().split('\n'))
prep_edits = [x for x in edits if x.edit1 in PREPS_1GRAM and x.edit2 in PREPS_1GRAM]
print ', '.join([x.edit1+u'→'+x.edit2 for x in prep_edits])
print
print prep_edits[2]
Each edit represents an object of type kilogram.edit.Edit
and you can retrieve the following info from it:
In [8]:
print prep_edits[2].text1
print
print prep_edits[2].text2
print prep_edits[2].edit1, prep_edits[2].edit2
print
print str(prep_edits[2].tokens)
It is also possible to retrieve n-gram contexts of a specified size around the edit:
In [9]:
print prep_edits[2].ngram_context(size=2)
print prep_edits[2].ngram_context(size=3)
The next part of this notebook describes how to process Google Books N-gram corpus in order to calculate association measures between words which we are going to use to fix grammar.