This notebook was put together by [Roman Prokofyev](http://prokofyev.ch)@[eXascale Infolab](http://exascale.info/). Source and license info is on [GitHub](https://github.com/dragoon/kilogram/).
This notebook is a part of bigger tutorial on fixing grammatical edits.
You will need to install the following python packages to run the notebook:
In [1]:
from kilogram import extract_edits
edits = extract_edits('/home/roman/travel.tsv')
PREPS_1GRAM = set(open('../extra/preps.txt').read().split('\n'))
prep_edits = [x for x in edits if x.edit1 in PREPS_1GRAM and x.edit2 in PREPS_1GRAM]
context_ngrams = prep_edits[2].ngram_context(size=3) # size=3 is default
print context_ngrams
Given the kilogram library, computing association measures becomes extremely simple.
All we need is to properly configure the endpoints to HBase and MongoDB which store our n-gram data:
In [2]:
from kilogram import NgramService
from kilogram import EditNgram
NgramService.configure(PREPS_1GRAM, mongo_host=('localhost', '27017'), hbase_host=('diufpc301', '9090'))
def print_measure(results):
for res in results:
print ' '.join(res[0]), res[1]
print
ngram = context_ngrams[3][0]
# PMI by default
print 'PMI:'
print_measure(ngram.association())
# Can also use anything implemented in NLTK
print 'Student T:'
print_measure(ngram.association('student_t'))
print 'MI Likelihood:'
print_measure(ngram.association('mi_like'))