This notebook was put together by [Roman Prokofyev]([eXascale Infolab]( Source and license info is on [GitHub](


You will need to install the following python packages to run the notebook:

Step 1: Retriving n-gram contexts

Small recap from the previous notebook to get the desired contexts.

In [1]:
from kilogram import extract_edits
edits = extract_edits('/home/roman/travel.tsv')
PREPS_1GRAM = set(open('../extra/preps.txt').read().split('\n'))
prep_edits = [x for x in edits if x.edit1 in PREPS_1GRAM and x.edit2 in PREPS_1GRAM]

context_ngrams = prep_edits[2].ngram_context(size=3)  # size=3 is default
print context_ngrams

Total edits extracted: 1334
{2: [building in, in the], 3: [biggest building in, building in the, in the world]}

Step 2: Computing association measures

Given the kilogram library, computing association measures becomes extremely simple.

All we need is to properly configure the endpoints to HBase and MongoDB which store our n-gram data:

In [2]:
from kilogram import NgramService
from kilogram import EditNgram
NgramService.configure(PREPS_1GRAM, mongo_host=('localhost', '27017'), hbase_host=('diufpc301', '9090'))

def print_measure(results):
    for res in results:
        print ' '.join(res[0]), res[1]

ngram = context_ngrams[3][0]
# PMI by default
print 'PMI:'
# Can also use anything implemented in NLTK
print 'Student T:'
print 'MI Likelihood:'

biggest building in 0.86337295448
biggest building on -0.194145914319
biggest building at -2.0029574526
biggest building of -3.47388436759

Student T:
biggest building on -16.6575710442
biggest building in -21.945374765
biggest building at -25.9761580745
biggest building of -130.521765404

MI Likelihood:
biggest building in 2.77298228106e-15
biggest building on 2.35666190597e-17
biggest building of 1.02821534354e-18
biggest building at 2.64008318578e-19