This notebook was put together by [Roman Prokofyev]([eXascale Infolab]( Source and license info is on [GitHub](

Pre-processed dataset

If you don't want to process data yourself, we have prepared the dataset processed according to the steps below in S3: http://coming.soon

You just need to load this dataset into your local database to compute association measures (Step 2).


  • Hadoop cluster with Cloudera Hadoop Distribution (CDH5): HDFS, YARN, PIG, HBASE

Step 0: Downloading n-gram counts

We need to download n-gram counts from some sufficiently large corpus. In this example we used Google Books N-gram corpus version 20120701. The dataset is split into multiple parts and can be downloaded from here:

The tutorial assumes you alredy have the data aggregated by year.

Step 1: Aggregating counts and identifying entities

Tte dataset contains n-gram counts per year, which we don't need here. Therefore we aggregate by year in Hadoop first. We also remove POS-tagged n-grams and lowercase everything. This step significantly reduces the amount of data we need to store, so we can, for example, load all 2grams into HBase.

The following table shows the sizes of ngrams at different processing stages:

Type Original Size (gzipped) Year-aggregated Final size
1grams 5.3GB 342MB 90MB
2grams 135GB 13.4GB 1.9GB
3grams 1.3TB 190.6GB 9.5GB
4grams 294GB 50.9GB 17.6GB
5grams XXXGB 51.1GB 12GB

At the final processing step, we try to aggregate n-grams containing various entities by their conceptual type using regular expressions and dictionaries. For example, we distinguish numeric entities, person names, cities (geo-entities) and DBPedia entities (see below). We will show how to use this information in the next part of this tutotial.

Internally numeric entities have 5 types which are identified by regular expressions:

  • Datetimes (2 different types)
  • Percentages
  • Area or Volume metrics
  • Generic numbers

Person names are identified by a dictionary lookup in the names corpus of the NLTK library. For each match we generate new n-gram with original string replaced by the "PERSON" token.

Cities are also identified by a dictionary lookup. We use Geonames 15000 as our dictionary. Since it is very small (~2Mb), we ship it with the kilogram library. For each match we generate new n-gram with original string replaced by the "CITY" token.

DBPedia entities are a bit more complicated. On one side, there are many interesting properties inside DBPedia that we can leverage on. On the other side, it's generally impossible to match arbitrary DBPedia entity inside an n-gram due to lack of context. It would be only possible if we had a list of unambiguous entities that always match their string representations (but we don't have it). However, we found a way to remove incorrect entity linkings (i.e., entities that often match generic phrases) by applying statistical methods of outlying counts detection for DBPedia types and removing the outliers.

That was the teoretical part. Coming to the n-gram processing itself.

First clone the kilogram library:

git clone
cd kilogram/mapreduce

Pre-filter n-grams

Run the pre-filter job (again, on already year-aggregated data). The job removes POS tags, multiple punctuation symbols and some other stuff:


While the job is running, install NLTK, kilogram library and download the names dataset on every machine using parallel-ssh:

parallel-ssh -h hosts.txt -l root 'apt-get install -y python-dev libxml2-dev libxslt-dev'
parallel-ssh -h hosts.txt -l root 'pip install -U'
parallel-ssh -h hosts.txt -l root 'python -m nltk.downloader -d /usr/local/lib/nltk_data names'

Hosts.txt file should contain a list of hosts in your cluster, one per line.

Pre-filtered n-grams are required to extract DBPedia entities. This method is covered in a separate tutorial: Linking DBPedia Entities in N-gram corpus

Post-filter n-grams

After DBPedia entities are linked, run the post-filter job to resolve numeric entities and lowercase all other n-grams:


After post-filtering, n-grams should consume ~44GB.

In the next step, we will extract only the n-grams we are interested in.

Step 1: Filtering preposition n-gram counts

To efficiently compute association measures between words, we need efficient access to every n-gram count up to 3grams. Unfortunately, the new n-gram count sizes are still to large to be put in any local key-value store.

Another issue with storing n-grams counts as simple key-values is high inefficiency for our grammatical correction task. For instance, to correct a preposition in a sentence, we will need to retrieve n-gram counts for all possible prepositions in consideration.

Using simple key-value counts, we will have to make one request per preposition per n-gram. However, we know in advance which prepositions we want to consider, so we can aggregate all preposition counts into one value and use special n-gram as a key. We will call such n-grams preposition n-grams.

Assuming that our current directory is something like ~/kilogram/mapreduce, we can filter preposition n-grams using the following scripts:

./ -n 2 -m ./filter/ --filter-file ../extra/preps.txt $NGRAM_POSTFILTER_DIR $2GRAMS_PREPS_OUT
./ -n 3 -m ./filter/ --filter-file ../extra/preps.txt $NGRAM_POSTFILTER_DIR $3GRAMS_PREPS_OUT 

This way we will filter 2grams and 3grams that contain prepositions, and their sizes should be approximately 170MB and 2.9GB respectively. Now we can continue and load them to a local database.

To compute association measures for 3grams, we would also require access to arbitrary 2gram and 1gram counts. We will show how to extract and load them to a storage in the next section.

Step 2: Loading n-gram counts into database

Given what we descussed in the previous section, we decided to use the following solution:

  • MongoDB to store preposition n-grams, since they are small enough to fit into one machine;
  • HBase to store other arbitrary n-grams.

To store preposition n-grams into MongoDB in the right format, we use scripts from the kilogram library. It needs to be installed via pip as well to run the scripts:

hdfs dfs -cat $2GRAMS_PREPS_OUT/* > preps_2grams

cat preps_2grams | --sub preps.txt | sort -k1,1 -t $'\t' > preps_2grams_mongo

hdfs dfs -cat $3GRAMS_PREPS_OUT/* > preps_3grams

cat preps_3grams | --sub preps.txt | sort -k1,1 -t $'\t' > preps_3grams_mongo -d ngrams --subs preps_3grams_mongo -d ngrams --subs preps_2grams_mongo

File preps.txt represents our preposition set, which counts 49 prepositions in total.

Compute and store all 1grams into mongo:

./ -n 1 -m ./filter/ $NGRAM_POSTFILTER_DIR $1GRAMS_OUT_DIR
hdfs dfs -cat $1GRAMS_OUT_DIR/* > all_1grams -d 1grams all_1grams

Next, we upload 2grams to HBase table names ngrams2 using Pig:

hbase shell
> create 'ngrams', 'ngram'
> ^D
pig -p table=ngrams -p path=$NGRAM_POSTFILTER_DIR -p n=2 ../extra/hbase_upload.pig

Grab another cup of something, this will take a while.

Step 3: Determiner skips

Filter arbitrary n-grams with determiner skips and insert them to HBase:

pig -p table=ngrams2 -p path=$NGRAM_DT_SKIPS_DIR -p n=2 ../extra/hbase_upload.pig

pig -p table=ngrams2 -p path=$NGRAM_DT_SKIPS_DIR -p n=3 ../extra/hbase_upload.pig Filter preposition n-grams with determiner skips and insert them to MongoDB:

./ -m ./grammar/ --filter-file ../extra/preps.txt $NGRAM_POSTFILTER_DIR $NGRAM_DT_SKIPS_DIR
hdfs dfs -cat $NGRAM_DT_SKIPS_DIR/* > preps_dt_skips

cat preps_dt_skips | --sub preps.txt | sort -k1,1 -t $'\t' > preps_dt_skips_mongo -d ngrams --subs preps_dt_skips_mongo

The next part of this notebook describes how to calculate association measures between words in order to fix grammar.