This notebook was put together by [Roman Prokofyev](http://prokofyev.ch)@[eXascale Infolab](http://exascale.info/). Source and license info is on [GitHub](https://github.com/dragoon/kilogram/).
This notebook is a part of bigger tutorial on fixing grammatical edits.
If you don't want to process data yourself, we have prepared the dataset processed according to the steps below in S3: http://coming.soon
You just need to load this dataset into your local database to compute association measures (Step 2).
We need to download n-gram counts from some sufficiently large corpus. In this example we used Google Books N-gram corpus version 20120701. The dataset is split into multiple parts and can be downloaded from here: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
The tutorial assumes you alredy have the data aggregated by year.
Tte dataset contains n-gram counts per year, which we don't need here. Therefore we aggregate by year in Hadoop first. We also remove POS-tagged n-grams and lowercase everything. This step significantly reduces the amount of data we need to store, so we can, for example, load all 2grams into HBase.
The following table shows the sizes of ngrams at different processing stages:
Type | Original Size (gzipped) | Year-aggregated | Final size |
---|---|---|---|
1grams | 5.3GB | 342MB | 90MB |
2grams | 135GB | 13.4GB | 1.9GB |
3grams | 1.3TB | 190.6GB | 9.5GB |
4grams | 294GB | 50.9GB | 17.6GB |
5grams | XXXGB | 51.1GB | 12GB |
At the final processing step, we try to aggregate n-grams containing various entities by their conceptual type using regular expressions and dictionaries. For example, we distinguish numeric entities, person names, cities (geo-entities) and DBPedia entities (see below). We will show how to use this information in the next part of this tutotial.
Internally numeric entities have 5 types which are identified by regular expressions:
Person names are identified by a dictionary lookup in the names corpus of the NLTK library. For each match we generate new n-gram with original string replaced by the "PERSON" token.
Cities are also identified by a dictionary lookup. We use Geonames 15000 as our dictionary. Since it is very small (~2Mb), we ship it with the kilogram library. For each match we generate new n-gram with original string replaced by the "CITY" token.
DBPedia entities are a bit more complicated. On one side, there are many interesting properties inside DBPedia that we can leverage on. On the other side, it's generally impossible to match arbitrary DBPedia entity inside an n-gram due to lack of context. It would be only possible if we had a list of unambiguous entities that always match their string representations (but we don't have it). However, we found a way to remove incorrect entity linkings (i.e., entities that often match generic phrases) by applying statistical methods of outlying counts detection for DBPedia types and removing the outliers.
That was the teoretical part. Coming to the n-gram processing itself.
First clone the kilogram library:
git clone https://github.com/dragoon/kilogram.git
cd kilogram/mapreduce
Run the pre-filter job (again, on already year-aggregated data). The job removes POS tags, multiple punctuation symbols and some other stuff:
./run_job.py [--reducers NUMBER_OF_REDUCERS] $RAW_NGRAMS_DIR $NGRAM_PREFILTER_DIR
While the job is running, install NLTK, kilogram library and download the names dataset on every machine using parallel-ssh:
parallel-ssh -h hosts.txt -l root 'apt-get install -y python-dev libxml2-dev libxslt-dev'
parallel-ssh -h hosts.txt -l root 'pip install -U https://github.com/dragoon/kilogram/zipball/master'
parallel-ssh -h hosts.txt -l root 'python -m nltk.downloader -d /usr/local/lib/nltk_data names'
Hosts.txt file should contain a list of hosts in your cluster, one per line.
Pre-filtered n-grams are required to extract DBPedia entities. This method is covered in a separate tutorial: Linking DBPedia Entities in N-gram corpus
After DBPedia entities are linked, run the post-filter job to resolve numeric entities and lowercase all other n-grams:
./run_job.py -m ./filter/mapper_postfilter.py $NGRAM_PREFILTER_DIR:$ENTITY_DIR:$DBPEDIA_TYPE_DIR $NGRAM_POSTFILTER_DIR
After post-filtering, n-grams should consume ~44GB.
In the next step, we will extract only the n-grams we are interested in.
To efficiently compute association measures between words, we need efficient access to every n-gram count up to 3grams. Unfortunately, the new n-gram count sizes are still to large to be put in any local key-value store.
Another issue with storing n-grams counts as simple key-values is high inefficiency for our grammatical correction task. For instance, to correct a preposition in a sentence, we will need to retrieve n-gram counts for all possible prepositions in consideration.
Using simple key-value counts, we will have to make one request per preposition per n-gram. However, we know in advance which prepositions we want to consider, so we can aggregate all preposition counts into one value and use special n-gram as a key. We will call such n-grams preposition n-grams.
Assuming that our current directory is something like ~/kilogram/mapreduce
, we can filter preposition n-grams using the following scripts:
./run_job.py -n 2 -m ./filter/mapper_filter.py --filter-file ../extra/preps.txt $NGRAM_POSTFILTER_DIR $2GRAMS_PREPS_OUT
./run_job.py -n 3 -m ./filter/mapper_filter.py --filter-file ../extra/preps.txt $NGRAM_POSTFILTER_DIR $3GRAMS_PREPS_OUT
This way we will filter 2grams and 3grams that contain prepositions, and their sizes should be approximately 170MB and 2.9GB respectively. Now we can continue and load them to a local database.
To compute association measures for 3grams, we would also require access to arbitrary 2gram and 1gram counts. We will show how to extract and load them to a storage in the next section.
Given what we descussed in the previous section, we decided to use the following solution:
To store preposition n-grams into MongoDB in the right format, we use scripts from the kilogram library. It needs to be installed via pip as well to run the scripts:
hdfs dfs -cat $2GRAMS_PREPS_OUT/* > preps_2grams
cat preps_2grams | convert_to_mongo.py --sub preps.txt | sort -k1,1 -t $'\t' > preps_2grams_mongo
hdfs dfs -cat $3GRAMS_PREPS_OUT/* > preps_3grams
cat preps_3grams | convert_to_mongo.py --sub preps.txt | sort -k1,1 -t $'\t' > preps_3grams_mongo
insert_to_mongo.py -d ngrams --subs preps_3grams_mongo
insert_to_mongo.py -d ngrams --subs preps_2grams_mongo
File preps.txt
represents our preposition set, which counts 49 prepositions in total.
Compute and store all 1grams into mongo:
./run_job.py -n 1 -m ./filter/mapper_filter.py $NGRAM_POSTFILTER_DIR $1GRAMS_OUT_DIR
hdfs dfs -cat $1GRAMS_OUT_DIR/* > all_1grams
insert_to_mongo.py -d 1grams all_1grams
Next, we upload 2grams to HBase table names ngrams2 using Pig:
hbase shell
> create 'ngrams', 'ngram'
> ^D
pig -p table=ngrams -p path=$NGRAM_POSTFILTER_DIR -p n=2 ../extra/hbase_upload.pig
Grab another cup of something, this will take a while.
Filter arbitrary n-grams with determiner skips and insert them to HBase:
/run_job.py -m ./grammar/mapper_DT_strip.py $NGRAM_POSTFILTER_DIR $NGRAM_DT_SKIPS_DIR
pig -p table=ngrams2 -p path=$NGRAM_DT_SKIPS_DIR -p n=2 ../extra/hbase_upload.pig
pig -p table=ngrams2 -p path=$NGRAM_DT_SKIPS_DIR -p n=3 ../extra/hbase_upload.pig Filter preposition n-grams with determiner skips and insert them to MongoDB:
./run_job.py -m ./grammar/mapper_DT_strip.py --filter-file ../extra/preps.txt $NGRAM_POSTFILTER_DIR $NGRAM_DT_SKIPS_DIR
hdfs dfs -cat $NGRAM_DT_SKIPS_DIR/* > preps_dt_skips
cat preps_dt_skips | convert_to_mongo.py --sub preps.txt | sort -k1,1 -t $'\t' > preps_dt_skips_mongo insert_to_mongo.py -d ngrams --subs preps_dt_skips_mongo
The next part of this notebook describes how to calculate association measures between words in order to fix grammar.