This notebook was put together by [Roman Prokofyev](http://prokofyev.ch)@[eXascale Infolab](http://exascale.info/). Source and license info is on [GitHub](https://github.com/dragoon/kilogram/).

Prerequisites

You will need to install the following python packages to run the notebook:

We will also do some plotting in this notebook, here are some preparations:


In [1]:
from __future__ import division
from collections import defaultdict
import matplotlib.pyplot as plt
from mpltools import style
style.use('ggplot')
%matplotlib inline

Step 0: Downloading FCE exam collection

In this notebook we will work with a standard First Certificate of English (FCE) exam collection. This corpus can be downloaded from: http://ilexir.co.uk/media/fce-released-dataset.zip. The detailed information on this corpus can be found in the following paper: http://ucrel.lancs.ac.uk/publications/cl2003/papers/nicholls.pdf

We can also process this dataset using dataset-utils library:


In [2]:
import subprocess
print subprocess.check_output("fce_parse_edit_history.py -o /home/roman/fce_edits.tsv /home/roman/fce-released-dataset.zip; exit 0",
                              shell=True, stderr=subprocess.STDOUT)



Step 1: Extracting edits of prepositions

Extract all edits from the FCE collections and keep only prepositions:


In [3]:
from kilogram import extract_edits
edits = extract_edits('/home/roman/fce_edits.tsv')
PREPS_1GRAM = set(open('../extra/preps.txt').read().split('\n'))
prep_edits = [x for x in edits if x.edit1 in PREPS_1GRAM and x.edit2 in PREPS_1GRAM]
del edits

print 'Preposition edits:', len(prep_edits)


Total edits extracted: 45704
Preposition edits: 2245

Next we extract all prepositions from the dataset:


In [12]:
from kilogram import extract_filtered
# Need to lower if in the beginning of a sentence
filter_func = lambda x: x[0].lower()+x[1:] in PREPS_1GRAM
all_preps = extract_filtered('/home/roman/fce_edits.tsv', filter_func)
print 'Percentage preposition replaces: {0:.2f}%'.format(len(prep_edits)/len(all_preps)*100)


Total edits extracted: 61053
Percentage preposition replaces: 3.68%

Step 2: Analyzing numeric patterns

While exploring the data, we have noticed that many prepositions occur close to linguistic constructs representing times, dates, percentages, etc. For example, in a phrase

the starting time has been changed from 19:30 to 20:15.

we have two constucts representing time of the day. Usually such constructs are called named entities. Unfortunately, these entities occur very rarely in our n-gram counts corpus, if occur at all. This is due to the large space of possibile numbers, especially float ones. More appealing choice would be to replace numeric entities with some placeholders, both in text and in the n-gram counts corpus.

In the kilogram library there exists a number_replace function which tests if a token from text represents one of the predefined numeric entities. Currently this function tries to resolve the following numeric entities:

  • Times, including am/pm variations;
  • Volumes/areas
  • Percentages;
  • Integers
  • Generic numbers;

Let's first check how many numeric entities do we have in a proximity of prepositions, and how often people do errors in those prepositions:


In [5]:
from kilogram.lang import number_replace
def get_num_distributions(edits):
    number_positions = 0
    for prep_edit in edits:
        for i, token in zip(range(-2,3), prep_edit.context(2)):
            if i != 0:
                token1 = number_replace(token)
                if token1 != token:
                    number_positions += 1
    return number_positions

all_numeric = get_num_distributions(all_preps)
error_numeric = get_num_distributions(prep_edits)
print 'Numeric + all prepositions:', all_numeric
print 'Numeric + erroneus prepositions:', error_numeric
print 'Percentage numeric errors of all numeric: {0:.2f}%'.format(error_numeric/all_numeric*100)
print 'Percentage numeric errors of all errors: {0:.2f}%'.format(error_numeric/len(prep_edits)*100)


Numeric + all prepositions: 566
Numeric + erroneus prepositions: 9
Percentage numeric errors of all numeric: 1.59%
Percentage numeric errors of all errors: 0.40%

We can see that despite being very low, the preposition errors close to numeric entities can still contribute to a better results if taken into account.

We can further investigate the data and calculate proportions per numberic entity type and per distance to a preposition:


In [16]:
def get_num_distributions(edits):
    number_positions_dist = defaultdict(lambda: 0)
    number_type_dist = defaultdict(lambda: 0)
    for prep_edit in edits:
        for i, token in zip(range(-2,3), prep_edit.context(2)):
            if i != 0 and token:
                if (token[0] == '<' and token[-1] == '>') or '<NUM>' in token:
                    number_positions_dist[i] += 1
                    number_type_dist[token] += 1
    return number_positions_dist, number_type_dist

number_positions_dist, number_type_dist = get_num_distributions(all_preps)
e_number_positions_dist, e_number_type_dist = get_num_distributions(prep_edits)

fig = plt.figure(figsize=(9,5))
plt.bar(*zip(*number_positions_dist.items()), align='center', log=True)
plt.bar(*zip(*e_number_positions_dist.items()), align='center', color='y')
number_type_data = sorted(number_type_dist.items(), key=lambda x: x[1], reverse=True)[:12]
categories, values = zip(*number_type_data)
e_values = [e_number_type_dist.get(category, 0) for category in categories]
fig = plt.figure(figsize=(18,5))
plt.bar(range(len(categories)), values, align='center', log=True)
plt.bar(range(len(categories)), e_values, align='center', color='y')
plt.xticks(range(len(categories)), categories)
print 'Done'


Done

Step 3: Central n-grams with zero counts

Now that we have looked at a specific solution to address rare n-grams in case of numeric entities, let's look more closer to n-grams with zero counts. May be we will be able to isolate other generic n-gram classes.

First configure n-gram backend service as in the previois part:


In [6]:
from kilogram import NgramService
NgramService.configure(PREPS_1GRAM, mongo_host=('localhost', '27017'), hbase_host=("diufpc301", "9090"))

Next we process first 1000 n-grams and see which ones have zero counts:


In [13]:
for prep_edit in all_preps[:1000]:
    central_ngram = prep_edit.ngram_context(fill=False)[3][1]
    if central_ngram:
        # lowercase n-gram since we lowercased our n-gram counts, also number_replace
        assoc = central_ngram.association()
        if len(assoc) == 0:
            print central_ngram


HBase req rate: 1.04969622934 r/s
winner of himself
show with danny
created by bloody
fights but by
cards from switzerland
frustrated to read
<NUM> but forty-five
lack of waiters
variety of micro-computers
stay at pat's
keen on musicals
<NUM> but unfortunately
! of course
go to 'theatre
$<NUM> for the
potential to download
influenced by achievements
available' but it
tickets for $<NUM>
secret to pat
talk to katrin
closed after finishing
apologised to maria
cheque for $<NUM>
$<NUM> to the
problems with tickets

Observing the output, we can extract the following n-gram classes:

  • Normal n-grams, such as "created by bloody" and "influenced by achievements", which just happen to have zero counts.
  • N-grams containing names of persons, such as "apologised to maria" or names of places, such as "cards from switzerland", or, more generally, other named entities.

Again, we probably want to replace those entities with some placeholders, but this time it's much more difficult than with numeric entities. In order to recognize them, we will have to employ advances tools such as Stanford NER or named entity chunker from NLTK library (http://www.nltk.org/book/ch07.html, Sec. 7.5)


In [ ]: