This notebook was put together by [Roman Prokofyev](http://prokofyev.ch)@[eXascale Infolab](http://exascale.info/). Source and license info is on [GitHub](https://github.com/dragoon/kilogram/).
This notebook is a part of bigger tutorial on fixing grammatical edits.
You will need to install the following python packages to run the notebook:
We will also do some plotting in this notebook, here are some preparations:
In [1]:
from __future__ import division
from collections import defaultdict
import matplotlib.pyplot as plt
from mpltools import style
style.use('ggplot')
%matplotlib inline
In this notebook we will work with a standard First Certificate of English (FCE) exam collection. This corpus can be downloaded from: http://ilexir.co.uk/media/fce-released-dataset.zip. The detailed information on this corpus can be found in the following paper: http://ucrel.lancs.ac.uk/publications/cl2003/papers/nicholls.pdf
We can also process this dataset using dataset-utils library:
In [2]:
import subprocess
print subprocess.check_output("fce_parse_edit_history.py -o /home/roman/fce_edits.tsv /home/roman/fce-released-dataset.zip; exit 0",
shell=True, stderr=subprocess.STDOUT)
In [3]:
from kilogram import extract_edits
edits = extract_edits('/home/roman/fce_edits.tsv')
PREPS_1GRAM = set(open('../extra/preps.txt').read().split('\n'))
prep_edits = [x for x in edits if x.edit1 in PREPS_1GRAM and x.edit2 in PREPS_1GRAM]
del edits
print 'Preposition edits:', len(prep_edits)
Next we extract all prepositions from the dataset:
In [12]:
from kilogram import extract_filtered
# Need to lower if in the beginning of a sentence
filter_func = lambda x: x[0].lower()+x[1:] in PREPS_1GRAM
all_preps = extract_filtered('/home/roman/fce_edits.tsv', filter_func)
print 'Percentage preposition replaces: {0:.2f}%'.format(len(prep_edits)/len(all_preps)*100)
While exploring the data, we have noticed that many prepositions occur close to linguistic constructs representing times, dates, percentages, etc. For example, in a phrase
the starting time has been changed from 19:30 to 20:15.
we have two constucts representing time of the day. Usually such constructs are called named entities. Unfortunately, these entities occur very rarely in our n-gram counts corpus, if occur at all. This is due to the large space of possibile numbers, especially float ones. More appealing choice would be to replace numeric entities with some placeholders, both in text and in the n-gram counts corpus.
In the kilogram library there exists a number_replace
function which tests if a token from text represents one of the predefined numeric entities. Currently this function tries to resolve the following numeric entities:
Let's first check how many numeric entities do we have in a proximity of prepositions, and how often people do errors in those prepositions:
In [5]:
from kilogram.lang import number_replace
def get_num_distributions(edits):
number_positions = 0
for prep_edit in edits:
for i, token in zip(range(-2,3), prep_edit.context(2)):
if i != 0:
token1 = number_replace(token)
if token1 != token:
number_positions += 1
return number_positions
all_numeric = get_num_distributions(all_preps)
error_numeric = get_num_distributions(prep_edits)
print 'Numeric + all prepositions:', all_numeric
print 'Numeric + erroneus prepositions:', error_numeric
print 'Percentage numeric errors of all numeric: {0:.2f}%'.format(error_numeric/all_numeric*100)
print 'Percentage numeric errors of all errors: {0:.2f}%'.format(error_numeric/len(prep_edits)*100)
We can see that despite being very low, the preposition errors close to numeric entities can still contribute to a better results if taken into account.
We can further investigate the data and calculate proportions per numberic entity type and per distance to a preposition:
In [16]:
def get_num_distributions(edits):
number_positions_dist = defaultdict(lambda: 0)
number_type_dist = defaultdict(lambda: 0)
for prep_edit in edits:
for i, token in zip(range(-2,3), prep_edit.context(2)):
if i != 0 and token:
if (token[0] == '<' and token[-1] == '>') or '<NUM>' in token:
number_positions_dist[i] += 1
number_type_dist[token] += 1
return number_positions_dist, number_type_dist
number_positions_dist, number_type_dist = get_num_distributions(all_preps)
e_number_positions_dist, e_number_type_dist = get_num_distributions(prep_edits)
fig = plt.figure(figsize=(9,5))
plt.bar(*zip(*number_positions_dist.items()), align='center', log=True)
plt.bar(*zip(*e_number_positions_dist.items()), align='center', color='y')
number_type_data = sorted(number_type_dist.items(), key=lambda x: x[1], reverse=True)[:12]
categories, values = zip(*number_type_data)
e_values = [e_number_type_dist.get(category, 0) for category in categories]
fig = plt.figure(figsize=(18,5))
plt.bar(range(len(categories)), values, align='center', log=True)
plt.bar(range(len(categories)), e_values, align='center', color='y')
plt.xticks(range(len(categories)), categories)
print 'Done'
Now that we have looked at a specific solution to address rare n-grams in case of numeric entities, let's look more closer to n-grams with zero counts. May be we will be able to isolate other generic n-gram classes.
First configure n-gram backend service as in the previois part:
In [6]:
from kilogram import NgramService
NgramService.configure(PREPS_1GRAM, mongo_host=('localhost', '27017'), hbase_host=("diufpc301", "9090"))
Next we process first 1000 n-grams and see which ones have zero counts:
In [13]:
for prep_edit in all_preps[:1000]:
central_ngram = prep_edit.ngram_context(fill=False)[3][1]
if central_ngram:
# lowercase n-gram since we lowercased our n-gram counts, also number_replace
assoc = central_ngram.association()
if len(assoc) == 0:
print central_ngram
Observing the output, we can extract the following n-gram classes:
Again, we probably want to replace those entities with some placeholders, but this time it's much more difficult than with numeric entities. In order to recognize them, we will have to employ advances tools such as Stanford NER or named entity chunker from NLTK library (http://www.nltk.org/book/ch07.html, Sec. 7.5)
In [ ]: