In [3]:
import os
import sys
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import utils
%matplotlib inline
%load_ext autoreload
%autoreload 2
CSV_PATH = '../../data/unique_counts_semi.csv'
# load data
initial_df = utils.load_queries(CSV_PATH)
Do some cleanup
In [2]:
# filter out queries with length less than 2 characters long
start_num = len(initial_df)
df = utils.clean_queries(initial_df)
print("{} distinct queries after stripping {} queries of length 1".format(len(df), start_num-len(df)))
print("Yielding a total of {} query occurrences.".format(df['countqstring'].sum()))
Let's take a look
In [3]:
df.head(10)
Out[3]:
The frequency of queries drops off pretty quickly, suggesting a long tail of low frequency queries. Let's get a sense of this by looking at the cumulative coverage of queries with frequencies between 1 and 10.
While we're at it, we can plot the cumulative coverage up until a frequency of 200 (in ascending order of frequency).
In [4]:
total = df['countqstring'].sum()
fig, ax = plt.subplots(ncols=2, figsize=(20, 8))
cum_coverage = pd.Series(range(1,200)).apply(lambda n: df[df['countqstring'] <= n]['countqstring'].sum())/total
cum_coverage = cum_coverage*100
cum_coverage = cum_coverage.round(2)
# plot the cumulative coverage
cum_coverage.plot(ax=ax[0])
ax[0].set_xlabel('Query Frequency')
ax[0].set_ylabel('Cumulative Coverage (%)')
# see if it looks Zipfian. ie plot a log-log graph of query frequency against query rank
df.plot(ax=ax[1], y='countqstring', use_index=True, logx=True, logy=True)
ax[1].set_xlabel('Rank of Query (ie most frequent to least frequent)')
ax[1].set_ylabel('Query Frequency');
print("Freq Cumulative Coverage")
for i, val in enumerate(cum_coverage[:10].get_values()):
print("{:>2} {:0<5}%".format(i+1, val))
ie queries with a frequency of 1 account for about 30% of queries, queries with frequency of 2 or less account for 48%, 3 or less account for 58%, etc.
Looking at the graph it seems like coverate rates drops off exponentially. Plotting a log-log graph of the query frequencies (y-axis) against the descending rank of the query frequency (x-axis) shows a linear-ish trend, suggesting that it does indeed look like some kind of inverse power function situation.
The pilot annotation round consisted of 50 queries sampled randomly from the total 84011 query instances. Below is a summary of the annotator's results.
Q2.
'YY' = Yes -- with place name
'YN' = Yes -- without place name
'NY' = No (but still a place)
'NN' = Not applicable (ie not explicit location and not a place)
Q3.
'IAD' = INFORMATIONAL_ADVICE
'IDC' = INFORMATIONAL_DIRECTED_CLOSED
'IDO' = INFORMATIONAL_DIRECTED_OPEN
'ILI' = INFORMATIONAL_LIST
'ILO' = INFORMATIONAL_LOCATE
'IUN' = INFORMATIONAL_UNDIRECTED
'NAV' = NAVIGATIONAL
'RDE' = RESOURCE_ENTERTAINMENT
'RDO' = RESOURCE_DOWNLOAD
'RIN' = RESOURCE_INTERACT
'ROB' = RESOURCE_OBTAIN
In [5]:
print(utils.get_user_results('annotator1'))
print('\n')
print(utils.get_user_results('annotator2'))
print('\n')
print(utils.get_user_results('martin'))
The following results present inter-annotater agreement for the pilot round using Fleiss' kappa.
Super handwavy concensus guide to interpreting kappa scores for annotation exercies in computation linguistics (Artstein and Poesio 2008:576):
In [6]:
user_pairs = [
['annotator1', 'annotator2'],
['martin', 'annotator1'],
['martin', 'annotator2'],
]
results = utils.do_iaa_pairs(user_pairs)
utils.print_iaa_pairs(results, user_pairs)
These scores are not particularly high. We're struggling to get into even 'tentative' reliability land. We're probably going to need to do some disagreement analysis to work out what's going on.
We can, however, look at agreement for Q1 and Q2 using coarser level of agreement. For Q2, this is whether annotators agreed that a location was explicit in the query (but ignoring whether the query included a place name).
For Q3, this is whether they agreed that the query was navigational, informational, or pertaining to a resource.
In [7]:
results = utils.do_iaa_pairs(user_pairs, questions=(2,3), level='coarse')
utils.print_iaa_pairs(results, user_pairs)
Agreement has improved, especially for Q2. Q3, however, is still a bit on the low side.
In [8]:
for question in (1,2,3):
print(utils.show_agreement(question, ['annotator1', 'annotator2', 'martin']))
print('\n')