Reduced coverage experiments

Where the set of entries contrained in one thesaurus is constrained to match another, smaller thesaurus. This lets us look at performance differences due to

  • higher coverage of space of all words/phrases (due to different filtering or algorithm specifics), versus
  • better vector quality

There are currently (19-5-15) two sets of such experiments, both for NPs only:

  • w2v vectors reduced to the coverage of count-windows vectors
  • count-windows reduced to coverage of Baroni vector

In [1]:
%cd ~/NetBeansProjects/ExpLosion/
from copy import deepcopy
from notebooks.common_imports import *
from gui.output_utils import *
from gui.user_code import pretty_names
from pprint import pprint

sns.timeseries.algo.bootstrap = my_bootstrap
sns.categorical.bootstrap = my_bootstrap


/Volumes/LocalDataHD/m/mm/mmb28/NetBeansProjects/ExpLosion

In [2]:
def plot_matching(exp_with_constraints, labels=None, rotation=0):
    matching = []
    for e in exp_with_constraints:
        settings = settings_of(e.id)
        settings['expansions__entries_of_id'] = None
        matching.append(Experiment.objects.get(**settings))
    
    ids1 = list(exp_with_constraints.values_list('id', flat=True))
    ids2 = [x.id for x in matching]
    print(ids1, '--->', ids2)
    if not labels:
        labels = ['%s-%s'%(a.id, b.id) for a,b in zip(exp_with_constraints, matching)]
    with sns.color_palette("cubehelix", 2):
        diff_plot_bar([ids1, ids2], ['Limited', 'Unlimited'],
                      labels, rotation=rotation, hue_order=['Unlimited', 'Limited'],
                     legend_title='Coverage')

count windows vectors (add, mult, ...) reduced to Baroni's coverage

we know they are better and have a higher coverage, so can reducing the coverage reduce the accuracy too


In [3]:
experiments = Experiment.objects.filter(expansions__entries_of__isnull=False, 
                                        expansions__entries_of__composer='Baroni')
names = [n.expansions.vectors.composer for n in experiments]

plot_matching(experiments, labels=names)
plt.axhline(random_vect_baseline(), c='k');
plt.savefig('plot-reduced-coverage1.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)


[188, 190, 192, 194] ---> [27, 28, 29, 30]

word2vec reduced to coverage of count windows (there shouldn't be a large difference)


In [4]:
exp_ids = Experiment.objects.filter(expansions__entries_of__isnull=False,
                                    expansions__entries_of__algorithm='count_windows'
                                   ).exclude(expansions__entries_of__composer='Baroni')
names = [n.expansions.vectors.composer for n in exp_ids]
plot_matching(exp_ids, labels=names)
plt.axhline(random_vect_baseline(), c='k');
plt.savefig('plot-reduced-coverage2.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)


[187, 189, 191, 193] ---> [34, 35, 36, 37]

wiki-100 reduced to coverage of wiki-15


In [5]:
exp_ids = Experiment.objects.filter(expansions__entries_of__isnull=False,
                                    expansions__vectors__unlabelled_percentage=100,
                                    expansions__entries_of__unlabelled_percentage=15)
names = [n.expansions.vectors.composer for n in exp_ids]
plot_matching(exp_ids, labels=names)
plt.axhline(random_vect_baseline(), c='k');
plt.savefig('plot-reduced-coverage3.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)


[282, 283, 284, 285] ---> [34, 35, 36, 37]

Learning curve with reduced coverage

An extended version of the above cell


In [6]:
constrained_exp_ids = Experiment.objects.filter(expansions__entries_of__isnull=False,
                                    expansions__entries_of__unlabelled_percentage__in=[1, 10],
                                    expansions__vectors__composer='Add').values_list('id', flat=True)
print(constrained_exp_ids)


[324, 325, 326, 327, 328, 329, 330, 331, 332, 333]

In [7]:
unconstrained_exp_ids = set()
for eid in constrained_exp_ids:
    s = settings_of(eid)
    s['expansions__entries_of_id'] = None
    del s['expansions__vectors__unlabelled_percentage']
    unconstrained_exp_ids.update(set(Experiment.objects.filter(**s).values_list('id', flat=True)))
print(unconstrained_exp_ids, '-->', constrained_exp_ids)


{34, 101, 102, 103, 104, 105, 106, 75} --> [324, 325, 326, 327, 328, 329, 330, 331, 332, 333]

In [8]:
# notation from thesis
names = {'N':'expansions__vectors__unlabelled_percentage',
         'M':'expansions__entries_of__unlabelled_percentage'}
df1 = dataframe_from_exp_ids(unconstrained_exp_ids, names)
# df1['kind'] = 'Unlimited'
df2 = dataframe_from_exp_ids(constrained_exp_ids, names)
# df2['kind'] = 'Limited'


Accuracy has 4000 values
folds has 4000 values
M has 4000 values
N has 4000 values
Accuracy has 5000 values
folds has 5000 values
M has 5000 values
N has 5000 values

In [9]:
df = pd.concat([df1, df2], ignore_index=True).convert_objects(convert_numeric=True).fillna(100)
# df['Tokens'] = df.percent * 1500000000 / 100
df.M = df.M.astype(int).astype(str) + '%'
with sns.color_palette("cubehelix", 4):
    ax = sns.tsplot(df, time='N', unit='folds', condition='M', value='Accuracy')
plt.xlim(0, df.N.max())
# plt.axvline(15 * 1500000000/ 100, color='red', linestyle='dotted');
# plt.axvline(1 * 1500000000/ 100, color='green', linestyle='dotted');
plt.savefig('plot-learning-curve-with-reduction.pdf', format='pdf', dpi=300)



In [ ]: