word2vec
's performance at the word analogy task and a range of word similarity task improves with more data. euroscipy_demo
notebook) show performance at the analogy task roughly scales like $log(x)$Let's investigate if the same holds for the document classification task. For simiplicity, we only look at word2vec
vectors with simple composion on the Reuters data set.
In [1]:
%cd ~/NetBeansProjects/ExpLosion/
from notebooks.common_imports import *
from gui.output_utils import *
sns.timeseries.algo.bootstrap = my_bootstrap
sns.categorical.bootstrap = my_bootstrap
In [8]:
corpus_sizes = {'cwiki': 525000000, 'wiki':1500000000}
def get_exp_ids_for_varying_amounts_of_unlabelled(lab_corpus, unlab=['wiki'], k=3):
query_dict = {'expansions__vectors__rep': 0,
'expansions__k':k,
'labelled':lab_corpus,
'expansions__use_similarity': 0,
'expansions__neighbour_strategy':'linear',
'expansions__vectors__dimensionality': 100,
'document_features_ev': 'AN+NN',
'document_features_tr': 'J+N+AN+NN',
'expansions__allow_overlap': False,
'expansions__entries_of': None,
'expansions__vectors__algorithm': 'word2vec',
'expansions__vectors__composer__in': ['Add', 'Mult', 'Left', 'Right'], # todo Verb???
'expansions__vectors__unlabelled__in': unlab,
'expansions__decode_handler': 'SignifiedOnlyFeatureHandler',
'expansions__noise': 0}
return Experiment.objects.filter(**query_dict).order_by('expansions__vectors__unlabelled_percentage',
'expansions__vectors__composer').values_list('id', flat=True)
# convert percentages to token counts
def compute_token_count(row):
corpus_sizes = {'cwiki': 525000000, 'wiki':1500000000}
return corpus_sizes[row.unlab] * (row.percent / 100)
def plot(lab_corpus, unlab, k=3):
ids = get_exp_ids_for_varying_amounts_of_unlabelled(lab_corpus, unlab, k=k)
print(len(ids), 'ids in total:', ids)
cols = {'Composer':'expansions__vectors__composer',
'percent':'expansions__vectors__unlabelled_percentage',
'unlab': 'expansions__vectors__unlabelled'}
df = dataframe_from_exp_ids(ids, cols).convert_objects(convert_numeric=True)
# df['tokens'] = corpus_sizes[] * df.percent.values / 100
df['tokens'] = df.apply(compute_token_count, axis=1)
df['Method'] = df.unlab + '-' + df.Composer
ax = sns.tsplot(df, time='tokens', condition='Method', value='Accuracy',
unit='folds', linewidth=4, ci=68);
# remove axis labels
ax.set(xlabel='Tokens', ylabel='Accuracy')
# random baseline
plt.axhline(random_vect_baseline(lab_corpus), label='RandV', color='black')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
return df, ax
In [5]:
with sns.color_palette("cubehelix", 4):
df1, ax = plot('amazon_grouped-tagged', ['wiki', 'cwiki'])
plt.savefig('plot-w2v_learning_curve_amazon.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)
In [6]:
with sns.color_palette("cubehelix", 4):
plot('reuters21578/r8-tagged-grouped', ['wiki'])
plt.savefig('plot-w2v_learning_curve_r2.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)
In [9]:
with sns.color_palette("cubehelix", 4):
plot('reuters21578/r8-tagged-grouped', ['wiki'], k=30)
plt.savefig('plot-w2v_learning_curve_r2-k30.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)
In [10]:
with sns.color_palette("cubehelix", 4):
plot('reuters21578/r8-tagged-grouped', ['wiki'], k=60)
plt.savefig('plot-w2v_learning_curve_r2-k60.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)
In [5]:
plt.rcParams['legend.fontsize'] = 16
plt.rcParams['axes.labelsize'] = 16
# do main plot
df['Algorithm'] = df.Composer
g = sns.lmplot('percent', 'Accuracy', df, hue='Algorithm', lowess=True,
line_kws={"linewidth": 4}, # line size
legend=False, # hide legend for now, I'll show it later
aspect=1.5);
labels = np.arange(0, 101, 20) * 0.01 * 1.5 # convert percent of corpus to a rough token count
g.set(xlim=(0,100), xticklabels=labels, xlabel='Token count, B')
# add random baseline
plt.axhline(random_vect_baseline(), color='k', label='RandV');
# add a legend with extra bits and bobs
plt.legend(markerscale=3,
# bbox_to_anchor=(0., 1.02, 1., .102),
bbox_to_anchor=(1., .5),
loc=3,
ncol=1,
borderaxespad=0.)
g.savefig('plot-w2v_learning_curve_amazon2.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)
In earlier iterations of this experiment we used Reuters data, where there was large variation between accuracies of multiple runs of the word2vec
. The code below looks into why that might be the case. These differences no longer exist, so the code below is probably useless.
Things to look at
compare vectors between multiple runs (before and after composition). The space is completely different, this makes no sense.
Could the whole variation be simply down to the small size of Reuters? Would using Amazon help?
Performance of w2v
models I've trained on google's analogy task
word2vec
?If the neighbours vary so much, why is the performance of some composers so stable?