Bruni, Tran, Baroni (multimodal dist semantics):
RG: http://aclweb.org/aclwiki/index.php?title=RG-65_Test_Collection_(State_of_the_art) $\rho=0.89$
Add noise as usual, evaluate intrinsically.
Vectors trained on less data will have lower coverage of types, so they may not be able to provide and answer for all word pairs in the test sets. I handle this in two ways:
Words in 3/4 datasets are provided without PoS tags (only MEN provides tags). In my work cat/J, cat/N and cat/V have different vectors, so before evaluation I need to map cat to one of these two versions. I use the first PoS tag that is found, in the order JNV. I could have ignored words in the test set that may map to multiple PoS tags.
ws353 data?Run thesisgenerator/scripts/intrinsic_eval_words.py first and make sure results are in the right place.
In [1]:
%cd ~/NetBeansProjects/ExpLosion/
from notebooks.common_imports import *
sns.timeseries.algo.bootstrap = my_bootstrap
sns.categorical.bootstrap = my_bootstrap
In [2]:
noise_df = pd.read_csv('../thesisgenerator/intrinsic_noise_word_level.csv', index_col=0)
noise_df['Dataset'] = noise_df.test.map(str.upper)
noise_df = noise_df.drop('test', axis=1)
noise_df.head()
Out[2]:
In [3]:
with sns.axes_style('whitegrid'):
g = sns.factorplot(x='noise', hue='vect', col='Dataset', y='corr', col_wrap=2,
data=noise_df[noise_df.kind == 'strict'], kind='point',
x_order=sorted(noise_df.noise.unique()), aspect=1.5,
col_order = ['MEN', 'WS353', 'RG', 'MC']
);
g.set_ylabels('Spearman $\\rho$')
sns.despine(left=True, bottom=True)
for ax in g.axes.flat:
sparsify_axis_labels(ax)
ax.axhline(0, c='k')
In [4]:
with sns.color_palette("cubehelix", 4):
g = sns.FacetGrid(noise_df[noise_df.kind == 'strict'], col='Dataset', col_wrap=2,
col_order = ['MEN', 'WS353', 'RG', 'MC'])
g.map_dataframe(tsplot_for_facetgrid, time='noise', value='corr', condition='vect',
unit='folds', ci=68).add_legend()
for ax in g.axes.flat:
sparsify_axis_labels(ax)
g.set_ylabels('Spearman $\\rho$')
g.set_xlabels('Noise')
plt.savefig('plot-intrinsic-noise.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)
In [5]:
with sns.axes_style('whitegrid'):
g = sns.factorplot(x='noise', col='vect', hue='Dataset', y='pval',
data=noise_df[noise_df.kind == 'relaxed'],
kind='point',
x_order=sorted(noise_df.noise.unique()));
sns.despine(left=True)
for ax in g.axes.flat:
sparsify_axis_labels(ax)
ax.set_ylim(-0.05, 1)
In [6]:
g = sns.FacetGrid(noise_df[noise_df.kind == 'relaxed'], col='vect')
g.map_dataframe(tsplot_for_facetgrid, time='noise', value='pval',
condition='Dataset', unit='folds', ci=0.1).add_legend()
for ax in g.axes.flat:
sparsify_axis_labels(ax)
ax.set_title('{1} {2}%'.format(*ax.title._text.split('-')))
In [7]:
# version of the above without percentiles
noise_df2 = noise_df[noise_df.kind == 'relaxed'].groupby(['Dataset', 'vect', 'noise']).mean().reset_index()
with sns.color_palette("cubehelix", 4):
g = sns.FacetGrid(noise_df2, col='vect')
g.map_dataframe(tsplot_for_facetgrid, time='noise', value='pval',
condition='Dataset', unit='folds', ci=0.1).add_legend()
for ax in g.axes.flat:
sparsify_axis_labels(ax)
_, corpus, percent = ax.title._text.split('-')
ax.set_title('{} {}%'.format(corpus.title(), percent), fontsize=18)
g.set_ylabels('P-value')
g.set_xlabels('Noise')
plt.savefig('plot-intrinsic-pvals.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)
In [8]:
curve_df = pd.read_csv('../thesisgenerator/intrinsic_learning_curve_word_level.csv')
with sns.axes_style('whitegrid'):
sns.factorplot(data=curve_df, col='test', x='percent', y='corr',
hue='kind', col_wrap=2, col_order = ['men', 'ws353', 'rg', 'mc'])
In [9]:
g = sns.FacetGrid(curve_df, col='test', col_wrap=2,
col_order = ['men', 'ws353', 'rg', 'mc'])
g.map_dataframe(tsplot_for_facetgrid, time='percent', value='corr', condition='kind',
unit='folds', ci=68).add_legend();
plt.savefig('plot-intrinsic-learning-curve.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)
None of the intrinsic tests but can tell between wiki-15 and wiki-100, regardless of dataset size.
I thought this may be because I was using the relaxed score, but the difference between relaxed and strict is generally small. Such a difference only arises when a model's coverage of the test words is poor, i.e. when unlabelled data is very limited. This isn't a real issue (see below)
In [10]:
g = sns.factorplot(y='missing', x='percent', hue='test', data=curve_df, aspect=2);
In [11]:
rep_df = pd.read_csv('../thesisgenerator/intrinsic_w2v_repeats_word_level.csv')
In [ ]:
rep_df.head()
In [ ]:
sns.factorplot(data=rep_df, x='rep_id', y='corr', col='test', kind='bar')
In [ ]:
from discoutils.thesaurus_loader import Vectors as V
from thesisgenerator.plugins.multivectors import MultiVectors
prefix = 'lustre/scratch/inf/mmb28/FeatureExtractionToolkit/word2vec_vectors/'
pattern = os.path.join(prefix, 'word2vec-wiki-15perc.unigr.strings.rep%d')
rep_vectors = [V.from_tsv(pattern % i) for i in [0, 1, 2]]
avg_vectors = [V.from_tsv(os.path.join(prefix, 'word2vec-wiki-15perc.unigr.strings.avg3'))]
mv = [MultiVectors(tuple(rep_vectors))]
In [ ]:
mv[0].get_nearest_neighbours('love/N')
In [ ]: