State-of-the-art for word-level intrinsic eval

Bruni, Tran, Baroni (multimodal dist semantics):

ws: 0.7
men: 0.77 TODO Need to find more recent papers

RG: http://aclweb.org/aclwiki/index.php?title=RG-65_Test_Collection_(State_of_the_art) $\rho=0.89$

Test: instrinsic eval on noise-corrupted vectors

Add noise as usual, evaluate intrinsically.

Coverage

Vectors trained on less data will have lower coverage of types, so they may not be able to provide and answer for all word pairs in the test sets. I handle this in two ways:

relaxed: OOV items are not looked at. This may provide an unfair advantage to vectors trained on less data, because we forgive their poor coverage
strict: OOV items score 0 correlation.

PoS tags

Words in 3/4 datasets are provided without PoS tags (only MEN provides tags). In my work cat/J, cat/N and cat/V have different vectors, so before evaluation I need to map cat to one of these two versions. I use the first PoS tag that is found, in the order JNV. I could have ignored words in the test set that may map to multiple PoS tags.

Questions

Do people use the strict version?
What do I do with multiple possible POS tags per word for ws353 data?

Running experiment

Run thesisgenerator/scripts/intrinsic_eval_words.py first and make sure results are in the right place.



In [1]:

    
%cd ~/NetBeansProjects/ExpLosion/
from notebooks.common_imports import *

sns.timeseries.algo.bootstrap = my_bootstrap
sns.categorical.bootstrap = my_bootstrap









    



/Volumes/LocalDataHD/m/mm/mmb28/NetBeansProjects/ExpLosion



In [2]:

    
noise_df = pd.read_csv('../thesisgenerator/intrinsic_noise_word_level.csv', index_col=0)
noise_df['Dataset'] = noise_df.test.map(str.upper)
noise_df = noise_df.drop('test', axis=1)
noise_df.head()









    Out[2]:






  
    
      
      vect
      noise
      kind
      corr
      pval
      folds
      Dataset
    
  
  
    
      0
      w2v-giga-100
      0
      strict
      0.463833
      3.120347e-20
      0
      WS353
    
    
      1
      w2v-giga-100
      0
      relaxed
      0.553462
      1.340141e-28
      0
      WS353
    
    
      2
      w2v-giga-100
      0
      strict
      0.466006
      1.975560e-20
      1
      WS353
    
    
      3
      w2v-giga-100
      0
      relaxed
      0.533901
      2.252053e-26
      1
      WS353
    
    
      4
      w2v-giga-100
      0
      strict
      0.382868
      9.066415e-14
      2
      WS353



In [3]:

    
with sns.axes_style('whitegrid'):
    g = sns.factorplot(x='noise', hue='vect', col='Dataset', y='corr', col_wrap=2,
                       data=noise_df[noise_df.kind == 'strict'], kind='point', 
                       x_order=sorted(noise_df.noise.unique()), aspect=1.5,
                      col_order = ['MEN', 'WS353', 'RG', 'MC']
                      );
g.set_ylabels('Spearman $\\rho$')
sns.despine(left=True, bottom=True)
for ax in g.axes.flat:
    sparsify_axis_labels(ax)
    ax.axhline(0, c='k')









    



/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/seaborn/categorical.py:2653: UserWarning: The `x_order` parameter has been renamed `order`
  UserWarning)

WARNING:py.warnings:/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/seaborn/categorical.py:2653: UserWarning: The `x_order` parameter has been renamed `order`
  UserWarning)



In [4]:

    
with sns.color_palette("cubehelix", 4):
    g = sns.FacetGrid(noise_df[noise_df.kind == 'strict'], col='Dataset', col_wrap=2,
                     col_order = ['MEN', 'WS353', 'RG', 'MC'])
    g.map_dataframe(tsplot_for_facetgrid, time='noise', value='corr', condition='vect', 
                    unit='folds', ci=68).add_legend()
for ax in g.axes.flat:
    sparsify_axis_labels(ax)

g.set_ylabels('Spearman $\\rho$')
g.set_xlabels('Noise')
plt.savefig('plot-intrinsic-noise.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)



In [5]:

    
with sns.axes_style('whitegrid'):
    g = sns.factorplot(x='noise', col='vect', hue='Dataset', y='pval',
                   data=noise_df[noise_df.kind == 'relaxed'], 
                   kind='point', 
                   x_order=sorted(noise_df.noise.unique()));
sns.despine(left=True)
for ax in g.axes.flat:
    sparsify_axis_labels(ax)
    ax.set_ylim(-0.05, 1)









    



/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/seaborn/categorical.py:2653: UserWarning: The `x_order` parameter has been renamed `order`
  UserWarning)

WARNING:py.warnings:/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/seaborn/categorical.py:2653: UserWarning: The `x_order` parameter has been renamed `order`
  UserWarning)



In [6]:

    
g = sns.FacetGrid(noise_df[noise_df.kind == 'relaxed'], col='vect')
g.map_dataframe(tsplot_for_facetgrid, time='noise', value='pval',
                condition='Dataset', unit='folds', ci=0.1).add_legend()
for ax in g.axes.flat:
    sparsify_axis_labels(ax)
    ax.set_title('{1} {2}%'.format(*ax.title._text.split('-')))



In [7]:

    
# version of the above without percentiles
noise_df2 = noise_df[noise_df.kind == 'relaxed'].groupby(['Dataset', 'vect', 'noise']).mean().reset_index()
with sns.color_palette("cubehelix", 4):
    g = sns.FacetGrid(noise_df2, col='vect')
    g.map_dataframe(tsplot_for_facetgrid, time='noise', value='pval',
                    condition='Dataset', unit='folds', ci=0.1).add_legend()

for ax in g.axes.flat:
    sparsify_axis_labels(ax)
    _, corpus, percent = ax.title._text.split('-')
    ax.set_title('{} {}%'.format(corpus.title(), percent), fontsize=18)

g.set_ylabels('P-value')
g.set_xlabels('Noise')
plt.savefig('plot-intrinsic-pvals.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)

Observations

Measured vector quality decreases nicely for WS353/MEN, oscilates for MC/RG.
P-value of correlation for smallers datasets explodes early, i.e. chance of such a strong correlation being observed by chance. Test has low power

Test: Learning curve

Evaluate vectors intrinsically as more unlabelled training data is added



In [8]:

    
curve_df = pd.read_csv('../thesisgenerator/intrinsic_learning_curve_word_level.csv')
with sns.axes_style('whitegrid'):
    sns.factorplot(data=curve_df, col='test', x='percent', y='corr', 
                   hue='kind', col_wrap=2, col_order = ['men', 'ws353', 'rg', 'mc'])



In [9]:

    
g = sns.FacetGrid(curve_df, col='test', col_wrap=2,
                 col_order = ['men', 'ws353', 'rg', 'mc'])
g.map_dataframe(tsplot_for_facetgrid, time='percent', value='corr', condition='kind', 
                unit='folds', ci=68).add_legend();
plt.savefig('plot-intrinsic-learning-curve.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)

Observation

None of the intrinsic tests but can tell between wiki-15 and wiki-100, regardless of dataset size.

I thought this may be because I was using the relaxed score, but the difference between relaxed and strict is generally small. Such a difference only arises when a model's coverage of the test words is poor, i.e. when unlabelled data is very limited. This isn't a real issue (see below)

Coverage

Almost perfect after 10% of wikipedia



In [10]:

    
g = sns.factorplot(y='missing', x='percent', hue='test', data=curve_df, aspect=2);

Repeated runs of `w2v` and multivector boosting



In [11]:

    
rep_df = pd.read_csv('../thesisgenerator/intrinsic_w2v_repeats_word_level.csv')









    



---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-11-b52cd322beb5> in <module>()
----> 1 rep_df = pd.read_csv('../thesisgenerator/intrinsic_w2v_repeats_word_level.csv')

/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    248 
    249     # Create the parser.
--> 250     parser = TextFileReader(filepath_or_buffer, **kwds)
    251 
    252     if (nrows is not None) and (chunksize is not None):

/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    564             self.options['has_index_names'] = kwds['has_index_names']
    565 
--> 566         self._make_engine(self.engine)
    567 
    568     def _get_options_with_defaults(self, engine):

/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    703     def _make_engine(self, engine='c'):
    704         if engine == 'c':
--> 705             self._engine = CParserWrapper(self.f, **self.options)
    706         else:
    707             if engine == 'python':

/home/m/mm/mmb28/anaconda3/lib/python3.4/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1070         kwds['allow_leading_cols'] = self.index_col is not False
   1071 
-> 1072         self._reader = _parser.TextReader(src, **kwds)
   1073 
   1074         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5912)()

OSError: File b'../thesisgenerator/intrinsic_w2v_repeats_word_level.csv' does not exist



In [ ]:

    
rep_df.head()



In [ ]:

    
sns.factorplot(data=rep_df, x='rep_id', y='corr', col='test', kind='bar')



In [ ]:

    
from discoutils.thesaurus_loader import Vectors as V
from thesisgenerator.plugins.multivectors import MultiVectors
prefix = 'lustre/scratch/inf/mmb28/FeatureExtractionToolkit/word2vec_vectors/'
pattern = os.path.join(prefix, 'word2vec-wiki-15perc.unigr.strings.rep%d')
rep_vectors = [V.from_tsv(pattern % i) for i in [0, 1, 2]]
avg_vectors = [V.from_tsv(os.path.join(prefix, 'word2vec-wiki-15perc.unigr.strings.avg3'))]
mv = [MultiVectors(tuple(rep_vectors))]



In [ ]:

    
mv[0].get_nearest_neighbours('love/N')



In [ ]:

	vect	kind	corr	pval	folds	Dataset
0	w2v-giga-100	strict	0.463833	3.120347e-20	0	WS353
1	w2v-giga-100	relaxed	0.553462	1.340141e-28	0	WS353
2	w2v-giga-100	strict	0.466006	1.975560e-20	1	WS353
3	w2v-giga-100	relaxed	0.533901	2.252053e-26	1	WS353
4	w2v-giga-100	strict	0.382868	9.066415e-14	2	WS353

State-of-the-art for word-level intrinsic eval

Test: instrinsic eval on noise-corrupted vectors

Coverage

PoS tags

Questions

Running experiment

Observations

Test: Learning curve

Observation

Coverage

Repeated runs of w2v and multivector boosting

Repeated runs of `w2v` and multivector boosting