In [1]:
%cd ~/NetBeansProjects/ExpLosion/
from notebooks.common_imports import *


/Users/miroslavbatchkarov/NetBeansProjects/ExpLosion

In [2]:
# reported accuracies: Add=50%, Mult=57%, SOTA=81.6 (Turney holistic model)
df = pd.read_csv('../thesisgenerator/intrinsic_turney_phraselevel.csv')

In [3]:
df.head()


Out[3]:
Unnamed: 0 unigrams composer coverage kind accuracy folds
0 0 w2v-giga-100 Add 0.788991 strict 0 0
1 1 w2v-giga-100 Add 0.272018 relaxed 0 0
2 2 w2v-giga-100 Add 0.788991 strict 0 1
3 3 w2v-giga-100 Add 0.272018 relaxed 0 1
4 4 w2v-giga-100 Add 0.788991 strict 0 2

In [4]:
g = sns.factorplot(data=df, x='unigrams', hue='composer', col='kind',
                   y='accuracy', kind='bar', aspect=1.5)
g.set_xticklabels(rotation=30);
sns.despine(left=True)
plt.savefig('plot-intrinsic-turney.pdf', format='pdf', dpi=300, bbox_inches='tight', pad_inches=0.1)



In [5]:
sns.factorplot(data=df, y='coverage', x='kind', col='unigrams', kind='bar')


Out[5]:
<seaborn.axisgrid.FacetGrid at 0x10b09a050>

In [6]:
df.groupby(['unigrams', 'composer', 'kind']).coverage.mean()


Out[6]:
unigrams      composer  kind   
w2v-giga-100  Add       relaxed    0.272018
                        strict     0.788991
              Avg       relaxed    0.272018
                        strict     0.788991
              Mult      relaxed    0.272018
                        strict     0.788991
              Right     relaxed    0.303211
                        strict     0.923853
w2v-wiki-100  Add       relaxed    0.781193
                        strict     0.982569
              Avg       relaxed    0.781193
                        strict     0.982569
              Mult      relaxed    0.781193
                        strict     0.982569
              Right     relaxed    0.788532
                        strict     0.994954
w2v-wiki-15   Add       relaxed    0.524771
                        strict     0.937615
              Avg       relaxed    0.524771
                        strict     0.937615
              Mult      relaxed    0.524771
                        strict     0.937615
              Right     relaxed    0.544037
                        strict     0.983486
Name: coverage, dtype: float64

Observations

  • the task is messy- many questions cannot be answered by an average native speaker
  • need to count how many times we don't have a vector for the gold unigram- if we don't, we can't return it as a neighbour
    • without it, models attempt >95% of questions; with it only 30%-75%

Task is set up to test if models are skating by on lexical overlap. I am not disallowing it here as it would be slightly unfair, but then the results of this test will not be compatible with my DC results. Should do both versions.

See §4.2 in (Turney, 2012), esp. eq 30/31, table 15 for explanation of how he does the evaluation


In [ ]: