QVEC

This notebook is a replication of Tsvetkov et al. (2015) Evaluation of Word Vector Representations by Subspace Alignment, which introduces QVEC. QVEC is an intrinsic evaluation method of word embeddings, measuring the correlation between dimensions of the embeddings and linguistic features. The original code is available, but I'm replicating it for two reasons: i) as a learning exercise and ii) the original implementation looks messy.

To implement QVEC, I'm going to need two things:

Gold standard linguistic features
The QVEC model

The linguistic features used in the original paper come from SemCor, a WordNet annotated subset of the Brown corpus. This is done in here.



In [1]:

    
import os
import csv
import pandas as pd
import numpy as np
from scipy import stats



In [2]:

    
data_path = '../../data'
tmp_path = '../../tmp'



In [3]:

    
feature_path = os.path.join(data_path, 'evaluation/semcor/tsvetkov_semcor.csv')
subset = pd.read_csv(feature_path, index_col=0)
subset.columns = [c.replace('semcor.', '') for c in subset.columns]
subset.head()









    Out[3]:







  
    
      
      noun.Tops
      noun.act
      noun.animal
      noun.artifact
      noun.attribute
      noun.body
      noun.cognition
      noun.communication
      noun.event
      noun.feeling
      ...
      verb.contact
      verb.creation
      verb.emotion
      verb.motion
      verb.perception
      verb.possession
      verb.social
      verb.stative
      verb.weather
      words
    
  
  
    
      0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.000000
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.0
      0
    
    
      1
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.272727
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.0
      a
    
    
      2
      0.0
      0.0
      0.0
      0.0
      0.035714
      0.0
      0.000000
      0.000000
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.571429
      0.0
      0.0
      0.0
      abandon
    
    
      3
      0.0
      0.0
      0.0
      0.0
      0.703704
      0.0
      0.296296
      0.000000
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.0
      ability
    
    
      4
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.000000
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      1.0
      0.0
      0.0
      abolish
    
  

5 rows × 42 columns

QVEC model

QVEC finds an alignment between dimensions of learnt word embeddings and dimensions (features) of linguistic features by maximising the cumulative correlation.

$N$ is the size of the vocabulary (in common between the embeddings and the linguistic features).

$D$ is the dimensionality of the embeddings.

$X \in \mathbb{R}^{D \times N}$ is the matrix of embeddings. Note that a word's embedding is a column, rows are individual dimensions.

$P$ is the number of linguistic features.

$S \in \mathbb{R}^{P \times N}$ is the matrix of linguistic features, created above. Again, each word is a column and rows are individual features.

QVEC finds an alignment between the rows of $X$ and the rows of $S$ that maximises the correlation between the aligned rows. Each row of $X$ is aligned to at most one row of $S$, but each row of $S$ may be aligned to more than one row of $X$.

$A \in \{0,1\}^{D \times P}$ holds the alignments. $x_{ij}$ is 1 if dimension $i$ of $X$ is aligned with linguistic feature $j$.

The sum of correlations (which can be arbitrarily large with more dimensions or features) is their measure of the quality of the word embeddings in $X$.

$QVEC = \max_{A|\sum_{j}a_{ij} \leq 1}\sum_{i=1}^{D}\sum_{j=1}^{P}r(x_i, s_j) \times a_{ij}$

In words, for any possible alignment $A$, subject to the constraint that each embedding dimension is aligned to 0 or 1 linguistic features, sum up the correlations. The sum for the best alignment is the measure of the embeddings.

Crucially, this assumes that the dimensions of the embeddings end up encoding linguistic features. The authors justify this by the effectiveness of using word embeddings in linear models in downstream tasks.

First things first, transform my linguistic features into the format mentioned above (i.e., the matrix $S$).



In [6]:

    
subset.set_index('words', inplace=True)
#subset.drop('count_in_semcor', inplace=True, axis=1)
subset = subset.T



In [9]:

    
subset.head()









    Out[9]:







  
    
      words
      0
      a
      abandon
      ability
      abolish
      absence
      absorb
      absorption
      abstract
      abstraction
      ...
      yes
      yesterday
      yield
      yokuts
      york
      young
      youth
      yr
      zero
      zinc
    
  
  
    
      noun.Tops
      0.0
      0.0
      0.000000
      0.000000
      0.0
      0.00
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.0
      0.0
      0.00
      0.0
      0.0
      0.000000
      0.00
      0.0
      0.0
      0.0
    
    
      noun.act
      0.0
      0.0
      0.000000
      0.000000
      0.0
      0.25
      0.0
      0.0
      0.000000
      0.250000
      ...
      0.0
      0.0
      0.00
      0.0
      0.0
      0.000000
      0.00
      0.0
      0.0
      0.0
    
    
      noun.animal
      0.0
      0.0
      0.000000
      0.000000
      0.0
      0.00
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.0
      0.0
      0.00
      0.0
      0.0
      0.304348
      0.00
      0.0
      0.0
      0.0
    
    
      noun.artifact
      0.0
      0.0
      0.000000
      0.000000
      0.0
      0.00
      0.0
      0.0
      0.272727
      0.166667
      ...
      0.0
      0.0
      0.16
      0.0
      0.0
      0.000000
      0.00
      0.0
      0.0
      0.0
    
    
      noun.attribute
      0.0
      0.0
      0.035714
      0.703704
      0.0
      0.00
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.0
      0.0
      0.00
      0.0
      0.0
      0.000000
      0.08
      0.0
      0.0
      0.0
    
  

5 rows × 4199 columns

Learnt word embeddings

The original paper trains various different models of varying sizes. At a later stage I could do that, but for now I'm happy with using pre-trained embeddings.



In [4]:

    
size = 50
fname = 'embeddings/glove.6B.{}d.txt'.format(size)
embedding_path = os.path.join(data_path, fname)
embeddings = pd.read_csv(embedding_path, sep=' ', header=None, index_col=0, quoting=csv.QUOTE_NONE).T
embeddings.head()









    Out[4]:







  
    
      
      the
      ,
      .
      of
      to
      and
      in
      a
      "
      's
      ...
      sigarms
      katuna
      aqm
      1.3775
      corythosaurus
      chanty
      kronik
      rolonda
      zsombor
      sandberger
    
  
  
    
      1
      0.41800
      0.013441
      0.15164
      0.70853
      0.680470
      0.268180
      0.330420
      0.21705
      0.25769
      0.23727
      ...
      -0.743970
      -0.30016
      -1.11670
      -0.24171
      -0.042672
      0.232040
      -0.60921
      -0.511810
      -0.75898
      0.072617
    
    
      2
      0.24968
      0.236820
      0.30177
      0.57088
      -0.039263
      0.143460
      0.249950
      0.46515
      0.45629
      0.40478
      ...
      0.082164
      -0.80268
      0.14057
      -0.23367
      -0.088106
      0.025672
      -0.67218
      0.058706
      -0.47426
      -0.513930
    
    
      3
      -0.41242
      -0.168990
      -0.16763
      -0.47160
      0.301860
      -0.278770
      -0.608740
      -0.46757
      -0.76974
      -0.20547
      ...
      -0.009147
      -0.46637
      0.36302
      0.10672
      -0.317240
      -0.706990
      0.23521
      1.091300
      0.47370
      0.472800
    
    
      4
      0.12170
      0.409510
      0.17684
      0.18048
      -0.177920
      0.016257
      0.109230
      0.10082
      -0.37679
      0.58805
      ...
      0.412900
      -0.29822
      -0.13836
      -1.60230
      -0.252090
      -0.045465
      -0.11195
      -0.551630
      0.77250
      -0.522020
    
    
      5
      0.34527
      0.638120
      0.31719
      0.54449
      0.429620
      0.113840
      0.036372
      1.01350
      0.59272
      0.65533
      ...
      -0.422550
      -1.03200
      -1.47970
      0.12440
      -0.268510
      0.139890
      -0.46094
      -0.102490
      -0.78064
      -0.355340
    
  

5 rows × 400000 columns



In [8]:

    
common_words = embeddings.columns.intersection(subset.columns)
embeddings = embeddings[common_words]
fname = os.path.join(tmp_path, 'glove_embeddings.csv')
embeddings.to_csv(fname)



In [9]:

    
from sklearn.metrics.pairwise import cosine_similarity as cos



In [11]:

    
pairwise = cos(embeddings.T)



In [20]:

    
distances = pd.DataFrame(pairwise, columns=common_words, index=common_words)
distances.to_csv(os.path.join(data_path, 'pairwise_sim.csv'))

The Python variables S and X refer to $S$ and $X$ exactly as above.



In [11]:

    
S = subset[common_words]
X = embeddings[common_words]

Now we want the correlation between the rows of S and the rows of X. This may not be the easiest way to do it but it works.



In [12]:

    
correlations = pd.DataFrame({i:X.corrwith(S.iloc[i], axis=1) for i in range(len(S))})
correlations.columns = S.index

For each row of this correlation matrix (i.e. for each of the dimenions of the embeddings), we want the linguistic feature that it is most correlated with. We also get the value of that correlation.



In [13]:

    
alignments = correlations.idxmax(axis=1)
correlations.max(axis=1).head()









    Out[13]:





1    0.073672
2    0.114031
3    0.128516
4    0.103564
5    0.207579
dtype: float64

The score of the embeddings relative to the linguistic features is the sum of the maximum correlations. Note how this value depends on how many dimensions in the embeddings there are. For 300 dimension vectors trained (by them) from GloVe, the authors get 34.4, while I get 32.4. Note that our linguistic features are still different, so the fact that the discrepancy here is not too big is encouraging.



In [14]:

    
qvec = correlations.max(axis=1).sum()
qvec









    Out[14]:





8.0863957435227896

We don't really need it, but just to be explicit let's get the matrix $A$ of alignments.



In [15]:

    
A = pd.DataFrame(0, index=range(len(X)), columns=S.index)
for dim, feat in alignments.iteritems():
    A[feat][dim] = 1
A.head()









    Out[15]:







  
    
      
      noun.Tops
      noun.act
      noun.animal
      noun.artifact
      noun.attribute
      noun.body
      noun.cognition
      noun.communication
      noun.event
      noun.feeling
      ...
      verb.consumption
      verb.contact
      verb.creation
      verb.emotion
      verb.motion
      verb.perception
      verb.possession
      verb.social
      verb.stative
      verb.weather
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 41 columns

The rest of the paper is a series of experiments training large models and evaluating them on both instrinic and extrinsic tasks, including QVEC. I'm not going to replicate that here, but the QVEC implementation is complete.

Canonical Correlation Analysis

In a follow-up 2016 paper, a subset of the original authors introduce QVEC-CCA. It's really just QVEC except instead of summing the highest row-wise correlations, they use canonical correlation analysis. I didn't know what that was, but after reading a bit I have a reasonable grasp of it. I'm going to replicate that 2016 paper, or at least the most important part of it which is the use of CCA. Note that the other new thing in the 2016 paper is the use of syntactic features, in addition to semantic, which I won't do right now.

Scikit-learn has an implementation of CCA. It took me a while to figure out what are the learnt parameters that I want, and I'm only 80% confident I have it right.



In [16]:

    
from sklearn.cross_decomposition import CCA

cca = CCA(n_components=1)
cca = cca.fit(X.T, S.T)

I believe the linear combinations I want are stored in the x_weights_ and y_weights_ attributes.



In [21]:

    
a = np.dot(X.T, cca.x_weights_)
b = np.dot(S.T, cca.y_weights_)
stats.pearsonr(a, b)









    Out[21]:





(array([ 0.73791464]), array([ 0.]))

Succint implementation



In [27]:

    
def qvec(features, embeddings):
    """
    Returns correlations between columns of `features` and `embeddings`.
    
    The aligned feature is the one with the highest correlation.
    The qvec score is the sum of correlations of aligned features.
    """
    common_words = embeddings.columns.intersection(subset.columns)
    S = features[common_words]
    X = embeddings[common_words]
    correlations = pd.DataFrame({i:X.corrwith(S.iloc[i], axis=1) for i in range(len(S))})
    correlations.columns = S.index
    return correlations



In [30]:

    
qvec(subset, embeddings).head()









    Out[30]:







  
    
      
      noun.Tops
      noun.act
      noun.animal
      noun.artifact
      noun.attribute
      noun.body
      noun.cognition
      noun.communication
      noun.event
      noun.feeling
      ...
      verb.consumption
      verb.contact
      verb.creation
      verb.emotion
      verb.motion
      verb.perception
      verb.possession
      verb.social
      verb.stative
      verb.weather
    
  
  
    
      1
      0.040100
      0.026092
      0.056525
      -0.037133
      -0.021099
      0.042961
      -0.037595
      -0.067074
      0.056121
      -0.020713
      ...
      0.034959
      -0.045302
      0.061470
      0.014180
      0.036282
      0.047905
      0.056783
      0.045944
      0.066336
      0.009226
    
    
      2
      0.010114
      -0.036833
      -0.073421
      0.031081
      0.089190
      -0.024908
      0.062203
      0.114031
      0.003696
      0.060540
      ...
      -0.042820
      -0.125930
      -0.046195
      -0.018379
      -0.077600
      -0.043725
      -0.075241
      -0.118294
      -0.023466
      -0.000403
    
    
      3
      -0.017465
      -0.113412
      -0.071950
      0.107044
      -0.040075
      -0.040329
      -0.123985
      -0.154424
      -0.018798
      -0.036100
      ...
      0.025986
      0.109754
      0.011967
      0.024162
      0.075199
      0.021415
      0.128516
      0.039837
      0.026042
      0.030984
    
    
      4
      0.046736
      0.028456
      0.016308
      0.039305
      0.007064
      -0.020841
      -0.013023
      0.024967
      0.046805
      -0.046602
      ...
      -0.045769
      -0.011350
      -0.033100
      -0.054945
      -0.064516
      -0.007210
      -0.037468
      -0.047789
      -0.017277
      -0.012592
    
    
      5
      0.076649
      -0.096512
      0.021287
      -0.025811
      0.107881
      -0.032195
      0.101387
      0.021477
      -0.041974
      0.032073
      ...
      0.000242
      -0.034706
      -0.035682
      0.005210
      -0.119434
      -0.000516
      0.037925
      -0.028063
      -0.016718
      -0.069846
    
  

5 rows × 41 columns

	noun.attribute	noun.cognition	noun.communication	...	verb.possession	verb.social	words
0	0.000000	0.000000	0.000000	...	0.000000	0.0	0
1	0.000000	0.000000	0.272727	...	0.000000	0.0	a
2	0.035714	0.000000	0.000000	...	0.571429	0.0	abandon
3	0.703704	0.296296	0.000000	...	0.000000	0.0	ability
4	0.000000	0.000000	0.000000	...	0.000000	1.0	abolish

words	abandon	ability	absence	abstract	abstraction	...	yield	young	youth
noun.Tops	0.000000	0.000000	0.00	0.000000	0.000000	...	0.00	0.000000	0.00
noun.act	0.000000	0.000000	0.25	0.000000	0.250000	...	0.00	0.000000	0.00
noun.animal	0.000000	0.000000	0.00	0.000000	0.000000	...	0.00	0.304348	0.00
noun.artifact	0.000000	0.000000	0.00	0.272727	0.166667	...	0.16	0.000000	0.00
noun.attribute	0.035714	0.703704	0.00	0.000000	0.000000	...	0.00	0.000000	0.08

	the	,	.	of	to	and	in	a	"	's	...	sigarms	katuna	aqm	1.3775	corythosaurus	chanty	kronik	rolonda	zsombor	sandberger
1	0.41800	0.013441	0.15164	0.70853	0.680470	0.268180	0.330420	0.21705	0.25769	0.23727	...	-0.743970	-0.30016	-1.11670	-0.24171	-0.042672	0.232040	-0.60921	-0.511810	-0.75898	0.072617
2	0.24968	0.236820	0.30177	0.57088	-0.039263	0.143460	0.249950	0.46515	0.45629	0.40478	...	0.082164	-0.80268	0.14057	-0.23367	-0.088106	0.025672	-0.67218	0.058706	-0.47426	-0.513930
3	-0.41242	-0.168990	-0.16763	-0.47160	0.301860	-0.278770	-0.608740	-0.46757	-0.76974	-0.20547	...	-0.009147	-0.46637	0.36302	0.10672	-0.317240	-0.706990	0.23521	1.091300	0.47370	0.472800
4	0.12170	0.409510	0.17684	0.18048	-0.177920	0.016257	0.109230	0.10082	-0.37679	0.58805	...	0.412900	-0.29822	-0.13836	-1.60230	-0.252090	-0.045465	-0.11195	-0.551630	0.77250	-0.522020
5	0.34527	0.638120	0.31719	0.54449	0.429620	0.113840	0.036372	1.01350	0.59272	0.65533	...	-0.422550	-1.03200	-1.47970	0.12440	-0.268510	0.139890	-0.46094	-0.102490	-0.78064	-0.355340

	noun.communication	...	verb.possession
0	0	...	0
1	0	...	0
2	1	...	0
3	0	...	1
4	0	...	0

	noun.Tops	noun.act	noun.animal	noun.artifact	noun.attribute	noun.body	noun.cognition	noun.communication	noun.event	noun.feeling	...	verb.consumption	verb.contact	verb.creation	verb.emotion	verb.motion	verb.perception	verb.possession	verb.social	verb.stative	verb.weather
1	0.040100	0.026092	0.056525	-0.037133	-0.021099	0.042961	-0.037595	-0.067074	0.056121	-0.020713	...	0.034959	-0.045302	0.061470	0.014180	0.036282	0.047905	0.056783	0.045944	0.066336	0.009226
2	0.010114	-0.036833	-0.073421	0.031081	0.089190	-0.024908	0.062203	0.114031	0.003696	0.060540	...	-0.042820	-0.125930	-0.046195	-0.018379	-0.077600	-0.043725	-0.075241	-0.118294	-0.023466	-0.000403
3	-0.017465	-0.113412	-0.071950	0.107044	-0.040075	-0.040329	-0.123985	-0.154424	-0.018798	-0.036100	...	0.025986	0.109754	0.011967	0.024162	0.075199	0.021415	0.128516	0.039837	0.026042	0.030984
4	0.046736	0.028456	0.016308	0.039305	0.007064	-0.020841	-0.013023	0.024967	0.046805	-0.046602	...	-0.045769	-0.011350	-0.033100	-0.054945	-0.064516	-0.007210	-0.037468	-0.047789	-0.017277	-0.012592
5	0.076649	-0.096512	0.021287	-0.025811	0.107881	-0.032195	0.101387	0.021477	-0.041974	0.032073	...	0.000242	-0.034706	-0.035682	0.005210	-0.119434	-0.000516	0.037925	-0.028063	-0.016718	-0.069846

	noun.communication	...	verb.possession
0	0	...	0
1	0	...	0
2	1	...	0
3	0	...	1
4	0	...	0

	noun.communication	...	verb.possession
0	0	...	0
1	0	...	0
2	1	...	0
3	0	...	1
4	0	...	0