QVEC

This notebook is a replication of Tsvetkov et al. (2015) Evaluation of Word Vector Representations by Subspace Alignment, which introduces QVEC. QVEC is an intrinsic evaluation method of word embeddings, measuring the correlation between dimensions of the embeddings and linguistic features. The original code is available, but I'm replicating it for two reasons: i) as a learning exercise and ii) the original implementation looks messy.

To implement QVEC, I'm going to need two things:

  • Gold standard linguistic features
  • The QVEC model

The linguistic features used in the original paper come from SemCor, a WordNet annotated subset of the Brown corpus. This is done in here.


In [1]:
import os
import csv
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data_path = '../../data'
tmp_path = '../../tmp'

In [3]:
feature_path = os.path.join(data_path, 'evaluation/semcor/tsvetkov_semcor.csv')
subset = pd.read_csv(feature_path, index_col=0)
subset.columns = [c.replace('semcor.', '') for c in subset.columns]
subset.head()


Out[3]:
noun.Tops noun.act noun.animal noun.artifact noun.attribute noun.body noun.cognition noun.communication noun.event noun.feeling ... verb.contact verb.creation verb.emotion verb.motion verb.perception verb.possession verb.social verb.stative verb.weather words
0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0
1 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.272727 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 a
2 0.0 0.0 0.0 0.0 0.035714 0.0 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.571429 0.0 0.0 0.0 abandon
3 0.0 0.0 0.0 0.0 0.703704 0.0 0.296296 0.000000 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 ability
4 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 1.0 0.0 0.0 abolish

5 rows × 42 columns

QVEC model

QVEC finds an alignment between dimensions of learnt word embeddings and dimensions (features) of linguistic features by maximising the cumulative correlation.

$N$ is the size of the vocabulary (in common between the embeddings and the linguistic features).

$D$ is the dimensionality of the embeddings.

$X \in \mathbb{R}^{D \times N}$ is the matrix of embeddings. Note that a word's embedding is a column, rows are individual dimensions.

$P$ is the number of linguistic features.

$S \in \mathbb{R}^{P \times N}$ is the matrix of linguistic features, created above. Again, each word is a column and rows are individual features.

QVEC finds an alignment between the rows of $X$ and the rows of $S$ that maximises the correlation between the aligned rows. Each row of $X$ is aligned to at most one row of $S$, but each row of $S$ may be aligned to more than one row of $X$.

$A \in \{0,1\}^{D \times P}$ holds the alignments. $x_{ij}$ is 1 if dimension $i$ of $X$ is aligned with linguistic feature $j$.

The sum of correlations (which can be arbitrarily large with more dimensions or features) is their measure of the quality of the word embeddings in $X$.

$QVEC = \max_{A|\sum_{j}a_{ij} \leq 1}\sum_{i=1}^{D}\sum_{j=1}^{P}r(x_i, s_j) \times a_{ij}$

In words, for any possible alignment $A$, subject to the constraint that each embedding dimension is aligned to 0 or 1 linguistic features, sum up the correlations. The sum for the best alignment is the measure of the embeddings.

Crucially, this assumes that the dimensions of the embeddings end up encoding linguistic features. The authors justify this by the effectiveness of using word embeddings in linear models in downstream tasks.

First things first, transform my linguistic features into the format mentioned above (i.e., the matrix $S$).


In [6]:
subset.set_index('words', inplace=True)
#subset.drop('count_in_semcor', inplace=True, axis=1)
subset = subset.T

In [9]:
subset.head()


Out[9]:
words 0 a abandon ability abolish absence absorb absorption abstract abstraction ... yes yesterday yield yokuts york young youth yr zero zinc
noun.Tops 0.0 0.0 0.000000 0.000000 0.0 0.00 0.0 0.0 0.000000 0.000000 ... 0.0 0.0 0.00 0.0 0.0 0.000000 0.00 0.0 0.0 0.0
noun.act 0.0 0.0 0.000000 0.000000 0.0 0.25 0.0 0.0 0.000000 0.250000 ... 0.0 0.0 0.00 0.0 0.0 0.000000 0.00 0.0 0.0 0.0
noun.animal 0.0 0.0 0.000000 0.000000 0.0 0.00 0.0 0.0 0.000000 0.000000 ... 0.0 0.0 0.00 0.0 0.0 0.304348 0.00 0.0 0.0 0.0
noun.artifact 0.0 0.0 0.000000 0.000000 0.0 0.00 0.0 0.0 0.272727 0.166667 ... 0.0 0.0 0.16 0.0 0.0 0.000000 0.00 0.0 0.0 0.0
noun.attribute 0.0 0.0 0.035714 0.703704 0.0 0.00 0.0 0.0 0.000000 0.000000 ... 0.0 0.0 0.00 0.0 0.0 0.000000 0.08 0.0 0.0 0.0

5 rows × 4199 columns

Learnt word embeddings

The original paper trains various different models of varying sizes. At a later stage I could do that, but for now I'm happy with using pre-trained embeddings.


In [4]:
size = 50
fname = 'embeddings/glove.6B.{}d.txt'.format(size)
embedding_path = os.path.join(data_path, fname)
embeddings = pd.read_csv(embedding_path, sep=' ', header=None, index_col=0, quoting=csv.QUOTE_NONE).T
embeddings.head()


Out[4]:
the , . of to and in a " 's ... sigarms katuna aqm 1.3775 corythosaurus chanty kronik rolonda zsombor sandberger
1 0.41800 0.013441 0.15164 0.70853 0.680470 0.268180 0.330420 0.21705 0.25769 0.23727 ... -0.743970 -0.30016 -1.11670 -0.24171 -0.042672 0.232040 -0.60921 -0.511810 -0.75898 0.072617
2 0.24968 0.236820 0.30177 0.57088 -0.039263 0.143460 0.249950 0.46515 0.45629 0.40478 ... 0.082164 -0.80268 0.14057 -0.23367 -0.088106 0.025672 -0.67218 0.058706 -0.47426 -0.513930
3 -0.41242 -0.168990 -0.16763 -0.47160 0.301860 -0.278770 -0.608740 -0.46757 -0.76974 -0.20547 ... -0.009147 -0.46637 0.36302 0.10672 -0.317240 -0.706990 0.23521 1.091300 0.47370 0.472800
4 0.12170 0.409510 0.17684 0.18048 -0.177920 0.016257 0.109230 0.10082 -0.37679 0.58805 ... 0.412900 -0.29822 -0.13836 -1.60230 -0.252090 -0.045465 -0.11195 -0.551630 0.77250 -0.522020
5 0.34527 0.638120 0.31719 0.54449 0.429620 0.113840 0.036372 1.01350 0.59272 0.65533 ... -0.422550 -1.03200 -1.47970 0.12440 -0.268510 0.139890 -0.46094 -0.102490 -0.78064 -0.355340

5 rows × 400000 columns


In [8]:
common_words = embeddings.columns.intersection(subset.columns)
embeddings = embeddings[common_words]
fname = os.path.join(tmp_path, 'glove_embeddings.csv')
embeddings.to_csv(fname)

In [9]:
from sklearn.metrics.pairwise import cosine_similarity as cos

In [11]:
pairwise = cos(embeddings.T)

In [20]:
distances = pd.DataFrame(pairwise, columns=common_words, index=common_words)
distances.to_csv(os.path.join(data_path, 'pairwise_sim.csv'))

The Python variables S and X refer to $S$ and $X$ exactly as above.


In [11]:
S = subset[common_words]
X = embeddings[common_words]

Now we want the correlation between the rows of S and the rows of X. This may not be the easiest way to do it but it works.


In [12]:
correlations = pd.DataFrame({i:X.corrwith(S.iloc[i], axis=1) for i in range(len(S))})
correlations.columns = S.index

For each row of this correlation matrix (i.e. for each of the dimenions of the embeddings), we want the linguistic feature that it is most correlated with. We also get the value of that correlation.


In [13]:
alignments = correlations.idxmax(axis=1)
correlations.max(axis=1).head()


Out[13]:
1    0.073672
2    0.114031
3    0.128516
4    0.103564
5    0.207579
dtype: float64

The score of the embeddings relative to the linguistic features is the sum of the maximum correlations. Note how this value depends on how many dimensions in the embeddings there are. For 300 dimension vectors trained (by them) from GloVe, the authors get 34.4, while I get 32.4. Note that our linguistic features are still different, so the fact that the discrepancy here is not too big is encouraging.


In [14]:
qvec = correlations.max(axis=1).sum()
qvec


Out[14]:
8.0863957435227896

We don't really need it, but just to be explicit let's get the matrix $A$ of alignments.


In [15]:
A = pd.DataFrame(0, index=range(len(X)), columns=S.index)
for dim, feat in alignments.iteritems():
    A[feat][dim] = 1
A.head()


Out[15]:
noun.Tops noun.act noun.animal noun.artifact noun.attribute noun.body noun.cognition noun.communication noun.event noun.feeling ... verb.consumption verb.contact verb.creation verb.emotion verb.motion verb.perception verb.possession verb.social verb.stative verb.weather
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 41 columns

The rest of the paper is a series of experiments training large models and evaluating them on both instrinic and extrinsic tasks, including QVEC. I'm not going to replicate that here, but the QVEC implementation is complete.

Canonical Correlation Analysis

In a follow-up 2016 paper, a subset of the original authors introduce QVEC-CCA. It's really just QVEC except instead of summing the highest row-wise correlations, they use canonical correlation analysis. I didn't know what that was, but after reading a bit I have a reasonable grasp of it. I'm going to replicate that 2016 paper, or at least the most important part of it which is the use of CCA. Note that the other new thing in the 2016 paper is the use of syntactic features, in addition to semantic, which I won't do right now.

Scikit-learn has an implementation of CCA. It took me a while to figure out what are the learnt parameters that I want, and I'm only 80% confident I have it right.


In [16]:
from sklearn.cross_decomposition import CCA

cca = CCA(n_components=1)
cca = cca.fit(X.T, S.T)

I believe the linear combinations I want are stored in the x_weights_ and y_weights_ attributes.


In [21]:
a = np.dot(X.T, cca.x_weights_)
b = np.dot(S.T, cca.y_weights_)
stats.pearsonr(a, b)


Out[21]:
(array([ 0.73791464]), array([ 0.]))

Succint implementation


In [27]:
def qvec(features, embeddings):
    """
    Returns correlations between columns of `features` and `embeddings`.
    
    The aligned feature is the one with the highest correlation.
    The qvec score is the sum of correlations of aligned features.
    """
    common_words = embeddings.columns.intersection(subset.columns)
    S = features[common_words]
    X = embeddings[common_words]
    correlations = pd.DataFrame({i:X.corrwith(S.iloc[i], axis=1) for i in range(len(S))})
    correlations.columns = S.index
    return correlations

In [30]:
qvec(subset, embeddings).head()


Out[30]:
noun.Tops noun.act noun.animal noun.artifact noun.attribute noun.body noun.cognition noun.communication noun.event noun.feeling ... verb.consumption verb.contact verb.creation verb.emotion verb.motion verb.perception verb.possession verb.social verb.stative verb.weather
1 0.040100 0.026092 0.056525 -0.037133 -0.021099 0.042961 -0.037595 -0.067074 0.056121 -0.020713 ... 0.034959 -0.045302 0.061470 0.014180 0.036282 0.047905 0.056783 0.045944 0.066336 0.009226
2 0.010114 -0.036833 -0.073421 0.031081 0.089190 -0.024908 0.062203 0.114031 0.003696 0.060540 ... -0.042820 -0.125930 -0.046195 -0.018379 -0.077600 -0.043725 -0.075241 -0.118294 -0.023466 -0.000403
3 -0.017465 -0.113412 -0.071950 0.107044 -0.040075 -0.040329 -0.123985 -0.154424 -0.018798 -0.036100 ... 0.025986 0.109754 0.011967 0.024162 0.075199 0.021415 0.128516 0.039837 0.026042 0.030984
4 0.046736 0.028456 0.016308 0.039305 0.007064 -0.020841 -0.013023 0.024967 0.046805 -0.046602 ... -0.045769 -0.011350 -0.033100 -0.054945 -0.064516 -0.007210 -0.037468 -0.047789 -0.017277 -0.012592
5 0.076649 -0.096512 0.021287 -0.025811 0.107881 -0.032195 0.101387 0.021477 -0.041974 0.032073 ... 0.000242 -0.034706 -0.035682 0.005210 -0.119434 -0.000516 0.037925 -0.028063 -0.016718 -0.069846

5 rows × 41 columns