In [1]:
%matplotlib inline
import os
import csv
from itertools import product
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
data_path = '../../data'
tmp_path = '../../tmp'
In [2]:
feature_path = os.path.join(data_path, 'evaluation/semcor/tsvetkov_semcor.csv')
subset = pd.read_csv(feature_path, index_col=0)
subset.columns = [c.replace('semcor.', '') for c in subset.columns]
subset.set_index('words', inplace=True)
subset = subset.T
In [3]:
size = 300
fname = 'embeddings/glove.6B.{}d.txt'.format(size)
embedding_path = os.path.join(data_path, fname)
embeddings = pd.read_csv(embedding_path, sep=' ', header=None, index_col=0, quoting=csv.QUOTE_NONE).T
In [4]:
def qvec(features, embeddings):
"""
Returns correlations between columns of `features` and `embeddings`.
The aligned feature is the one with the highest correlation.
The qvec score is the sum of correlations of aligned features.
"""
common_words = embeddings.columns.intersection(features.columns)
S = features[common_words]
X = embeddings[common_words]
correlations = pd.DataFrame({i:X.corrwith(S.iloc[i], axis=1) for i in range(len(S))})
correlations.columns = S.index
return correlations
In [46]:
correlations = qvec(subset, embeddings)
V = len(embeddings.columns.intersection(subset.columns))
In [47]:
correlations.head()
Out[47]:
The dataframe below is as follows: index is the dimension of the learnt embedding, 'feature' is the name of the linguistic feature aligned with that dimension, and 'max_corr' is the correlation between the dimension and feature. The sum of the 'max_corr' column is the qvec score.
39 dimensions pick out 'noun.person', 37 'noun.artifact', 19 'noun.body', 15 'verb.change'.
In [48]:
alignments = pd.DataFrame(correlations.idxmax(axis=1))
alignments.columns = ['feature']
alignments['max_corr'] = correlations.max(axis=1)
alignments.sort_values(by='max_corr', ascending=False).head(10)
Out[48]:
QVEC is looking at 41 correlation coefficients (or as many linguistic features as there are) and finding the maximum. Here, I show the relevant scatterplot for the highest correlation.
A consistent observation is that the distribution of the linguistic features are strongly peaked at 0. That is, almost all words have 0 for for most features. Sometimes, there is some mass at 1. This suggests to me that the linguistic features being used are not appropriate.
In [8]:
common_words = embeddings.columns.intersection(subset.columns)
S = subset[common_words]
X = embeddings[common_words]
def plot(i, j, X=X, S=S):
"""Plot ith dimension of embeddings against feature j."""
x = X.loc[i]
s = S.loc[j]
sns.jointplot(x, s);
In [9]:
plot(300,'noun.person')
In sum: each dimension looks pretty normal, but the formal tests I'm using suggest otherwise. Most are centered at 0 with std around 0.4.
From the marginal distribution plots above, it looks like each dimension is normally distributed. I don't know if that's purposively done during training or if it just turns out that way.
In [10]:
sns.distplot(X.loc[89]);
In [11]:
fig, axs = plt.subplots(1,2)
vector = X.loc[1]
sm.qqplot(vector, ax=axs[0]);
stats.probplot(vector, plot=axs[1]);
In [12]:
def do_kstest(i):
vector = X.loc[i]
ybar = vector.mean()
s = vector.std()
result = stats.kstest(vector, cdf='norm', args=(ybar, s))
return result.pvalue
p_values = [do_kstest(i) for i in X.index]
sns.distplot(p_values);
In [13]:
def do_shapirotest(i):
vector = X.loc[i]
result = stats.shapiro(vector)
return result[1]
p_values = [do_shapirotest(i) for i in X.index]
sns.distplot(p_values);
In [14]:
def do_lillieforstest(i):
vector = X.loc[i]
result = sm.stats.lilliefors(vector)
return result[1]
In [15]:
p_values = [do_lillieforstest(i) for i in X.index]
sns.distplot(p_values);
In [16]:
fig, axs = plt.subplots(1,2)
sns.distplot(X.mean(axis=1), ax=axs[0]);
sns.distplot(X.std(axis=1), ax=axs[1]);
In [17]:
sns.distplot(X['bird']);
In [18]:
fig, axs = plt.subplots(1,2)
sns.distplot(X.mean(), ax=axs[0]);
sns.distplot(X.std(), ax=axs[1]);
In [19]:
S.mean(axis=1).sort_values(ascending=False).head()
Out[19]:
In [20]:
fig, axs = plt.subplots(ncols=4, figsize=(10, 4), sharey=True)
sns.distplot(S.loc['noun.artifact'], ax=axs[0], kde=False);
sns.distplot(S.loc['noun.person'], ax=axs[1], kde=False);
sns.distplot(S.loc['noun.act'], ax=axs[2], kde=False);
sns.distplot(S.loc['noun.communication'], ax=axs[3], kde=False);
In [21]:
proportions = S.astype(bool).sum(axis=1) / len(S.columns)
print(proportions.sort_values(ascending=False).head())
proportions.describe()
Out[21]:
Knowing that the dimensions of the learnt embeddings are normally distributed and that the features are strongly bimodal, what is the best way to measure their correlation? It's clear that Pearson's $r$ and Spearman's $\rho$ are not appropriate because of the high number of ties.
I see two broad approaches:
In sum: Neither is very insightful. I need to use different (less sparse) features.
In sum: Removing 0's seems to help, but is not principled. It picks out one extremely rare feature. The fact that removing 0's helps tells me the presence of such rare features is a problem.
I changed the 0's to missing values and then use the usual QVEC code from above. I checked the source of corrwith
and it looks like it ignores missing values, which is what I want.
The wierd thing is that one feature 'noun.motive' is the most highly correlated feature for 66 of the 300 dimensions. Most of the most highly correlated features are 'noun.motive'. Previously, it didn't appear at all. There are only 11 nonzero entries for it.
In [22]:
S_no_zeroes = S[(S != 0)]
S_no_zeroes.head()
Out[22]:
In [23]:
tmp = qvec(S_no_zeroes, embeddings)
tmp.head()
Out[23]:
In [24]:
alignments = pd.DataFrame(tmp.idxmax(axis=1))
alignments.columns = ['feature']
alignments['max_corr'] = tmp.max(axis=1)
alignments.sort_values(by='max_corr', ascending=False).head()
Out[24]:
The following dimensions and features were aligned previously:
In [25]:
plot(122, 'noun.person', S=S_no_zeroes);
In [26]:
plot(255, 'noun.person', S=S_no_zeroes);
In sum: I binarize S, so the linguistic features are now presence/absence. For each dimension-feature pair, you can look at the distribution of the dimension for words with and without that feature. Dimension-features that are aligned using the original method show separation. But quantifying the separation across all dimension-feature pairs is problematic, using either t-test or KS test. You get seemingly significant results for unaligned pairs. I have not done any multiple test corrections. This approach does not seem as promising as changing the features to less sparse ones.*
One suggestion here is to compare the means directly. Below, I plot the distribution of a dimension split by words that have that feature and those that don't. For dimensions and features that were identified above as aligned, this plot shows some good separation. For non-aligned dimension-feature pairs, there is no separation. However, I perform a two-tailed t-test for a difference of means between the two. I get "significant" results even when there is no visible difference. Thus, I cannot blindly trust the t test results here. To show this, I perform all $(41 \times 300)$ t tests and plot the p values. The plot suggests most pairs are significantly different, in line with my eyeball checks previously.
In [27]:
def do_ttest(i, feature, X=X, S=S):
"""Do two sample t test for difference of means between the ith dimension
of words with feature and those without."""
dim = X.loc[i]
have = S.loc[feature].astype(bool)
result = stats.ttest_ind(dim[have], dim[~have])
return result[1]
In [28]:
def do_2kstest(i, feature, X=X, S=S):
"""Returns p value from 2 sided KS test that the ith dimension from words with
feature and those from words without feature come from the same distribution."""
dim = X.loc[i]
have = S.loc[feature].astype(bool)
result = stats.ks_2samp(dim[have], dim[~have])
return result[1]
In [29]:
def plot_by_presence(i, feature, X=X, S=S):
"""Plot distribution of the ith dimension of X for those that have
feature and those that don't."""
dim = X.loc[i]
have = S.loc[feature].astype(bool)
has_label = feature
has_not_label = 'no {}'.format(feature)
sns.distplot(dim[have], label=has_label);
sns.distplot(dim[~have], label=has_not_label);
t_test = do_ttest(i, feature, X, S)
ks_test = do_2kstest(i, feature, X, S)
plt.legend();
plt.title('t: {}\nks: {}'.format(t_test, ks_test))
This dimension-feature pair was aligned in the original method using correlation between raw values. I see good separation between the distributions, which is consistent with a relationship between the variables. The t test result strongly suggests the population means are different (but I can see that from the plot).
In [30]:
plot_by_presence(255, 'noun.person')
This dimension-feature pair is weakly negatively correlated using the original method ($r=-0.06$). Consistent with that, the distributions overlap a lot. However, a t test gives a small p value, suggesting the population means are different.
In [31]:
plot_by_presence(1, 'noun.body')
This is the distribution of p values from a t test for all dimension-feature pairs. I think it shows the inappropriateness of a t test more than anything else.
In [32]:
tmp = [do_ttest(i, f) for (i, f) in product(X.index, S.index)]
sns.distplot(tmp);
In [33]:
tmp = [do_2kstest(i, f) for (i, f) in product(X.index, S.index)]
sns.distplot(tmp);
In [34]:
(subset.size - np.count_nonzero(subset.values)) / subset.size
Out[34]:
In [36]:
sns.distplot(correlations.values.flatten());
The following summary shows that the $(41*300)$ correlations are centered at 0 with std 0.05. The largest is 0.32, and the smallest is -0.30. These don't seem like very high numbers. The 75% percentile is 0.03. By looking back at the histogram above, it's obviously these dimensions are not highly correlated with these linguistic features.
Another important point from this is that the distribution of correlations is symmetric. So using the the max, rather than the max absolute, seems arbitrary. It doesn't allow for reversed dimensions.
In [37]:
pd.Series(correlations.values.flatten()).describe()
Out[37]:
In [38]:
sns.distplot(correlations.max(axis=1));
In the heatmap below, nothing really sticks out. It does not look like these features are captured well by these dimensions.
In [14]:
sns.heatmap(correlations);
In [53]:
subset.index.difference(alignments['feature'])
Out[53]:
NB: This stinks with the whole embedding matrix, but looks promising with smaller vocab.
In the paper they give the top K words for a dimension. The code prints, for each dimension, the dimension number, the aligned linguistic feature, the correlation between the two previous things, and the top k words associated with the dimension. I understand the last bit to mean "the k words with the highest value in the dimension".
Importantly, it matters whether you look for the top K words in the whole embedding matrix, or the reduced vocab in the matrix X (the one you have linguistic features for). You get much better results when you use X. Ideally, the method would generalize to the larger vocab. Clearly, X will have words with much higher frequency. This may give more sensible (stable?) results.
How can I assess whether these top K words are "correct" or not?
In [112]:
def highest_value(i, k=20, X=X):
"""Return the top `k` words with highest values for ith dimension in X."""
dim = X.loc[i]
return dim.nlargest(n=k).index
In [113]:
k = 10
largest = pd.DataFrame([highest_value(i, k) for i in alignments.index], index=alignments.index)
top_k = pd.merge(alignments, largest, left_index=True, right_index=True)
top_k.sort_values(by='max_corr', ascending=False).head()
Out[113]:
In [120]:
def get_dims(feature, df=top_k):
"""Return the dimensions aligned with `feature` in `df`."""
return df[df['feature']==feature].sort_values(by='max_corr', ascending=False)
get_dims('noun.time').head()
Out[120]: