In this notebook I replicate Koehn (2015): What's in an embedding? Analyzing word embeddings through multilingual evaluation. This paper proposes to i) evaluate an embedding method on more than one language, and ii) evaluate an embedding model by how well its embeddings capture syntactic features. He uses an L2-regularized linear classifier, with an upper baseline that assigns the most frequent class. He finds that most methods perform similarly on this task, but that dependency based embeddings perform better. Dependency based embeddings particularly perform better when you decrease the dimensionality. Overall, the aim is to have an evalation method that tells you something about the structure of the learnt representations. He evaulates a range of different models on their ability to capture a number of different morphosyntactic features in a bunch of languages.
Embedding models tested:
Features tested:
Languages tested:
Word embeddings were trained on automatically PoS-tagged and dependency-parsed data using existing models. This is so the dependency-based embeddings can be trained. The evaluation is on hand-labelled data. English training data is a subset of Wikipedia; English test data comes from PTB. For all other languages, both the training and test data come from a shared task on parsing morphologically rich languages. Koehn trained embeddings with window size 5 and 11 and dimensionality 10, 100, 200.
Dependency-based embeddings perform the best on almost all tasks. They even do well when the dimensionality is reduced to 10, while other methods perform poorly in this case.
I'll need:
In [21]:
%matplotlib inline
import os
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
data_path = '../../data'
tmp_path = '../../tmp'
In [22]:
size = 50
fname = 'embeddings/glove.6B.{}d.txt'.format(size)
glove_path = os.path.join(data_path, fname)
glove = pd.read_csv(glove_path, sep=' ', header=None, index_col=0, quoting=csv.QUOTE_NONE)
glove.head()
Out[22]:
In [23]:
fname = 'UD_English/features.csv'
features_path = os.path.join(data_path, os.path.join('evaluation/dependency', fname))
features = pd.read_csv(features_path).set_index('form')
features.head()
Out[23]:
In [24]:
df = pd.merge(glove, features, how='inner', left_index=True, right_index=True)
df.head()
Out[24]:
In [25]:
def prepare_X_and_y(feature, data):
"""Return X and y ready for predicting feature from embeddings."""
relevant_data = data[data[feature].notnull()]
columns = list(range(1, size+1))
X = relevant_data[columns]
y = relevant_data[feature]
train = relevant_data['set'] == 'train'
test = (relevant_data['set'] == 'test') | (relevant_data['set'] == 'dev')
X_train, X_test = X[train].values, X[test].values
y_train, y_test = y[train].values, y[test].values
return X_train, X_test, y_train, y_test
def predict(model, X_test):
"""Wrapper for getting predictions."""
results = model.predict_proba(X_test)
return np.array([t for f,t in results]).reshape(-1,1)
def conmat(model, X_test, y_test):
"""Wrapper for sklearn's confusion matrix."""
y_pred = model.predict(X_test)
c = confusion_matrix(y_test, y_pred)
sns.heatmap(c, annot=True, fmt='d',
xticklabels=model.classes_,
yticklabels=model.classes_,
cmap="YlGnBu", cbar=False)
plt.ylabel('Ground truth')
plt.xlabel('Prediction')
def draw_roc(model, X_test, y_test):
"""Convenience function to draw ROC curve."""
y_pred = predict(model, X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc = roc_auc_score(y_test, y_pred)
label = r'$AUC={}$'.format(str(round(roc, 3)))
plt.plot(fpr, tpr, label=label);
plt.title('ROC')
plt.xlabel('False positive rate');
plt.ylabel('True positive rate');
plt.legend();
def cross_val_auc(model, X, y):
for _ in range(5):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
model = model.fit(X_train, y_train)
draw_roc(model, X_test, y_test)
In [32]:
X_train, X_test, y_train, y_test = prepare_X_and_y('Tense', df)
model = LogisticRegression(penalty='l2', solver='liblinear')
model = model.fit(X_train, y_train)
conmat(model, X_test, y_test)
In [31]:
sns.distplot(model.coef_[0], rug=True, kde=False);
In [ ]: