Köhn

In this notebook I replicate Koehn (2015): What's in an embedding? Analyzing word embeddings through multilingual evaluation. This paper proposes to i) evaluate an embedding method on more than one language, and ii) evaluate an embedding model by how well its embeddings capture syntactic features. He uses an L2-regularized linear classifier, with an upper baseline that assigns the most frequent class. He finds that most methods perform similarly on this task, but that dependency based embeddings perform better. Dependency based embeddings particularly perform better when you decrease the dimensionality. Overall, the aim is to have an evalation method that tells you something about the structure of the learnt representations. He evaulates a range of different models on their ability to capture a number of different morphosyntactic features in a bunch of languages.

Embedding models tested:

cbow
skip-gram
glove
dep
cca
brown

Features tested:

pos
headpos (the pos of the word's head)
label
gender
case
number
tense

Languages tested:

Basque
English
French
German
Hungarian
Polish
Swedish

Word embeddings were trained on automatically PoS-tagged and dependency-parsed data using existing models. This is so the dependency-based embeddings can be trained. The evaluation is on hand-labelled data. English training data is a subset of Wikipedia; English test data comes from PTB. For all other languages, both the training and test data come from a shared task on parsing morphologically rich languages. Koehn trained embeddings with window size 5 and 11 and dimensionality 10, 100, 200.

Dependency-based embeddings perform the best on almost all tasks. They even do well when the dimensionality is reduced to 10, while other methods perform poorly in this case.

I'll need:

models
learnt representations
automatically labeled data
hand-labeled data



In [21]:

    
%matplotlib inline
import os
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

data_path = '../../data'
tmp_path = '../../tmp'

Learnt representations

GloVe



In [22]:

    
size = 50
fname = 'embeddings/glove.6B.{}d.txt'.format(size)
glove_path = os.path.join(data_path, fname)
glove = pd.read_csv(glove_path, sep=' ', header=None, index_col=0, quoting=csv.QUOTE_NONE)
glove.head()









    Out[22]:







  
    
      
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
    
    
      0
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      the
      0.418000
      0.249680
      -0.41242
      0.12170
      0.34527
      -0.044457
      -0.49688
      -0.17862
      -0.00066
      -0.656600
      ...
      -0.298710
      -0.157490
      -0.347580
      -0.045637
      -0.44251
      0.187850
      0.002785
      -0.184110
      -0.115140
      -0.78581
    
    
      ,
      0.013441
      0.236820
      -0.16899
      0.40951
      0.63812
      0.477090
      -0.42852
      -0.55641
      -0.36400
      -0.239380
      ...
      -0.080262
      0.630030
      0.321110
      -0.467650
      0.22786
      0.360340
      -0.378180
      -0.566570
      0.044691
      0.30392
    
    
      .
      0.151640
      0.301770
      -0.16763
      0.17684
      0.31719
      0.339730
      -0.43478
      -0.31086
      -0.44999
      -0.294860
      ...
      -0.000064
      0.068987
      0.087939
      -0.102850
      -0.13931
      0.223140
      -0.080803
      -0.356520
      0.016413
      0.10216
    
    
      of
      0.708530
      0.570880
      -0.47160
      0.18048
      0.54449
      0.726030
      0.18157
      -0.52393
      0.10381
      -0.175660
      ...
      -0.347270
      0.284830
      0.075693
      -0.062178
      -0.38988
      0.229020
      -0.216170
      -0.225620
      -0.093918
      -0.80375
    
    
      to
      0.680470
      -0.039263
      0.30186
      -0.17792
      0.42962
      0.032246
      -0.41376
      0.13228
      -0.29847
      -0.085253
      ...
      -0.094375
      0.018324
      0.210480
      -0.030880
      -0.19722
      0.082279
      -0.094340
      -0.073297
      -0.064699
      -0.26044
    
  

5 rows × 50 columns

Features



In [23]:

    
fname = 'UD_English/features.csv'
features_path = os.path.join(data_path, os.path.join('evaluation/dependency', fname))
features = pd.read_csv(features_path).set_index('form')
features.head()









    



/home/bacon/miniconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)






    Out[23]:







  
    
      
      lemma
      universal_pos
      lg_pos
      Case
      Definite
      Degree
      Foreign
      Gender
      Mood
      NumType
      Number
      Person
      Poss
      PronType
      Reflex
      Tense
      VerbForm
      Voice
      set
    
    
      form
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      What
      what
      PRON
      WP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      Int
      NaN
      NaN
      NaN
      NaN
      test
    
    
      if
      if
      SCONJ
      IN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      test
    
    
      Google
      Google
      PROPN
      NNP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      Sing
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      test
    
    
      Morphed
      morph
      VERB
      VBD
      NaN
      NaN
      NaN
      NaN
      NaN
      Ind
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      Past
      Fin
      NaN
      test
    
    
      Into
      into
      ADP
      IN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      test



In [24]:

    
df = pd.merge(glove, features, how='inner', left_index=True, right_index=True)
df.head()









    Out[24]:







  
    
      
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      NumType
      Number
      Person
      Poss
      PronType
      Reflex
      Tense
      VerbForm
      Voice
      set
    
  
  
    
      !
      -0.58402
      0.39031
      0.65282
      -0.3403
      0.19493
      -0.83489
      0.11929
      -0.57291
      -0.56844
      0.72989
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      test
    
    
      !
      -0.58402
      0.39031
      0.65282
      -0.3403
      0.19493
      -0.83489
      0.11929
      -0.57291
      -0.56844
      0.72989
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      test
    
    
      !
      -0.58402
      0.39031
      0.65282
      -0.3403
      0.19493
      -0.83489
      0.11929
      -0.57291
      -0.56844
      0.72989
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      test
    
    
      !
      -0.58402
      0.39031
      0.65282
      -0.3403
      0.19493
      -0.83489
      0.11929
      -0.57291
      -0.56844
      0.72989
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      test
    
    
      !
      -0.58402
      0.39031
      0.65282
      -0.3403
      0.19493
      -0.83489
      0.11929
      -0.57291
      -0.56844
      0.72989
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      test
    
  

5 rows × 69 columns

Prediction



In [25]:

    
def prepare_X_and_y(feature, data):
    """Return X and y ready for predicting feature from embeddings."""
    relevant_data = data[data[feature].notnull()]
    columns = list(range(1, size+1))
    X = relevant_data[columns]
    y = relevant_data[feature]
    train = relevant_data['set'] == 'train'
    test = (relevant_data['set'] == 'test') | (relevant_data['set'] == 'dev')
    X_train, X_test = X[train].values, X[test].values
    y_train, y_test = y[train].values, y[test].values
    return X_train, X_test, y_train, y_test

def predict(model, X_test):
    """Wrapper for getting predictions."""
    results = model.predict_proba(X_test)
    return np.array([t for f,t in results]).reshape(-1,1)

def conmat(model, X_test, y_test):
    """Wrapper for sklearn's confusion matrix."""
    y_pred = model.predict(X_test)
    c = confusion_matrix(y_test, y_pred)
    sns.heatmap(c, annot=True, fmt='d', 
                xticklabels=model.classes_, 
                yticklabels=model.classes_, 
                cmap="YlGnBu", cbar=False)
    plt.ylabel('Ground truth')
    plt.xlabel('Prediction')

def draw_roc(model, X_test, y_test):
    """Convenience function to draw ROC curve."""
    y_pred = predict(model, X_test)
    fpr, tpr, thresholds = roc_curve(y_test, y_pred)
    roc = roc_auc_score(y_test, y_pred)
    label = r'$AUC={}$'.format(str(round(roc, 3)))
    plt.plot(fpr, tpr, label=label);
    plt.title('ROC')
    plt.xlabel('False positive rate');
    plt.ylabel('True positive rate');
    plt.legend();

def cross_val_auc(model, X, y):
    for _ in range(5):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
        model = model.fit(X_train, y_train)
        draw_roc(model, X_test, y_test)



In [32]:

    
X_train, X_test, y_train, y_test = prepare_X_and_y('Tense', df)

model = LogisticRegression(penalty='l2', solver='liblinear')
model = model.fit(X_train, y_train)
conmat(model, X_test, y_test)



In [31]:

    
sns.distplot(model.coef_[0], rug=True, kde=False);

Hyperparameter optimization before error analysis



In [ ]:

	1	2	3	4	5	6	7	8	9	10	...	41	42	43	44	45	46	47	48	49	50
0
the	0.418000	0.249680	-0.41242	0.12170	0.34527	-0.044457	-0.49688	-0.17862	-0.00066	-0.656600	...	-0.298710	-0.157490	-0.347580	-0.045637	-0.44251	0.187850	0.002785	-0.184110	-0.115140	-0.78581
,	0.013441	0.236820	-0.16899	0.40951	0.63812	0.477090	-0.42852	-0.55641	-0.36400	-0.239380	...	-0.080262	0.630030	0.321110	-0.467650	0.22786	0.360340	-0.378180	-0.566570	0.044691	0.30392
.	0.151640	0.301770	-0.16763	0.17684	0.31719	0.339730	-0.43478	-0.31086	-0.44999	-0.294860	...	-0.000064	0.068987	0.087939	-0.102850	-0.13931	0.223140	-0.080803	-0.356520	0.016413	0.10216
of	0.708530	0.570880	-0.47160	0.18048	0.54449	0.726030	0.18157	-0.52393	0.10381	-0.175660	...	-0.347270	0.284830	0.075693	-0.062178	-0.38988	0.229020	-0.216170	-0.225620	-0.093918	-0.80375
to	0.680470	-0.039263	0.30186	-0.17792	0.42962	0.032246	-0.41376	0.13228	-0.29847	-0.085253	...	-0.094375	0.018324	0.210480	-0.030880	-0.19722	0.082279	-0.094340	-0.073297	-0.064699	-0.26044

	lemma	universal_pos	lg_pos	Case	Definite	Degree	Foreign	Gender	Mood	NumType	Number	Person	Poss	PronType	Reflex	Tense	VerbForm	Voice	set
form
What	what	PRON	WP	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Int	NaN	NaN	NaN	NaN	test
if	if	SCONJ	IN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	test
Google	Google	PROPN	NNP	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Sing	NaN	NaN	NaN	NaN	NaN	NaN	NaN	test
Morphed	morph	VERB	VBD	NaN	NaN	NaN	NaN	NaN	Ind	NaN	NaN	NaN	NaN	NaN	NaN	Past	Fin	NaN	test
Into	into	ADP	IN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	test

	1	2	3	4	5	6	7	8	9	10	...	NumType	Number	Person	Poss	PronType	Reflex	Tense	VerbForm	Voice	set
!	-0.58402	0.39031	0.65282	-0.3403	0.19493	-0.83489	0.11929	-0.57291	-0.56844	0.72989	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	test
!	-0.58402	0.39031	0.65282	-0.3403	0.19493	-0.83489	0.11929	-0.57291	-0.56844	0.72989	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	test
!	-0.58402	0.39031	0.65282	-0.3403	0.19493	-0.83489	0.11929	-0.57291	-0.56844	0.72989	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	test
!	-0.58402	0.39031	0.65282	-0.3403	0.19493	-0.83489	0.11929	-0.57291	-0.56844	0.72989	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	test
!	-0.58402	0.39031	0.65282	-0.3403	0.19493	-0.83489	0.11929	-0.57291	-0.56844	0.72989	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	test