In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from gensim.models.word2vec import Word2Vec

In this section, you can decide which model for which corpus you want to use for your analysis. Here by default, it's the critical corpus within the years 1820-1940. We tested over 10-year slides, it's clearly much less effective due to the small size of the French corpora. After the load, put the path to your directory of models, and then select your range in "year in range".


In [3]:
from collections import OrderedDict

models = OrderedDict([
    (year, Word2Vec.load('/Your/Path/To/Your/Models/{}.bin'.format(year)))
    for year in range(1820, 1940, 20)
])

In [4]:
def cosine_series(anchor, query):
    
    series = OrderedDict()
    
    for year, model in models.items():
        
        series[year] = (
            model.similarity(anchor, query)
            if query in model else 0
        )

    return series

In [5]:
import numpy as np
import statsmodels.api as sm

def lin_reg(series):

    x = np.array(list(series.keys()))
    y = np.array(list(series.values()))

    x = sm.add_constant(x)

    return sm.OLS(y, x).fit()

In [6]:
def plot_cosine_series(anchor, query, w=5, h=4):
    
    series = cosine_series(anchor, query)
    
    fit = lin_reg(series)

    x1 = list(series.keys())[0]
    x2 = list(series.keys())[-1]

    y1 = fit.predict()[0]
    y2 = fit.predict()[-1]
    
    print(query)
    
    plt.figure(figsize=(w, h))
    plt.ylim(0, 1)
    plt.title(query)
    plt.xlabel('Year')
    plt.ylabel('Similarity')
    plt.plot(list(series.keys()), list(series.values()))
    plt.plot([x1, x2], [y1, y2], color='gray', linewidth=0.5)
    plt.show()

The cell below (which calls the methods above), shows how two words differ from one another through time, with cosine similarity. Here, we show how "poésie" evolves compared to "littérature". You can manually change both below.


In [7]:
plot_cosine_series('littérature', 'poésie')


poésie

The two next cells get the 200 most similar terms to a specific term, from the training models, here "littérature".


In [8]:
def union_neighbor_vocab(anchor, topn=200):
    
    vocab = set()
    
    for year, model in models.items():
        similar = model.most_similar(anchor, topn=topn)
        vocab.update([s[0] for s in similar])
        
    return vocab

In [9]:
union_vocab = union_neighbor_vocab('littérature')

At this point, we'll do the same thing as above, and calculate, for each token in the 200 nearest terms to the main entry, the proximity of this term and its significance. The significance is calculated with the p value, that is to say that, below a certain threshold (0.05) we have a strong likelyhood that the result is sure and significant.


In [10]:
data = []
for token in union_vocab:
    
    series = cosine_series('littérature', token)
    fit = lin_reg(series)
    
    data.append((token, fit.params[1], fit.pvalues[1]))

In [11]:
import pandas as pd

df1 = pd.DataFrame(data, columns=('token', 'slope', 'p'))

Increasing

In this part, we want to show what terms emerge more and more with the main entry, that is to say by default "littérature". The "slope" is the degree of progress, and the "p" value its efficiency. So here, the main emergence with "littérature" which is significant is "humanisme". All terms seem to be significant, except "fédéralisme", "welschinger", "maniere", "hennet", "réapparition", "deffence", "bourgin", "colonie", "naturalisme", "réalisme", "sillery", "gréco", "compétence", "symbolisme", "catholique", "japonais", "manuel", "romand", "topographie, "organisme", "prédominance". That is to say that those terms can be nearest, but that statistically they are not significant enough to be sure, while the others are more certain.


In [12]:
pd.set_option('display.max_rows', 1000)

df1.sort_values('slope', ascending=False).head(50)


Out[12]:
token slope p
255 humanisme 0.008528 0.008731
402 mentalité 0.007059 0.038732
239 sociologie 0.006669 0.024114
302 apport 0.006484 0.004768
63 éthique 0.006427 0.004038
494 professionnel 0.006414 0.005583
457 adaptation 0.006290 0.012927
64 orientation 0.006222 0.017795
519 symboliste 0.006036 0.011135
333 fédéralisme 0.005919 0.074813
514 réaliste 0.005792 0.004040
182 classicisme 0.005337 0.041930
405 individualisme 0.005269 0.007675
111 artistique 0.005224 0.016393
60 psychologue 0.005079 0.031881
21 théoricien 0.005076 0.008933
0 welschinger 0.005076 0.068333
479 lyrisme 0.004927 0.004797
204 maniere 0.004815 0.158302
416 hennet 0.004772 0.158302
508 humaniste 0.004746 0.039770
398 linguistique 0.004736 0.047843
42 initiateur 0.004529 0.024463
480 psychologie 0.004525 0.020063
344 réapparition 0.004494 0.147884
322 deffense 0.004470 0.158302
37 bourgin 0.004424 0.158302
372 enquête 0.004417 0.029462
305 xviie 0.004348 0.030010
506 belge 0.004339 0.035691
79 colonisation 0.004247 0.069903
266 naturalisme 0.004245 0.116436
523 technique 0.004121 0.017245
159 dramaturgie 0.004010 0.096570
353 réalisme 0.003938 0.107986
40 sillery 0.003738 0.278006
276 évolution 0.003532 0.011508
489 gréco 0.003503 0.105689
161 compétence 0.003387 0.133109
401 symbolisme 0.003367 0.123809
203 catholique 0.003174 0.140966
154 japonais 0.002915 0.268274
94 manuel 0.002900 0.232273
400 programme 0.002806 0.076685
234 culture 0.002757 0.007773
526 romand 0.002748 0.285405
341 topographie 0.002721 0.316263
250 organisme 0.002658 0.324369
425 renaissance 0.002433 0.000671
516 prédominance 0.002374 0.294377

In this following cell, we show how the top ten most similar vectors change through time compared to "littérature" : "humanisme" for example seems to be very rare before 1860, and then becomes more and more similar to "littérature". Those show the terms that were not similar in the beginning, but tend to be more and more related to "littérature". You should keep in mind the p values associated to each vector.


In [13]:
for i, row in df1.sort_values('slope', ascending=False).head(20).iterrows():
    plot_cosine_series('littérature', row['token'], 3, 2)


humanisme
mentalité
sociologie
apport
éthique
professionnel
adaptation
orientation
symboliste
fédéralisme
réaliste
classicisme
individualisme
artistique
psychologue
théoricien
welschinger
lyrisme
maniere
hennet

Decreasing

This is the same process here : we want to see which terms tend to disassociate themselves from "littérature" (by default, which you can change with the trained models). Then again, you have to check the p values. "transplantation", "choeur" and "philé" are not considered significant, "chaldéen" is, and "destination", "morceau", etc. are as well. The fact that those are less significant is logical : the fewer the terms, the more erratic their series tend to be.


In [14]:
df1.sort_values('slope', ascending=True).head(50)


Out[14]:
token slope p
303 chaldéen -0.006042 0.014349
404 septante -0.005784 0.054396
334 transplantation -0.005606 0.109092
454 choeur -0.005519 0.159394
102 philé -0.005426 0.158302
532 pannonie -0.005413 0.152058
458 tirer -0.005310 0.136123
125 destination -0.005293 0.017138
260 morceau -0.005033 0.032446
430 diviniser -0.004939 0.160054
27 scène -0.004797 0.083719
15 marche -0.004495 0.096892
179 romance -0.004431 0.147998
325 héraclite -0.004322 0.197699
335 homère -0.004240 0.078792
433 précéder -0.004189 0.024624
528 acteur -0.004062 0.207038
387 différent -0.003973 0.121160
499 prologue -0.003941 0.127740
187 poème -0.003800 0.034563
240 traiter -0.003777 0.120664
447 fable -0.003747 0.038174
156 chirurgie -0.003639 0.142766
254 renfermer -0.003637 0.063433
536 divers -0.003591 0.162928
122 quarantaine -0.003545 0.298731
6 personnage -0.003534 0.197593
380 investigation -0.003489 0.028636
297 barbarie -0.003480 0.002474
424 alphabet -0.003417 0.086647
215 principal -0.003368 0.259143
449 tragique -0.003354 0.204413
284 arabe -0.003341 0.090054
518 ascension -0.003330 0.224813
328 tribune -0.003298 0.022713
59 former -0.003296 0.126915
198 réunir -0.003262 0.188630
467 épique -0.003252 0.065215
142 comique -0.003221 0.184657
223 seconde -0.003201 0.240705
317 pièce -0.003177 0.204008
498 subordonner -0.003151 0.131390
386 positif -0.003150 0.102832
357 théologique -0.003126 0.001978
232 considérer -0.003069 0.105707
365 anatomie -0.003059 0.022842
492 distinguer -0.003051 0.167331
167 partie -0.003016 0.146112
461 vraiment -0.002996 0.125920
439 poète -0.002988 0.044902

In [15]:
for i, row in df1.sort_values('slope', ascending=True).head(20).iterrows():
    plot_cosine_series('littérature', row['token'], 3, 2)


chaldéen
septante
transplantation
choeur
philé
pannonie
tirer
destination
morceau
diviniser
scène
marche
romance
héraclite
homère
précéder
acteur
différent
prologue
poème

In [16]:
def intersect_neighbor_vocab(anchor, topn=2000):
    
    vocabs = []
    
    for year, model in models.items():
        similar = model.most_similar(anchor, topn=topn)
        vocabs.append(set([s[0] for s in similar]))
        
    return set.intersection(*vocabs)

In [17]:
intersect_vocab = intersect_neighbor_vocab('littérature')

In [18]:
data = []
for token in intersect_vocab:
    
    series = cosine_series('littérature', token)
    fit = lin_reg(series)
    
    if fit.pvalues[1] < 0.05:
        data.append((token, fit.params[1], fit.pvalues[1]))

In [19]:
import pandas as pd

df2 = pd.DataFrame(data, columns=('token', 'slope', 'p'))

Intersected neighbors

In this part, we show which significant terms tend to be, throughout time, the nearest neighbours to the main entry. That is to say that these vectors follow the same evolution through time as the main entry and are very near to the "littérature" vector. At this stage, we only ask for significant terms (filter above : "if fit.pvalues[1] < 0.05").


In [20]:
df2.sort_values('slope', ascending=False)


Out[20]:
token slope p
11 culture 0.002757 0.007773
19 renaissance 0.002433 0.000671
24 goût -0.001155 0.012860
12 résumé -0.001584 0.033721
1 recherche -0.001631 0.034747
13 traduction -0.001667 0.041989
22 version -0.001856 0.030805
2 spéculation -0.001864 0.029492
5 dialecte -0.001944 0.013789
9 original -0.001986 0.041103
10 comparaison -0.002137 0.005343
8 primitif -0.002153 0.041237
6 phase -0.002208 0.035145
0 faculté -0.002256 0.021378
21 oriental -0.002304 0.043778
3 savant -0.002420 0.035848
17 cours -0.002546 0.035036
4 cité -0.002567 0.025337
14 monument -0.002576 0.001468
18 éloquence -0.002825 0.024742
15 commentaire -0.002871 0.028624
23 chevalerie -0.002971 0.028932
20 poète -0.002988 0.044902
16 anatomie -0.003059 0.022842
7 poème -0.003800 0.034563

In [21]:
for i, row in df2.sort_values('slope', ascending=False).iterrows():
    plot_cosine_series('littérature', row['token'], 3, 2)


culture
renaissance
goût
résumé
recherche
traduction
version
spéculation
dialecte
original
comparaison
primitif
phase
faculté
oriental
savant
cours
cité
monument
éloquence
commentaire
chevalerie
poète
anatomie
poème

In [ ]: