In [1]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
from gensim.models.word2vec import Word2Vec

In this section, you can decide which model for which corpus you want to use for your analysis. Here by default, it's the critical corpus within the years 1820-1940. We tested over 10-year slides, it's clearly much less effective due to the small size of the French corpora. After the load, put the path to your directory of models, and then select your range in "year in range".



In [3]:

    
from collections import OrderedDict

models = OrderedDict([
    (year, Word2Vec.load('/Your/Path/To/Your/Models/{}.bin'.format(year)))
    for year in range(1820, 1940, 20)
])



In [4]:

    
def cosine_series(anchor, query):
    
    series = OrderedDict()
    
    for year, model in models.items():
        
        series[year] = (
            model.similarity(anchor, query)
            if query in model else 0
        )

    return series



In [5]:

    
import numpy as np
import statsmodels.api as sm

def lin_reg(series):

    x = np.array(list(series.keys()))
    y = np.array(list(series.values()))

    x = sm.add_constant(x)

    return sm.OLS(y, x).fit()



In [6]:

    
def plot_cosine_series(anchor, query, w=5, h=4):
    
    series = cosine_series(anchor, query)
    
    fit = lin_reg(series)

    x1 = list(series.keys())[0]
    x2 = list(series.keys())[-1]

    y1 = fit.predict()[0]
    y2 = fit.predict()[-1]
    
    print(query)
    
    plt.figure(figsize=(w, h))
    plt.ylim(0, 1)
    plt.title(query)
    plt.xlabel('Year')
    plt.ylabel('Similarity')
    plt.plot(list(series.keys()), list(series.values()))
    plt.plot([x1, x2], [y1, y2], color='gray', linewidth=0.5)
    plt.show()

The cell below (which calls the methods above), shows how two words differ from one another through time, with cosine similarity. Here, we show how "poésie" evolves compared to "littérature". You can manually change both below.



In [7]:

    
plot_cosine_series('littérature', 'poésie')









    



poésie

The two next cells get the 200 most similar terms to a specific term, from the training models, here "littérature".



In [8]:

    
def union_neighbor_vocab(anchor, topn=200):
    
    vocab = set()
    
    for year, model in models.items():
        similar = model.most_similar(anchor, topn=topn)
        vocab.update([s[0] for s in similar])
        
    return vocab



In [9]:

    
union_vocab = union_neighbor_vocab('littérature')

At this point, we'll do the same thing as above, and calculate, for each token in the 200 nearest terms to the main entry, the proximity of this term and its significance. The significance is calculated with the p value, that is to say that, below a certain threshold (0.05) we have a strong likelyhood that the result is sure and significant.



In [10]:

    
data = []
for token in union_vocab:
    
    series = cosine_series('littérature', token)
    fit = lin_reg(series)
    
    data.append((token, fit.params[1], fit.pvalues[1]))



In [11]:

    
import pandas as pd

df1 = pd.DataFrame(data, columns=('token', 'slope', 'p'))

Increasing

In this part, we want to show what terms emerge more and more with the main entry, that is to say by default "littérature". The "slope" is the degree of progress, and the "p" value its efficiency. So here, the main emergence with "littérature" which is significant is "humanisme". All terms seem to be significant, except "fédéralisme", "welschinger", "maniere", "hennet", "réapparition", "deffence", "bourgin", "colonie", "naturalisme", "réalisme", "sillery", "gréco", "compétence", "symbolisme", "catholique", "japonais", "manuel", "romand", "topographie, "organisme", "prédominance". That is to say that those terms can be nearest, but that statistically they are not significant enough to be sure, while the others are more certain.



In [12]:

    
pd.set_option('display.max_rows', 1000)

df1.sort_values('slope', ascending=False).head(50)









    Out[12]:






  
    
      
      token
      slope
      p
    
  
  
    
      255
      humanisme
      0.008528
      0.008731
    
    
      402
      mentalité
      0.007059
      0.038732
    
    
      239
      sociologie
      0.006669
      0.024114
    
    
      302
      apport
      0.006484
      0.004768
    
    
      63
      éthique
      0.006427
      0.004038
    
    
      494
      professionnel
      0.006414
      0.005583
    
    
      457
      adaptation
      0.006290
      0.012927
    
    
      64
      orientation
      0.006222
      0.017795
    
    
      519
      symboliste
      0.006036
      0.011135
    
    
      333
      fédéralisme
      0.005919
      0.074813
    
    
      514
      réaliste
      0.005792
      0.004040
    
    
      182
      classicisme
      0.005337
      0.041930
    
    
      405
      individualisme
      0.005269
      0.007675
    
    
      111
      artistique
      0.005224
      0.016393
    
    
      60
      psychologue
      0.005079
      0.031881
    
    
      21
      théoricien
      0.005076
      0.008933
    
    
      0
      welschinger
      0.005076
      0.068333
    
    
      479
      lyrisme
      0.004927
      0.004797
    
    
      204
      maniere
      0.004815
      0.158302
    
    
      416
      hennet
      0.004772
      0.158302
    
    
      508
      humaniste
      0.004746
      0.039770
    
    
      398
      linguistique
      0.004736
      0.047843
    
    
      42
      initiateur
      0.004529
      0.024463
    
    
      480
      psychologie
      0.004525
      0.020063
    
    
      344
      réapparition
      0.004494
      0.147884
    
    
      322
      deffense
      0.004470
      0.158302
    
    
      37
      bourgin
      0.004424
      0.158302
    
    
      372
      enquête
      0.004417
      0.029462
    
    
      305
      xviie
      0.004348
      0.030010
    
    
      506
      belge
      0.004339
      0.035691
    
    
      79
      colonisation
      0.004247
      0.069903
    
    
      266
      naturalisme
      0.004245
      0.116436
    
    
      523
      technique
      0.004121
      0.017245
    
    
      159
      dramaturgie
      0.004010
      0.096570
    
    
      353
      réalisme
      0.003938
      0.107986
    
    
      40
      sillery
      0.003738
      0.278006
    
    
      276
      évolution
      0.003532
      0.011508
    
    
      489
      gréco
      0.003503
      0.105689
    
    
      161
      compétence
      0.003387
      0.133109
    
    
      401
      symbolisme
      0.003367
      0.123809
    
    
      203
      catholique
      0.003174
      0.140966
    
    
      154
      japonais
      0.002915
      0.268274
    
    
      94
      manuel
      0.002900
      0.232273
    
    
      400
      programme
      0.002806
      0.076685
    
    
      234
      culture
      0.002757
      0.007773
    
    
      526
      romand
      0.002748
      0.285405
    
    
      341
      topographie
      0.002721
      0.316263
    
    
      250
      organisme
      0.002658
      0.324369
    
    
      425
      renaissance
      0.002433
      0.000671
    
    
      516
      prédominance
      0.002374
      0.294377

In this following cell, we show how the top ten most similar vectors change through time compared to "littérature" : "humanisme" for example seems to be very rare before 1860, and then becomes more and more similar to "littérature". Those show the terms that were not similar in the beginning, but tend to be more and more related to "littérature". You should keep in mind the p values associated to each vector.



In [13]:

    
for i, row in df1.sort_values('slope', ascending=False).head(20).iterrows():
    plot_cosine_series('littérature', row['token'], 3, 2)









    



humanisme






    












    



mentalité






    












    



sociologie






    












    



apport






    












    



éthique






    












    



professionnel






    












    



adaptation






    












    



orientation






    












    



symboliste






    












    



fédéralisme






    












    



réaliste






    












    



classicisme






    












    



individualisme






    












    



artistique






    












    



psychologue






    












    



théoricien






    












    



welschinger






    












    



lyrisme






    












    



maniere






    












    



hennet

Decreasing

This is the same process here : we want to see which terms tend to disassociate themselves from "littérature" (by default, which you can change with the trained models). Then again, you have to check the p values. "transplantation", "choeur" and "philé" are not considered significant, "chaldéen" is, and "destination", "morceau", etc. are as well. The fact that those are less significant is logical : the fewer the terms, the more erratic their series tend to be.



In [14]:

    
df1.sort_values('slope', ascending=True).head(50)









    Out[14]:






  
    
      
      token
      slope
      p
    
  
  
    
      303
      chaldéen
      -0.006042
      0.014349
    
    
      404
      septante
      -0.005784
      0.054396
    
    
      334
      transplantation
      -0.005606
      0.109092
    
    
      454
      choeur
      -0.005519
      0.159394
    
    
      102
      philé
      -0.005426
      0.158302
    
    
      532
      pannonie
      -0.005413
      0.152058
    
    
      458
      tirer
      -0.005310
      0.136123
    
    
      125
      destination
      -0.005293
      0.017138
    
    
      260
      morceau
      -0.005033
      0.032446
    
    
      430
      diviniser
      -0.004939
      0.160054
    
    
      27
      scène
      -0.004797
      0.083719
    
    
      15
      marche
      -0.004495
      0.096892
    
    
      179
      romance
      -0.004431
      0.147998
    
    
      325
      héraclite
      -0.004322
      0.197699
    
    
      335
      homère
      -0.004240
      0.078792
    
    
      433
      précéder
      -0.004189
      0.024624
    
    
      528
      acteur
      -0.004062
      0.207038
    
    
      387
      différent
      -0.003973
      0.121160
    
    
      499
      prologue
      -0.003941
      0.127740
    
    
      187
      poème
      -0.003800
      0.034563
    
    
      240
      traiter
      -0.003777
      0.120664
    
    
      447
      fable
      -0.003747
      0.038174
    
    
      156
      chirurgie
      -0.003639
      0.142766
    
    
      254
      renfermer
      -0.003637
      0.063433
    
    
      536
      divers
      -0.003591
      0.162928
    
    
      122
      quarantaine
      -0.003545
      0.298731
    
    
      6
      personnage
      -0.003534
      0.197593
    
    
      380
      investigation
      -0.003489
      0.028636
    
    
      297
      barbarie
      -0.003480
      0.002474
    
    
      424
      alphabet
      -0.003417
      0.086647
    
    
      215
      principal
      -0.003368
      0.259143
    
    
      449
      tragique
      -0.003354
      0.204413
    
    
      284
      arabe
      -0.003341
      0.090054
    
    
      518
      ascension
      -0.003330
      0.224813
    
    
      328
      tribune
      -0.003298
      0.022713
    
    
      59
      former
      -0.003296
      0.126915
    
    
      198
      réunir
      -0.003262
      0.188630
    
    
      467
      épique
      -0.003252
      0.065215
    
    
      142
      comique
      -0.003221
      0.184657
    
    
      223
      seconde
      -0.003201
      0.240705
    
    
      317
      pièce
      -0.003177
      0.204008
    
    
      498
      subordonner
      -0.003151
      0.131390
    
    
      386
      positif
      -0.003150
      0.102832
    
    
      357
      théologique
      -0.003126
      0.001978
    
    
      232
      considérer
      -0.003069
      0.105707
    
    
      365
      anatomie
      -0.003059
      0.022842
    
    
      492
      distinguer
      -0.003051
      0.167331
    
    
      167
      partie
      -0.003016
      0.146112
    
    
      461
      vraiment
      -0.002996
      0.125920
    
    
      439
      poète
      -0.002988
      0.044902



In [15]:

    
for i, row in df1.sort_values('slope', ascending=True).head(20).iterrows():
    plot_cosine_series('littérature', row['token'], 3, 2)









    



chaldéen






    












    



septante






    












    



transplantation






    












    



choeur






    












    



philé






    












    



pannonie






    












    



tirer






    












    



destination






    












    



morceau






    












    



diviniser






    












    



scène






    












    



marche






    












    



romance






    












    



héraclite






    












    



homère






    












    



précéder






    












    



acteur






    












    



différent






    












    



prologue






    












    



poème



In [16]:

    
def intersect_neighbor_vocab(anchor, topn=2000):
    
    vocabs = []
    
    for year, model in models.items():
        similar = model.most_similar(anchor, topn=topn)
        vocabs.append(set([s[0] for s in similar]))
        
    return set.intersection(*vocabs)



In [17]:

    
intersect_vocab = intersect_neighbor_vocab('littérature')



In [18]:

    
data = []
for token in intersect_vocab:
    
    series = cosine_series('littérature', token)
    fit = lin_reg(series)
    
    if fit.pvalues[1] < 0.05:
        data.append((token, fit.params[1], fit.pvalues[1]))



In [19]:

    
import pandas as pd

df2 = pd.DataFrame(data, columns=('token', 'slope', 'p'))

Intersected neighbors

In this part, we show which significant terms tend to be, throughout time, the nearest neighbours to the main entry. That is to say that these vectors follow the same evolution through time as the main entry and are very near to the "littérature" vector. At this stage, we only ask for significant terms (filter above : "if fit.pvalues[1] < 0.05").



In [20]:

    
df2.sort_values('slope', ascending=False)









    Out[20]:






  
    
      
      token
      slope
      p
    
  
  
    
      11
      culture
      0.002757
      0.007773
    
    
      19
      renaissance
      0.002433
      0.000671
    
    
      24
      goût
      -0.001155
      0.012860
    
    
      12
      résumé
      -0.001584
      0.033721
    
    
      1
      recherche
      -0.001631
      0.034747
    
    
      13
      traduction
      -0.001667
      0.041989
    
    
      22
      version
      -0.001856
      0.030805
    
    
      2
      spéculation
      -0.001864
      0.029492
    
    
      5
      dialecte
      -0.001944
      0.013789
    
    
      9
      original
      -0.001986
      0.041103
    
    
      10
      comparaison
      -0.002137
      0.005343
    
    
      8
      primitif
      -0.002153
      0.041237
    
    
      6
      phase
      -0.002208
      0.035145
    
    
      0
      faculté
      -0.002256
      0.021378
    
    
      21
      oriental
      -0.002304
      0.043778
    
    
      3
      savant
      -0.002420
      0.035848
    
    
      17
      cours
      -0.002546
      0.035036
    
    
      4
      cité
      -0.002567
      0.025337
    
    
      14
      monument
      -0.002576
      0.001468
    
    
      18
      éloquence
      -0.002825
      0.024742
    
    
      15
      commentaire
      -0.002871
      0.028624
    
    
      23
      chevalerie
      -0.002971
      0.028932
    
    
      20
      poète
      -0.002988
      0.044902
    
    
      16
      anatomie
      -0.003059
      0.022842
    
    
      7
      poème
      -0.003800
      0.034563



In [21]:

    
for i, row in df2.sort_values('slope', ascending=False).iterrows():
    plot_cosine_series('littérature', row['token'], 3, 2)









    



culture






    












    



renaissance






    












    



goût






    












    



résumé






    












    



recherche






    












    



traduction






    












    



version






    












    



spéculation






    












    



dialecte






    












    



original






    












    



comparaison






    












    



primitif






    












    



phase






    












    



faculté






    












    



oriental






    












    



savant






    












    



cours






    












    



cité






    












    



monument






    












    



éloquence






    












    



commentaire






    












    



chevalerie






    












    



poète






    












    



anatomie






    












    



poème



In [ ]:

	token	slope	p
255	humanisme	0.008528	0.008731
402	mentalité	0.007059	0.038732
239	sociologie	0.006669	0.024114
302	apport	0.006484	0.004768
63	éthique	0.006427	0.004038
494	professionnel	0.006414	0.005583
457	adaptation	0.006290	0.012927
64	orientation	0.006222	0.017795
519	symboliste	0.006036	0.011135
333	fédéralisme	0.005919	0.074813
514	réaliste	0.005792	0.004040
182	classicisme	0.005337	0.041930
405	individualisme	0.005269	0.007675
111	artistique	0.005224	0.016393
60	psychologue	0.005079	0.031881
21	théoricien	0.005076	0.008933
0	welschinger	0.005076	0.068333
479	lyrisme	0.004927	0.004797
204	maniere	0.004815	0.158302
416	hennet	0.004772	0.158302
508	humaniste	0.004746	0.039770
398	linguistique	0.004736	0.047843
42	initiateur	0.004529	0.024463
480	psychologie	0.004525	0.020063
344	réapparition	0.004494	0.147884
322	deffense	0.004470	0.158302
37	bourgin	0.004424	0.158302
372	enquête	0.004417	0.029462
305	xviie	0.004348	0.030010
506	belge	0.004339	0.035691
79	colonisation	0.004247	0.069903
266	naturalisme	0.004245	0.116436
523	technique	0.004121	0.017245
159	dramaturgie	0.004010	0.096570
353	réalisme	0.003938	0.107986
40	sillery	0.003738	0.278006
276	évolution	0.003532	0.011508
489	gréco	0.003503	0.105689
161	compétence	0.003387	0.133109
401	symbolisme	0.003367	0.123809
203	catholique	0.003174	0.140966
154	japonais	0.002915	0.268274
94	manuel	0.002900	0.232273
400	programme	0.002806	0.076685
234	culture	0.002757	0.007773
526	romand	0.002748	0.285405
341	topographie	0.002721	0.316263
250	organisme	0.002658	0.324369
425	renaissance	0.002433	0.000671
516	prédominance	0.002374	0.294377

	token	slope	p
303	chaldéen	-0.006042	0.014349
404	septante	-0.005784	0.054396
334	transplantation	-0.005606	0.109092
454	choeur	-0.005519	0.159394
102	philé	-0.005426	0.158302
532	pannonie	-0.005413	0.152058
458	tirer	-0.005310	0.136123
125	destination	-0.005293	0.017138
260	morceau	-0.005033	0.032446
430	diviniser	-0.004939	0.160054
27	scène	-0.004797	0.083719
15	marche	-0.004495	0.096892
179	romance	-0.004431	0.147998
325	héraclite	-0.004322	0.197699
335	homère	-0.004240	0.078792
433	précéder	-0.004189	0.024624
528	acteur	-0.004062	0.207038
387	différent	-0.003973	0.121160
499	prologue	-0.003941	0.127740
187	poème	-0.003800	0.034563
240	traiter	-0.003777	0.120664
447	fable	-0.003747	0.038174
156	chirurgie	-0.003639	0.142766
254	renfermer	-0.003637	0.063433
536	divers	-0.003591	0.162928
122	quarantaine	-0.003545	0.298731
6	personnage	-0.003534	0.197593
380	investigation	-0.003489	0.028636
297	barbarie	-0.003480	0.002474
424	alphabet	-0.003417	0.086647
215	principal	-0.003368	0.259143
449	tragique	-0.003354	0.204413
284	arabe	-0.003341	0.090054
518	ascension	-0.003330	0.224813
328	tribune	-0.003298	0.022713
59	former	-0.003296	0.126915
198	réunir	-0.003262	0.188630
467	épique	-0.003252	0.065215
142	comique	-0.003221	0.184657
223	seconde	-0.003201	0.240705
317	pièce	-0.003177	0.204008
498	subordonner	-0.003151	0.131390
386	positif	-0.003150	0.102832
357	théologique	-0.003126	0.001978
232	considérer	-0.003069	0.105707
365	anatomie	-0.003059	0.022842
492	distinguer	-0.003051	0.167331
167	partie	-0.003016	0.146112
461	vraiment	-0.002996	0.125920
439	poète	-0.002988	0.044902