In [40]:
import matplotlib.pyplot as plt
%matplotlib inline

In [41]:
from gensim.models.word2vec import Word2Vec

In this section, you can decide which model for which corpus you want to use for your analysis. Here by default, it's the Revue des Deux Mondes corpus within the years 1820-1900. We tested over 10-year slides, it's clearly much less effective due to the small size of the French corpora. After the load, put the path to your directory of models, and then select your range in "year in range".


In [42]:
from collections import OrderedDict

models = OrderedDict([
    (year, Word2Vec.load('/home/odysseus/Téléchargements/hist-vec-master/r2m/LemCorpR2M_models/{}.bin'.format(year)))
    for year in range(1820, 1900, 20)
])

In [43]:
def cosine_series(anchor, query):
    
    series = OrderedDict()
    
    for year, model in models.items():
        
        series[year] = (
            model.similarity(anchor, query)
            if query in model else 0
        )

    return series

In [44]:
import numpy as np
import statsmodels.api as sm

def lin_reg(series):

    x = np.array(list(series.keys()))
    y = np.array(list(series.values()))

    x = sm.add_constant(x)

    return sm.OLS(y, x).fit()

In [45]:
def plot_cosine_series(anchor, query, w=8, h=4):
    
    series = cosine_series(anchor, query)
    
    fit = lin_reg(series)

    x1 = list(series.keys())[0]
    x2 = list(series.keys())[-1]

    y1 = fit.predict()[0]
    y2 = fit.predict()[-1]
    
    print(query)
    
    plt.figure(figsize=(w, h))
    plt.ylim(0, 1)
    plt.title(query)
    plt.xlabel('Year')
    plt.ylabel('Similarity')
    plt.plot(list(series.keys()), list(series.values()))
    plt.plot([x1, x2], [y1, y2], color='gray', linewidth=0.5)
    plt.show()

Similarity of a list of key terms to "littérature"

The cell below (which calls the methods above), shows how two words differ from one another through time, with cosine similarity. Here, we show how a list of selected concepts evolves compared to "littérature". You can manually change both below.


In [46]:
testList = ('littérature','poésie', 'science', 'savoir', 'histoire', 'philosophie', 'lettre', 'critique', 
            'roman', 'théâtre', 'drame', 'esprit', 'langue', 'diplomatie', 'politique', 'morale', 'société', 
            'pouvoir', 'théologie', 'droit', 'loi', 'méthode', 'génie', 'romantisme', 'réalisme', 'symbolisme', 
            'naturalisme')

for idx, val in enumerate(testList):
    if idx>0:
        plot_cosine_series('littérature', val)


poésie
science
savoir
histoire
philosophie
lettre
critique
roman
théâtre
drame
esprit
langue
diplomatie
politique
morale
société
pouvoir
théologie
droit
loi
méthode
génie
romantisme
réalisme
symbolisme
naturalisme

The two next cells get the 200 most similar terms to a specific term, from the training models, here "littérature".


In [47]:
def union_neighbor_vocab(anchor, topn=200):
    
    vocab = set()
    
    for year, model in models.items():
        similar = model.most_similar(anchor, topn=topn)
        vocab.update([s[0] for s in similar])
        
    return vocab

At this point, we'll do the same thing as above, and calculate, for each token in the 200 nearest terms to the main entry, the proximity of this term and its significance. The significance is calculated with the p value, that is to say that, below a certain threshold (0.05) we have a strong likelyhood that the result is sure and significant.


In [48]:
entries={}

for word in testList:
    data = []
    for token in union_neighbor_vocab(word):
    
        series = cosine_series(word, token)
        fit = lin_reg(series)
    
        if fit.pvalues[1] < 0.05:
            data.append((token, fit.params[1], fit.pvalues[1]))
    entries[word]=data

Increasing

In this part, we want to show what terms emerge more and more with the main entry, that is to say each word of the given test list. The "slope" is the degree of progress, and the "p" value its efficiency. So here, the main emergence with "littérature" which is significant is "humanisme". All terms seem to be significant, except "fédéralisme", "welschinger", "maniere", "hennet", "réapparition", "deffence", "bourgin", "colonie", "naturalisme", "réalisme", "sillery", "gréco", "compétence", "symbolisme", "catholique", "japonais", "manuel", "romand", "topographie, "organisme", "prédominance". That is to say that those terms can be nearest, but that statistically they are not significant enough to be sure, while the others are more certain.

In this following cell, we show how the top ten most similar vectors change through time compared to the words in the test list : "humanisme" for example seems to be very rare before 1860, and then becomes more and more similar to "littérature". Those show the terms that were not similar in the beginning, but tend to be more and more related to "littérature". You should keep in mind the p values associated to each vector.


In [49]:
import pandas as pd
from IPython.display import Markdown, display
pd.set_option('display.max_rows', 1000)

for word in testList :
    display(Markdown("### <i><b>"+word+"</i></b>"))
    df1 = pd.DataFrame(entries[word], columns=('token', 'slope', 'p'))
    print(df1.sort_values('slope', ascending=False).head(10))
    print('\n\n')
    
    for i, row in df1.sort_values('slope', ascending=False).head(10).iterrows():
        plot_cosine_series(word, row['token'], 8, 4)


littérature</b>

            token     slope         p
0     physiologie  0.015517  0.039857
121   positiviste  0.015063  0.002245
28    sauvegarder  0.014201  0.024549
105     arbitrage  0.011269  0.003807
87         adepte  0.008166  0.026672
7    consécration  0.003431  0.000266
45      orthodoxe  0.003419  0.002439
122   suppression  0.001464  0.040740
52   traditionnel  0.001363  0.007826
63   enseignement -0.000700  0.043674



physiologie
positiviste
sauvegarder
arbitrage
adepte
consécration
orthodoxe
suppression
traditionnel
enseignement

poésie</b>

             token     slope         p
35    infiltration  0.015045  0.024431
103  compromettant  0.014531  0.043260
19       expansion  0.001542  0.028793
55         artiste  0.000468  0.040553
86       curiosité -0.001054  0.011632
61        grandeur -0.001207  0.031502
53    enseignement -0.001276  0.018840
75         analyse -0.001356  0.031715
58       ascendant -0.001456  0.039471
34         raffiné -0.001463  0.041801



infiltration
compromettant
expansion
artiste
curiosité
grandeur
enseignement
analyse
ascendant
raffiné

science</b>

           token     slope         p
49      religion -0.001481  0.019389
14       opinion -0.001647  0.040320
44       société -0.001911  0.003703
29       naturel -0.001914  0.020147
28       intérêt -0.002073  0.009383
48      pratique -0.002115  0.006009
52  civilisation -0.002135  0.011585
23     puissance -0.002276  0.030968
54        peuple -0.002345  0.012885
38        morale -0.002404  0.005120



religion
opinion
société
naturel
intérêt
pratique
civilisation
puissance
peuple
morale

savoir</b>

         token     slope         p
35  comprendre -0.001639  0.045174
21   autrement -0.002523  0.027493
27        user -0.002640  0.008462
46    souffrir -0.002643  0.043424
53   remarquer -0.002696  0.046478
31   ressentir -0.002883  0.028704
42     bonheur -0.003029  0.025884
48   respecter -0.003246  0.045202
33     accuser -0.003294  0.047635
13       voilà -0.003382  0.039623



comprendre
autrement
user
souffrir
remarquer
ressentir
bonheur
respecter
accuser
voilà

histoire</b>

         token     slope         p
46        pays -0.001362  0.042655
57  expérience -0.001631  0.022928
24      auteur -0.001642  0.039254
91    ailleurs -0.002154  0.019824
52   éducation -0.002207  0.006406
34      époque -0.002505  0.004581
42        rôle -0.002513  0.042427
65     langage -0.002643  0.011432
75    résultat -0.002722  0.008607
38       usage -0.002742  0.008103



pays
expérience
auteur
ailleurs
éducation
époque
rôle
langage
résultat
usage

philosophie</b>

           token     slope         p
26   liquidation  0.010727  0.001520
75       méthode -0.001888  0.036232
39       opinion -0.002189  0.003355
64       naturel -0.002267  0.039408
1      nécessité -0.002334  0.016193
105      société -0.002446  0.004877
123      système -0.002475  0.036814
22        absolu -0.002508  0.018984
58       artiste -0.002551  0.044241
85       justice -0.002602  0.022261



liquidation
méthode
opinion
naturel
nécessité
société
système
absolu
artiste
justice

lettre</b>

            token     slope         p
227       pénates  0.010997  0.014294
211         henri -0.003865  0.020310
155       réponse -0.003868  0.047355
29            duc -0.004054  0.014019
86       dépêcher -0.004230  0.012138
173   ambassadeur -0.004524  0.026064
160     damoiseau -0.004694  0.007392
171     entretien -0.004750  0.040872
182  portefeuille -0.004987  0.028571
198   explication -0.005044  0.010016



pénates
henri
réponse
duc
dépêcher
ambassadeur
damoiseau
entretien
portefeuille
explication

critique</b>

             token     slope         p
160       ébruiter  0.013833  0.006994
18         émérite  0.013364  0.007787
6    théoriquement  0.013105  0.013248
107         adepte  0.001428  0.002170
35     méconnaître -0.000892  0.038641
60           moral -0.001231  0.014476
97   impossibilité -0.001629  0.014588
117       sécurité -0.001638  0.047308
45            fait -0.001909  0.015290
9          penseur -0.002023  0.042820



ébruiter
émérite
théoriquement
adepte
méconnaître
moral
impossibilité
sécurité
fait
penseur

roman</b>

           token     slope         p
11          legs  0.009040  0.007757
8   matérialisme  0.004433  0.017241
33  superstition -0.001313  0.034652
59       ivresse -0.001412  0.042952
39         drame -0.001577  0.013694
65     ignorance -0.001581  0.042358
75         style -0.001635  0.007702
35       mystère -0.001818  0.033856
19     isolement -0.001871  0.010875
30       légende -0.001903  0.023818



legs
matérialisme
superstition
ivresse
drame
ignorance
style
mystère
isolement
légende

théâtre</b>

           token     slope         p
164    charivari  0.015093  0.048832
67        rabais  0.013027  0.031442
5        ontario  0.010890  0.041629
18   historiette  0.010846  0.030062
68         trône -0.002647  0.041369
163       niveau -0.002704  0.015317
84         siège -0.002929  0.039352
176    monastère -0.003078  0.017424
144      secours -0.003158  0.038287
93      chapitre -0.003207  0.041014



charivari
rabais
ontario
historiette
trône
niveau
siège
monastère
secours
chapitre

drame</b>

       token     slope         p
37   envolée  0.015334  0.010577
56  hobereau  0.015157  0.032581
13   mayence  0.015043  0.016630
21      caïd  0.013901  0.015423
41   percher  0.013572  0.040128
33   stipulé  0.011056  0.048340
30  chapitre -0.000901  0.030932
8      roman -0.001577  0.013694
45   poisson -0.001788  0.043484
4     dessin -0.001954  0.034184



envolée
hobereau
mayence
caïd
percher
stipulé
chapitre
roman
poisson
dessin

esprit</b>

           token     slope         p
38      habitude -0.001963  0.048457
35        défaut -0.002021  0.034724
5   intelligence -0.002076  0.037064
29          idée -0.002177  0.010229
8        réalité -0.002329  0.032418
48     influence -0.002350  0.002875
3       tendance -0.002388  0.036438
47       méthode -0.002456  0.022145
16        talent -0.002510  0.034719
49        morale -0.002615  0.028271



habitude
défaut
intelligence
idée
réalité
influence
tendance
méthode
talent
morale

langue</b>

          token     slope         p
58       ligure  0.017885  0.045427
60     offusqué  0.013719  0.041684
51    évolution  0.003874  0.048935
18       devise  0.003626  0.007943
46        thèse  0.001993  0.046253
12      procédé -0.001295  0.040765
32  disposition -0.001998  0.018117
0      croyance -0.002137  0.034141
48   historique -0.002321  0.013468
30      mission -0.002328  0.033181



ligure
offusqué
évolution
devise
thèse
procédé
disposition
croyance
historique
mission

diplomatie</b>

           token     slope         p
6    compétition  0.015603  0.028272
68   dégrèvement  0.014307  0.046187
8      stratégie  0.006472  0.032153
90      donation  0.004327  0.027324
56  félicitation  0.004280  0.005215
23    confession -0.001293  0.003261
76    accusation -0.001351  0.018766
73    auxiliaire -0.001495  0.021666
97     tentative -0.001501  0.049495
71    conversion -0.001529  0.040239



compétition
dégrèvement
stratégie
donation
félicitation
confession
accusation
auxiliaire
tentative
conversion

politique</b>

         token     slope         p
13   ingérence  0.010740  0.043739
12   électoral  0.005655  0.023942
71    principe -0.001436  0.026769
5        force -0.001503  0.041200
3    nécessité -0.001731  0.045832
14    critique -0.002062  0.036110
94    religion -0.002148  0.015891
33         art -0.002300  0.000617
23      absolu -0.002527  0.030603
100     peuple -0.002530  0.032991



ingérence
électoral
principe
force
nécessité
critique
religion
art
absolu
peuple

morale</b>

            token     slope         p
68   irréductible  0.013985  0.038140
28            art -0.001824  0.028311
67        intérêt -0.001889  0.035990
38      caractère -0.002122  0.014709
118        social -0.002285  0.013524
113      pratique -0.002296  0.024600
15     démocratie -0.002310  0.037612
72        science -0.002404  0.005120
92       garantie -0.002444  0.002345
7    intelligence -0.002473  0.019460



irréductible
art
intérêt
caractère
social
pratique
démocratie
science
garantie
intelligence

société</b>

          token     slope         p
9     ingérence  0.012192  0.022791
61    influence -0.001841  0.022127
21      procédé -0.001878  0.023496
28   importance -0.001897  0.046111
54      science -0.001911  0.003703
40    monarchie -0.001963  0.020835
64       nation -0.001965  0.045277
36        école -0.001971  0.005828
24      opinion -0.002028  0.030748
17  institution -0.002050  0.016399



ingérence
influence
procédé
importance
science
monarchie
nation
école
opinion
institution

pouvoir</b>

              token     slope         p
44      humainement  0.009377  0.010440
8            berner  0.008327  0.037252
1         médisance  0.007886  0.031092
55  péremptoirement  0.007415  0.029162
15         illusion -0.001537  0.029111
67        politique -0.001574  0.004878
37            moyen -0.002209  0.035905
54          justice -0.002359  0.043892
20              but -0.002465  0.049889
63        situation -0.002489  0.001543



humainement
berner
médisance
péremptoirement
illusion
politique
moyen
justice
but
situation

théologie</b>

          token     slope         p
10      breslau  0.016342  0.036485
51  positiviste  0.016208  0.036273
16      mayence  0.015294  0.024593
33       genèse  0.014879  0.014241
0       enclave  0.013961  0.036653
42    arbitrage  0.012126  0.016487
3     stratégie  0.008039  0.001387
20    orthodoxe  0.003543  0.024684
50  suppression  0.002274  0.019547
11     criminel -0.001746  0.004708



breslau
positiviste
mayence
genèse
enclave
arbitrage
stratégie
orthodoxe
suppression
criminel

droit</b>

              token     slope         p
13        ingérence  0.010870  0.042973
68           raison -0.002296  0.011160
85         principe -0.002452  0.029501
106         liberté -0.002456  0.042787
87            parti -0.002463  0.049357
75        influence -0.002511  0.010790
54           besoin -0.002609  0.002606
33              but -0.002980  0.014875
103  responsabilité -0.003007  0.010548
39      conséquence -0.003025  0.022695



ingérence
raison
principe
liberté
parti
influence
besoin
but
responsabilité
conséquence

loi</b>

             token     slope         p
14       ingérence  0.011580  0.037077
8    théoriquement  0.011070  0.043036
110         nation -0.001591  0.031155
142          règle -0.002100  0.049025
39       religieux -0.002454  0.032087
113        essence -0.002571  0.022237
70       monarchie -0.002584  0.000842
45         procédé -0.002679  0.036882
141           état -0.002693  0.024566
7         tendance -0.002805  0.010189



ingérence
théoriquement
nation
règle
religieux
essence
monarchie
procédé
état
tendance

méthode</b>

            token     slope         p
24        quinine  0.015808  0.037095
114  entraînement -0.000991  0.009192
70        préjugé -0.001271  0.023138
121    profession -0.001377  0.024176
6        tendance -0.001392  0.026156
65   intervention -0.001435  0.049911
5        croyance -0.001439  0.047691
47        qualité -0.001494  0.012870
68       grandeur -0.001512  0.026904
37      ressource -0.001578  0.001074



quinine
entraînement
préjugé
profession
tendance
intervention
croyance
qualité
grandeur
ressource

génie</b>

           token     slope         p
12     ingérence  0.014089  0.007894
28      tactique  0.007133  0.027211
86      fraction  0.002154  0.008874
42         moral -0.001368  0.046415
83        modéré -0.001582  0.036823
65       élément -0.001732  0.002654
39      doctrine -0.001842  0.041764
22     extension -0.002049  0.005894
48      noblesse -0.002070  0.012684
17  intellectuel -0.002073  0.002055



ingérence
tactique
fraction
moral
modéré
élément
doctrine
extension
noblesse
intellectuel

romantisme</b>

          token     slope         p
17        vigny  0.018439  0.025716
23       borgia  0.017853  0.024510
3      invalide  0.017548  0.025692
6     électorat  0.017533  0.020528
30          for  0.016387  0.026694
20     objectif  0.016373  0.031353
9    moralement  0.015458  0.021212
11     conjugal  0.015429  0.026207
0   physiologie  0.015150  0.025562
7   notoirement  0.015140  0.046848



vigny
borgia
invalide
électorat
for
objectif
moralement
conjugal
physiologie
notoirement

réalisme</b>

             token     slope         p
6          assumer  0.019816  0.027574
8        condillac  0.019199  0.040982
10      stephenson  0.016852  0.048451
12         picorer  0.016681  0.033938
3         giboulée  0.016597  0.042483
5             pâté  0.016554  0.048890
15     contrariant  0.016062  0.026361
2               pé  0.015816  0.018827
13      pourriture  0.015720  0.049079
9   outrageusement  0.015652  0.006961



assumer
condillac
stephenson
picorer
giboulée
pâté
contrariant
pourriture
outrageusement

symbolisme</b>

            token     slope         p
8       boulanger  0.014525  0.025732
2         fédéral  0.013281  0.032013
10  inappréciable  0.012813  0.015061
3         surtaxe  0.012059  0.034377
1          primat  0.011806  0.032218
5          octroi  0.011617  0.024266
6      marécageux  0.010829  0.047873
4     omnipotence  0.010525  0.029241
0        patrocle  0.010407  0.026735
9            hung  0.009692  0.033139



boulanger
fédéral
inappréciable
surtaxe
primat
octroi
marécageux
omnipotence
patrocle
hung

naturalisme</b>

           token     slope         p
21        conrad  0.017726  0.049440
12          watt  0.017302  0.037589
8          isaïe  0.017289  0.038633
1         crèche  0.017085  0.020745
16         vigny  0.017012  0.018021
6        laybach  0.016977  0.010704
5   prolégomènes  0.016840  0.028643
4     cosmogonie  0.016348  0.038720
14         ferré  0.016041  0.019310
3         mylord  0.015907  0.045543



conrad
watt
isaïe
crèche
vigny
laybach
prolégomènes
cosmogonie
ferré
mylord

Decreasing

This is the same process here : we want to see which terms tend to disassociate themselves from "littérature" (by default, which you can change with the trained models). Then again, you have to check the p values. "transplantation", "choeur" and "philé" are not considered significant, "chaldéen" is, and "destination", "morceau", etc. are as well. The fact that those are less significant is logical : the fewer the terms, the more erratic their series tend to be.


In [50]:
for word in testList :
    display(Markdown("### <i><b>"+word+"</i></b>"))
    df2 = pd.DataFrame(entries[word], columns=('token', 'slope', 'p'))
    print(df2.sort_values('slope', ascending=True).head(10))
    print('\n\n')
    
    for i, row in df2.sort_values('slope', ascending=True).head(10).iterrows():
        plot_cosine_series(word, row['token'], 8, 4)


littérature</b>

        token     slope         p
83  géographe -0.015757  0.027133
43  mahométan -0.015267  0.037757
46   dialecte -0.013514  0.037056
23     cortès -0.012651  0.006905
58      étage -0.009194  0.014811
10    détroit -0.008516  0.001436
62     troupe -0.008314  0.017857
70      armée -0.007865  0.003266
80      ligne -0.007592  0.029598
98   barbarie -0.007355  0.003667



géographe
mahométan
dialecte
cortès
étage
détroit
troupe
armée
ligne
barbarie

poésie</b>

         token     slope         p
10       mably -0.016867  0.025046
38   mahométan -0.015830  0.038863
42    dialecte -0.013144  0.028955
1          eau -0.011813  0.048568
16    français -0.011090  0.021440
97       corps -0.010094  0.042067
36     presque -0.009632  0.017308
43        jury -0.009430  0.014418
92  angleterre -0.009342  0.012717
57  navigateur -0.009145  0.034269



mably
mahométan
dialecte
eau
français
corps
presque
jury
angleterre
navigateur

science</b>

           token     slope         p
16     mahométan -0.018531  0.006524
36     géographe -0.017698  0.025349
1       idolâtre -0.015748  0.012714
19      dialecte -0.012565  0.023560
27    navigateur -0.010153  0.020703
42    évaluation -0.009018  0.043152
24      moralité -0.007411  0.010340
6   régénération -0.007083  0.013936
39         telle -0.006333  0.012455
7       français -0.005956  0.000211



mahométan
géographe
idolâtre
dialecte
navigateur
évaluation
moralité
régénération
telle
français

savoir</b>

          token     slope         p
38  généraliser -0.011913  0.008846
39  extravagant -0.010626  0.037561
7      alléguer -0.009960  0.017112
30     remédier -0.008072  0.037651
10      lecteur -0.007703  0.028545
11    restituer -0.007683  0.045184
15     regarder -0.007588  0.030870
37    attention -0.007535  0.017882
0    inquiétude -0.006533  0.036228
45        amour -0.006297  0.017418



généraliser
extravagant
alléguer
remédier
lecteur
restituer
regarder
attention
inquiétude
amour

histoire</b>

           token     slope         p
9          toise -0.013701  0.031257
41          jury -0.013347  0.002065
40      dialecte -0.012157  0.006093
71        sicile -0.010535  0.041529
86      spontané -0.010534  0.014053
36       gaulois -0.010194  0.014669
64    navigation -0.010044  0.025507
0   bouffonnerie -0.009923  0.024940
77      barbarie -0.009282  0.031884
14  régénération -0.008863  0.018909



toise
jury
dialecte
sicile
spontané
gaulois
navigation
bouffonnerie
barbarie
régénération

philosophie</b>

            token     slope         p
89     évaluation -0.012249  0.042411
38        lumière -0.011484  0.010008
84        branche -0.010371  0.042822
28       archipel -0.010356  0.026675
108       couleur -0.009989  0.007510
102       irlande -0.009210  0.029722
17      discrédit -0.009105  0.030422
86   débarquement -0.008625  0.031224
34        costume -0.008605  0.007252
10        tableau -0.008357  0.007530



évaluation
lumière
branche
archipel
couleur
irlande
discrédit
débarquement
costume
tableau

lettre</b>

       token     slope         p
225  profond -0.016110  0.039243
6     cacher -0.014542  0.007088
74      oeil -0.014402  0.002285
28    battre -0.014329  0.033207
73       âme -0.014258  0.044471
12      face -0.014156  0.007536
104    bruit -0.013794  0.000804
23   tremper -0.013704  0.032849
63      sang -0.013685  0.000440
24    tendre -0.013287  0.030964



profond
cacher
oeil
battre
âme
face
bruit
tremper
sang
tendre

critique</b>

        token     slope         p
17     radeau -0.010104  0.042286
128   irlande -0.009081  0.017950
76      pérou -0.008826  0.013934
55   pamphlet -0.008527  0.033846
120    purger -0.008482  0.049345
135       vif -0.008074  0.036254
74    rejeter -0.007979  0.017083
136     chair -0.007800  0.000651
64     captif -0.007586  0.005665
32    ébauche -0.007546  0.041981



radeau
irlande
pérou
pamphlet
purger
vif
rejeter
chair
captif
ébauche

roman</b>

            token     slope         p
55         caboul -0.018629  0.044791
43    sympathiser -0.017304  0.006314
45          pérou -0.010837  0.001280
21  incontestable -0.008452  0.004896
68       imminent -0.008144  0.021157
77         persan -0.008018  0.007179
76     déposition -0.007161  0.037771
20        prodige -0.007042  0.046436
82          juste -0.006605  0.023989
67    naturaliste -0.006409  0.028567



caboul
sympathiser
pérou
incontestable
imminent
persan