In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
from gensim.models.word2vec import Word2Vec
In this section, you can decide which model for which corpus you want to use for your analysis. Here by default, it's the critical corpus within the years 1820-1940. We tested over 10-year slides, it's clearly much less effective due to the small size of the French corpora. After the load, put the path to your directory of models, and then select your range in "year in range".
In [3]:
from collections import OrderedDict
models = OrderedDict([
(year, Word2Vec.load('/Your/Path/To/Your/Models/{}.bin'.format(year)))
for year in range(1820, 1940, 20)
])
In [4]:
def cosine_series(anchor, query):
series = OrderedDict()
for year, model in models.items():
series[year] = (
model.similarity(anchor, query)
if query in model else 0
)
return series
In [5]:
import numpy as np
import statsmodels.api as sm
def lin_reg(series):
x = np.array(list(series.keys()))
y = np.array(list(series.values()))
x = sm.add_constant(x)
return sm.OLS(y, x).fit()
In [6]:
def plot_cosine_series(anchor, query, w=5, h=4):
series = cosine_series(anchor, query)
fit = lin_reg(series)
x1 = list(series.keys())[0]
x2 = list(series.keys())[-1]
y1 = fit.predict()[0]
y2 = fit.predict()[-1]
print(query)
plt.figure(figsize=(w, h))
plt.ylim(0, 1)
plt.title(query)
plt.xlabel('Year')
plt.ylabel('Similarity')
plt.plot(list(series.keys()), list(series.values()))
plt.plot([x1, x2], [y1, y2], color='gray', linewidth=0.5)
plt.show()
The cell below (which calls the methods above), shows how two words differ from one another through time, with cosine similarity. Here, we show how "poésie" evolves compared to "littérature". You can manually change both below.
In [7]:
plot_cosine_series('littérature', 'poésie')
The two next cells get the 200 most similar terms to a specific term, from the training models, here "littérature".
In [8]:
def union_neighbor_vocab(anchor, topn=200):
vocab = set()
for year, model in models.items():
similar = model.most_similar(anchor, topn=topn)
vocab.update([s[0] for s in similar])
return vocab
In [9]:
union_vocab = union_neighbor_vocab('littérature')
At this point, we'll do the same thing as above, and calculate, for each token in the 200 nearest terms to the main entry, the proximity of this term and its significance. The significance is calculated with the p value, that is to say that, below a certain threshold (0.05) we have a strong likelyhood that the result is sure and significant.
In [10]:
data = []
for token in union_vocab:
series = cosine_series('littérature', token)
fit = lin_reg(series)
data.append((token, fit.params[1], fit.pvalues[1]))
In [11]:
import pandas as pd
df1 = pd.DataFrame(data, columns=('token', 'slope', 'p'))
In this part, we want to show what terms emerge more and more with the main entry, that is to say by default "littérature". The "slope" is the degree of progress, and the "p" value its efficiency. So here, the main emergence with "littérature" which is significant is "humanisme". All terms seem to be significant, except "fédéralisme", "welschinger", "maniere", "hennet", "réapparition", "deffence", "bourgin", "colonie", "naturalisme", "réalisme", "sillery", "gréco", "compétence", "symbolisme", "catholique", "japonais", "manuel", "romand", "topographie, "organisme", "prédominance". That is to say that those terms can be nearest, but that statistically they are not significant enough to be sure, while the others are more certain.
In [12]:
pd.set_option('display.max_rows', 1000)
df1.sort_values('slope', ascending=False).head(50)
Out[12]:
In this following cell, we show how the top ten most similar vectors change through time compared to "littérature" : "humanisme" for example seems to be very rare before 1860, and then becomes more and more similar to "littérature". Those show the terms that were not similar in the beginning, but tend to be more and more related to "littérature". You should keep in mind the p values associated to each vector.
In [13]:
for i, row in df1.sort_values('slope', ascending=False).head(20).iterrows():
plot_cosine_series('littérature', row['token'], 3, 2)
This is the same process here : we want to see which terms tend to disassociate themselves from "littérature" (by default, which you can change with the trained models). Then again, you have to check the p values. "transplantation", "choeur" and "philé" are not considered significant, "chaldéen" is, and "destination", "morceau", etc. are as well. The fact that those are less significant is logical : the fewer the terms, the more erratic their series tend to be.
In [14]:
df1.sort_values('slope', ascending=True).head(50)
Out[14]:
In [15]:
for i, row in df1.sort_values('slope', ascending=True).head(20).iterrows():
plot_cosine_series('littérature', row['token'], 3, 2)
In [16]:
def intersect_neighbor_vocab(anchor, topn=2000):
vocabs = []
for year, model in models.items():
similar = model.most_similar(anchor, topn=topn)
vocabs.append(set([s[0] for s in similar]))
return set.intersection(*vocabs)
In [17]:
intersect_vocab = intersect_neighbor_vocab('littérature')
In [18]:
data = []
for token in intersect_vocab:
series = cosine_series('littérature', token)
fit = lin_reg(series)
if fit.pvalues[1] < 0.05:
data.append((token, fit.params[1], fit.pvalues[1]))
In [19]:
import pandas as pd
df2 = pd.DataFrame(data, columns=('token', 'slope', 'p'))
In this part, we show which significant terms tend to be, throughout time, the nearest neighbours to the main entry. That is to say that these vectors follow the same evolution through time as the main entry and are very near to the "littérature" vector. At this stage, we only ask for significant terms (filter above : "if fit.pvalues[1] < 0.05").
In [20]:
df2.sort_values('slope', ascending=False)
Out[20]:
In [21]:
for i, row in df2.sort_values('slope', ascending=False).iterrows():
plot_cosine_series('littérature', row['token'], 3, 2)
In [ ]: