“You shall know a word by the company it keeps”
- J. R. Firth 1957
... government debt problems are turning into banking crises...
... Europe governments needs unified banking regulation to replace the hodgepodge of debt regulations...
The words: government, regulation and debt probably represent some aspect of banking since they frequently appear near by.
The words: Pokeman and tublar probably don't represent some aspect of banking since they don't frequently appear near by.
The man and woman meet each other ...
They become king and queen ...
They got old and stop talking to each other. Instead, they read books and magazines ...
In [7]:
corpus = """The man and woman meet each other ...
They become king and queen ...
They got old and stop talking to each other. Instead, they read books and magazines ...
"""
In [8]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
In [9]:
# Let's hand assign the words to vectors
important_words = ['queen', 'book', 'king', 'magazine', 'woman', 'man']
vectors = np.array([[0.1, 0.3], # queen
[-0.5, -0.1], # book
[0.2, 0.2], # king
[-0.3, -0.2], # magazine
[-0.5, 0.4], # car
[-0.45, 0.3]]) # bike
In [10]:
plt.plot(vectors[:,0], vectors[:,1], 'o')
plt.xlim(-0.6, 0.3)
plt.ylim(-0.3, 0.5)
for word, x, y in zip(important_words, vectors[:,0], vectors[:,1]):
plt.annotate(word, (x, y), size=12)
In [11]:
# Encode each word using 1-hot encoding
{'queen': [0, 0, 0, 0, 0],
'book': [0, 0, 0, 0, 1],
'king': [0, 0, 0, 1, 0],
'magazine': [0, 0, 1, 0, 0],
'woman': [0, 1, 0, 0, 0],
'man': [1, 0, 0, 0, 0],
}
Out[11]:
2) “Skip-gram”: Each current word as an input to a log-linear classifier to predict words within a certain range before and after that current word
“Insurgents killed in ongoing fighting.”
Bi-grams = {insurgents killed, killed in, in ongoing, ongoing fighting}.
2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}.
Tri-grams = {insurgents killed in, killed in ongoing, in ongoing fighting}.
2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.
CBOW is several times faster to train than the skip-gram and has slightly better accuracy for frequent words.
Skip-gram works well with a small amount of the training data and well represents rare words.
Skip-gram tends to be the most common architecture.