Introduction to Word Embeddings

Analyzing Meaning through Word Embeddings



Using vectors to represent things



The Geometry of Culture

Analyzing Meaning through Word Embeddings



Austin C. Kozlowski; Matt Taddy; James A. Evans

https://arxiv.org/abs/1803.09288

Word embeddings represent semantic relations between words as geometric relationships between vectors in a high-dimensional space, operationalizing a relational model of meaning consistent with contemporary theories of identity and culture.

  • Dimensions induced by word differences (e.g. man - woman, rich - poor, black - white, liberal - conservative) in these vector spaces closely correspond to dimensions of cultural meaning,
  • Macro-cultural investigation with a longitudinal analysis of the coevolution of gender and class associations in the United States over the 20th century

The success of these high-dimensional models motivates a move towards "high-dimensional theorizing" of meanings, identities and cultural processes.

HistWords

HistWords is a collection of tools and datasets for analyzing language change using word vector embeddings.

  • The goal of this project is to facilitate quantitative research in diachronic linguistics, history, and the digital humanities.
  • We used the historical word vectors in HistWords to study the semantic evolution of more than 30,000 words across 4 languages.

  • This study led us to propose two statistical laws that govern the evolution of word meaning

https://nlp.stanford.edu/projects/histwords/

https://github.com/williamleif/histwords

Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change

Word embeddings quantify 100 years of gender and ethnic stereotypes

http://www.pnas.org/content/early/2018/03/30/1720347115

The Illustrated Word2vec

Jay Alammar. https://jalammar.github.io/illustrated-word2vec/

Personality Embeddings

What are you like?

Big Five personality traits: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism

  • the five-factor model (FFM)
  • the OCEAN model
  • 开放性(openness):具有想象、审美、情感丰富、求异、创造、智能等特质。
  • 责任心(conscientiousness):显示胜任、公正、条理、尽职、成就、自律、谨慎、克制等特点。
  • 外倾性(extraversion):表现出热情、社交、果断、活跃、冒险、乐观等特质。
  • 宜人性(agreeableness):具有信任、利他、直率、依从、谦虚、移情等特质。
  • 神经质或情绪稳定性(neuroticism):具有平衡焦虑、敌对、压抑、自我意识、冲动、脆弱等情绪的特质,即具有保持情绪稳定的能力。

In [15]:
# Personality Embeddings: What are you like?
jay = [-0.4, 0.8, 0.5, -0.2, 0.3]
john = [-0.3, 0.2, 0.3, -0.4, 0.9]
mike = [-0.5, -0.4, -0.2, 0.7, -0.1]

Cosine Similarity

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

$$ \mathbf{A}\cdot\mathbf{B} =\left\|\mathbf{A}\right\|\left\|\mathbf{B}\right\|\cos\theta $$
$$ \text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }, $$

where $A_i$ and $B_i$ are components of vector $A$ and $B$ respectively.


In [10]:
from numpy import dot
from numpy.linalg import norm

def cos_sim(a, b):
    return dot(a, b)/(norm(a)*norm(b))

cos_sim([1, 0, -1], [-1,-1, 0])


Out[10]:
-0.4999999999999999

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])


Out[8]:
array([[-0.5]])
$$CosineDistance = 1- CosineSimilarity$$

In [13]:
from scipy import spatial
# spatial.distance.cosine computes 
# the Cosine distance between 1-D arrays.
1 - spatial.distance.cosine([1, 0, -1], [-1,-1, 0])


Out[13]:
-0.5

In [11]:
cos_sim(jay, john)


Out[11]:
0.6582337075311759

In [12]:
cos_sim(jay, mike)


Out[12]:
-0.3683509554826695

Cosine similarity works for any number of dimensions.

  • We can represent people (and things) as vectors of numbers (which is great for machines!).
  • We can easily calculate how similar vectors are to each other.

Word Embeddings

Google News Word2Vec

You can download Google’s pre-trained model here.

  • It’s 1.5GB!
  • It includes word vectors for a vocabulary of 3 million words and phrases
  • It is trained on roughly 100 billion words from a Google News dataset.
  • The vector length is 300 features.

http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

Using the Gensim library in python, we can

  • find the most similar words to the resulting vector.
  • add and subtract word vectors,

In [16]:
import gensim
# Load Google's pre-trained Word2Vec model.
filepath = '/Users/datalab/bigdata/GoogleNews-vectors-negative300.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=True)

In [24]:
model['woman'][:10]


Out[24]:
array([ 0.24316406, -0.07714844, -0.10302734, -0.10742188,  0.11816406,
       -0.10742188, -0.11425781,  0.02563477,  0.11181641,  0.04858398],
      dtype=float32)

In [20]:
model.most_similar('woman')


Out[20]:
[('man', 0.7664012312889099),
 ('girl', 0.7494640946388245),
 ('teenage_girl', 0.7336829900741577),
 ('teenager', 0.631708562374115),
 ('lady', 0.6288785934448242),
 ('teenaged_girl', 0.6141784191131592),
 ('mother', 0.607630729675293),
 ('policewoman', 0.6069462299346924),
 ('boy', 0.5975908041000366),
 ('Woman', 0.5770983099937439)]

In [18]:
model.similarity('woman', 'man')


Out[18]:
0.76640123

In [23]:
cos_sim(model['woman'], model['man'])


Out[23]:
0.76640123

In [17]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)


Out[17]:
[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431607246399),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133)]
$$King- Queen = Man - Woman$$

Now that we’ve looked at trained word embeddings,

  • let’s learn more about the training process.
  • But before we get to word2vec, we need to look at a conceptual parent of word embeddings: the neural language model.

The neural language model

“You shall know a word by the company it keeps” J.R. Firth

Bengio 2003 A Neural Probabilistic Language Model. Journal of Machine Learning Research. 3:1137–1155

After being trained, early neural language models (Bengio 2003) would calculate a prediction in three steps:

The output of the neural language model is a probability score for all the words the model knows.

  • We're referring to the probability as a percentage here,
  • but 40% would actually be represented as 0.4 in the output vector.

Language Model Training

  • We get a lot of text data (say, all Wikipedia articles, for example). then
  • We have a window (say, of three words) that we slide against all of that text.
  • The sliding window generates training samples for our model

As this window slides against the text, we (virtually) generate a dataset that we use to train a model.

Instead of only looking two words before the target word, we can also look at two words after it.

If we do this, the dataset we’re virtually building and training the model against would look like this:

This is called a Continuous Bag of Words (CBOW) https://arxiv.org/pdf/1301.3781.pdf

Skip-gram

Instead of guessing a word based on its context (the words before and after it), this other architecture tries to guess neighboring words using the current word.

https://arxiv.org/pdf/1301.3781.pdf

The pink boxes are in different shades because this sliding window actually creates four separate samples in our training dataset.

  • We then slide our window to the next position:
  • Which generates our next four examples:

Negative Sampling

And switch it to a model that takes the input and output word, and outputs a score indicating if they’re neighbors or not

  • 0 for “not neighbors”, 1 for “neighbors”.

we need to introduce negative samples to our dataset

  • samples of words that are not neighbors.
  • Our model needs to return 0 for those samples.
  • This leads to a great tradeoff of computational and statistical efficiency.

Skipgram with Negative Sampling (SGNS)

Word2vec Training Process


In [118]:
# see http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)


Out[118]:
<torch._C.Generator at 0x116c571d0>

In [202]:
text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

text = text.replace(',', '').replace('.', '').lower().split()

In [203]:
# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(text)
vocab_size = len(vocab)
print('vocab_size:', vocab_size)

w2i = {w: i for i, w in enumerate(vocab)}
i2w = {i: w for i, w in enumerate(vocab)}


vocab_size: 44

In [204]:
# context window size is two
def create_cbow_dataset(text):
    data = []
    for i in range(2, len(text) - 2):
        context = [text[i - 2], text[i - 1],
                   text[i + 1], text[i + 2]]
        target = text[i]
        data.append((context, target))
    return data

cbow_train = create_cbow_dataset(text)
print('cbow sample', cbow_train[0])


cbow sample (['we', 'are', 'to', 'study'], 'about')

In [205]:
def create_skipgram_dataset(text):
    import random
    data = []
    for i in range(2, len(text) - 2):
        data.append((text[i], text[i-2], 1))
        data.append((text[i], text[i-1], 1))
        data.append((text[i], text[i+1], 1))
        data.append((text[i], text[i+2], 1))
        # negative sampling
        for _ in range(4):
            if random.random() < 0.5 or i >= len(text) - 3:
                rand_id = random.randint(0, i-1)
            else:
                rand_id = random.randint(i+3, len(text)-1)
            data.append((text[i], text[rand_id], 0))
    return data


skipgram_train = create_skipgram_dataset(text)
print('skipgram sample', skipgram_train[0])


skipgram sample ('about', 'we', 1)

In [206]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, embd_size, context_size, hidden_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
        self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, inputs):
        embedded = self.embeddings(inputs).view((1, -1))
        hid = F.relu(self.linear1(embedded))
        out = self.linear2(hid)
        log_probs = F.log_softmax(out, dim = 1)
        return log_probs
    
    def extract(self, inputs):
        embeds = self.embeddings(inputs)
        return embeds

In [207]:
class SkipGram(nn.Module):
    def __init__(self, vocab_size, embd_size):
        super(SkipGram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
    
    def forward(self, focus, context):
        embed_focus = self.embeddings(focus).view((1, -1)) # input
        embed_ctx = self.embeddings(context).view((1, -1)) # output
        score = torch.mm(embed_focus, torch.t(embed_ctx)) # input*output
        log_probs = F.logsigmoid(score) # sigmoid
        return log_probs
    
    def extract(self, focus):
        embed_focus = self.embeddings(focus)
        return embed_focus

torch.mm Performs a matrix multiplication of the matrices

torch.t Expects :attr:input to be a matrix (2-D tensor) and transposes dimensions 0 and 1. Can be seen as a short-hand function for transpose(input, 0, 1).


In [208]:
embd_size = 100
learning_rate = 0.001
n_epoch = 30
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right

In [209]:
def train_cbow():
    hidden_size = 64
    losses = []
    loss_fn = nn.NLLLoss()
    model = CBOW(vocab_size, embd_size, CONTEXT_SIZE, hidden_size)
    print(model)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    for epoch in range(n_epoch):
        total_loss = .0
        for context, target in cbow_train:
            ctx_idxs = [w2i[w] for w in context]
            ctx_var = Variable(torch.LongTensor(ctx_idxs))

            model.zero_grad()
            log_probs = model(ctx_var)

            loss = loss_fn(log_probs, Variable(torch.LongTensor([w2i[target]])))

            loss.backward()
            optimizer.step()

            total_loss += loss.data.item()
        losses.append(total_loss)
    return model, losses

In [210]:
def train_skipgram():
    losses = []
    loss_fn = nn.MSELoss()
    model = SkipGram(vocab_size, embd_size)
    print(model)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    
    for epoch in range(n_epoch):
        total_loss = .0
        for in_w, out_w, target in skipgram_train:
            in_w_var = Variable(torch.LongTensor([w2i[in_w]]))
            out_w_var = Variable(torch.LongTensor([w2i[out_w]]))
            
            model.zero_grad()
            log_probs = model(in_w_var, out_w_var)
            loss = loss_fn(log_probs[0], Variable(torch.Tensor([target])))
            
            loss.backward()
            optimizer.step()

            total_loss += loss.data.item()
        losses.append(total_loss)
    return model, losses

In [211]:
cbow_model, cbow_losses = train_cbow()
sg_model, sg_losses = train_skipgram()


CBOW(
  (embeddings): Embedding(44, 100)
  (linear1): Linear(in_features=400, out_features=64, bias=True)
  (linear2): Linear(in_features=64, out_features=44, bias=True)
)
SkipGram(
  (embeddings): Embedding(44, 100)
)

In [212]:
plt.figure(figsize= (10, 4))
plt.subplot(121)
plt.plot(range(n_epoch), cbow_losses, 'r-o', label = 'CBOW Losses')
plt.legend()
plt.subplot(122)
plt.plot(range(n_epoch), sg_losses, 'g-s', label = 'SkipGram Losses')
plt.legend()
plt.tight_layout()



In [213]:
cbow_vec = cbow_model.extract(Variable(torch.LongTensor([v for v in w2i.values()])))
cbow_vec = cbow_vec.data.numpy()
len(cbow_vec[0])


Out[213]:
100

In [214]:
sg_vec = sg_model.extract(Variable(torch.LongTensor([v for v in w2i.values()])))
sg_vec = sg_vec.data.numpy()
len(sg_vec[0])


Out[214]:
100

In [217]:
# 利用PCA算法进行降维
from sklearn.decomposition import PCA
X_reduced = PCA(n_components=2).fit_transform(sg_vec)

# 绘制所有单词向量的二维空间投影
import matplotlib.pyplot as plt
import matplotlib

fig = plt.figure(figsize = (20, 10))
ax = fig.gca()
ax.set_facecolor('black')
ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.4, color = 'white')
# 绘制几个特殊单词的向量
words = list(w2i.keys())
# 设置中文字体,否则无法在图形上显示中文
for w in words:
    if w in w2i:
        ind = w2i[w]
        xy = X_reduced[ind]
        plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')
        plt.text(xy[0], xy[1], w, alpha = 1, color = 'white', fontsize = 20)


NGram词向量模型

本文件是集智AI学园http://campus.swarma.org 出品的“火炬上的深度学习”第VI课的配套源代码

原理:利用一个人工神经网络来根据前N个单词来预测下一个单词,从而得到每个单词的词向量

以刘慈欣著名的科幻小说《三体》为例,来展示利用NGram模型训练词向量的方法

  • 预处理分为两个步骤:1、读取文件、2、分词、3、将语料划分为N+1元组,准备好训练用数据
  • 在这里,我们并没有去除标点符号,一是为了编程简洁,而是考虑到分词会自动将标点符号当作一个单词处理,因此不需要额外考虑。

In [35]:
with open("../data/3body.txt", 'r') as f:
    text = str(f.read())

In [37]:
import jieba, re
temp = jieba.lcut(text)
words = []
for i in temp:
    #过滤掉所有的标点符号
    i = re.sub("[\s+\.\!\/_,$%^*(+\"\'””《》]+|[+——!,。?、~@#¥%……&*():]+", "", i)
    if len(i) > 0:
        words.append(i)
print(len(words))


7754

In [36]:
text[:100]


Out[36]:
'八万五千三体时(约8.6个地球年)后。\n\n元首下令召开三体世界全体执政官紧急会议,这很不寻常,一定有什么重大的事件发生。\n\n两万三体时前,三体舰队启航了,它们只知道目标的大致方向,却不知道它的距离。也'

In [44]:
print(*words[:50])


八万五千 三体 时 约 86 个 地球 年 后 元首 下令 召开 三体 世界 全体 执政官 紧急会议 这 很 不 寻常 一定 有 什么 重大 的 事件 发生 两万 三体 时前 三体 舰队 启航 了 它们 只 知道 目标 的 大致 方向 却 不 知道 它 的 距离 也许 目标

In [38]:
trigrams = [([words[i], words[i + 1]], words[i + 2]) for i in range(len(words) - 2)]
# 打印出前三个元素看看
print(trigrams[:3])


[(['八万五千', '三体'], '时'), (['三体', '时'], '约'), (['时', '约'], '86')]

In [84]:
# 得到词汇表
vocab = set(words)
print(len(vocab))
word_to_idx = {i:[k, 0] for k, i in enumerate(vocab)} 
idx_to_word = {k:i for k, i in enumerate(vocab)}
for w in words:
     word_to_idx[w][1] +=1


2000

构造NGram神经网络模型 (三层的网络)

  1. 输入层:embedding层,这一层的作用是:先将输入单词的编号映射为一个one hot编码的向量,形如:001000,维度为单词表大小。 然后,embedding会通过一个线性的神经网络层映射到这个词的向量表示,输出为embedding_dim
  2. 线性层,从embedding_dim维度到128维度,然后经过非线性ReLU函数
  3. 线性层:从128维度到单词表大小维度,然后log softmax函数,给出预测每个单词的概率

In [87]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

import torch

class NGram(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)  #嵌入层
        self.linear1 = nn.Linear(context_size * embedding_dim, 128) #线性层
        self.linear2 = nn.Linear(128, vocab_size) #线性层

    def forward(self, inputs):
        #嵌入运算,嵌入运算在内部分为两步:将输入的单词编码映射为one hot向量表示,然后经过一个线性层得到单词的词向量
        embeds = self.embeddings(inputs).view(1, -1)
        # 线性层加ReLU
        out = F.relu(self.linear1(embeds))
        
        # 线性层加Softmax
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim = 1)
        return log_probs
    def extract(self, inputs):
        embeds = self.embeddings(inputs)
        return embeds

In [89]:
losses = [] #纪录每一步的损失函数
criterion = nn.NLLLoss() #运用负对数似然函数作为目标函数(常用于多分类问题的目标函数)
model = NGram(len(vocab), 10, 2) #定义NGram模型,向量嵌入维数为10维,N(窗口大小)为2
optimizer = optim.SGD(model.parameters(), lr=0.001) #使用随机梯度下降算法作为优化器 
#循环100个周期
for epoch in range(100):
    total_loss = torch.Tensor([0])
    for context, target in trigrams:
        # 准备好输入模型的数据,将词汇映射为编码
        context_idxs = [word_to_idx[w][0] for w in context]
        # 包装成PyTorch的Variable
        context_var = Variable(torch.LongTensor(context_idxs))
        # 清空梯度:注意PyTorch会在调用backward的时候自动积累梯度信息,故而每隔周期要清空梯度信息一次。
        optimizer.zero_grad()
        # 用神经网络做计算,计算得到输出的每个单词的可能概率对数值
        log_probs = model(context_var)
        # 计算损失函数,同样需要把目标数据转化为编码,并包装为Variable
        loss = criterion(log_probs, Variable(torch.LongTensor([word_to_idx[target][0]])))
        # 梯度反传
        loss.backward()
        # 对网络进行优化
        optimizer.step()
        # 累加损失函数值
        total_loss += loss.data
    losses.append(total_loss)
    print('第{}轮,损失函数为:{:.2f}'.format(epoch, total_loss.numpy()[0]))


第0轮,损失函数为:56704.61
第1轮,损失函数为:53935.28
第2轮,损失函数为:52241.16
第3轮,损失函数为:51008.51
第4轮,损失函数为:50113.76
第5轮,损失函数为:49434.07
第6轮,损失函数为:48879.33
第7轮,损失函数为:48404.71
第8轮,损失函数为:47983.95
第9轮,损失函数为:47600.01
第10轮,损失函数为:47240.32
第11轮,损失函数为:46897.53
第12轮,损失函数为:46566.24
第13轮,损失函数为:46241.59
第14轮,损失函数为:45920.18
第15轮,损失函数为:45599.50
第16轮,损失函数为:45277.74
第17轮,损失函数为:44953.10
第18轮,损失函数为:44624.41
第19轮,损失函数为:44290.34
第20轮,损失函数为:43950.63
第21轮,损失函数为:43604.48
第22轮,损失函数为:43251.90
第23轮,损失函数为:42891.99
第24轮,损失函数为:42524.64
第25轮,损失函数为:42149.46
第26轮,损失函数为:41766.14
第27轮,损失函数为:41374.89
第28轮,损失函数为:40975.62
第29轮,损失函数为:40568.36
第30轮,损失函数为:40153.31
第31轮,损失函数为:39730.61
第32轮,损失函数为:39300.70
第33轮,损失函数为:38863.39
第34轮,损失函数为:38419.11
第35轮,损失函数为:37968.16
第36轮,损失函数为:37510.99
第37轮,损失函数为:37048.06
第38轮,损失函数为:36579.82
第39轮,损失函数为:36106.78
第40轮,损失函数为:35629.46
第41轮,损失函数为:35148.57
第42轮,损失函数为:34665.39
第43轮,损失函数为:34180.25
第44轮,损失函数为:33693.93
第45轮,损失函数为:33207.48
第46轮,损失函数为:32721.72
第47轮,损失函数为:32237.36
第48轮,损失函数为:31755.00
第49轮,损失函数为:31275.05
第50轮,损失函数为:30798.38
第51轮,损失函数为:30325.62
第52轮,损失函数为:29857.59
第53轮,损失函数为:29394.65
第54轮,损失函数为:28937.08
第55轮,损失函数为:28485.72
第56轮,损失函数为:28041.07
第57轮,损失函数为:27603.33
第58轮,损失函数为:27173.14
第59轮,损失函数为:26750.82
第60轮,损失函数为:26336.92
第61轮,损失函数为:25931.60
第62轮,损失函数为:25534.87
第63轮,损失函数为:25147.07
第64轮,损失函数为:24768.02
第65轮,损失函数为:24397.92
第66轮,损失函数为:24036.68
第67轮,损失函数为:23684.69
第68轮,损失函数为:23341.30
第69轮,损失函数为:23006.46
第70轮,损失函数为:22680.18
第71轮,损失函数为:22361.95
第72轮,损失函数为:22051.86
第73轮,损失函数为:21749.46
第74轮,损失函数为:21454.48
第75轮,损失函数为:21167.06
第76轮,损失函数为:20886.72
第77轮,损失函数为:20613.04
第78轮,损失函数为:20346.13
第79轮,损失函数为:20085.52
第80轮,损失函数为:19831.27
第81轮,损失函数为:19583.16
第82轮,损失函数为:19341.03
第83轮,损失函数为:19104.43
第84轮,损失函数为:18873.11
第85轮,损失函数为:18646.91
第86轮,损失函数为:18425.87
第87轮,损失函数为:18209.80
第88轮,损失函数为:17998.34
第89轮,损失函数为:17791.97
第90轮,损失函数为:17589.94
第91轮,损失函数为:17392.24
第92轮,损失函数为:17199.04
第93轮,损失函数为:17009.97
第94轮,损失函数为:16824.82
第95轮,损失函数为:16643.87
第96轮,损失函数为:16466.76
第97轮,损失函数为:16293.54
第98轮,损失函数为:16123.99
第99轮,损失函数为:15957.75

12m 24s!!!


In [91]:
# 从训练好的模型中提取每个单词的向量
vec = model.extract(Variable(torch.LongTensor([v[0] for v in word_to_idx.values()])))
vec = vec.data.numpy()

# 利用PCA算法进行降维
from sklearn.decomposition import PCA
X_reduced = PCA(n_components=2).fit_transform(vec)

In [107]:
# 绘制所有单词向量的二维空间投影
import matplotlib.pyplot as plt
import matplotlib

fig = plt.figure(figsize = (20, 10))
ax = fig.gca()
ax.set_facecolor('black')
ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.4, color = 'white')
# 绘制几个特殊单词的向量
words = ['智子', '地球', '三体', '质子', '科学', '世界', '文明', '太空', '加速器', '平面', '宇宙', '信息']
# 设置中文字体,否则无法在图形上显示中文
zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/华文仿宋.ttf', size = 35)
for w in words:
    if w in word_to_idx:
        ind = word_to_idx[w][0]
        xy = X_reduced[ind]
        plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')
        plt.text(xy[0], xy[1], w, fontproperties = zhfont1, alpha = 1, color = 'white')



In [109]:
# 定义计算cosine相似度的函数
import numpy as np
def cos_similarity(vec1, vec2):
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    norm = norm1 * norm2
    dot = np.dot(vec1, vec2)
    result = dot / norm if norm > 0 else 0
    return result
    
# 在所有的词向量中寻找到与目标词(word)相近的向量,并按相似度进行排列
def find_most_similar(word, vectors, word_idx):
    vector = vectors[word_to_idx[word][0]]
    simi = [[cos_similarity(vector, vectors[num]), key] for num, key in enumerate(word_idx.keys())]
    sort = sorted(simi)[::-1]
    words = [i[1] for i in sort]
    return words

# 与智子靠近的词汇
find_most_similar('智子', vec, word_to_idx)[:10]


Out[109]:
['局部', '一场', '来', '错误', '一生', '正中', '航行', '地面', '只是', '政府']

Gensim Word2vec


In [110]:
import gensim as gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from gensim.models.word2vec import LineSentence

In [112]:
f = open("../data/三体.txt", 'r')
lines = []
for line in f:
    temp = jieba.lcut(line)
    words = []
    for i in temp:
        #过滤掉所有的标点符号
        i = re.sub("[\s+\.\!\/_,$%^*(+\"\'””《》]+|[+——!,。?、~@#¥%……&*():;‘]+", "", i)
        if len(i) > 0:
            words.append(i)
    if len(words) > 0:
        lines.append(words)

In [113]:
# 调用gensim Word2Vec的算法进行训练。
# 参数分别为:size: 嵌入后的词向量维度;window: 上下文的宽度,min_count为考虑计算的单词的最低词频阈值
model = Word2Vec(lines, size = 20, window = 2 , min_count = 0)

In [115]:
model.wv.most_similar('三体', topn = 10)


Out[115]:
[('地球', 0.9994192123413086),
 ('组织', 0.9993805289268494),
 ('文明', 0.9993024468421936),
 ('中', 0.9992458820343018),
 ('发展', 0.9992369413375854),
 ('的', 0.9992075562477112),
 ('与', 0.9990826845169067),
 ('社会', 0.9990127682685852),
 ('一次', 0.9990046620368958),
 ('状态', 0.9989502429962158)]

In [1]:
# 将词向量投影到二维空间
rawWordVec = []
word2ind = {}
for i, w in enumerate(model.wv.vocab):
    rawWordVec.append(model[w])
    word2ind[w] = i
rawWordVec = np.array(rawWordVec)
X_reduced = PCA(n_components=2).fit_transform(rawWordVec)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-05a54a135a3e> in <module>()
      2 rawWordVec = []
      3 word2ind = {}
----> 4 for i, w in enumerate(model.wv.vocab):
      5     rawWordVec.append(model[w])
      6     word2ind[w] = i

NameError: name 'model' is not defined

In [117]:
# 绘制星空图
# 绘制所有单词向量的二维空间投影
fig = plt.figure(figsize = (15, 10))
ax = fig.gca()
ax.set_facecolor('black')
ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.3, color = 'white')
# 绘制几个特殊单词的向量
words = ['智子', '地球', '三体', '质子', '科学', '世界', '文明', '太空', '加速器', '平面', '宇宙', '进展','的']
# 设置中文字体,否则无法在图形上显示中文
zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/华文仿宋.ttf', size=26)
for w in words:
    if w in word2ind:
        ind = word2ind[w]
        xy = X_reduced[ind]
        plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')
        plt.text(xy[0], xy[1], w, fontproperties = zhfont1, alpha = 1, color = 'yellow')


END