In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
from datascience import *
import numpy as np
from scipy.spatial.distance import cosine
import gensim
import nltk

4. Word2Vec from Scratch

Neural Networks

The Word2Vec algorithm is a clever implementation of a more general Neural Network.

Neural Networks can be thought of as function approximators. Oftentimes, we have inputs and outputs that can be represented as vectors but do not know the function that transforms input to output even though we would like to reproduce it. For example, if we throw a ball directly upward, we already know based on Newtonian physics that its height will be a function of time as determined by gravity and the initial velocity, and that we can identify its path in terms of these. However, without knowing anything about these features of motion, a Neural Network could be used to reverse engineer the path of the ball by showing it a series of inputs (times) and their associated outputs (heights). In turn, the Neural Network would be able to predict the ball's height at later times or in between the times it had been shown.

Neural Networks are very often used to approximate functions that identify whether images have cats in them.

At the grittiest mathematical level, Neural Networks consist of (at least!) two matrices. For convenience, we'll call them Matrix A and Matrix B. Without going into the Linear Algebra, the input vector is initially multiplied by Matrix A, creating a new vector that is then multiplied by Matrix B, producing the network's output vector. And that's it. The key here are the values contained in each of the matrices.

The values of the network's matrices are learned through an iterative, correction process. Initially, the matrices are assigned random values, so naturally the network's output is random as well. However, that intial output gets compared to a desired, target vector and based on its error, the matrices get tweaked. In our previous example, the network would have been trained by showing it a given time (say, 2 seconds after throwing the ball) and comparing its predicted output (say, a randomly chosen -127 feet above ground) to the target output (say, 11 feet above ground). The values of Matrix B would first need to be revised in order to produce the correct output and the values of Matrix A would be revised based on those of Matrix B. Further pairs of inputs and outputs would be shown to the Network and the values of each matrix further updated.

Once the network has been trained, we might think of Matrix A as encoding relationships among inputs and Matrix B as decoding these into outputs. Word2Vec takes advantage of the fact that when we train a Neural Network on CBOW or Skip-Gram models of language, each line of Matrix A represents a unique word's projection into semantic space.

A Note About Terminology

I have used my own terms to describe the structure of a Neural Network, since they are simpler for our particular application. Networks are often described in terms of Layers, Nodes, and Weights.

The input vector might be referred to as Layer 0, or the input layer. The vector that is produced by multiplying the input vector by Matrix A is Layer 1, or more often, the hidden layer. And the vector that is produced after multiplication with Matrix B is Layer 2, or the output layer. The network I have described has just two layers, but there may be potentially very many -- i.e. many hidden layers between the input and output. This is referred to as deep learning.

Matrices A and B are typically referred to as Weight Matrices. The number of dimensions in them are referred to as nodes. The idea is that each node in each layer is connected to each node in the next one, akin to the neurons of a human brain. The matrices record the strength of connection between each node.

One-Hot Vector Representation of Words

In order to make useful inputs and targets for our Neural Network, words initially get represented by a one-hot vector. Essentially, the one-hot encoding of a word is a document-term matrix for a single document with a single word and as many columns as there are unique words in the entire corpus. Every value for the document is zero except in the column for the word in question. This strategy is used to distinguish words from one another when they are represented as inputs and targets. Since the values of the vector are entirely independent of one another, no relationships between words get expressed by these.

For CBOW, the target output is simply the one-hot vector of the word in question. The input is the average of the one-hot vectors representing context words within a given window.

Word2Vec Features

Vocabulary: Most frequent words to include in our vocabulary; determines size of one-hot word vectors and, in turn, one dimensions of matrices
Semantic Space: Dimensions in which words' relationships get encoded; determines other dimension of matrices; alternately referred to as the number of hidden nodes
Window: Context words to observe left and right
Alpha: learning rate; prevents model from over-correcting, enables finer tuning
Epochs: Passes through dataset
Batch Size: Words to sample from data during each pass



In [ ]:

    
# Set value for each feature

vocab_size = 10000
semantic_space_size = 100
window = 5
alpha = 1e-4
epochs = 5
batch_size = 10000



In [ ]:

    
# Assign each unique token to a column in the one-hot vector

from collections import Counter

all_tokens = [token for sentence in words_by_sentence for token in sentence]
token_counts = Counter(all_tokens)
common_tokens = [token for token,frequency in token_counts.most_common(vocab_size-1)]

# We'll add a dummy term to represent all infrequent words
common_tokens.append('_SLUG_')

token_to_ix = { tk:i for i,tk in enumerate(common_tokens) }
ix_to_token = { i:tk for i,tk in enumerate(common_tokens) }



In [ ]:

    
# Inspect
token_to_ix['whale']



In [ ]:

    
# Inspect
ix_to_token[1618]



In [ ]:

    
# Initialize matrices with random values

Matrix_A = np.random.randn(vocab_size,semantic_space_size)*0.01
Matrix_B = np.random.randn(semantic_space_size,vocab_size)*0.01



In [ ]:

    
# Inspect
Matrix_A



In [ ]:

    
# We need two functions:
# 1) Produce appropriate input and output vectors for our CBOW model, given a list of tokens in a sentence
# 2) Update the values in our matrices after each observation

def CBOW_input_output(sentence):
    
    # Get one-hot values for each word in sentence
    ix_map = [token_to_ix[word] if word in common_tokens else token_to_ix['_SLUG_'] for word in sentence]
    
    # Randomly select a target word
    target_index = np.random.choice(range(len(sentence)))
    
    
    # Create (target) vector in which all values are zero
    target_vector = np.zeros((vocab_size,1))
    
    # Set target-word value to 1
    target_vector[ix_map[target_index]] = 1
    
    
    # Create (input) vector in which all values are zero
    input_vector = np.zeros((vocab_size,1))
    
    # Iterate through each of window-distance words to left of target
    for i in range(target_index-window, target_index):
        
        # Add 1 to each word's value in the input vector
        if i in range(len(sentence)) and sentence[i] in common_tokens:
            input_vector[ix_map[i]] += 1

            
    # Iterate through each of window-distance words to left of target
    for i in range(target_index+1, target_index+window+1):
        
        # Add 1 to each word's value in the input vector
        if i in range(len(sentence)) and sentence[i] in common_tokens:
            input_vector[ix_map[i]] += 1
        

    
    # Divide by number of words observed
    input_vector = input_vector/(sum(input_vector)+1e-8)
            
    return input_vector, target_vector



def train_function(input_vector_, target_vector_, Matrix_A_, Matrix_B_, alpha_):
    
    # Multiply input by matrices to produce output
    hidden_vector = np.dot(input_vector_.T,Matrix_A_)
    output_vector = 1/(1+np.exp(-(np.dot(hidden_vector,Matrix_B_))))
    
    # Note that after the output vector is produced, it is put into an activation function
    # -- in this case, the logistic function -- that squashes its values between (0,1).
    # This basically measures each vocabulary word's probability of being the true output.
    
    # Determine error for each matrix
    output_delta = (target_vector_.T - output_vector)
    hidden_delta =  np.dot(output_delta,Matrix_B_.T)

    # Update matrix values
    Matrix_B_ += np.dot(hidden_vector.T,output_delta)*alpha_
    Matrix_A_ += np.dot(input_vector_,hidden_delta)*alpha_
    
    return(Matrix_A_,Matrix_B_)



In [ ]:

    
# Train our Neural Network

for _ in range(epochs):
    sampled_sentences = np.random.choice(words_by_sentence, batch_size)
    for sentence in sampled_sentences:
        input_vector, target_vector = CBOW_input_output(sentence)
        Matrix_A, Matrix_B = train_function(input_vector, target_vector, Matrix_A, Matrix_B, alpha)



In [ ]:

    
# Inspect
Matrix_A



In [ ]:

    
# Normalize each matrix row

row_norm = np.linalg.norm(Matrix_A, axis=1)
Matrix_A_norm = Matrix_A / row_norm[:, numpy.newaxis]



In [ ]:

    
# Create functions to retrieve word embeddings, determine similarities

def get_vector(word,embedding_matrix=Matrix_A_norm):
    if word in common_tokens:
        ix = token_to_ix[word]
        return embedding_matrix[ix]
    else:
        print(word, "not in vocabulary")
    

def token_similarity(token_1,token_2):
    vector_1 = get_vector(token_1)
    vector_2 = get_vector(token_2)
    return 1-cosine(vector_1,vector_2)



In [ ]:

    
# Retrieve word embedding
get_vector('whale')



In [ ]:

    
# Pronouns should be close to one another
token_similarity('she','her')



In [ ]:

    
# Let's compare a pronoun to a control word with a different POS
token_similarity('call','me')



In [ ]:

    
# Can we recreate the analogy 'king' - 'man' + 'woman' ~ 'queen?
# This formulation plays with some of the underlying algebra

token_similarity('king','queen') - token_similarity('man','queen') + token_similarity('king','queen')



In [ ]:

    
## EX. Write a function that will produce inputs and outputs for the Skip-Gram model.

## EX. Our model randomly selects words from the corpus, so it implicitly prefers frequent tokens.
##     Modify the script to minimize (but not eliminate) appearances of high-frequency terms.
##     as components of input vectors.

## EX. It has been found that the learning process can be sped up by showing "wrong" targets
##     during training and penalizing the network when it predicts them as outputs.
##     Implement this.

Appendix: Vector Rejection



In [ ]:

    
# Create functions that reject vectors, find nearest words

def vector_rejector(vector_1,vector_2):
    # Rejects vector_2 from vector_1
    # Note that vectors are passed in, not words; also mean of
    # several vectors could be passed as a single argument
    vector_2_norm = vector_2 / np.linalg.norm(vector_2)
    projection = np.dot(vector_1,vector_2_norm) * vector_2_norm
    return vector_1 - projection

def nearest_words(vector,embedding_matrix=Matrix_A_norm):
    # Note that this relies on our 'ix_to_token' dictionary;
    # format of embedding matrix is 2-D numpy array
    new_matrix = np.dot(embedding_matrix, vector)
    nearest_rows  = gensim.matutils.argsort(new_matrix, topn = 10, reverse = True)
    result = [(ix_to_token[row_id], float(new_matrix[row_id])) for row_id in nearest_rows]
    return result



In [ ]:

    
she_vector = vector_rejector(get_vector('she'),get_vector('he'))



In [ ]:

    
# Inspect
she_vector



In [ ]:

    
nearest_words(she_vector)



In [ ]: