In [3]:
%matplotlib inline

import matplotlib
import seaborn as sns
# matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

Neural networks and word2vec

A neural network is a method of machine learning which chains together a number of more basic models (nodes) at a number of layers, with each layer feeding into the next. The most basic (1-node, 1-layer) neural network, the 'perceptron'.

The Perceptron

The perceptron is a linear decision boundary classifier that trains by an iterative learning approach. The model works as follows:

  • Input: Data point is transformed into a n-length 'feature vector' $\\vec{v}$ $\\in R^n$
  • Ouput: A classification, either -1 or 1.

Here's an example of some randomly generated data. In the below scatterplot, we've created some random data points in a 2-dimensional vector space and classified them relative to the line $y = x$ (points "above" are class 0, points "below" are class 1).


In [4]:
from sklearn import metrics
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from scipy.special import expit as sigmoid

SIZE = 100

# Generating Training Data
np.random.RandomState(1087)

training_x = np.random.normal(size=[SIZE, 2])
training_t = (training_x[:, 1] > training_x[:, 0]).astype(np.int)

In [5]:
plt.scatter(training_x[:, 0], training_x[:, 1], c=training_t)
plt.xlabel("x[0]")
plt.ylabel("x[1]")
plt.title("Training Data")


Out[5]:
<matplotlib.text.Text at 0x7fe1b6ae0940>

In [10]:
epochs = 1000
weights = np.array([1., 0.])
learning_rate = .001

normalized_weight_history = []
for epoch in range(epochs):
    pred = sigmoid(np.dot(training_x, weights))
    deriv = pred * (1 - pred) * (pred - training_t)
    update = np.dot(deriv, training_x)
    weights -= learning_rate * update
    if epoch % 50 == 0:
        normalized_weight_history += [weights / np.abs(weights).sum()]
normalized_weight_history = np.array(normalized_weight_history)
        
preds = (sigmoid(np.dot(training_x, weights)) > .5).astype(int)

print("Mean Absolute Error:", metrics.mean_absolute_error(training_t, preds))
print("Mean Squared Error:", metrics.mean_squared_error(training_t, preds)) 
print("R^2:", metrics.r2_score(training_t, preds)) 

plt.scatter(
    normalized_weight_history[:, 0],
    normalized_weight_history[:, 1]
)
plt.xlim([-1, 1])
plt.ylim([-1, 1])
plt.xlabel("weight[0]")
plt.ylabel("weight[1]")
plt.title("Convergence of Weights")


Mean Absolute Error: 0.02
Mean Squared Error: 0.02
R^2: 0.918400652795
Out[10]:
<matplotlib.text.Text at 0x7fe1b6de0b70>

In [12]:
#We rotate the resultant vector 90 degrees. 
m = -(weights[0])/(weights[1])
#Since we didn't include a bias term, we start at (0,0)
p1 = [0, 0]
x2 = 1.3
y2 = (x2 - p1[0])*m + p1[1]
p2 = [x2, y2]

def drawLine2P(x,y,xlims):
    xrange = np.arange(xlims[0],xlims[1],0.1)
    A = np.vstack([x, np.ones(len(x))]).T
    k, b = np.linalg.lstsq(A, y)[0]
    plt.plot(xrange, k*xrange + b, 'k')

plt.scatter(training_x[:,0], training_x[:,1],  c=training_t)
drawLine2P([p1[0], p2[0]], [p1[1], p2[1]], [-3, 3])
plt.xlabel("x[0]")
plt.ylabel("x[1]")
plt.title("Training Data with Prediction")


Out[12]:
<matplotlib.text.Text at 0x7fe1b71214e0>

word2vec

The skip-gram and continuous-bag-of-words neural networks used in word2vec are simple neural networks that product a suprisingly cool result on natural language data.

word2vec is a library that, given a corpus of natural language text data maps each word into a high-dimensional vector space. For those of you familiar with information retrieval theory, the concept is widely used in that field: a word's position in the vector space denotes how similar it is to other nearby words. Using this process, we create interesting clusters of related words, and somewhat more compellingly, learn non-trivial associations between words.

For example, consider this sentence,

"Man is to King as Woman is to __"

As an English speaker, you know the answer is "Queen". As it turns out, a well-trained word2vec vector space can also come to this conclusion by the following vector computation:

closest((vector("king") - vector("man")) + vector("woman"))

where we measure closeness via cosine similarity.

The tool: gensim

Gensim is a great python library for various topic-modelling and clustering tasks. The library recently added word2vec to its toolbox - it's a thin wrapper around the highly optimized C implementation.

To install gensim, you can simply run conda install gensim (if you're still on pip you can use this as well). You'll need:

  • at least numpy/scipy installed
  • ideally a C compiler so that you can use optimized word2vec training (as noted here).

We'll also use the first Pre-trained model (from the Google News Dataset), from the word2vec Google Code page above, to illustrate some of the classic analogies we mentioned above. Note these vectors are 1.5GB, so sit back and relax!

import gensim # googlenews_model = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

In [ ]: