In [1]:
import numpy as np
import pandas as pd

SOURCE_FILE = 'HN_posts_year_to_Sep_26_2016.csv'

hn = pd.read_csv(SOURCE_FILE)

Fitting a TF-IDF matrix

See the documentation for Tf-idf term weighting for more details. The Wikipedia article on TF-IDF is also a good source.


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectoriser = TfidfVectorizer(max_df=0.5, min_df=1, stop_words='english', use_idf=True)
tfidf_matrix = vectoriser.fit_transform(hn['title'])
feature_names = vectoriser.get_feature_names()

Load Word2Vec model

In this case we are loading a 300 dimension word2vec model that has been pre-trained on the Google News corpus.

This means if we ask the model for a specific word (and the word is known to the model), we will get an ordered list of 300 numbers that represent that word.


In [3]:
from gensim.models import Word2Vec

MODEL_DIM = 300
WORD2VEC_MODEL = "GoogleNews-vectors-negative300.bin"
model = Word2Vec.load_word2vec_format(WORD2VEC_MODEL, binary=True)

Singular vector per sentence in article

Each sentence in the article is boiled down to one vector.

This process generally consists of taking the list of numbers for each word in the sentence and performing some computation to get one list of numbers of the same dimension as each word.

Eg. a sentence of 5 words:

The cat sat on the mat.

where each word is represented by a 3 dim vector

The -> [1 2 3]
cat -> [4 3 7]
sat -> [6 2 2]
on  -> [3 9 1]
the -> [1 2 3]
mat -> [8 3 5]

undergoes some computation (see below) to result in one vector of 3 dimensions.

Use TF-IDF scores to generate a vector for each sentence

The word vector for each word in the sentence (if it is known to the model) is multiplied with the word's TF-IDF score. Then the average of these vectors is used to represent the sentence vector.


In [4]:
def title_to_vec(words, model, vectoriser, num_features):
    title_vector = np.zeros((num_features), dtype="float64")
    nwords = 0
    index2word_set = set(model.index2word)
    response = vectoriser.transform([words])
    for col in response.nonzero()[1]:
        word = feature_names[col]
        if word in index2word_set:
            word_tfidf = response[0, col]
            word_vector = model[word]
            nwords = nwords + 1
            title_vector = np.add(title_vector, word_vector*word_tfidf)
    if nwords > 0:
        title_vector = np.divide(title_vector, nwords)
    return title_vector

In [5]:
from functools import partial

vectorise_titles = partial(title_to_vec, vectoriser=vectoriser, num_features=MODEL_DIM, model=model)

In [6]:
hn_small = hn.head(1000)

# This step could take a while! Reduce the truncated list if it takes to long...
hn_small.insert(0, 'vector', hn_small['title'].apply(vectorise_titles))

Visualising the list of title vectors

Step 1

Reduce the dimensionality of the vectors from 300 to 2 so that they can be plotted on a 2D surface.

For this process the Barnes-Hut implementation of the T-SNE algorithm is used.


In [7]:
from sklearn.manifold import TSNE

tnse_model = TSNE(n_components=2, method="barnes_hut")
X_2d  = tnse_model.fit_transform(np.array(hn_small['vector'].values.tolist()))

Step 2

The 2D version of the sentence vectors are now plotted as a scatter plot using Plotly.js.

Note: depending on the random_state parameter of the tsne_model the graph will change with each run. See the documentation for more details.


In [8]:
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
import colorlover as cl

init_notebook_mode(connected=True) # inject plotly.js into the notebook

trace = Scatter(
    x = X_2d[:, 0],
    y = X_2d[:, 1],
    mode = "markers",
    text = hn_small['title'].values.tolist()
    )

iplot({
        "data": [trace],
        "layout": Layout(title="HN")
    })