In [1]:
import numpy as np
import pandas as pd
SOURCE_FILE = 'HN_posts_year_to_Sep_26_2016.csv'
hn = pd.read_csv(SOURCE_FILE)
See the documentation for Tf-idf term weighting for more details. The Wikipedia article on TF-IDF is also a good source.
In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectoriser = TfidfVectorizer(max_df=0.5, min_df=1, stop_words='english', use_idf=True)
tfidf_matrix = vectoriser.fit_transform(hn['title'])
feature_names = vectoriser.get_feature_names()
In this case we are loading a 300 dimension word2vec model that has been pre-trained on the Google News corpus.
This means if we ask the model for a specific word (and the word is known to the model), we will get an ordered list of 300 numbers that represent that word.
In [3]:
from gensim.models import Word2Vec
MODEL_DIM = 300
WORD2VEC_MODEL = "GoogleNews-vectors-negative300.bin"
model = Word2Vec.load_word2vec_format(WORD2VEC_MODEL, binary=True)
Each sentence in the article is boiled down to one vector.
This process generally consists of taking the list of numbers for each word in the sentence and performing some computation to get one list of numbers of the same dimension as each word.
Eg. a sentence of 5 words:
The cat sat on the mat.
where each word is represented by a 3 dim vector
The -> [1 2 3]
cat -> [4 3 7]
sat -> [6 2 2]
on -> [3 9 1]
the -> [1 2 3]
mat -> [8 3 5]
undergoes some computation (see below) to result in one vector of 3 dimensions.
The word vector for each word in the sentence (if it is known to the model) is multiplied with the word's TF-IDF score. Then the average of these vectors is used to represent the sentence vector.
In [4]:
def title_to_vec(words, model, vectoriser, num_features):
title_vector = np.zeros((num_features), dtype="float64")
nwords = 0
index2word_set = set(model.index2word)
response = vectoriser.transform([words])
for col in response.nonzero()[1]:
word = feature_names[col]
if word in index2word_set:
word_tfidf = response[0, col]
word_vector = model[word]
nwords = nwords + 1
title_vector = np.add(title_vector, word_vector*word_tfidf)
if nwords > 0:
title_vector = np.divide(title_vector, nwords)
return title_vector
In [5]:
from functools import partial
vectorise_titles = partial(title_to_vec, vectoriser=vectoriser, num_features=MODEL_DIM, model=model)
In [6]:
hn_small = hn.head(1000)
# This step could take a while! Reduce the truncated list if it takes to long...
hn_small.insert(0, 'vector', hn_small['title'].apply(vectorise_titles))
Reduce the dimensionality of the vectors from 300 to 2 so that they can be plotted on a 2D surface.
For this process the Barnes-Hut implementation of the T-SNE algorithm is used.
In [7]:
from sklearn.manifold import TSNE
tnse_model = TSNE(n_components=2, method="barnes_hut")
X_2d = tnse_model.fit_transform(np.array(hn_small['vector'].values.tolist()))
The 2D version of the sentence vectors are now plotted as a scatter plot using Plotly.js.
Note: depending on the random_state
parameter of the tsne_model
the graph will change with each run. See the documentation for more details.
In [8]:
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
import colorlover as cl
init_notebook_mode(connected=True) # inject plotly.js into the notebook
trace = Scatter(
x = X_2d[:, 0],
y = X_2d[:, 1],
mode = "markers",
text = hn_small['title'].values.tolist()
)
iplot({
"data": [trace],
"layout": Layout(title="HN")
})