This notebook will show you how to organize in 2D a set of documents/articles/posts so that articles with similar content are grouped near to each other. The example I am using is a set of Wikipedia articles of Political ideologies, but in principle it can be used for any set of documents.
The result of this notebook can be viewed live here.
The pipeline consists of two steps.
1) Convert all of the articles to a tf-idf matrix.
tf-idf stands for "term frequency inverse document frequency" and is commonly used in natural language processing applications dealing with large collections of documents. A tf-idf matrix is an $n * m$ sparse matrix consisting of $n$ rows, corresponding to our $n$ documents, and $m$ columns, corresponding to the $m$ unique "terms" (usually just words but can be n-grams or other kinds of tokens) that appear in the entire corpus.
Each entry in the matrix, $tfidf(i,j)$ can be interpreted as the "relative importance" of term $j$ to document $i$. It is calculated as
$$tfidf(i,j) = tf(i,j)*idf(i,j)$$$tf(i, j)$ is the "term frequency," i.e. the percentage of terms in document $i$ which are term $j$. For example, in the document "the cat in the hat", the term "the" has a $tf$ of (2 / 5) = 0.4. Thus $tf$ is high when the term is frequently found in the document.
$idf(i, j)$, not to be confused with this IDF is the "inverse document frequency." It is given by:
$$idf(i, j) = log(\frac{N}{n_j})$$where $N$ is the number of documents in the corpus and $n_j$ is the number of documents which contain term $j$. When $n_j$ is high, this value shrinks towards 0. This happens when the term frequently appears in many or all documents, thus common terms like "the", "a", "it", etc will have a low $idf$ score because they appear in most documents. Conversely, when the term rarely appears in documents ($n_j$ is low), then $idf$ score will be high. These tend to be special or topic-specific words which appear in few of the documents.
So intuitively, the $tfidf$ score for a term in a document goes higher if the term appears frequently in the document and appears infrequently in other documents (so that term is important to that document).
2) This gives you a high-dimensional matrix of n documents which can be reduced to 2 dimensions using the t-SNE dimensionality reduction technique. A better description of how t-SNE works can be found in the link.
You need nltk and scikit-learn to run most of this notebook. You need to also download the 'punkt' dataset from nltk.
pip install -U nltk
pip install -U scikit-learn
python -c "import nltk; nltk.download('punkt')"
Also, if you don't already have numpy, you should install it as well (it's only used to normalize the data later).
pip install numpy
Additionally, if you are following this example and wish to extract articles from Wikipedia, you need the python wikipedia library.
pip install wikipedia
First make sure all these imports work (minus wikipedia if you intend to use another corpus). We will assume wikipedia from here.
In [2]:
import string
import os
import time
import pickle
import json
import re
import wikipedia
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.preprocessing import normalize
from sklearn.decomposition import TruncatedSVD
nltk.download('punkt')
nltk.download('stopwords')
Out[2]:
In this example, we are going to cluster all of the links found in the Wikipedia page "Vital articles." First, we will open the page in main, then add all of the raw text of the articles into a dictionary called token_dict.
In [3]:
main = wikipedia.page('Wikipedia:Vital articles')
#https://en.wikipedia.org/wiki/Wikipedia:1,000_core_topics
#https://en.wikipedia.org/wiki/User:West.andrew.g/2016_Popular_pages
In [7]:
token_dict = {}
for i, article in enumerate(main.links):
if article not in token_dict:
if i%20==0:
print "getting text for article %d/%d : %s"%(i, len(main.links), article)
try:
text = wikipedia.page(article)
token_dict[article] = text.content
except:
print " ==> error processing "+article
You can save the text to disk in the following cell.
In [3]:
pickle.dump(token_dict, open('fulltext_WikiVitalArticles.p', 'wb'))
later you can retrieve it like this:
In [5]:
token_dict = pickle.load(open('fulltext_WikiVitalArticles.p', 'rb'))
The next sell will calculate the SVD decomposition of the tf-idf matrix.
In [7]:
def tokenize(text):
text = text.lower() # lower case
text = re.sub(r"[%s\n\t]+"%string.punctuation, ' ', text) # remove punctuation
text = re.sub(r"[ ]+", " ", text) # remove extra spaces
text = text.translate(string.punctuation) # punctuation
tokens = nltk.word_tokenize(text)
tokens = [t for t in tokens if not t in stopwords.words('english')] # stopwords
stems = [PorterStemmer().stem(t) for t in tokens]
return stems
# calculate tfidf (might take a while)
print("calculating tf-idf")
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())
print("reducing tf-idf to 500 dim")
tfs_reduced = TruncatedSVD(n_components=500, random_state=0).fit_transform(tfs)
print("done")
You can save the calculation to disk in the following cell.
In [8]:
pickle.dump(tfs_reduced, open('tfidf_WikiVitalArticles.p', 'wb'))
You can load it back later like this:
In [10]:
tfs_reduced = pickle.load(open('tfidf_WikiVitalArticles.p', 'rb'))
Now calculate t-SNE on the reduced feature vectors and normalize to (0,1).
In [18]:
# calculate t-SNE
tsne = TSNE(n_components=2, perplexity=50, verbose=2).fit_transform(tfs_reduced)
# save to json file
x_axis, y_axis = tsne[:, 0], tsne[:, 1]
x_norm = (x_axis-np.min(x_axis)) / (np.max(x_axis) - np.min(x_axis))
y_norm = (y_axis-np.min(y_axis)) / (np.max(y_axis) - np.min(y_axis))
data = {"x":[float(x) for x in x_norm.tolist()], "y":[float(y) for y in y_norm.tolist()], "names":token_dict.keys()}
Save to json for future-keeping.
In [13]:
with open('tsne_wikiVitalArticles.json', 'w') as outfile:
json.dump(data, outfile)
We can also convert the t-SNE to an nx
by ny
grid assignment.
In [15]:
nx, ny = 32, 31
import rasterfairy
grid_assignment = rasterfairy.transformPointCloud2D(tsne[0:nx*ny, :], target=(nx, ny))[0]
The next cell will create an HTML file with the gridded wikipedia articles arranged by similarity.
In [17]:
grid_sorted = sorted(range(len(grid_assignment)), key=lambda k: grid_assignment[k][1]*nx + grid_assignment[k][0])
keys = list(token_dict.keys())
links_grid = [[0 for x in range(nx)] for y in range(ny)]
for i, g in enumerate(grid_assignment):
links_grid[int(g[1])][int(g[0])] = keys[i]
table_html = '<table>\n'
for row in links_grid:
table_html += '\t<tr>\n'
for col in row:
table_html += '\t\t<td><a href=\"https://en.wikipedia.org/wiki/%s\">%s</a></td>\n' % (col, col)
table_html += '\t</tr>\n'
table_html += '</table>\n'
html = '''
<head>
<style>
body {
padding-top: 80px;
text-align: center;
font-family: monaco, monospace;
background-size: cover;
}
table {
text-align: center;
}
tr {
background-color:#ff0;
}
td {
padding:10px;
}
</style>
</head>
<body>
%s
</body>
''' % table_html
with open('index.html', 'wb') as text_file:
#text_file.write(html.encode('utf-8'))
text_file.write(html.encode('utf-8'))