This notebook will show you how to organize in 2D a set of documents/articles/posts so that articles with similar content are grouped near to each other. The example I am using is a set of Wikipedia articles of Political ideologies, but in principle it can be used for any set of documents.

The result of this notebook can be viewed live here.

Procedure

The pipeline consists of two steps.

1) Convert all of the articles to a tf-idf matrix.

tf-idf stands for "term frequency inverse document frequency" and is commonly used in natural language processing applications dealing with large collections of documents. A tf-idf matrix is an $n * m$ sparse matrix consisting of $n$ rows, corresponding to our $n$ documents, and $m$ columns, corresponding to the $m$ unique "terms" (usually just words but can be n-grams or other kinds of tokens) that appear in the entire corpus.

Each entry in the matrix, $tfidf(i,j)$ can be interpreted as the "relative importance" of term $j$ to document $i$. It is calculated as

$$tfidf(i,j) = tf(i,j)*idf(i,j)$$

$tf(i, j)$ is the "term frequency," i.e. the percentage of terms in document $i$ which are term $j$. For example, in the document "the cat in the hat", the term "the" has a $tf$ of (2 / 5) = 0.4. Thus $tf$ is high when the term is frequently found in the document.

$idf(i, j)$, not to be confused with this IDF is the "inverse document frequency." It is given by:

$$idf(i, j) = log(\frac{N}{n_j})$$

where $N$ is the number of documents in the corpus and $n_j$ is the number of documents which contain term $j$. When $n_j$ is high, this value shrinks towards 0. This happens when the term frequently appears in many or all documents, thus common terms like "the", "a", "it", etc will have a low $idf$ score because they appear in most documents. Conversely, when the term rarely appears in documents ($n_j$ is low), then $idf$ score will be high. These tend to be special or topic-specific words which appear in few of the documents.

So intuitively, the $tfidf$ score for a term in a document goes higher if the term appears frequently in the document and appears infrequently in other documents (so that term is important to that document).

2) This gives you a high-dimensional matrix of n documents which can be reduced to 2 dimensions using the t-SNE dimensionality reduction technique. A better description of how t-SNE works can be found in the link.

Installation

You need nltk and scikit-learn to run most of this notebook. You need to also download the 'punkt' dataset from nltk.

pip install -U nltk
pip install -U scikit-learn
python -c "import nltk; nltk.download('punkt')"

Also, if you don't already have numpy, you should install it as well (it's only used to normalize the data later).

pip install numpy

Additionally, if you are following this example and wish to extract articles from Wikipedia, you need the python wikipedia library.

pip install wikipedia

First make sure all these imports work (minus wikipedia if you intend to use another corpus). We will assume wikipedia from here.


In [2]:
import string
import os
import time
import pickle
import json
import re
import wikipedia
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.preprocessing import normalize
from sklearn.decomposition import TruncatedSVD

nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/itpstudent/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/itpstudent/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[2]:
True

In this example, we are going to cluster all of the links found in the Wikipedia page "Vital articles." First, we will open the page in main, then add all of the raw text of the articles into a dictionary called token_dict.


In [3]:
main = wikipedia.page('Wikipedia:Vital articles')
#https://en.wikipedia.org/wiki/Wikipedia:1,000_core_topics
#https://en.wikipedia.org/wiki/User:West.andrew.g/2016_Popular_pages

In [7]:
token_dict = {}
for i, article in enumerate(main.links):
    if article not in token_dict:
        if i%20==0:
            print "getting text for article %d/%d : %s"%(i, len(main.links), article)
        try:
            text = wikipedia.page(article)
            token_dict[article] = text.content
        except:
            print " ==> error processing "+article


getting text for article 0/997 : 0
getting text for article 20/997 : Age of Enlightenment
getting text for article 40/997 : Amazon rainforest
getting text for article 60/997 : Anthropology
getting text for article 80/997 : Association football
getting text for article 100/997 : Benjamin Franklin
getting text for article 120/997 : Brazil
getting text for article 140/997 : Carbon
getting text for article 160/997 : Chemical reaction
getting text for article 180/997 : Coal
getting text for article 200/997 : Confucius
getting text for article 220/997 : DNA
getting text for article 240/997 : Discrimination
getting text for article 260/997 : Economics
getting text for article 280/997 : Engineering
getting text for article 300/997 : Existence
getting text for article 320/997 : Flood
getting text for article 340/997 : Function (mathematics)
getting text for article 360/997 : Geometry
getting text for article 380/997 : Great Wall of China
getting text for article 400/997 : Hippocrates
getting text for article 420/997 : History of the world
getting text for article 440/997 : Ibn Khaldun
getting text for article 460/997 : Inorganic chemistry
getting text for article 480/997 : Jainism
getting text for article 500/997 : Judaism
getting text for article 520/997 : Leo Tolstoy
getting text for article 540/997 : Ludwig van Beethoven
getting text for article 560/997 : Marketing
getting text for article 580/997 : Medicine
getting text for article 600/997 : Middle Ages
getting text for article 620/997 : Moon landing
getting text for article 640/997 : Nationalism
getting text for article 660/997 : Nikola Tesla
getting text for article 680/997 : Organic chemistry
getting text for article 700/997 : Personal name
getting text for article 720/997 : Play (activity)
getting text for article 740/997 : Prime number
getting text for article 760/997 : Ramesses II
getting text for article 780/997 : Robotics
getting text for article 800/997 : Scientific method
getting text for article 820/997 : Short story
getting text for article 840/997 : Socialism
getting text for article 860/997 : Spanish language
getting text for article 880/997 : Strong interaction
getting text for article 900/997 : Tank
getting text for article 920/997 : Theravada
getting text for article 940/997 : Tuberculosis
getting text for article 960/997 : Virus
getting text for article 980/997 : Wind power

You can save the text to disk in the following cell.


In [3]:
pickle.dump(token_dict, open('fulltext_WikiVitalArticles.p', 'wb'))

later you can retrieve it like this:


In [5]:
token_dict = pickle.load(open('fulltext_WikiVitalArticles.p', 'rb'))

The next sell will calculate the SVD decomposition of the tf-idf matrix.


In [7]:
def tokenize(text):
    text = text.lower()    # lower case
    text = re.sub(r"[%s\n\t]+"%string.punctuation, ' ', text)  # remove punctuation
    text = re.sub(r"[ ]+",  " ",  text)  # remove extra spaces
    text = text.translate(string.punctuation)  # punctuation
    tokens = nltk.word_tokenize(text)
    tokens = [t for t in tokens if not t in stopwords.words('english')] # stopwords
    stems = [PorterStemmer().stem(t) for t in tokens]
    return stems

# calculate tfidf (might take a while)
print("calculating tf-idf")
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())

print("reducing tf-idf to 500 dim")
tfs_reduced = TruncatedSVD(n_components=500, random_state=0).fit_transform(tfs)
print("done")


calculating tf-idf
reducing tf-idf to 500 dim
done

You can save the calculation to disk in the following cell.


In [8]:
pickle.dump(tfs_reduced, open('tfidf_WikiVitalArticles.p', 'wb'))

You can load it back later like this:


In [10]:
tfs_reduced = pickle.load(open('tfidf_WikiVitalArticles.p', 'rb'))

Now calculate t-SNE on the reduced feature vectors and normalize to (0,1).


In [18]:
# calculate t-SNE
tsne = TSNE(n_components=2, perplexity=50, verbose=2).fit_transform(tfs_reduced)

# save to json file
x_axis, y_axis = tsne[:, 0], tsne[:, 1]
x_norm = (x_axis-np.min(x_axis)) / (np.max(x_axis) - np.min(x_axis))
y_norm = (y_axis-np.min(y_axis)) / (np.max(y_axis) - np.min(y_axis))
data = {"x":[float(x) for x in x_norm.tolist()], "y":[float(y) for y in y_norm.tolist()], "names":token_dict.keys()}


[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 997 samples in 0.005s...
[t-SNE] Computed neighbors for 997 samples in 0.780s...
[t-SNE] Computed conditional probabilities for sample 997 / 997
[t-SNE] Mean sigma: 0.381256
[t-SNE] Computed conditional probabilities in 0.052s
[t-SNE] Iteration 50: error = 72.6430206, gradient norm = 0.3293283 (50 iterations in 1.764s)
[t-SNE] Iteration 100: error = 74.4200134, gradient norm = 0.3027419 (50 iterations in 1.741s)
[t-SNE] Iteration 150: error = 75.8908005, gradient norm = 0.2851454 (50 iterations in 1.800s)
[t-SNE] Iteration 200: error = 75.6168747, gradient norm = 0.3014873 (50 iterations in 1.789s)
[t-SNE] Iteration 250: error = 76.6445007, gradient norm = 0.2806742 (50 iterations in 1.782s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 76.644501
[t-SNE] Iteration 300: error = 1.5614842, gradient norm = 0.0028795 (50 iterations in 1.657s)
[t-SNE] Iteration 350: error = 1.4017555, gradient norm = 0.0009432 (50 iterations in 1.724s)
[t-SNE] Iteration 400: error = 1.3293533, gradient norm = 0.0004273 (50 iterations in 1.510s)
[t-SNE] Iteration 450: error = 1.3067455, gradient norm = 0.0002495 (50 iterations in 1.451s)
[t-SNE] Iteration 500: error = 1.2943467, gradient norm = 0.0002196 (50 iterations in 1.454s)
[t-SNE] Iteration 550: error = 1.2851582, gradient norm = 0.0002550 (50 iterations in 1.453s)
[t-SNE] Iteration 600: error = 1.2779895, gradient norm = 0.0003128 (50 iterations in 1.447s)
[t-SNE] Iteration 650: error = 1.2735915, gradient norm = 0.0001538 (50 iterations in 1.452s)
[t-SNE] Iteration 700: error = 1.2691839, gradient norm = 0.0001373 (50 iterations in 1.471s)
[t-SNE] Iteration 750: error = 1.2658666, gradient norm = 0.0001648 (50 iterations in 1.447s)
[t-SNE] Iteration 800: error = 1.2637744, gradient norm = 0.0001543 (50 iterations in 1.446s)
[t-SNE] Iteration 850: error = 1.2607478, gradient norm = 0.0001401 (50 iterations in 1.472s)
[t-SNE] Iteration 900: error = 1.2588487, gradient norm = 0.0000997 (50 iterations in 1.463s)
[t-SNE] Iteration 950: error = 1.2562903, gradient norm = 0.0001329 (50 iterations in 1.483s)
[t-SNE] Iteration 1000: error = 1.2551097, gradient norm = 0.0000897 (50 iterations in 1.480s)
[t-SNE] Error after 1000 iterations: 1.255110

Save to json for future-keeping.


In [13]:
with open('tsne_wikiVitalArticles.json', 'w') as outfile:
    json.dump(data, outfile)

We can also convert the t-SNE to an nx by ny grid assignment.


In [15]:
nx, ny = 32, 31

import rasterfairy
grid_assignment = rasterfairy.transformPointCloud2D(tsne[0:nx*ny, :], target=(nx, ny))[0]

The next cell will create an HTML file with the gridded wikipedia articles arranged by similarity.


In [17]:
grid_sorted = sorted(range(len(grid_assignment)), key=lambda k: grid_assignment[k][1]*nx + grid_assignment[k][0])
keys = list(token_dict.keys())

links_grid = [[0 for x in range(nx)] for y in range(ny)] 
for i, g in enumerate(grid_assignment):
    links_grid[int(g[1])][int(g[0])] = keys[i]

table_html = '<table>\n'
for row in links_grid:
    table_html += '\t<tr>\n'
    for col in row:
        table_html += '\t\t<td><a href=\"https://en.wikipedia.org/wiki/%s\">%s</a></td>\n' % (col, col)
    table_html += '\t</tr>\n'
table_html += '</table>\n'

html = '''
    <head>
    <style>
        body {
          padding-top: 80px;
          text-align: center;
          font-family: monaco, monospace;
          background-size: cover;
        }
        table {
            text-align: center;
        }
        tr {
          background-color:#ff0;
        }
        td {
          padding:10px; 
        }
    </style>
    </head>
    <body>
        %s
    </body>
''' % table_html

with open('index.html', 'wb') as text_file:
    #text_file.write(html.encode('utf-8'))
    text_file.write(html.encode('utf-8'))