Text Clustering


This notebook is a work in progress regarding text clustering task. Overall, the libraries used are Numpy, NLTK, Seaborn, Sklearn and Gensim.

Feel free to report any feedback by creating a issue on Github, or contacting me directly.


Clustering is about categorizing/organizing/labelling objects such as to maximize the similarity between objects in one cluster/group (inner class similarity) while maximazing the dissimilarity between different clusters (inter class similarity).

Clustering is an example of an unsupervised learning algorithm.

In the following cells I will explore clustering related to text/sentences. In such context similarity should target the semantic and pragmatic meaning of the text: sentences with the same or closely similar meaning should fall into the same category.

In [12]:
import itertools 
import nltk
import csv
import scipy

import numpy as np
import pandas as pd

import seaborn as sns

sns.set_context("notebook", font_scale=1.5)

%matplotlib inline

Data Preprocessing

We start with clean sentences as data to feed to the pipeline. Different preprocessing steps might be needed or can be used to possibly improve final results; these include:

  • HTML or other markup tags removal [BeautifulSoup]
  • Remove possible non relevant content (e.g. URLs, numbers, standard strings)
  • Remove stopwords
  • Lemmization and stemming [NLTK]

In [3]:
# Dummy example data
vocabulary_size = 1000
sentences = ["A brown fox jumped on the lazy dog", 
            "A brown fox jumped on the brown duck",
            "A brown fox jumped on the lazy elephant",
            "An elephant is eating green grass near the alpaca",
            "A green alpaca tried to jump over an elephant",
            "May you rest in a deep and dreamless slumber"]
df = pd.DataFrame(sentences, columns=['sentences'])

0 A brown fox jumped on the lazy dog
1 A brown fox jumped on the brown duck
2 A brown fox jumped on the lazy elephant
3 An elephant is eating green grass near the alpaca
4 A green alpaca tried to jump over an elephant
5 May you rest in a deep and dreamless slumber

Text Vectorization

We want to represent our text using a Vector Space Model, meaning that each sentence should be encoded as a continuous vector of numbers, such that semantic similarity between sentences are computable using mathematical distance metrics (e.g. Euclidean distance, Manhattan distance, cosine).

Notice that for more complex techniques, your sentence can be encoded as a matrix (for example if each word is embedded as a vector), or more generally as a tensor (a N-dimensional vector).

Basic Vectorizer Methods

Common ways to vectorize your sentences are based on words count. Each sentence is represented by a vector of length N, where N is the size of your vocabulary. Each element of the vector is then associated with a word (or N-gram), and has a value that depends on the technique used for the vectorization.

  • binarize
  • count
  • tf-idf (term frequency * inverse term frequency)
  • hashing

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [5]:
# This class accepts functions for preprocessing and tokenization, 
# so you can operate your data cleaning directly at this point
vectorizer =TfidfVectorizer(analyzer="word", max_features=vocabulary_size, stop_words="english", ngram_range=(1,2))
X = vectorizer.fit_transform(df['sentences'].values)

In [6]:

(6, 37)

In [7]:

 'alpaca tried',
 'brown duck',
 'brown fox',
 'deep dreamless',
 'dreamless slumber',
 'eating green',
 'elephant eating',
 'fox jumped',
 'grass near',
 'green alpaca',
 'green grass',
 'jump elephant',
 'jumped brown',
 'jumped lazy',
 'lazy dog',
 'lazy elephant',
 'near alpaca',
 'rest deep',
 'tried jump']

In [67]:

array([[ 0.29315886,  0.35750457,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.24750486,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.29315886,
         0.35750457,  0.        ,  0.35750457,  0.35750457,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.35750457,  0.35750457]])

Words Embeddings [TODO]

More advanced techniques might be able to capture a deeper semantic meaning of sentences

  • WordToVec
  • Doc2Vec
  • ..

Similarity and Distance Metrics

You can test how similar/close sentences are, or directly rank all your data in one round. A common measure used for this case and in NLP tasks in general is cosine similarity: it consider the cosine of the angle between two vectors which is 1 for 0 degrees, lower otherwise.

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

In [74]:
# Retrieve the sentence of our example dataset most similar to some test sentences
test_sentences = ["How to put an alpaca to sleep", "How to achieve a deep sleep"]
test_data = vectorizer.transform(test_sentences)
for y in test_data:
    res = cosine_similarity(y, X)
    index = np.argsort(res[0])[-1]

A green alpaca tried to jump over an elephant
May you rest in a deep and dreamless slumber

Dimensionality Reduction and Features Selection

Ref 1

Reducing the dimensionality of your data can reduce the computation time for next steps in the pipeline, as well as improving your overall results. More data doesn't always imply better outcomes.

An initial analysis can be operated considering factors that are indicative of a feature importance, these include:

  • missing values ratio (remove if percentage of missing values is higher than a predefined threshold)
  • low variance (remove if variance is lower than a predefined threshold). Given that variance is range dependent, normalized is required.

A second step can rely on correlation measures, for which similar features are manually reduced to a single one.

Using machine learning models to get insight about features importance is another good method that leverages the power already embedded in the models themselves. A good model is the Random Forest one, especially given how the results are easily interpretable. [??others]

A different approach sees the measurements of results quality/performances using different set of features. This approach can be operated in two way: backward feature elimination (remove one at each step) and forward feature construction (add one at each step).

A multitude of other techniques can be used, more complex and self-contained:

  • Principal Component Analysis (PCA)
  • Latent Dirichlet allocation (LDA)
  • Latent semantic analysis/indexing (LSA)
  • chi-squared
  • ...


Now we can simply treat our sentences as vectors and rely on the preferred clustering technique. As already mentioned, clustering is a type of unsupervised learning task.

It is common for clustering algorithms to consist of two steps: initialization and refinement. The clustering algorithm used will start from the initial (possibly random) clustering state and iteratively refine our clusters, based on an a criterion function, which also defines the stopping criteria.

Some possible clustering approaches are:

  • partitional (k-means)
  • hierarchical (agglomerative): build a tree representation of the data
  • density-based (DBSCAN)
  • spectral
  • mixtures


Fast, partitional-type, noisy

You specify how many clusters you want to generate, and the algorithm will iteratively process the data trying to define clusters that minimize inner-cluster difference and maximize the inter-cluster one.

Given that the algorithm randomly initialize the clusters centroids, results can be non-optimal. To improve results it is common to run the algorithm a different number of times, and pick only the clustering which gives the best results (lowest value for the cost function). This process is already embedded in Sklearn implementation.

There is no right number of clusters, the choice should be based on the context, requirements and possible visual/expert-knowledge judgment. A more theoretical approach consists in monitoring the cost function with different numbers of clusters, and spot the "elbow" of the cost curve.

In [68]:
from sklearn.cluster import KMeans

In [71]:
# Specify number of clusters and fit the data
num_clusters = 3
km = KMeans(num_clusters=num_clusters)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [72]:
# Predict/retrieve the cluster ID of our data
df['Cluster'] = km.predict(X)

sentences Cluster
0 A brown fox jumped on the lazy dog 0
1 A brown fox jumped on the brown duck 0
2 A brown fox jumped on the lazy elephant 0
3 An elephant is eating green grass near the alpaca 1
4 A green alpaca tried to jump over an elephant 1
5 May you rest in a deep and dreamless slumber 2

Elbow Method

This method consist in visualizing how some clustering metrics evolve based on the number of clusters used. Metrics like percentage of variance explained, or inner-cluster mean distance can be plotted for a range of values. Where improvements show a clean slow down (elbow) it might be an indicator of a good choice for value to use.

In [ ]:
from scipy.spatial.distance import cdist, pdist
centroids = []
dists = []
data = X
num_clusters = np.arange(1, 30)
for k in num_clusters:
    km = KMeans(n_clusters=k)
    dist = np.min(cdist(data, km.cluster_centers_, 'euclidean'), axis=1)
    dist_avg = sum(dist)/len(data)

In [ ]:
sns.pointplot(x=num_clusters, y=dists)

Singular Value Decomposition (SVD)

SVD: "factorizes a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix, which contains the relative importance of each factor)."


In [9]:
X = X.todense() #get our data matrix in dense representation

In [11]:

(6, 37)

In [18]:
vocab = np.array(vectorizer.get_feature_names())

In [15]:
U, s, Vh = scipy.linalg.svd(X, full_matrices=False)
print(U.shape, s.shape, Vh.shape)

(6, 6) (6,) (6, 37)

In [21]:

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [22]:

['dreamless dreamless slumber slumber',
 'brown fox brown fox',
 'deep deep dreamless dreamless']

In [ ]: