This notebook is a work in progress regarding text clustering task. Overall, the libraries used are Numpy, NLTK, Seaborn, Sklearn and Gensim.
Feel free to report any feedback by creating a issue on Github, or contacting me directly.
Clustering is about categorizing/organizing/labelling objects such as to maximize the similarity between objects in one cluster/group (inner class similarity) while maximazing the dissimilarity between different clusters (inter class similarity).
Clustering is an example of an unsupervised learning algorithm.
In the following cells I will explore clustering related to text/sentences. In such context similarity should target the semantic and pragmatic meaning of the text: sentences with the same or closely similar meaning should fall into the same category.
In [12]:
import itertools
import nltk
import csv
import scipy
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_context("notebook", font_scale=1.5)
%matplotlib inline
We start with clean sentences as data to feed to the pipeline. Different preprocessing steps might be needed or can be used to possibly improve final results; these include:
In [3]:
# Dummy example data
vocabulary_size = 1000
sentences = ["A brown fox jumped on the lazy dog",
"A brown fox jumped on the brown duck",
"A brown fox jumped on the lazy elephant",
"An elephant is eating green grass near the alpaca",
"A green alpaca tried to jump over an elephant",
"May you rest in a deep and dreamless slumber"]
df = pd.DataFrame(sentences, columns=['sentences'])
df
Out[3]:
We want to represent our text using a Vector Space Model, meaning that each sentence should be encoded as a continuous vector of numbers, such that semantic similarity between sentences are computable using mathematical distance metrics (e.g. Euclidean distance, Manhattan distance, cosine).
Notice that for more complex techniques, your sentence can be encoded as a matrix (for example if each word is embedded as a vector), or more generally as a tensor (a N-dimensional vector).
Common ways to vectorize your sentences are based on words count. Each sentence is represented by a vector of length N, where N is the size of your vocabulary. Each element of the vector is then associated with a word (or N-gram), and has a value that depends on the technique used for the vectorization.
In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
In [5]:
# This class accepts functions for preprocessing and tokenization,
# so you can operate your data cleaning directly at this point
vectorizer =TfidfVectorizer(analyzer="word", max_features=vocabulary_size, stop_words="english", ngram_range=(1,2))
X = vectorizer.fit_transform(df['sentences'].values)
In [6]:
X.shape
Out[6]:
In [7]:
vectorizer.get_feature_names()
Out[7]:
In [67]:
X[4].toarray()
Out[67]:
More advanced techniques might be able to capture a deeper semantic meaning of sentences
You can test how similar/close sentences are, or directly rank all your data in one round. A common measure used for this case and in NLP tasks in general is cosine similarity: it consider the cosine of the angle between two vectors which is 1 for 0 degrees, lower otherwise.
In [6]:
from sklearn.metrics.pairwise import cosine_similarity
In [74]:
# Retrieve the sentence of our example dataset most similar to some test sentences
test_sentences = ["How to put an alpaca to sleep", "How to achieve a deep sleep"]
test_data = vectorizer.transform(test_sentences)
for y in test_data:
res = cosine_similarity(y, X)
index = np.argsort(res[0])[-1]
print(sentences[index])
Reducing the dimensionality of your data can reduce the computation time for next steps in the pipeline, as well as improving your overall results. More data doesn't always imply better outcomes.
An initial analysis can be operated considering factors that are indicative of a feature importance, these include:
A second step can rely on correlation measures, for which similar features are manually reduced to a single one.
Using machine learning models to get insight about features importance is another good method that leverages the power already embedded in the models themselves. A good model is the Random Forest one, especially given how the results are easily interpretable. [??others]
A different approach sees the measurements of results quality/performances using different set of features. This approach can be operated in two way: backward feature elimination (remove one at each step) and forward feature construction (add one at each step).
A multitude of other techniques can be used, more complex and self-contained:
Now we can simply treat our sentences as vectors and rely on the preferred clustering technique. As already mentioned, clustering is a type of unsupervised learning task.
It is common for clustering algorithms to consist of two steps: initialization and refinement. The clustering algorithm used will start from the initial (possibly random) clustering state and iteratively refine our clusters, based on an a criterion function, which also defines the stopping criteria.
Some possible clustering approaches are:
Fast, partitional-type, noisy
You specify how many clusters you want to generate, and the algorithm will iteratively process the data trying to define clusters that minimize inner-cluster difference and maximize the inter-cluster one.
Given that the algorithm randomly initialize the clusters centroids, results can be non-optimal. To improve results it is common to run the algorithm a different number of times, and pick only the clustering which gives the best results (lowest value for the cost function). This process is already embedded in Sklearn implementation.
There is no right number of clusters, the choice should be based on the context, requirements and possible visual/expert-knowledge judgment. A more theoretical approach consists in monitoring the cost function with different numbers of clusters, and spot the "elbow" of the cost curve.
In [68]:
from sklearn.cluster import KMeans
In [71]:
# Specify number of clusters and fit the data
num_clusters = 3
km = KMeans(num_clusters=num_clusters)
km.fit(X)
Out[71]:
In [72]:
# Predict/retrieve the cluster ID of our data
df['Cluster'] = km.predict(X)
df
Out[72]:
This method consist in visualizing how some clustering metrics evolve based on the number of clusters used. Metrics like percentage of variance explained, or inner-cluster mean distance can be plotted for a range of values. Where improvements show a clean slow down (elbow) it might be an indicator of a good choice for value to use.
In [ ]:
from scipy.spatial.distance import cdist, pdist
centroids = []
dists = []
data = X
num_clusters = np.arange(1, 30)
for k in num_clusters:
km = KMeans(n_clusters=k)
km.fit(data)
centroids.append(km.cluster_centers_)
dist = np.min(cdist(data, km.cluster_centers_, 'euclidean'), axis=1)
dist_avg = sum(dist)/len(data)
dists.append(dist_avg)
In [ ]:
sns.pointplot(x=num_clusters, y=dists)
sns.plt.show()
SVD: "factorizes a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix, which contains the relative importance of each factor)."
In [9]:
X = X.todense() #get our data matrix in dense representation
In [11]:
X.shape
Out[11]:
In [18]:
vocab = np.array(vectorizer.get_feature_names())
In [15]:
U, s, Vh = scipy.linalg.svd(X, full_matrices=False)
print(U.shape, s.shape, Vh.shape)
In [21]:
num_top_words=3
def show_topics(a):
top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
topic_words = ([top_words(t) for t in a])
return [' '.join(t) for t in topic_words]
In [22]:
show_topics(Vh[:3])
Out[22]:
In [ ]: