This notebook is a work in progress regarding text clustering task. Overall, the libraries used are Numpy, NLTK, Seaborn, Sklearn and Gensim.

Feel free to report any feedback by creating a issue on Github, or contacting me directly.

Clustering is about categorizing/organizing/labelling objects such as to maximize the similarity between objects in one cluster/group (inner class similarity) while maximazing the dissimilarity between different clusters (inter class similarity).

Clustering is an example of an unsupervised learning algorithm.

In the following cells I will explore clustering related to text/sentences. In such context similarity should target the semantic and pragmatic meaning of the text: sentences with the same or closely similar meaning should fall into the same category.

```
In [12]:
```import itertools
import nltk
import csv
import scipy
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_context("notebook", font_scale=1.5)
%matplotlib inline

We start with clean sentences as data to feed to the pipeline. Different preprocessing steps might be needed or can be used to possibly improve final results; these include:

- HTML or other markup tags removal [BeautifulSoup]
- Remove possible non relevant content (e.g. URLs, numbers, standard strings)
- Remove stopwords
- Lemmization and stemming [NLTK]

```
In [3]:
```# Dummy example data
vocabulary_size = 1000
sentences = ["A brown fox jumped on the lazy dog",
"A brown fox jumped on the brown duck",
"A brown fox jumped on the lazy elephant",
"An elephant is eating green grass near the alpaca",
"A green alpaca tried to jump over an elephant",
"May you rest in a deep and dreamless slumber"]
df = pd.DataFrame(sentences, columns=['sentences'])
df

```
Out[3]:
```

We want to represent our text using a Vector Space Model, meaning that each sentence should be encoded as a continuous vector of numbers, such that semantic similarity between sentences are computable using mathematical distance metrics (e.g. Euclidean distance, Manhattan distance, cosine).

Notice that for more complex techniques, your sentence can be encoded as a matrix (for example if each word is embedded as a vector), or more generally as a tensor (a N-dimensional vector).

Common ways to vectorize your sentences are based on words count. Each sentence is represented by a vector of length N, where N is the size of your vocabulary. Each element of the vector is then associated with a word (or N-gram), and has a value that depends on the technique used for the vectorization.

- binarize
- count
- tf-idf (term frequency * inverse term frequency)
- hashing

```
In [4]:
```from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

```
In [5]:
```# This class accepts functions for preprocessing and tokenization,
# so you can operate your data cleaning directly at this point
vectorizer =TfidfVectorizer(analyzer="word", max_features=vocabulary_size, stop_words="english", ngram_range=(1,2))
X = vectorizer.fit_transform(df['sentences'].values)

```
In [6]:
```X.shape

```
Out[6]:
```

```
In [7]:
```vectorizer.get_feature_names()

```
Out[7]:
```

```
In [67]:
```X[4].toarray()

```
Out[67]:
```

More advanced techniques might be able to capture a deeper semantic meaning of sentences

- WordToVec
- Doc2Vec
- ..

```
In [6]:
```from sklearn.metrics.pairwise import cosine_similarity

```
In [74]:
```# Retrieve the sentence of our example dataset most similar to some test sentences
test_sentences = ["How to put an alpaca to sleep", "How to achieve a deep sleep"]
test_data = vectorizer.transform(test_sentences)
for y in test_data:
res = cosine_similarity(y, X)
index = np.argsort(res[0])[-1]
print(sentences[index])

```
```

Reducing the dimensionality of your data can reduce the computation time for next steps in the pipeline, as well as improving your overall results. More data doesn't always imply better outcomes.

An initial analysis can be operated considering factors that are indicative of a feature importance, these include:

- missing values ratio (remove if percentage of missing values is higher than a predefined threshold)
- low variance (remove if variance is lower than a predefined threshold). Given that variance is range dependent, normalized is required.

A second step can rely on **correlation measures**, for which similar features are manually reduced to a single one.

Using **machine learning models** to get insight about features importance is another good method that leverages the power already embedded in the models themselves. A good model is the Random Forest one, especially given how the results are easily interpretable. [??others]

A different approach sees the measurements of results quality/performances using **different set of features**. This approach can be operated in two way: backward feature elimination (remove one at each step) and forward feature construction (add one at each step).

A multitude of other techniques can be used, more complex and self-contained:

- Principal Component Analysis (PCA)
- Latent Dirichlet allocation (LDA)
- Latent semantic analysis/indexing (LSA)
- chi-squared
- ...

Now we can simply treat our sentences as vectors and rely on the preferred clustering technique. As already mentioned, clustering is a type of unsupervised learning task.

It is common for clustering algorithms to consist of two steps: initialization and refinement. The clustering algorithm used will start from the initial (possibly random) clustering state and iteratively refine our clusters, based on an a criterion function, which also defines the stopping criteria.

Some possible clustering approaches are:

- partitional (k-means)
- hierarchical (agglomerative): build a tree representation of the data
- density-based (DBSCAN)
- spectral
- mixtures

Fast, partitional-type, noisy

You specify how many clusters you want to generate, and the algorithm will iteratively process the data trying to define clusters that minimize inner-cluster difference and maximize the inter-cluster one.

Given that the algorithm randomly initialize the clusters centroids, results can be non-optimal. To improve results it is common to run the algorithm a different number of times, and pick only the clustering which gives the best results (lowest value for the cost function). This process is already embedded in Sklearn implementation.

There is no right number of clusters, the choice should be based on the context, requirements and possible visual/expert-knowledge judgment. A more theoretical approach consists in monitoring the cost function with different numbers of clusters, and spot the "elbow" of the cost curve.

```
In [68]:
```from sklearn.cluster import KMeans

```
In [71]:
```# Specify number of clusters and fit the data
num_clusters = 3
km = KMeans(num_clusters=num_clusters)
km.fit(X)

```
Out[71]:
```

```
In [72]:
```# Predict/retrieve the cluster ID of our data
df['Cluster'] = km.predict(X)
df

```
Out[72]:
```

```
In [ ]:
```from scipy.spatial.distance import cdist, pdist
centroids = []
dists = []
data = X
num_clusters = np.arange(1, 30)
for k in num_clusters:
km = KMeans(n_clusters=k)
km.fit(data)
centroids.append(km.cluster_centers_)
dist = np.min(cdist(data, km.cluster_centers_, 'euclidean'), axis=1)
dist_avg = sum(dist)/len(data)
dists.append(dist_avg)

```
In [ ]:
```sns.pointplot(x=num_clusters, y=dists)
sns.plt.show()

```
In [9]:
```X = X.todense() #get our data matrix in dense representation

```
In [11]:
```X.shape

```
Out[11]:
```

```
In [18]:
```vocab = np.array(vectorizer.get_feature_names())

```
In [15]:
```U, s, Vh = scipy.linalg.svd(X, full_matrices=False)
print(U.shape, s.shape, Vh.shape)

```
```

```
In [21]:
```num_top_words=3
def show_topics(a):
top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
topic_words = ([top_words(t) for t in a])
return [' '.join(t) for t in topic_words]

```
In [22]:
```show_topics(Vh[:3])

```
Out[22]:
```

```
In [ ]:
```