Text mining - Clustering

Machine Learning types:

  • Supervised learing (labeled data),
  • Unsupervised learning (not labeled data),
  • Semi-supervised learning (somewhere in the middle).

In this notebook we:

  • Scrape all quotes (save both all and only the first page),
  • Vectorize quotes using TF-IDF vectorizer,
    • TF: Term frequency = how frequently a term appears in the target observation (quote),
    • IDF: Inverce document frequency = is that word unique to that selected observation (quote or not).
  • Use vectorized words to cluster all the quotes using:
    • k-means clustering: unsupervised learning methods, that calculates distance between vectors and determines quotes that are "close" to each other based on some similarity metric (i.e. Euclidian distance). Number of clusters predetermined.
    • hiearchical (agglomerative) clustering: starts with single word clusters (bottom up approach) and merges simjilar words until forms a single cluster for the total input. The biggest hierarchical distance determines number of clusters.
  • Use quotes to tokenize them (just splitting by space for simplicity) and calculating word vectors to receive similar words (uses Neural Networks, is considered semi-supervised approach).

In [9]:
import time
import requests
import numpy as np
import pandas as pd
from itertools import chain
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

from textblob import TextBlob
from gensim.models import word2vec
from scipy.cluster.hierarchy import ward, dendrogram

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, AgglomerativeClustering

Scraping


In [2]:
def get_quotes(url):
    page = BeautifulSoup(requests.get(url).content, "html.parser")
    quotes = [i.get_text() for i in page.find_all("span",class_="text")]
    time.sleep(3)
    return quotes

quotes = get_quotes("http://quotes.toscrape.com/")

urls = ["http://quotes.toscrape.com/page/"+str(i)+"/" for i in range(1,11)]
quotes_all = [get_quotes(i) for i in urls]
quotes_all = chain.from_iterable(quotes_all)

TF-IDF vectorization


In [3]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(quotes)
print(tfidf_matrix.shape)


(10, 97)

In [4]:
features = tfidf_vectorizer.get_feature_names()
data = tfidf_matrix.toarray()
tfidf_df = pd.DataFrame(data,columns=features)

K-Means clustering


In [5]:
k=5
k5 = KMeans(n_clusters=k)
k5.fit(tfidf_matrix)
clusters = k5.labels_.tolist()
my_dict = {'quotes': quotes, 'cluster': clusters}
df = pd.DataFrame(my_dict)
print(df)
df.cluster.value_counts()


   cluster                                             quotes
0        3  “The world as we have created it is a process ...
1        3  “It is our choices, Harry, that show what we t...
2        2  “There are only two ways to live your life. On...
3        2  “The person, be it gentleman or lady, who has ...
4        2  “Imperfection is beauty, madness is genius and...
5        0  “Try not to become a man of success. Rather be...
6        2  “It is better to be hated for what you are tha...
7        4  “I have not failed. I've just found 10,000 way...
8        1  “A woman is like a tea bag; you never know how...
9        1  “A day without sunshine is like, you know, nig...
Out[5]:
2    4
3    2
1    2
4    1
0    1
Name: cluster, dtype: int64

Important terms according to K-Means


In [6]:
important_terms = k5.cluster_centers_.argsort()[:, ::-1]
key_list = list(tfidf_vectorizer.vocabulary_.keys())
val_list = list(tfidf_vectorizer.vocabulary_.values())
key_list[val_list.index(74)]

for i in range(k):
    for j in important_terms[i, :5]:
        print("Cluster: ", i, key_list[val_list.index(j)])


Cluster:  0 become
Cluster:  0 man
Cluster:  0 of
Cluster:  0 success
Cluster:  0 value
Cluster:  1 know
Cluster:  1 like
Cluster:  1 you
Cluster:  1 is
Cluster:  1 sunshine
Cluster:  2 be
Cluster:  2 is
Cluster:  2 to
Cluster:  2 absolutely
Cluster:  2 are
Cluster:  3 our
Cluster:  3 thinking
Cluster:  3 we
Cluster:  3 it
Cluster:  3 more
Cluster:  4 000
Cluster:  4 failed
Cluster:  4 found
Cluster:  4 ve
Cluster:  4 just

Hierarchical (Agglomerative) clustering


In [10]:
dist = 1 - cosine_similarity(tfidf_matrix)
linkage_matrix = ward(dist)

plt.subplots(figsize=(15, 20))
dendrogram(linkage_matrix, orientation="right", labels=quotes)

plt.savefig('clusters.png')

Gensim - Word2Vec


In [11]:
tokenized_sentences = [sentence.split() for sentence in quotes_all]
model = word2vec.Word2Vec(tokenized_sentences, min_count=1)

In [12]:
w1 = "world"
w2 = "man"
w3 = w1

In [13]:
print(model.wv.similarity(w1,w2))
print("\n")
model.wv.most_similar(w3)


0.151384


Out[13]:
[('changing', 0.27232587337493896),
 ('new.”', 0.25895264744758606),
 ('been.”', 0.25502169132232666),
 ('things', 0.251057505607605),
 ('die,', 0.24658332765102386),
 ('seconds', 0.2448711395263672),
 ('suit', 0.24468350410461426),
 ('“Good', 0.24021926522254944),
 ('that,', 0.2382466197013855),
 ('six', 0.23802979290485382)]