K-means Clustering in sci-kit learn

This example uses a dataset downloaded from https://www.opensubtitles.org/en/search/vip and the raw data at opus.lingfil.uu.se/OpenSubtitles2016/raw/en. Metadata such as title actor and director was scraped from IMDB and is not guaranteed to be complete. This example uses the last 5000 most recent movies. The full archive (1.1 Gig) is here.

The code does the following:

  1. counts words
  2. builds a TFIDF weighted vocabulary
  3. Applies the TFIDF weights to the word counts to create a sparse matrix
  4. Runs K-means clustering on the sparce matrix
  5. Prints top words for each cluster using the largest features in the cluster centroid

Be sure to install the following:

  1. pip3 install sklearn
  2. pip3 install pandas
  3. pip3 install scipy

In [1]:
import pandas as pd 

import sys

'3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:04:38) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'


In [2]:
import tempfile
import zipfile
import os.path

zipFile = "./openSubtitles-5000.json.zip"

print( "Unarchiving ...")
temp_dir = tempfile.mkdtemp()
zip_ref = zipfile.ZipFile(zipFile, 'r')

openSubtitlesFile = os.path.join(temp_dir, "openSubtitles-5000.json")
print ("file unarchived to:" + openSubtitlesFile)

Unarchiving ...
file unarchived to:/var/folders/k1/ywpsl_ld2fj1bn5vp9bbgsr40000gn/T/tmp155tiu8f/openSubtitles-5000.json

Tokenizing and Filtering a Vocabulary

In [31]:
import json
from sklearn.feature_extraction.text import CountVectorizer
#from log_progress import log_progress

maxDocsToload = 50000

titles = []
def make_corpus(file):
    with open(file) as f:
        for i, line in enumerate(f):
            doc = json.loads(line)
            #if 'Sci-Fi' not in doc.get('Genre',''):
            #    continue
            if i % 100 == 0:
                print ("%d " % i, end='') 
            yield doc.get('Text','')
            if i == maxDocsToload:
print ("Starting load ...")                
textGenerator = make_corpus(openSubtitlesFile)              
count_vectorizer = CountVectorizer(min_df=2, max_df=0.75, ngram_range=(1,2), max_features=50000,
                                   stop_words='english', analyzer="word", token_pattern="[a-zA-Z]{3,}")
term_freq_matrix = count_vectorizer.fit_transform(textGenerator)
print ("Done.")
print ( "term_freq_matrix shape = %s" % (term_freq_matrix.shape,) )
print ("term_freq_matrix = \n%s" % term_freq_matrix)

Starting load ...
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 Done.
term_freq_matrix shape = (5000, 50000)
term_freq_matrix = 
Feature Vocabulary

In [32]:
print( "Vocabulary length = ", len(count_vectorizer.vocabulary_))
word = "data";
rainingIndex = count_vectorizer.vocabulary_[word];
print( "token index for \"%s\" = %d" % (word,rainingIndex))
feature_names = count_vectorizer.get_feature_names()
print( "feature_names[%d] = %s" % (rainingIndex, feature_names[rainingIndex]))

Vocabulary length =  50000
token index for "data" = 8419
feature_names[8419] = data

In [33]:
for i in range(0,1000):
    print( "feature_names[%d] = %s" % (i, feature_names[i]))

TFIDF Weighting

This applys the TFIDF weight to the matrix

tfidf value = word count / number of documents word is in

The document vectors are also normalized so they have a euclidian magnitude of 1.0.

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print( tf_idf_matrix)

In [58]:
from sklearn.cluster import KMeans,MiniBatchKMeans
import numpy

num_clusters = 5
#km = KMeans(n_clusters=num_clusters, verbose=True, init='k-means++', n_init=3, n_jobs=-1)
km = MiniBatchKMeans(n_clusters=num_clusters, verbose=True, init='k-means++', n_init=25, batch_size=2000)


clusters = km.labels_.tolist()
print ("cluster id for each document = %s" % clusters)

# sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

In [60]:
labels = pd.DataFrame(clusters, columns=['Cluster Labels'])
counts = pd.DataFrame(labels['Cluster Labels'].value_counts().sort_index())
counts.columns=['Document Count']

Document Count
0 1756
1 415
2 1209
3 1057
4 563

In [61]:
topNWords = 50

df = pd.DataFrame()

for i in range(num_clusters):
    clusterWords = []
    for topWordIndex,ind in enumerate(order_centroids[i, :topNWords]):   
        clusterWords.append( feature_names[ind] )
    df['Cluster %d' % i] = pd.Series(clusterWords)
        #dtype='object', data= [''] * topNWords)

df.style.set_properties(**{'text-align': 'right'})

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4
0 guys fuck sighs mom sir
1 music fucking chuckles dad king
2 laughs shit police guys father
3 world guy phone baby men
4 guy gotta door guy lord
5 whoa guys guy girl mary
6 shit money killed school brother
7 huh wanna car cause majesty
8 sighs jesus detective party queen
9 hell fucked murder house mother
10 hello dad jane don want son
11 money fuckin case sighs kill
12 car baby agent honey shall
13 chuckles ain kill mother captain
14 grunting phone killer family war
15 joe alright dead hmm die
16 dad ray hell danny gods
17 gotta huh dad kids lady
18 baby yeah yeah house happy dead
19 team girl money chuckles child
20 ooh cause victim wedding death
21 door music guys ooh wife
22 real marty fbi wow francis
23 cause listen woman tonight heart
24 job sighs gun huh woman
25 sir mom ago money fight
26 grunts asshole went hello boy
27 today job mom guess highness
28 school fuck fuck sam laughs general
29 cool ass father phone family
30 bleep sir sir fun george
31 hmm car alex hey hey girl
32 house bullshit son job killed
33 playing house saw mike army
34 president bitch hmm pretty world
35 bit laughs indistinct married husband
36 wanna real family yeah yeah door
37 hulk hey hey henry care city
38 girl door blood car ship
39 family kids knew having sighs
40 listen hell laughs hell miss
41 kill hmm body son dear
42 game dude evidence stuff daughter
43 phil brother mother today prince
44 pretty father wife music speak
45 wow johnny job listen return
46 woman kid took father magic
47 guess motherfucker girl cool john
48 don want don want music real blood
49 fun business emma friend power

In [38]:
titlesFrame = pd.DataFrame()

sort = titlesFrame.sort_values(by=['Labels','Titles'])
for i in range(num_clusters):
    display( sort.query('Labels == %d' % i) )

The End ...

