Clustering with Hypertools

The cluster feature performs clustering analysis on the data (an arrray, dataframe, or list) and returns a list of cluster labels.

The default clustering method is K-Means (argument 'KMeans') with MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration, SpectralClustering and HDBSCAN also supported.

Note that, if a list is passed, the arrays will be stacked and clustering will be performed across all lists (not within each list).

Import Packages



In [ ]:

    
import hypertools as hyp
from collections import Counter

%matplotlib inline

Load your data

We will load one of the sample datasets. This dataset consists of 8,124 samples of mushrooms with various text features.



In [ ]:

    
geo = hyp.load('mushrooms')
mushrooms = geo.get_data()

We can peek at the first few rows of the dataframe using the pandas function head()



In [ ]:

    
mushrooms.head()

Obtain cluster labels

To obtain cluster labels, simply pass the data to hyp.cluster. Since we have not specified a desired number of cluster, the default of 3 clusters is used (labels 0, 1, and 2). Additionally, since we have note specified a desired clustering algorithm, K-Means is used by default.



In [ ]:

    
labels = hyp.cluster(mushrooms)
set(labels)

We can further examine the number of datapoints assigned each label.



In [ ]:

    
Counter(labels)

Specify number of cluster labels

You can also specify the number of desired clusters by setting the n_clusters argument to an integer number of clusters, as below. We can see that when we pass the int 10 to n_clusters, 10 cluster labels are assigned.

Since we have note specified a desired clustering algorithm, K-Means is used by default.



In [ ]:

    
labels_10 = hyp.cluster(mushrooms, n_clusters = 10)
set(labels_10)

Different clustering models

You may prefer to use a clustering model other than K-Means. To do so, simply pass a string to the cluster argument specifying the desired clustering algorithm.

In this case, we specify both the clustering model (HDBSCAN) and the number of clusters (10).



In [ ]:

    
labels_HDBSCAN = hyp.cluster(mushrooms, cluster='HDBSCAN')



In [ ]:

    
geo = hyp.plot(mushrooms, '.', hue=labels_10, title='K-means clustering')
geo = hyp.plot(mushrooms, '.', hue=labels_HDBSCAN, title='HCBSCAN clustering')