The cluster feature performs clustering analysis on the data (an arrray, dataframe, or list) and returns a list of cluster labels.
The default clustering method is K-Means (argument 'KMeans') with MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration, SpectralClustering and HDBSCAN also supported.
Note that, if a list is passed, the arrays will be stacked and clustering will be performed across all lists (not within each list).
In [ ]:
import hypertools as hyp
from collections import Counter
%matplotlib inline
We will load one of the sample datasets. This dataset consists of 8,124 samples of mushrooms with various text features.
In [ ]:
geo = hyp.load('mushrooms')
mushrooms = geo.get_data()
We can peek at the first few rows of the dataframe using the pandas function head()
In [ ]:
mushrooms.head()
To obtain cluster labels, simply pass the data to hyp.cluster
. Since we have not specified a desired number of cluster, the default of 3 clusters is used (labels 0, 1, and 2). Additionally, since we have note specified a desired clustering algorithm, K-Means is used by default.
In [ ]:
labels = hyp.cluster(mushrooms)
set(labels)
We can further examine the number of datapoints assigned each label.
In [ ]:
Counter(labels)
You can also specify the number of desired clusters by setting the n_clusters
argument to an integer number of clusters, as below. We can see that when we pass the int 10 to n_clusters, 10 cluster labels are assigned.
Since we have note specified a desired clustering algorithm, K-Means is used by default.
In [ ]:
labels_10 = hyp.cluster(mushrooms, n_clusters = 10)
set(labels_10)
You may prefer to use a clustering model other than K-Means. To do so, simply pass a string to the cluster argument specifying the desired clustering algorithm.
In this case, we specify both the clustering model (HDBSCAN) and the number of clusters (10).
In [ ]:
labels_HDBSCAN = hyp.cluster(mushrooms, cluster='HDBSCAN')
In [ ]:
geo = hyp.plot(mushrooms, '.', hue=labels_10, title='K-means clustering')
geo = hyp.plot(mushrooms, '.', hue=labels_HDBSCAN, title='HCBSCAN clustering')