K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that observation in a given group is more similar to another observation in the same group than to another observation in a different group.
More about H2O K-means Clustering: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html
Using the cluster_size_constraints parameter, a user can set the minimum size of each cluster during the training by an array of numbers. The size of the array must be equal as the k parameter.
To satisfy the custom minimal cluster size, the calculation of clusters is converted to the Minimal Cost Flow problem. Instead of using the Lloyd iteration algorithm, a graph is constructed based on the distances and constraints. The goal is to go iteratively through the input edges and create an optimal spanning tree that satisfies the constraints.
More information about how to convert the standard K-means algorithm to the Minimal Cost Flow problem is described in this paper: https://pdfs.semanticscholar.org/ecad/eb93378d7911c2f7b9bd83a8af55d7fa9e06.pdf.
Minimum-cost flow problem can be efficiently solved in polynomial time. Currently, the performance of this implementation of Constrained K-means algorithm is slow due to many repeatable calculations which cannot be parallelized and more optimized at H2O backend.
Expected time with various sized data:
(OS debian 10.0 (x86-64), processor Intel© Core™ i7-7700HQ CPU @ 2.80GHz × 4, RAM 23.1 GiB)
To solve Constrained K-means in a shorter time, you can used the H2O Aggregator model to aggregate data to smaller size first and then pass these data to the Constrained K-means model to calculate the final centroids to be used with scoring. The results won't be as accurate as a result from a model with the whole dataset. However, it should help solve the problem of a huge datasets.
However, there are some assumptions:
The H2O Aggregator method is a clustering-based method for reducing a numerical/categorical dataset into a dataset with fewer rows. Aggregator maintains outliers as outliers but lumps together dense clusters into exemplars with an attached count column showing the member points.
More about H2O Aggregator: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/aggregator.html
In [2]:
# run h2o Kmeans
# Import h2o library
import h2o
from h2o.estimators import H2OKMeansEstimator
# init h2o cluster
h2o.init(strict_version_check=False, url="http://192.168.59.147:54321")
In [3]:
# import time to measure elapsed time
from timeit import default_timer as timer
from datetime import timedelta
import time
start = timer()
end = timer()
print("Time:", timedelta(seconds=end-start))
source: G. Karypis, "CLUTO A Clustering Toolkit," Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/~cluto. Karypis, George, Eui-Hong Han, and Vipin Kumar.
In [4]:
# load data
import pandas as pd
cluto = pd.read_csv("../../smalldata/cluto/cluto_t7_10k.csv", header=None)
cluto.columns = ["x", "y", "class"]
cluto.loc[cluto["class"] == "noise", "class"] = 9
cluto["class"] = cluto["class"].astype("category")
cluto
Out[4]:
In [5]:
import matplotlib.pyplot as plt
groups = cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Original Cluto dataset", fontsize=20)
ax.legend(numpoints=1)
Out[5]:
In [6]:
# load data to h2o
data_h2o_cluto = h2o.H2OFrame(cluto)
# run h2o Kmeans to estimate good start points
h2o_km_cluto = H2OKMeansEstimator(k=10, init="furthest", standardize=True)
start = timer()
h2o_km_cluto.train(x=["x", "y"], training_frame=data_h2o_cluto)
end = timer()
user_points = h2o.H2OFrame(h2o_km_cluto.centers())
# show details
h2o_km_cluto.show()
print("Time:", timedelta(seconds=end-start))
In [7]:
# run h2o constrained Kmeans
h2o_km_co_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[100, 200, 100, 200, 100, 100, 100, 100, 100, 100], standardize=True)
start = timer()
h2o_km_co_cluto.train(x=["x", "y"], training_frame=data_h2o_cluto)
end = timer()
# show details
h2o_km_co_cluto.show()
time_h2o_km_co_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_cluto)
In [8]:
from h2o.estimators.aggregator import H2OAggregatorEstimator
# original data size 10000, constraints [100, 200, 100, 200, 100, 100, 100, 100, 100, 100]
# aggregated data size 5000, constaints [50, 100, 50, 100, 50, 50, 50, 50, 50, 50]
params = {
"target_num_exemplars": 5000,
"rel_tol_num_exemplars": 0.5,
"categorical_encoding": "eigen"
}
agg = H2OAggregatorEstimator(**params)
start = timer()
agg.train(x=["x","y","class"], training_frame=data_h2o_cluto)
data_agg_12_cluto = agg.aggregated_frame
# run h2o Kmeans
h2o_km_co_agg_12_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[50, 100, 50, 100, 50, 50, 50, 50, 50, 50], standardize=True)
h2o_km_co_agg_12_cluto.train(x=["x", "y"],training_frame=data_agg_12_cluto)
end = timer()
# show details
h2o_km_co_agg_12_cluto.show()
time_h2o_km_co_agg_12_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_12_cluto)
In [9]:
# original data size 10000, constraints [100, 200, 100, 200, 100, 100, 100, 100, 100, 100]
# aggregated data size 2500, constaints [50, 100, 50, 100, 50, 50, 50, 50, 50, 50]
params = {
"target_num_exemplars": 2500,
"rel_tol_num_exemplars": 0.5,
"categorical_encoding": "eigen"
}
agg_14 = H2OAggregatorEstimator(**params)
start = timer()
agg_14.train(x=["x","y","class"], training_frame=data_h2o_cluto)
data_agg_14_cluto = agg_14.aggregated_frame
# run h2o Kmeans
h2o_km_co_agg_14_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[25, 50, 25, 50, 25, 25, 25, 25, 25, 25], standardize=True)
h2o_km_co_agg_14_cluto.train(x=["x","y"],training_frame=data_agg_14_cluto)
end = timer()
# show details
h2o_km_co_agg_14_cluto.show()
time_h2o_km_co_agg_14_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_14_cluto)
In [10]:
groups = cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Original Cluto dataset", fontsize=20)
ax.legend(numpoints=1)
Out[10]:
In [11]:
data_agg_df_12_cluto = data_agg_12_cluto.as_data_frame()
data_agg_df_12_cluto["class"] = data_agg_df_12_cluto["class"].astype("category")
groups = data_agg_df_12_cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Aggregated (1/2 size) Cluto Dataset", fontsize=20)
ax.legend(numpoints=1)
Out[11]:
In [12]:
data_agg_df_14_cluto = data_agg_14_cluto.as_data_frame()
data_agg_df_14_cluto["class"] = data_agg_df_14_cluto["class"].astype("category")
groups = data_agg_df_14_cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Aggregated (1/4 size) Cluto Dataset", fontsize=20)
ax.legend(numpoints=1)
Out[12]:
In [13]:
cluto["km_pred"] = h2o_km_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
groups = cluto.groupby('km_pred')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Predictions of standard K-means", fontsize=20)
ax.legend(numpoints=1)
Out[13]:
In [14]:
cluto["km_co_pred"] = h2o_km_co_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
groups = cluto.groupby('km_co_pred')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Predictions of Constrained K-means trained with whole Cluto Dataset", fontsize=20)
ax.legend(numpoints=1)
Out[14]:
In [15]:
cluto["km_co_pred_1/2"] = h2o_km_co_agg_12_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
groups = cluto.groupby('km_co_pred_1/2')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Predictions of Constrained K-means trained with aggregated (1/2 of size) Cluto Dataset", fontsize=20)
ax.legend(numpoints=1)
Out[15]:
In [16]:
cluto["km_co_pred_1/4"] = h2o_km_co_agg_14_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
groups = cluto.groupby('km_co_pred_1/4')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Predictions of Constrained K-means trained with aggregated (1/4 of size) Cluto Dataset", fontsize=20)
ax.legend(numpoints=1)
Out[16]:
In [17]:
centers_km_co_cluto = pd.DataFrame(h2o_km_co_cluto.centers())
centers_km_co_cluto["algo"] = "km_co"
centers_km_co_agg_12_cluto = pd.DataFrame(h2o_km_co_agg_12_cluto.centers())
centers_km_co_agg_12_cluto["algo"] = "km_co_agg_1/2"
centers_km_co_agg_14_cluto = pd.DataFrame(h2o_km_co_agg_14_cluto.centers())
centers_km_co_agg_14_cluto["algo"] = "km_co_agg_1/4"
centers_all_cluto = pd.concat([centers_km_co_cluto, centers_km_co_agg_12_cluto, centers_km_co_agg_14_cluto])
centers_all_cluto
Out[17]:
In [18]:
groups = centers_all_cluto.groupby('algo')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group[0], group[1], marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Centroids of Constrained K-means algos", fontsize=20)
ax.legend(numpoints=1)
Out[18]: