K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that observation in a given group is more similar to another observation in the same group than to another observation in a different group.
More about H2O K-means Clustering: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html
Using the cluster_size_constraints parameter, a user can set the minimum size of each cluster during the training by an array of numbers. The size of the array must be equal as the k parameter.
To satisfy the custom minimal cluster size, the calculation of clusters is converted to the Minimal Cost Flow problem. Instead of using the Lloyd iteration algorithm, a graph is constructed based on the distances and constraints. The goal is to go iteratively through the input edges and create an optimal spanning tree that satisfies the constraints.
More information about how to convert the standard K-means algorithm to the Minimal Cost Flow problem is described in this paper: https://pdfs.semanticscholar.org/ecad/eb93378d7911c2f7b9bd83a8af55d7fa9e06.pdf.
Minimum-cost flow problem can be efficiently solved in polynomial time. Currently, the performance of this implementation of Constrained K-means algorithm is slow due to many repeatable calculations which cannot be parallelized and more optimized at H2O backend.
Expected time with various sized data:
(OS debian 10.0 (x86-64), processor Intel© Core™ i7-7700HQ CPU @ 2.80GHz × 4, RAM 23.1 GiB)
To solve Constrained K-means in a shorter time, you can used the H2O Aggregator model to aggregate data to smaller size first and then pass these data to the Constrained K-means model to calculate the final centroids to be used with scoring. The results won't be as accurate as a result from a model with the whole dataset. However, it should help solve the problem of a huge datasets.
However, there are some assumptions:
The H2O Aggregator method is a clustering-based method for reducing a numerical/categorical dataset into a dataset with fewer rows. Aggregator maintains outliers as outliers but lumps together dense clusters into exemplars with an attached count column showing the member points.
More about H2O Aggregator: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/aggregator.html
In [2]:
# run h2o Kmeans
# Import h2o library
import h2o
from h2o.estimators import H2OKMeansEstimator
# init h2o cluster
h2o.init(strict_version_check=False, url="http://192.168.59.147:54321")
In [3]:
# load data
import pandas as pd
data = pd.read_csv("../../smalldata/chicago/chicagoAllWeather.csv")
data = data.iloc[:,[1, 2, 3, 4, 5]]
print(data.shape)
data.head()
Out[3]:
In [4]:
# import time to measure elapsed time
from timeit import default_timer as timer
from datetime import timedelta
import time
start = timer()
end = timer()
print("Time:", timedelta(seconds=end-start))
In [5]:
data_h2o = h2o.H2OFrame(data)
# run h2o Kmeans to get good starting points
h2o_km = H2OKMeansEstimator(k=3, init="furthest", standardize=True)
start = timer()
h2o_km.train(training_frame=data_h2o)
end = timer()
user_points = h2o.H2OFrame(h2o_km.centers())
# show details
h2o_km.show()
time_km = timedelta(seconds=end-start)
print("Time:", time_km)
In [6]:
# run h2o constrained Kmeans
h2o_km_co = H2OKMeansEstimator(k=3, user_points=user_points, cluster_size_constraints=[1000, 2000, 1000], standardize=True)
start = timer()
h2o_km_co.train(training_frame=data_h2o)
end = timer()
# show details
h2o_km_co.show()
time_km_co = timedelta(seconds=end-start)
print("Time:", time_km_co)
In [7]:
from h2o.estimators.aggregator import H2OAggregatorEstimator
# original data size 5162, constraints 1000, 2000, 1000
# aggregated data size ~ 2581, constaints 500, 1000, 500
params = {
"target_num_exemplars": 2581,
"rel_tol_num_exemplars": 0.01,
"categorical_encoding": "eigen"
}
agg = H2OAggregatorEstimator(**params)
start = timer()
agg.train(training_frame=data_h2o)
data_agg = agg.aggregated_frame
# run h2o Kmeans
h2o_km_co_agg = H2OKMeansEstimator(k=3, user_points=user_points, cluster_size_constraints=[500, 1000, 500], standardize=True)
h2o_km_co_agg.train(x=["month", "day", "year", "maxTemp", "meanTemp"],training_frame=data_agg)
end = timer()
# show details
h2o_km_co_agg.show()
time_km_co_12 = timedelta(seconds=end-start)
print("Time:", time_km_co_12)
In [8]:
from h2o.estimators.aggregator import H2OAggregatorEstimator
# original data size 5162, constraints 1000, 2000, 1000
# aggregated data size ~ 1290, constaints 250, 500, 250
params = {
"target_num_exemplars": 1290,
"rel_tol_num_exemplars": 0.01,
"categorical_encoding": "eigen"
}
agg_14 = H2OAggregatorEstimator(**params)
start = timer()
agg_14.train(training_frame=data_h2o)
data_agg_14 = agg_14.aggregated_frame
# run h2o Kmeans
h2o_km_co_agg_14 = H2OKMeansEstimator(k=3, user_points=user_points, cluster_size_constraints=[240, 480, 240], standardize=True)
h2o_km_co_agg_14.train(x=list(range(5)),training_frame=data_agg_14)
end = timer()
# show details
h2o_km_co_agg_14.show()
time_km_co_14 = timedelta(seconds=end-start)
print("Time:", time_km_co_14)
In [9]:
centers_km_co = h2o_km_co.centers()
centers_km_co_agg_12 = h2o_km_co_agg.centers()
centers_km_co_agg_14 = h2o_km_co_agg_14.centers()
centers_all = pd.concat([pd.DataFrame(centers_km_co).sort_values(by=[0]), pd.DataFrame(centers_km_co_agg_12).sort_values(by=[0]), pd.DataFrame(centers_km_co_agg_14).sort_values(by=[0])])
In [10]:
diff_first_cluster = pd.concat([centers_all.iloc[0,:] - centers_all.iloc[3,:], centers_all.iloc[0,:] - centers_all.iloc[6,:]], axis=1, ignore_index=True).transpose()
diff_first_cluster.index = ["1/2", "1/4"]
diff_first_cluster.style.bar(subset=[0,1,2,3,4], align='mid', color=['#d65f5f', '#5fba7d'], width=90)
Out[10]:
In [11]:
diff_second_cluster = pd.concat([centers_all.iloc[1,:] - centers_all.iloc[4,:], centers_all.iloc[1,:] - centers_all.iloc[7,:]], axis=1, ignore_index=True).transpose()
diff_second_cluster.index = ["1/2", "1/4"]
diff_second_cluster.style.bar(subset=[0,1,2,3,4], align='mid', color=['#d65f5f', '#5fba7d'], width=90)
Out[11]:
In [12]:
diff_third_cluster = pd.concat([centers_all.iloc[2,:] - centers_all.iloc[5,:], centers_all.iloc[2,:] - centers_all.iloc[8,:]], axis=1, ignore_index=True).transpose()
diff_third_cluster.index = ["1/2", "1/4"]
diff_third_cluster.style.bar(subset=[0,1,2,3,4], color=['#d65f5f', '#5fba7d'], align="mid", width=90)
Out[12]: