Constrained K-Means demo - Cluto dataset

H2O K-Means algorithm

K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that observation in a given group is more similar to another observation in the same group than to another observation in a different group.

More about H2O K-means Clustering: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html

Constrained K-Means algorithm in H2O

Using the cluster_size_constraints parameter, a user can set the minimum size of each cluster during the training by an array of numbers. The size of the array must be equal as the k parameter.

To satisfy the custom minimal cluster size, the calculation of clusters is converted to the Minimal Cost Flow problem. Instead of using the Lloyd iteration algorithm, a graph is constructed based on the distances and constraints. The goal is to go iteratively through the input edges and create an optimal spanning tree that satisfies the constraints.

More information about how to convert the standard K-means algorithm to the Minimal Cost Flow problem is described in this paper: https://pdfs.semanticscholar.org/ecad/eb93378d7911c2f7b9bd83a8af55d7fa9e06.pdf.

Minimum-cost flow problem can be efficiently solved in polynomial time. Currently, the performance of this implementation of Constrained K-means algorithm is slow due to many repeatable calculations which cannot be parallelized and more optimized at H2O backend.

Expected time with various sized data:

  • 5 000 rows, 5 features ~ 0h 4m 3s
  • 10 000 rows, 5 features ~ 0h 9m 21s
  • 15 000 rows, 5 features ~ 0h 22m 25s
  • 20 000 rows, 5 features ~ 0h 39m 27s
  • 25 000 rows, 5 features ~ 1h 06m 8s
  • 30 000 rows, 5 features ~ 1h 26m 43s
  • 35 000 rows, 5 features ~ 1h 44m 7s
  • 40 000 rows, 5 features ~ 2h 13m 31s
  • 45 000 rows, 5 features ~ 2h 4m 29s
  • 50 000 rows, 5 features ~ 4h 4m 18s

(OS debian 10.0 (x86-64), processor Intel© Core™ i7-7700HQ CPU @ 2.80GHz × 4, RAM 23.1 GiB)

Shorter time using Aggregator Model

To solve Constrained K-means in a shorter time, you can used the H2O Aggregator model to aggregate data to smaller size first and then pass these data to the Constrained K-means model to calculate the final centroids to be used with scoring. The results won't be as accurate as a result from a model with the whole dataset. However, it should help solve the problem of a huge datasets.

However, there are some assumptions:

  • the large dataset has to consist of many similar data points - if not, the insensitive aggregation can break the structure of the dataset
  • the resulting clustering may not meet the initial constraints exactly when scoring (this also applies to Constrained K-means model, scoring use only result centroids to score and no constraints defined before)

The H2O Aggregator method is a clustering-based method for reducing a numerical/categorical dataset into a dataset with fewer rows. Aggregator maintains outliers as outliers but lumps together dense clusters into exemplars with an attached count column showing the member points.

More about H2O Aggregator: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/aggregator.html


In [2]:
# run h2o Kmeans

# Import h2o library
import h2o
from h2o.estimators import H2OKMeansEstimator

# init h2o cluster
h2o.init(strict_version_check=False, url="http://192.168.59.147:54321")


versionFromGradle='3.29.0',projectVersion='3.29.0.99999',branch='maurever_PUBDEV-6447_constrained_kmeans_improvement',lastCommitHash='162ceb18eae8b773028f27b284129c3ef752d001',gitDescribe='jenkins-master-4952-11-g162ceb18ea-dirty',compiledOn='2020-02-20 15:01:59',compiledBy='mori'
Checking whether there is an H2O instance running at http://192.168.59.147:54321 . connected.
H2O cluster uptime: 12 secs
H2O cluster timezone: Europe/Berlin
H2O data parsing timezone: UTC
H2O cluster version: 3.29.0.99999
H2O cluster version age: 1 hour and 17 minutes
H2O cluster name: mori
H2O cluster total nodes: 1
H2O cluster free memory: 4.821 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://192.168.59.147:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.7.3 candidate

In [3]:
# import time to measure elapsed time
from timeit import default_timer as timer
from datetime import timedelta
import time

start = timer()
end = timer()
print("Time:", timedelta(seconds=end-start))


Time: 0:00:00.000010

Data - Cluto-t7.10k

source: G. Karypis, "CLUTO A Clustering Toolkit," Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/~cluto. Karypis, George, Eui-Hong Han, and Vipin Kumar.

  • 10 000 rows
  • 3 features (x, y, class {0,1,2,3,4,5,6,7,8,noise})

In [4]:
# load data
import pandas as pd
cluto = pd.read_csv("../../smalldata/cluto/cluto_t7_10k.csv", header=None)
cluto.columns = ["x", "y", "class"]
cluto.loc[cluto["class"] == "noise", "class"] = 9
cluto["class"] = cluto["class"].astype("category")
cluto


Out[4]:
x y class
0 539.512024 411.975006 1
1 542.241028 147.626007 2
2 653.468994 370.727997 0
3 598.585999 284.882996 1
4 573.062988 294.562988 1
5 139.570007 401.381012 6
6 228.970001 281.992004 6
7 305.747009 94.350998 7
8 610.617004 167.190002 1
9 500.450012 118.780998 2
10 341.804993 361.755005 5
11 128.582993 81.598999 7
12 39.105000 381.265991 6
13 294.911987 421.480011 6
14 189.214005 322.730011 4
15 251.889008 69.384003 7
16 116.098999 296.382996 6
17 86.839996 307.898010 6
18 509.272003 429.165985 1
19 90.306999 73.486000 7
20 269.270996 163.341995 8
21 137.112000 289.688995 6
22 63.613998 406.865997 6
23 224.473007 92.762001 7
24 345.226013 306.391998 5
25 188.507996 282.093994 6
26 531.218994 468.524994 9
27 47.276001 338.063995 6
28 543.882019 319.738007 1
29 379.937988 97.189003 9
... ... ... ...
9970 560.979980 214.078995 2
9971 286.790985 252.617004 6
9972 284.575012 412.520996 6
9973 455.101990 382.811005 6
9974 530.895996 320.019012 1
9975 189.263000 272.364990 6
9976 251.432007 154.587006 8
9977 325.653015 427.165009 6
9978 16.472000 39.292999 9
9979 51.148998 85.004997 7
9980 435.549988 145.395004 3
9981 176.287994 406.208008 6
9982 243.806000 97.797997 7
9983 545.987976 437.963989 1
9984 355.358002 373.989990 5
9985 288.807007 245.080994 6
9986 442.739014 153.699005 3
9987 72.005997 158.556000 8
9988 52.868999 389.467010 6
9989 55.768002 424.278992 9
9990 17.427000 102.032997 7
9991 36.368000 303.322998 6
9992 58.061001 111.898003 7
9993 9.794000 123.472000 9
9994 484.895996 201.332001 2
9995 451.783997 372.544006 6
9996 550.674988 327.447998 1
9997 474.742004 161.518005 3
9998 535.835022 375.765991 1
9999 234.878006 181.878006 8

10000 rows × 3 columns


In [5]:
import matplotlib.pyplot as plt
groups = cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Original Cluto dataset", fontsize=20)    
ax.legend(numpoints=1)


Out[5]:
<matplotlib.legend.Legend at 0x7fc5d22316d8>

In [6]:
# load data to h2o
data_h2o_cluto = h2o.H2OFrame(cluto)

# run h2o Kmeans to estimate good start points
h2o_km_cluto = H2OKMeansEstimator(k=10, init="furthest", standardize=True)

start = timer()
h2o_km_cluto.train(x=["x", "y"], training_frame=data_h2o_cluto)
end = timer()

user_points = h2o.H2OFrame(h2o_km_cluto.centers())

# show details
h2o_km_cluto.show()
print("Time:", timedelta(seconds=end-start))


Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Model Details
=============
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1582207404277_1


Model Summary: 
number_of_rows number_of_clusters number_of_categorical_columns number_of_iterations within_cluster_sum_of_squares total_sum_of_squares between_cluster_sum_of_squares
0 10000.0 10.0 0.0 10.0 1667.636656 19998.0 18330.363344

ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 1666.5094477161151
Total Sum of Square Error to Grand Mean: 19998.000020727988
Between Cluster Sum of Square Error: 18331.490573011873

Centroid Statistics: 
centroid size within_cluster_sum_of_squares
0 1.0 1041.0 179.995729
1 2.0 973.0 157.996571
2 3.0 1171.0 208.030397
3 4.0 1008.0 112.034030
4 5.0 855.0 123.911769
5 6.0 1065.0 190.608740
6 7.0 981.0 152.060651
7 8.0 973.0 159.002221
8 9.0 899.0 145.266071
9 10.0 1034.0 237.603267
Scoring History: 
timestamp duration iterations number_of_reassigned_observations within_cluster_sum_of_squares
0 2020-02-20 15:03:38 0.058 sec 0.0 NaN NaN
1 2020-02-20 15:03:38 0.208 sec 1.0 10000.0 2603.218838
2 2020-02-20 15:03:38 0.244 sec 2.0 993.0 1865.508498
3 2020-02-20 15:03:38 0.254 sec 3.0 456.0 1772.786949
4 2020-02-20 15:03:38 0.267 sec 4.0 329.0 1725.575052
5 2020-02-20 15:03:38 0.278 sec 5.0 231.0 1699.701634
6 2020-02-20 15:03:38 0.289 sec 6.0 177.0 1687.171620
7 2020-02-20 15:03:38 0.302 sec 7.0 142.0 1678.999864
8 2020-02-20 15:03:38 0.309 sec 8.0 127.0 1673.706844
9 2020-02-20 15:03:38 0.314 sec 9.0 99.0 1669.989434
10 2020-02-20 15:03:38 0.319 sec 10.0 67.0 1667.636656
Time: 0:00:00.565814

In [7]:
# run h2o constrained Kmeans
h2o_km_co_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[100, 200, 100, 200, 100, 100, 100, 100, 100, 100], standardize=True)

start = timer()
h2o_km_co_cluto.train(x=["x", "y"], training_frame=data_h2o_cluto)
end = timer()

# show details
h2o_km_co_cluto.show()
time_h2o_km_co_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_cluto)


kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
=============
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1582207404277_2


Model Summary: 
number_of_rows number_of_clusters number_of_categorical_columns number_of_iterations within_cluster_sum_of_squares total_sum_of_squares between_cluster_sum_of_squares
0 10000.0 10.0 0.0 10.0 1664.966569 19998.0 18333.033431

ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 1664.9665694840944
Total Sum of Square Error to Grand Mean: 19997.999999999996
Between Cluster Sum of Square Error: 18333.0334305159

Centroid Statistics: 
centroid size within_cluster_sum_of_squares
0 1.0 1011.0 170.119422
1 2.0 993.0 165.447394
2 3.0 1176.0 208.789514
3 4.0 997.0 109.136555
4 5.0 859.0 124.607615
5 6.0 1062.0 191.610032
6 7.0 966.0 145.764267
7 8.0 959.0 155.069414
8 9.0 906.0 147.223218
9 10.0 1071.0 247.199140
Scoring History: 
timestamp duration iterations number_of_reassigned_observations within_cluster_sum_of_squares
0 2020-02-20 15:03:39 0.011 sec 0.0 NaN NaN
1 2020-02-20 15:04:26 47.307 sec 1.0 10000.0 1666.509446
2 2020-02-20 15:05:00 1 min 21.174 sec 2.0 42.0 1666.002233
3 2020-02-20 15:05:40 2 min 1.024 sec 3.0 32.0 1665.601672
4 2020-02-20 15:06:17 2 min 38.436 sec 4.0 18.0 1665.330971
5 2020-02-20 15:06:50 3 min 11.585 sec 5.0 21.0 1665.222647
6 2020-02-20 15:07:28 3 min 49.345 sec 6.0 16.0 1665.083857
7 2020-02-20 15:08:06 4 min 27.729 sec 7.0 15.0 1665.013420
8 2020-02-20 15:08:44 5 min 4.867 sec 8.0 4.0 1664.971517
9 2020-02-20 15:09:21 5 min 42.196 sec 9.0 1.0 1664.967125
10 2020-02-20 15:09:56 6 min 17.489 sec 10.0 1.0 1664.966569
Time: 0:06:17.922825

In [8]:
from h2o.estimators.aggregator import H2OAggregatorEstimator

# original data size 10000, constraints [100, 200, 100, 200, 100, 100, 100, 100, 100, 100]
# aggregated data size 5000, constaints [50, 100, 50, 100, 50, 50, 50, 50, 50, 50]

params = {
    "target_num_exemplars": 5000,
    "rel_tol_num_exemplars": 0.5,
    "categorical_encoding": "eigen"
}
agg = H2OAggregatorEstimator(**params)

start = timer()
agg.train(x=["x","y","class"], training_frame=data_h2o_cluto)
data_agg_12_cluto = agg.aggregated_frame

# run h2o Kmeans
h2o_km_co_agg_12_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[50, 100, 50, 100, 50, 50, 50, 50, 50, 50], standardize=True)

h2o_km_co_agg_12_cluto.train(x=["x", "y"],training_frame=data_agg_12_cluto)
end = timer()

# show details
h2o_km_co_agg_12_cluto.show()
time_h2o_km_co_agg_12_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_12_cluto)


aggregator Model Build progress: |████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
=============
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1582207404277_4


Model Summary: 
number_of_rows number_of_clusters number_of_categorical_columns number_of_iterations within_cluster_sum_of_squares total_sum_of_squares between_cluster_sum_of_squares
0 4704.0 10.0 0.0 10.0 833.419474 9406.0 8572.580526

ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 833.4194743305884
Total Sum of Square Error to Grand Mean: 9406.000000000002
Between Cluster Sum of Square Error: 8572.580525669413

Centroid Statistics: 
centroid size within_cluster_sum_of_squares
0 1.0 495.0 89.102803
1 2.0 451.0 81.021073
2 3.0 549.0 103.322650
3 4.0 447.0 52.938713
4 5.0 403.0 65.250578
5 6.0 509.0 98.360575
6 7.0 458.0 76.074637
7 8.0 448.0 73.799758
8 9.0 441.0 76.886811
9 10.0 503.0 116.661877
Scoring History: 
timestamp duration iterations number_of_reassigned_observations within_cluster_sum_of_squares
0 2020-02-20 15:09:58 0.004 sec 0.0 NaN NaN
1 2020-02-20 15:10:12 14.465 sec 1.0 4704.0 836.396491
2 2020-02-20 15:10:24 26.487 sec 2.0 29.0 834.019564
3 2020-02-20 15:10:36 38.478 sec 3.0 18.0 833.666722
4 2020-02-20 15:10:48 50.340 sec 4.0 12.0 833.523871
5 2020-02-20 15:11:00 1 min 2.571 sec 5.0 4.0 833.471111
6 2020-02-20 15:11:13 1 min 14.848 sec 6.0 3.0 833.457722
7 2020-02-20 15:11:24 1 min 26.745 sec 7.0 6.0 833.448221
8 2020-02-20 15:11:37 1 min 39.728 sec 8.0 3.0 833.432040
9 2020-02-20 15:11:49 1 min 51.767 sec 9.0 2.0 833.424375
10 2020-02-20 15:12:01 2 min 3.394 sec 10.0 1.0 833.419474
Time: 0:02:05.151901

In [9]:
# original data size 10000, constraints [100, 200, 100, 200, 100, 100, 100, 100, 100, 100]
# aggregated data size 2500, constaints [50, 100, 50, 100, 50, 50, 50, 50, 50, 50]

params = {
    "target_num_exemplars": 2500,
    "rel_tol_num_exemplars": 0.5,
    "categorical_encoding": "eigen"
}
agg_14 = H2OAggregatorEstimator(**params)

start = timer()
agg_14.train(x=["x","y","class"], training_frame=data_h2o_cluto)
data_agg_14_cluto = agg_14.aggregated_frame

# run h2o Kmeans
h2o_km_co_agg_14_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[25, 50, 25, 50, 25, 25, 25, 25, 25, 25], standardize=True)

h2o_km_co_agg_14_cluto.train(x=["x","y"],training_frame=data_agg_14_cluto)
end = timer()

# show details
h2o_km_co_agg_14_cluto.show()
time_h2o_km_co_agg_14_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_14_cluto)


aggregator Model Build progress: |████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
=============
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1582207404277_6


Model Summary: 
number_of_rows number_of_clusters number_of_categorical_columns number_of_iterations within_cluster_sum_of_squares total_sum_of_squares between_cluster_sum_of_squares
0 1998.0 10.0 0.0 10.0 386.550663 3994.0 3607.449337

ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 386.55066275115064
Total Sum of Square Error to Grand Mean: 3993.999999999999
Between Cluster Sum of Square Error: 3607.4493372488487

Centroid Statistics: 
centroid size within_cluster_sum_of_squares
0 1.0 219.0 44.646723
1 2.0 197.0 39.017929
2 3.0 223.0 46.074045
3 4.0 184.0 25.204827
4 5.0 188.0 32.548179
5 6.0 215.0 46.961160
6 7.0 195.0 35.921999
7 8.0 200.0 35.803402
8 9.0 187.0 36.819424
9 10.0 190.0 43.552975
Scoring History: 
timestamp duration iterations number_of_reassigned_observations within_cluster_sum_of_squares
0 2020-02-20 15:12:02 0.001 sec 0.0 NaN NaN
1 2020-02-20 15:12:08 5.375 sec 1.0 1998.0 394.928571
2 2020-02-20 15:12:13 10.380 sec 2.0 31.0 388.292709
3 2020-02-20 15:12:18 15.378 sec 3.0 21.0 387.354860
4 2020-02-20 15:12:22 20.198 sec 4.0 10.0 387.051973
5 2020-02-20 15:12:27 24.858 sec 5.0 9.0 386.968308
6 2020-02-20 15:12:32 29.360 sec 6.0 8.0 386.847480
7 2020-02-20 15:12:36 34.053 sec 7.0 8.0 386.732143
8 2020-02-20 15:12:42 39.803 sec 8.0 5.0 386.664586
9 2020-02-20 15:12:47 44.387 sec 9.0 5.0 386.607298
10 2020-02-20 15:12:51 48.896 sec 10.0 4.0 386.550663
Time: 0:00:49.909734

In [10]:
groups = cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Original Cluto dataset", fontsize=20)
ax.legend(numpoints=1)


Out[10]:
<matplotlib.legend.Legend at 0x7fc5d5ffc198>

In [11]:
data_agg_df_12_cluto = data_agg_12_cluto.as_data_frame()
data_agg_df_12_cluto["class"] = data_agg_df_12_cluto["class"].astype("category")

groups = data_agg_df_12_cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Aggregated (1/2 size) Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)


Out[11]:
<matplotlib.legend.Legend at 0x7fc5cff2e4a8>

In [12]:
data_agg_df_14_cluto = data_agg_14_cluto.as_data_frame()
data_agg_df_14_cluto["class"] = data_agg_df_14_cluto["class"].astype("category")

groups = data_agg_df_14_cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Aggregated (1/4 size) Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)


Out[12]:
<matplotlib.legend.Legend at 0x7fc5cff2e320>

In [13]:
cluto["km_pred"] = h2o_km_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")

groups = cluto.groupby('km_pred')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Predictions of standard K-means", fontsize=20)  
ax.legend(numpoints=1)


kmeans prediction progress: |█████████████████████████████████████████████| 100%
Out[13]:
<matplotlib.legend.Legend at 0x7fc5d093fba8>

In [14]:
cluto["km_co_pred"] = h2o_km_co_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
groups = cluto.groupby('km_co_pred')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Predictions of Constrained K-means trained with whole Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)


kmeans prediction progress: |█████████████████████████████████████████████| 100%
Out[14]:
<matplotlib.legend.Legend at 0x7fc5cfe27390>

In [15]:
cluto["km_co_pred_1/2"] = h2o_km_co_agg_12_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")

groups = cluto.groupby('km_co_pred_1/2')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Predictions of Constrained K-means trained with aggregated (1/2 of size) Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)


kmeans prediction progress: |█████████████████████████████████████████████| 100%
Out[15]:
<matplotlib.legend.Legend at 0x7fc5cfd5dd30>

In [16]:
cluto["km_co_pred_1/4"] = h2o_km_co_agg_14_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")

groups = cluto.groupby('km_co_pred_1/4')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Predictions of Constrained K-means trained with aggregated (1/4 of size) Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)


kmeans prediction progress: |█████████████████████████████████████████████| 100%
Out[16]:
<matplotlib.legend.Legend at 0x7fc5cfc99518>

Difference between result centroids calculated based on all data and aggregated data


In [17]:
centers_km_co_cluto = pd.DataFrame(h2o_km_co_cluto.centers())
centers_km_co_cluto["algo"] =  "km_co"
centers_km_co_agg_12_cluto = pd.DataFrame(h2o_km_co_agg_12_cluto.centers())
centers_km_co_agg_12_cluto["algo"] =  "km_co_agg_1/2"
centers_km_co_agg_14_cluto = pd.DataFrame(h2o_km_co_agg_14_cluto.centers())
centers_km_co_agg_14_cluto["algo"] =  "km_co_agg_1/4"

centers_all_cluto = pd.concat([centers_km_co_cluto, centers_km_co_agg_12_cluto, centers_km_co_agg_14_cluto])
centers_all_cluto


Out[17]:
0 1 algo
0 573.728842 203.508799 km_co
1 133.744288 393.898371 km_co
2 100.095328 126.964439 km_co
3 585.305011 418.143112 km_co
4 345.861888 281.141212 km_co
5 266.990470 122.451603 km_co
6 578.007404 318.342757 km_co
7 355.965845 389.771451 km_co
8 128.568808 286.907431 km_co
9 521.531612 104.519515 km_co
0 568.775922 201.920099 km_co_agg_1/2
1 131.608865 396.226071 km_co_agg_1/2
2 104.380929 124.546519 km_co_agg_1/2
3 585.544331 417.416145 km_co_agg_1/2
4 340.504389 277.464477 km_co_agg_1/2
5 274.120831 118.697088 km_co_agg_1/2
6 574.578165 316.891841 km_co_agg_1/2
7 352.010968 391.491030 km_co_agg_1/2
8 122.792111 285.733267 km_co_agg_1/2
9 527.328463 98.201551 km_co_agg_1/2
0 561.261476 192.532877 km_co_agg_1/4
1 127.089594 396.637400 km_co_agg_1/4
2 104.014574 116.121202 km_co_agg_1/4
3 582.102163 417.127032 km_co_agg_1/4
4 333.999154 273.114326 km_co_agg_1/4
5 285.327334 116.059544 km_co_agg_1/4
6 571.867079 311.335805 km_co_agg_1/4
7 352.365579 394.340531 km_co_agg_1/4
8 118.349588 275.254412 km_co_agg_1/4
9 539.858625 86.908300 km_co_agg_1/4

In [18]:
groups = centers_all_cluto.groupby('algo')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group[0], group[1], marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Centroids of Constrained K-means algos", fontsize=20)  
ax.legend(numpoints=1)


Out[18]:
<matplotlib.legend.Legend at 0x7fc5cfc3b160>