Constrained K-Means demo - Cluto dataset

H2O K-Means algorithm

K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that observation in a given group is more similar to another observation in the same group than to another observation in a different group.

More about H2O K-means Clustering: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html

Constrained K-Means algorithm in H2O

Using the cluster_size_constraints parameter, a user can set the minimum size of each cluster during the training by an array of numbers. The size of the array must be equal as the k parameter.

To satisfy the custom minimal cluster size, the calculation of clusters is converted to the Minimal Cost Flow problem. Instead of using the Lloyd iteration algorithm, a graph is constructed based on the distances and constraints. The goal is to go iteratively through the input edges and create an optimal spanning tree that satisfies the constraints.

More information about how to convert the standard K-means algorithm to the Minimal Cost Flow problem is described in this paper: https://pdfs.semanticscholar.org/ecad/eb93378d7911c2f7b9bd83a8af55d7fa9e06.pdf.

Minimum-cost flow problem can be efficiently solved in polynomial time. Currently, the performance of this implementation of Constrained K-means algorithm is slow due to many repeatable calculations which cannot be parallelized and more optimized at H2O backend.

Expected time with various sized data:

5 000 rows, 5 features ~ 0h 4m 3s
10 000 rows, 5 features ~ 0h 9m 21s
15 000 rows, 5 features ~ 0h 22m 25s
20 000 rows, 5 features ~ 0h 39m 27s
25 000 rows, 5 features ~ 1h 06m 8s
30 000 rows, 5 features ~ 1h 26m 43s
35 000 rows, 5 features ~ 1h 44m 7s
40 000 rows, 5 features ~ 2h 13m 31s
45 000 rows, 5 features ~ 2h 4m 29s
50 000 rows, 5 features ~ 4h 4m 18s

(OS debian 10.0 (x86-64), processor Intel© Core™ i7-7700HQ CPU @ 2.80GHz × 4, RAM 23.1 GiB)

Shorter time using Aggregator Model

To solve Constrained K-means in a shorter time, you can used the H2O Aggregator model to aggregate data to smaller size first and then pass these data to the Constrained K-means model to calculate the final centroids to be used with scoring. The results won't be as accurate as a result from a model with the whole dataset. However, it should help solve the problem of a huge datasets.

However, there are some assumptions:

the large dataset has to consist of many similar data points - if not, the insensitive aggregation can break the structure of the dataset
the resulting clustering may not meet the initial constraints exactly when scoring (this also applies to Constrained K-means model, scoring use only result centroids to score and no constraints defined before)

The H2O Aggregator method is a clustering-based method for reducing a numerical/categorical dataset into a dataset with fewer rows. Aggregator maintains outliers as outliers but lumps together dense clusters into exemplars with an attached count column showing the member points.

More about H2O Aggregator: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/aggregator.html



In [2]:

    
# run h2o Kmeans

# Import h2o library
import h2o
from h2o.estimators import H2OKMeansEstimator

# init h2o cluster
h2o.init(strict_version_check=False, url="http://192.168.59.147:54321")









    



versionFromGradle='3.29.0',projectVersion='3.29.0.99999',branch='maurever_PUBDEV-6447_constrained_kmeans_improvement',lastCommitHash='162ceb18eae8b773028f27b284129c3ef752d001',gitDescribe='jenkins-master-4952-11-g162ceb18ea-dirty',compiledOn='2020-02-20 15:01:59',compiledBy='mori'
Checking whether there is an H2O instance running at http://192.168.59.147:54321 . connected.






    




H2O cluster uptime:
12 secs
H2O cluster timezone:
Europe/Berlin
H2O data parsing timezone:
UTC
H2O cluster version:
3.29.0.99999
H2O cluster version age:
1 hour and 17 minutes 
H2O cluster name:
mori
H2O cluster total nodes:
1
H2O cluster free memory:
4.821 Gb
H2O cluster total cores:
8
H2O cluster allowed cores:
8
H2O cluster status:
locked, healthy
H2O connection url:
http://192.168.59.147:54321
H2O connection proxy:
None
H2O internal security:
False
H2O API Extensions:
Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version:
3.7.3 candidate



In [3]:

    
# import time to measure elapsed time
from timeit import default_timer as timer
from datetime import timedelta
import time

start = timer()
end = timer()
print("Time:", timedelta(seconds=end-start))









    



Time: 0:00:00.000010

Data - Cluto-t7.10k

source: G. Karypis, "CLUTO A Clustering Toolkit," Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at http://www.cs.umn.edu/~cluto. Karypis, George, Eui-Hong Han, and Vipin Kumar.

10 000 rows
3 features (x, y, class {0,1,2,3,4,5,6,7,8,noise})



In [4]:

    
# load data
import pandas as pd
cluto = pd.read_csv("../../smalldata/cluto/cluto_t7_10k.csv", header=None)
cluto.columns = ["x", "y", "class"]
cluto.loc[cluto["class"] == "noise", "class"] = 9
cluto["class"] = cluto["class"].astype("category")
cluto









    Out[4]:







  
    
      
      x
      y
      class
    
  
  
    
      0
      539.512024
      411.975006
      1
    
    
      1
      542.241028
      147.626007
      2
    
    
      2
      653.468994
      370.727997
      0
    
    
      3
      598.585999
      284.882996
      1
    
    
      4
      573.062988
      294.562988
      1
    
    
      5
      139.570007
      401.381012
      6
    
    
      6
      228.970001
      281.992004
      6
    
    
      7
      305.747009
      94.350998
      7
    
    
      8
      610.617004
      167.190002
      1
    
    
      9
      500.450012
      118.780998
      2
    
    
      10
      341.804993
      361.755005
      5
    
    
      11
      128.582993
      81.598999
      7
    
    
      12
      39.105000
      381.265991
      6
    
    
      13
      294.911987
      421.480011
      6
    
    
      14
      189.214005
      322.730011
      4
    
    
      15
      251.889008
      69.384003
      7
    
    
      16
      116.098999
      296.382996
      6
    
    
      17
      86.839996
      307.898010
      6
    
    
      18
      509.272003
      429.165985
      1
    
    
      19
      90.306999
      73.486000
      7
    
    
      20
      269.270996
      163.341995
      8
    
    
      21
      137.112000
      289.688995
      6
    
    
      22
      63.613998
      406.865997
      6
    
    
      23
      224.473007
      92.762001
      7
    
    
      24
      345.226013
      306.391998
      5
    
    
      25
      188.507996
      282.093994
      6
    
    
      26
      531.218994
      468.524994
      9
    
    
      27
      47.276001
      338.063995
      6
    
    
      28
      543.882019
      319.738007
      1
    
    
      29
      379.937988
      97.189003
      9
    
    
      ...
      ...
      ...
      ...
    
    
      9970
      560.979980
      214.078995
      2
    
    
      9971
      286.790985
      252.617004
      6
    
    
      9972
      284.575012
      412.520996
      6
    
    
      9973
      455.101990
      382.811005
      6
    
    
      9974
      530.895996
      320.019012
      1
    
    
      9975
      189.263000
      272.364990
      6
    
    
      9976
      251.432007
      154.587006
      8
    
    
      9977
      325.653015
      427.165009
      6
    
    
      9978
      16.472000
      39.292999
      9
    
    
      9979
      51.148998
      85.004997
      7
    
    
      9980
      435.549988
      145.395004
      3
    
    
      9981
      176.287994
      406.208008
      6
    
    
      9982
      243.806000
      97.797997
      7
    
    
      9983
      545.987976
      437.963989
      1
    
    
      9984
      355.358002
      373.989990
      5
    
    
      9985
      288.807007
      245.080994
      6
    
    
      9986
      442.739014
      153.699005
      3
    
    
      9987
      72.005997
      158.556000
      8
    
    
      9988
      52.868999
      389.467010
      6
    
    
      9989
      55.768002
      424.278992
      9
    
    
      9990
      17.427000
      102.032997
      7
    
    
      9991
      36.368000
      303.322998
      6
    
    
      9992
      58.061001
      111.898003
      7
    
    
      9993
      9.794000
      123.472000
      9
    
    
      9994
      484.895996
      201.332001
      2
    
    
      9995
      451.783997
      372.544006
      6
    
    
      9996
      550.674988
      327.447998
      1
    
    
      9997
      474.742004
      161.518005
      3
    
    
      9998
      535.835022
      375.765991
      1
    
    
      9999
      234.878006
      181.878006
      8
    
  

10000 rows × 3 columns



In [5]:

    
import matplotlib.pyplot as plt
groups = cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Original Cluto dataset", fontsize=20)    
ax.legend(numpoints=1)









    Out[5]:





<matplotlib.legend.Legend at 0x7fc5d22316d8>



In [6]:

    
# load data to h2o
data_h2o_cluto = h2o.H2OFrame(cluto)

# run h2o Kmeans to estimate good start points
h2o_km_cluto = H2OKMeansEstimator(k=10, init="furthest", standardize=True)

start = timer()
h2o_km_cluto.train(x=["x", "y"], training_frame=data_h2o_cluto)
end = timer()

user_points = h2o.H2OFrame(h2o_km_cluto.centers())

# show details
h2o_km_cluto.show()
print("Time:", timedelta(seconds=end-start))









    



Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Model Details
=============
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1582207404277_1


Model Summary: 






    







  
    
      
      
      number_of_rows
      number_of_clusters
      number_of_categorical_columns
      number_of_iterations
      within_cluster_sum_of_squares
      total_sum_of_squares
      between_cluster_sum_of_squares
    
  
  
    
      0
      
      10000.0
      10.0
      0.0
      10.0
      1667.636656
      19998.0
      18330.363344
    
  








    




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 1666.5094477161151
Total Sum of Square Error to Grand Mean: 19998.000020727988
Between Cluster Sum of Square Error: 18331.490573011873

Centroid Statistics: 






    







  
    
      
      
      centroid
      size
      within_cluster_sum_of_squares
    
  
  
    
      0
      
      1.0
      1041.0
      179.995729
    
    
      1
      
      2.0
      973.0
      157.996571
    
    
      2
      
      3.0
      1171.0
      208.030397
    
    
      3
      
      4.0
      1008.0
      112.034030
    
    
      4
      
      5.0
      855.0
      123.911769
    
    
      5
      
      6.0
      1065.0
      190.608740
    
    
      6
      
      7.0
      981.0
      152.060651
    
    
      7
      
      8.0
      973.0
      159.002221
    
    
      8
      
      9.0
      899.0
      145.266071
    
    
      9
      
      10.0
      1034.0
      237.603267
    
  








    



Scoring History: 






    







  
    
      
      
      timestamp
      duration
      iterations
      number_of_reassigned_observations
      within_cluster_sum_of_squares
    
  
  
    
      0
      
      2020-02-20 15:03:38
      0.058 sec
      0.0
      NaN
      NaN
    
    
      1
      
      2020-02-20 15:03:38
      0.208 sec
      1.0
      10000.0
      2603.218838
    
    
      2
      
      2020-02-20 15:03:38
      0.244 sec
      2.0
      993.0
      1865.508498
    
    
      3
      
      2020-02-20 15:03:38
      0.254 sec
      3.0
      456.0
      1772.786949
    
    
      4
      
      2020-02-20 15:03:38
      0.267 sec
      4.0
      329.0
      1725.575052
    
    
      5
      
      2020-02-20 15:03:38
      0.278 sec
      5.0
      231.0
      1699.701634
    
    
      6
      
      2020-02-20 15:03:38
      0.289 sec
      6.0
      177.0
      1687.171620
    
    
      7
      
      2020-02-20 15:03:38
      0.302 sec
      7.0
      142.0
      1678.999864
    
    
      8
      
      2020-02-20 15:03:38
      0.309 sec
      8.0
      127.0
      1673.706844
    
    
      9
      
      2020-02-20 15:03:38
      0.314 sec
      9.0
      99.0
      1669.989434
    
    
      10
      
      2020-02-20 15:03:38
      0.319 sec
      10.0
      67.0
      1667.636656
    
  








    



Time: 0:00:00.565814



In [7]:

    
# run h2o constrained Kmeans
h2o_km_co_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[100, 200, 100, 200, 100, 100, 100, 100, 100, 100], standardize=True)

start = timer()
h2o_km_co_cluto.train(x=["x", "y"], training_frame=data_h2o_cluto)
end = timer()

# show details
h2o_km_co_cluto.show()
time_h2o_km_co_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_cluto)









    



kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
=============
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1582207404277_2


Model Summary: 






    







  
    
      
      
      number_of_rows
      number_of_clusters
      number_of_categorical_columns
      number_of_iterations
      within_cluster_sum_of_squares
      total_sum_of_squares
      between_cluster_sum_of_squares
    
  
  
    
      0
      
      10000.0
      10.0
      0.0
      10.0
      1664.966569
      19998.0
      18333.033431
    
  








    




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 1664.9665694840944
Total Sum of Square Error to Grand Mean: 19997.999999999996
Between Cluster Sum of Square Error: 18333.0334305159

Centroid Statistics: 






    







  
    
      
      
      centroid
      size
      within_cluster_sum_of_squares
    
  
  
    
      0
      
      1.0
      1011.0
      170.119422
    
    
      1
      
      2.0
      993.0
      165.447394
    
    
      2
      
      3.0
      1176.0
      208.789514
    
    
      3
      
      4.0
      997.0
      109.136555
    
    
      4
      
      5.0
      859.0
      124.607615
    
    
      5
      
      6.0
      1062.0
      191.610032
    
    
      6
      
      7.0
      966.0
      145.764267
    
    
      7
      
      8.0
      959.0
      155.069414
    
    
      8
      
      9.0
      906.0
      147.223218
    
    
      9
      
      10.0
      1071.0
      247.199140
    
  








    



Scoring History: 






    







  
    
      
      
      timestamp
      duration
      iterations
      number_of_reassigned_observations
      within_cluster_sum_of_squares
    
  
  
    
      0
      
      2020-02-20 15:03:39
      0.011 sec
      0.0
      NaN
      NaN
    
    
      1
      
      2020-02-20 15:04:26
      47.307 sec
      1.0
      10000.0
      1666.509446
    
    
      2
      
      2020-02-20 15:05:00
      1 min 21.174 sec
      2.0
      42.0
      1666.002233
    
    
      3
      
      2020-02-20 15:05:40
      2 min  1.024 sec
      3.0
      32.0
      1665.601672
    
    
      4
      
      2020-02-20 15:06:17
      2 min 38.436 sec
      4.0
      18.0
      1665.330971
    
    
      5
      
      2020-02-20 15:06:50
      3 min 11.585 sec
      5.0
      21.0
      1665.222647
    
    
      6
      
      2020-02-20 15:07:28
      3 min 49.345 sec
      6.0
      16.0
      1665.083857
    
    
      7
      
      2020-02-20 15:08:06
      4 min 27.729 sec
      7.0
      15.0
      1665.013420
    
    
      8
      
      2020-02-20 15:08:44
      5 min  4.867 sec
      8.0
      4.0
      1664.971517
    
    
      9
      
      2020-02-20 15:09:21
      5 min 42.196 sec
      9.0
      1.0
      1664.967125
    
    
      10
      
      2020-02-20 15:09:56
      6 min 17.489 sec
      10.0
      1.0
      1664.966569
    
  








    



Time: 0:06:17.922825



In [8]:

    
from h2o.estimators.aggregator import H2OAggregatorEstimator

# original data size 10000, constraints [100, 200, 100, 200, 100, 100, 100, 100, 100, 100]
# aggregated data size 5000, constaints [50, 100, 50, 100, 50, 50, 50, 50, 50, 50]

params = {
    "target_num_exemplars": 5000,
    "rel_tol_num_exemplars": 0.5,
    "categorical_encoding": "eigen"
}
agg = H2OAggregatorEstimator(**params)

start = timer()
agg.train(x=["x","y","class"], training_frame=data_h2o_cluto)
data_agg_12_cluto = agg.aggregated_frame

# run h2o Kmeans
h2o_km_co_agg_12_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[50, 100, 50, 100, 50, 50, 50, 50, 50, 50], standardize=True)

h2o_km_co_agg_12_cluto.train(x=["x", "y"],training_frame=data_agg_12_cluto)
end = timer()

# show details
h2o_km_co_agg_12_cluto.show()
time_h2o_km_co_agg_12_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_12_cluto)









    



aggregator Model Build progress: |████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
=============
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1582207404277_4


Model Summary: 






    







  
    
      
      
      number_of_rows
      number_of_clusters
      number_of_categorical_columns
      number_of_iterations
      within_cluster_sum_of_squares
      total_sum_of_squares
      between_cluster_sum_of_squares
    
  
  
    
      0
      
      4704.0
      10.0
      0.0
      10.0
      833.419474
      9406.0
      8572.580526
    
  








    




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 833.4194743305884
Total Sum of Square Error to Grand Mean: 9406.000000000002
Between Cluster Sum of Square Error: 8572.580525669413

Centroid Statistics: 






    







  
    
      
      
      centroid
      size
      within_cluster_sum_of_squares
    
  
  
    
      0
      
      1.0
      495.0
      89.102803
    
    
      1
      
      2.0
      451.0
      81.021073
    
    
      2
      
      3.0
      549.0
      103.322650
    
    
      3
      
      4.0
      447.0
      52.938713
    
    
      4
      
      5.0
      403.0
      65.250578
    
    
      5
      
      6.0
      509.0
      98.360575
    
    
      6
      
      7.0
      458.0
      76.074637
    
    
      7
      
      8.0
      448.0
      73.799758
    
    
      8
      
      9.0
      441.0
      76.886811
    
    
      9
      
      10.0
      503.0
      116.661877
    
  








    



Scoring History: 






    







  
    
      
      
      timestamp
      duration
      iterations
      number_of_reassigned_observations
      within_cluster_sum_of_squares
    
  
  
    
      0
      
      2020-02-20 15:09:58
      0.004 sec
      0.0
      NaN
      NaN
    
    
      1
      
      2020-02-20 15:10:12
      14.465 sec
      1.0
      4704.0
      836.396491
    
    
      2
      
      2020-02-20 15:10:24
      26.487 sec
      2.0
      29.0
      834.019564
    
    
      3
      
      2020-02-20 15:10:36
      38.478 sec
      3.0
      18.0
      833.666722
    
    
      4
      
      2020-02-20 15:10:48
      50.340 sec
      4.0
      12.0
      833.523871
    
    
      5
      
      2020-02-20 15:11:00
      1 min  2.571 sec
      5.0
      4.0
      833.471111
    
    
      6
      
      2020-02-20 15:11:13
      1 min 14.848 sec
      6.0
      3.0
      833.457722
    
    
      7
      
      2020-02-20 15:11:24
      1 min 26.745 sec
      7.0
      6.0
      833.448221
    
    
      8
      
      2020-02-20 15:11:37
      1 min 39.728 sec
      8.0
      3.0
      833.432040
    
    
      9
      
      2020-02-20 15:11:49
      1 min 51.767 sec
      9.0
      2.0
      833.424375
    
    
      10
      
      2020-02-20 15:12:01
      2 min  3.394 sec
      10.0
      1.0
      833.419474
    
  








    



Time: 0:02:05.151901



In [9]:

    
# original data size 10000, constraints [100, 200, 100, 200, 100, 100, 100, 100, 100, 100]
# aggregated data size 2500, constaints [50, 100, 50, 100, 50, 50, 50, 50, 50, 50]

params = {
    "target_num_exemplars": 2500,
    "rel_tol_num_exemplars": 0.5,
    "categorical_encoding": "eigen"
}
agg_14 = H2OAggregatorEstimator(**params)

start = timer()
agg_14.train(x=["x","y","class"], training_frame=data_h2o_cluto)
data_agg_14_cluto = agg_14.aggregated_frame

# run h2o Kmeans
h2o_km_co_agg_14_cluto = H2OKMeansEstimator(k=10, user_points=user_points, cluster_size_constraints=[25, 50, 25, 50, 25, 25, 25, 25, 25, 25], standardize=True)

h2o_km_co_agg_14_cluto.train(x=["x","y"],training_frame=data_agg_14_cluto)
end = timer()

# show details
h2o_km_co_agg_14_cluto.show()
time_h2o_km_co_agg_14_cluto = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_14_cluto)









    



aggregator Model Build progress: |████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
=============
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1582207404277_6


Model Summary: 






    







  
    
      
      
      number_of_rows
      number_of_clusters
      number_of_categorical_columns
      number_of_iterations
      within_cluster_sum_of_squares
      total_sum_of_squares
      between_cluster_sum_of_squares
    
  
  
    
      0
      
      1998.0
      10.0
      0.0
      10.0
      386.550663
      3994.0
      3607.449337
    
  








    




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 386.55066275115064
Total Sum of Square Error to Grand Mean: 3993.999999999999
Between Cluster Sum of Square Error: 3607.4493372488487

Centroid Statistics: 






    







  
    
      
      
      centroid
      size
      within_cluster_sum_of_squares
    
  
  
    
      0
      
      1.0
      219.0
      44.646723
    
    
      1
      
      2.0
      197.0
      39.017929
    
    
      2
      
      3.0
      223.0
      46.074045
    
    
      3
      
      4.0
      184.0
      25.204827
    
    
      4
      
      5.0
      188.0
      32.548179
    
    
      5
      
      6.0
      215.0
      46.961160
    
    
      6
      
      7.0
      195.0
      35.921999
    
    
      7
      
      8.0
      200.0
      35.803402
    
    
      8
      
      9.0
      187.0
      36.819424
    
    
      9
      
      10.0
      190.0
      43.552975
    
  








    



Scoring History: 






    







  
    
      
      
      timestamp
      duration
      iterations
      number_of_reassigned_observations
      within_cluster_sum_of_squares
    
  
  
    
      0
      
      2020-02-20 15:12:02
      0.001 sec
      0.0
      NaN
      NaN
    
    
      1
      
      2020-02-20 15:12:08
      5.375 sec
      1.0
      1998.0
      394.928571
    
    
      2
      
      2020-02-20 15:12:13
      10.380 sec
      2.0
      31.0
      388.292709
    
    
      3
      
      2020-02-20 15:12:18
      15.378 sec
      3.0
      21.0
      387.354860
    
    
      4
      
      2020-02-20 15:12:22
      20.198 sec
      4.0
      10.0
      387.051973
    
    
      5
      
      2020-02-20 15:12:27
      24.858 sec
      5.0
      9.0
      386.968308
    
    
      6
      
      2020-02-20 15:12:32
      29.360 sec
      6.0
      8.0
      386.847480
    
    
      7
      
      2020-02-20 15:12:36
      34.053 sec
      7.0
      8.0
      386.732143
    
    
      8
      
      2020-02-20 15:12:42
      39.803 sec
      8.0
      5.0
      386.664586
    
    
      9
      
      2020-02-20 15:12:47
      44.387 sec
      9.0
      5.0
      386.607298
    
    
      10
      
      2020-02-20 15:12:51
      48.896 sec
      10.0
      4.0
      386.550663
    
  








    



Time: 0:00:49.909734



In [10]:

    
groups = cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Original Cluto dataset", fontsize=20)
ax.legend(numpoints=1)









    Out[10]:





<matplotlib.legend.Legend at 0x7fc5d5ffc198>



In [11]:

    
data_agg_df_12_cluto = data_agg_12_cluto.as_data_frame()
data_agg_df_12_cluto["class"] = data_agg_df_12_cluto["class"].astype("category")

groups = data_agg_df_12_cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Aggregated (1/2 size) Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)









    Out[11]:





<matplotlib.legend.Legend at 0x7fc5cff2e4a8>



In [12]:

    
data_agg_df_14_cluto = data_agg_14_cluto.as_data_frame()
data_agg_df_14_cluto["class"] = data_agg_df_14_cluto["class"].astype("category")

groups = data_agg_df_14_cluto.groupby('class')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Aggregated (1/4 size) Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)









    Out[12]:





<matplotlib.legend.Legend at 0x7fc5cff2e320>



In [13]:

    
cluto["km_pred"] = h2o_km_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")

groups = cluto.groupby('km_pred')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Predictions of standard K-means", fontsize=20)  
ax.legend(numpoints=1)









    



kmeans prediction progress: |█████████████████████████████████████████████| 100%






    Out[13]:





<matplotlib.legend.Legend at 0x7fc5d093fba8>



In [14]:

    
cluto["km_co_pred"] = h2o_km_co_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")
groups = cluto.groupby('km_co_pred')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Predictions of Constrained K-means trained with whole Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)









    



kmeans prediction progress: |█████████████████████████████████████████████| 100%






    Out[14]:





<matplotlib.legend.Legend at 0x7fc5cfe27390>



In [15]:

    
cluto["km_co_pred_1/2"] = h2o_km_co_agg_12_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")

groups = cluto.groupby('km_co_pred_1/2')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Predictions of Constrained K-means trained with aggregated (1/2 of size) Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)









    



kmeans prediction progress: |█████████████████████████████████████████████| 100%






    Out[15]:





<matplotlib.legend.Legend at 0x7fc5cfd5dd30>



In [16]:

    
cluto["km_co_pred_1/4"] = h2o_km_co_agg_14_cluto.predict(data_h2o_cluto).as_data_frame()['predict'].astype("category")

groups = cluto.groupby('km_co_pred_1/4')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Predictions of Constrained K-means trained with aggregated (1/4 of size) Cluto Dataset", fontsize=20)  
ax.legend(numpoints=1)









    



kmeans prediction progress: |█████████████████████████████████████████████| 100%






    Out[16]:





<matplotlib.legend.Legend at 0x7fc5cfc99518>

Difference between result centroids calculated based on all data and aggregated data



In [17]:

    
centers_km_co_cluto = pd.DataFrame(h2o_km_co_cluto.centers())
centers_km_co_cluto["algo"] =  "km_co"
centers_km_co_agg_12_cluto = pd.DataFrame(h2o_km_co_agg_12_cluto.centers())
centers_km_co_agg_12_cluto["algo"] =  "km_co_agg_1/2"
centers_km_co_agg_14_cluto = pd.DataFrame(h2o_km_co_agg_14_cluto.centers())
centers_km_co_agg_14_cluto["algo"] =  "km_co_agg_1/4"

centers_all_cluto = pd.concat([centers_km_co_cluto, centers_km_co_agg_12_cluto, centers_km_co_agg_14_cluto])
centers_all_cluto









    Out[17]:







  
    
      
      0
      1
      algo
    
  
  
    
      0
      573.728842
      203.508799
      km_co
    
    
      1
      133.744288
      393.898371
      km_co
    
    
      2
      100.095328
      126.964439
      km_co
    
    
      3
      585.305011
      418.143112
      km_co
    
    
      4
      345.861888
      281.141212
      km_co
    
    
      5
      266.990470
      122.451603
      km_co
    
    
      6
      578.007404
      318.342757
      km_co
    
    
      7
      355.965845
      389.771451
      km_co
    
    
      8
      128.568808
      286.907431
      km_co
    
    
      9
      521.531612
      104.519515
      km_co
    
    
      0
      568.775922
      201.920099
      km_co_agg_1/2
    
    
      1
      131.608865
      396.226071
      km_co_agg_1/2
    
    
      2
      104.380929
      124.546519
      km_co_agg_1/2
    
    
      3
      585.544331
      417.416145
      km_co_agg_1/2
    
    
      4
      340.504389
      277.464477
      km_co_agg_1/2
    
    
      5
      274.120831
      118.697088
      km_co_agg_1/2
    
    
      6
      574.578165
      316.891841
      km_co_agg_1/2
    
    
      7
      352.010968
      391.491030
      km_co_agg_1/2
    
    
      8
      122.792111
      285.733267
      km_co_agg_1/2
    
    
      9
      527.328463
      98.201551
      km_co_agg_1/2
    
    
      0
      561.261476
      192.532877
      km_co_agg_1/4
    
    
      1
      127.089594
      396.637400
      km_co_agg_1/4
    
    
      2
      104.014574
      116.121202
      km_co_agg_1/4
    
    
      3
      582.102163
      417.127032
      km_co_agg_1/4
    
    
      4
      333.999154
      273.114326
      km_co_agg_1/4
    
    
      5
      285.327334
      116.059544
      km_co_agg_1/4
    
    
      6
      571.867079
      311.335805
      km_co_agg_1/4
    
    
      7
      352.365579
      394.340531
      km_co_agg_1/4
    
    
      8
      118.349588
      275.254412
      km_co_agg_1/4
    
    
      9
      539.858625
      86.908300
      km_co_agg_1/4



In [18]:

    
groups = centers_all_cluto.groupby('algo')
fig, ax = plt.subplots(1,1,figsize=(20,15))


for name, group in groups:
    ax.plot(group[0], group[1], marker='o', linestyle='', ms=7, label=name)

fig.suptitle("Centroids of Constrained K-means algos", fontsize=20)  
ax.legend(numpoints=1)









    Out[18]:





<matplotlib.legend.Legend at 0x7fc5cfc3b160>

H2O cluster uptime:	12 secs
H2O cluster timezone:	Europe/Berlin
H2O data parsing timezone:	UTC
H2O cluster version:	3.29.0.99999
H2O cluster version age:	1 hour and 17 minutes
H2O cluster name:	mori
H2O cluster total nodes:	1
H2O cluster free memory:	4.821 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	locked, healthy
H2O connection url:	http://192.168.59.147:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version:	3.7.3 candidate

	x	y	class
0	539.512024	411.975006	1
1	542.241028	147.626007	2
2	653.468994	370.727997	0
3	598.585999	284.882996	1
4	573.062988	294.562988	1
5	139.570007	401.381012	6
6	228.970001	281.992004	6
7	305.747009	94.350998	7
8	610.617004	167.190002	1
9	500.450012	118.780998	2
10	341.804993	361.755005	5
11	128.582993	81.598999	7
12	39.105000	381.265991	6
13	294.911987	421.480011	6
14	189.214005	322.730011	4
15	251.889008	69.384003	7
16	116.098999	296.382996	6
17	86.839996	307.898010	6
18	509.272003	429.165985	1
19	90.306999	73.486000	7
20	269.270996	163.341995	8
21	137.112000	289.688995	6
22	63.613998	406.865997	6
23	224.473007	92.762001	7
24	345.226013	306.391998	5
25	188.507996	282.093994	6
26	531.218994	468.524994	9
27	47.276001	338.063995	6
28	543.882019	319.738007	1
29	379.937988	97.189003	9
...	...	...	...
9970	560.979980	214.078995	2
9971	286.790985	252.617004	6
9972	284.575012	412.520996	6
9973	455.101990	382.811005	6
9974	530.895996	320.019012	1
9975	189.263000	272.364990	6
9976	251.432007	154.587006	8
9977	325.653015	427.165009	6
9978	16.472000	39.292999	9
9979	51.148998	85.004997	7
9980	435.549988	145.395004	3
9981	176.287994	406.208008	6
9982	243.806000	97.797997	7
9983	545.987976	437.963989	1
9984	355.358002	373.989990	5
9985	288.807007	245.080994	6
9986	442.739014	153.699005	3
9987	72.005997	158.556000	8
9988	52.868999	389.467010	6
9989	55.768002	424.278992	9
9990	17.427000	102.032997	7
9991	36.368000	303.322998	6
9992	58.061001	111.898003	7
9993	9.794000	123.472000	9
9994	484.895996	201.332001	2
9995	451.783997	372.544006	6
9996	550.674988	327.447998	1
9997	474.742004	161.518005	3
9998	535.835022	375.765991	1
9999	234.878006	181.878006	8

	centroid	size	within_cluster_sum_of_squares
0	1.0	1041.0	179.995729
1	2.0	973.0	157.996571
2	3.0	1171.0	208.030397
3	4.0	1008.0	112.034030
4	5.0	855.0	123.911769
5	6.0	1065.0	190.608740
6	7.0	981.0	152.060651
7	8.0	973.0	159.002221
8	9.0	899.0	145.266071
9	10.0	1034.0	237.603267

	timestamp	duration	iterations	number_of_reassigned_observations	within_cluster_sum_of_squares
0	2020-02-20 15:03:38	0.058 sec	0.0	NaN	NaN
1	2020-02-20 15:03:38	0.208 sec	1.0	10000.0	2603.218838
2	2020-02-20 15:03:38	0.244 sec	2.0	993.0	1865.508498
3	2020-02-20 15:03:38	0.254 sec	3.0	456.0	1772.786949
4	2020-02-20 15:03:38	0.267 sec	4.0	329.0	1725.575052
5	2020-02-20 15:03:38	0.278 sec	5.0	231.0	1699.701634
6	2020-02-20 15:03:38	0.289 sec	6.0	177.0	1687.171620
7	2020-02-20 15:03:38	0.302 sec	7.0	142.0	1678.999864
8	2020-02-20 15:03:38	0.309 sec	8.0	127.0	1673.706844
9	2020-02-20 15:03:38	0.314 sec	9.0	99.0	1669.989434
10	2020-02-20 15:03:38	0.319 sec	10.0	67.0	1667.636656

	centroid	size	within_cluster_sum_of_squares
0	1.0	1011.0	170.119422
1	2.0	993.0	165.447394
2	3.0	1176.0	208.789514
3	4.0	997.0	109.136555
4	5.0	859.0	124.607615
5	6.0	1062.0	191.610032
6	7.0	966.0	145.764267
7	8.0	959.0	155.069414
8	9.0	906.0	147.223218
9	10.0	1071.0	247.199140

	timestamp	duration	iterations	number_of_reassigned_observations	within_cluster_sum_of_squares
0	2020-02-20 15:03:39	0.011 sec	0.0	NaN	NaN
1	2020-02-20 15:04:26	47.307 sec	1.0	10000.0	1666.509446
2	2020-02-20 15:05:00	1 min 21.174 sec	2.0	42.0	1666.002233
3	2020-02-20 15:05:40	2 min 1.024 sec	3.0	32.0	1665.601672
4	2020-02-20 15:06:17	2 min 38.436 sec	4.0	18.0	1665.330971
5	2020-02-20 15:06:50	3 min 11.585 sec	5.0	21.0	1665.222647
6	2020-02-20 15:07:28	3 min 49.345 sec	6.0	16.0	1665.083857
7	2020-02-20 15:08:06	4 min 27.729 sec	7.0	15.0	1665.013420
8	2020-02-20 15:08:44	5 min 4.867 sec	8.0	4.0	1664.971517
9	2020-02-20 15:09:21	5 min 42.196 sec	9.0	1.0	1664.967125
10	2020-02-20 15:09:56	6 min 17.489 sec	10.0	1.0	1664.966569

	centroid	size	within_cluster_sum_of_squares
0	1.0	495.0	89.102803
1	2.0	451.0	81.021073
2	3.0	549.0	103.322650
3	4.0	447.0	52.938713
4	5.0	403.0	65.250578
5	6.0	509.0	98.360575
6	7.0	458.0	76.074637
7	8.0	448.0	73.799758
8	9.0	441.0	76.886811
9	10.0	503.0	116.661877

	timestamp	duration	iterations	number_of_reassigned_observations	within_cluster_sum_of_squares
0	2020-02-20 15:09:58	0.004 sec	0.0	NaN	NaN
1	2020-02-20 15:10:12	14.465 sec	1.0	4704.0	836.396491
2	2020-02-20 15:10:24	26.487 sec	2.0	29.0	834.019564
3	2020-02-20 15:10:36	38.478 sec	3.0	18.0	833.666722
4	2020-02-20 15:10:48	50.340 sec	4.0	12.0	833.523871
5	2020-02-20 15:11:00	1 min 2.571 sec	5.0	4.0	833.471111
6	2020-02-20 15:11:13	1 min 14.848 sec	6.0	3.0	833.457722
7	2020-02-20 15:11:24	1 min 26.745 sec	7.0	6.0	833.448221
8	2020-02-20 15:11:37	1 min 39.728 sec	8.0	3.0	833.432040
9	2020-02-20 15:11:49	1 min 51.767 sec	9.0	2.0	833.424375
10	2020-02-20 15:12:01	2 min 3.394 sec	10.0	1.0	833.419474

	centroid	size	within_cluster_sum_of_squares
0	1.0	219.0	44.646723
1	2.0	197.0	39.017929
2	3.0	223.0	46.074045
3	4.0	184.0	25.204827
4	5.0	188.0	32.548179
5	6.0	215.0	46.961160
6	7.0	195.0	35.921999
7	8.0	200.0	35.803402
8	9.0	187.0	36.819424
9	10.0	190.0	43.552975

	timestamp	duration	iterations	number_of_reassigned_observations	within_cluster_sum_of_squares
0	2020-02-20 15:12:02	0.001 sec	0.0	NaN	NaN
1	2020-02-20 15:12:08	5.375 sec	1.0	1998.0	394.928571
2	2020-02-20 15:12:13	10.380 sec	2.0	31.0	388.292709
3	2020-02-20 15:12:18	15.378 sec	3.0	21.0	387.354860
4	2020-02-20 15:12:22	20.198 sec	4.0	10.0	387.051973
5	2020-02-20 15:12:27	24.858 sec	5.0	9.0	386.968308
6	2020-02-20 15:12:32	29.360 sec	6.0	8.0	386.847480
7	2020-02-20 15:12:36	34.053 sec	7.0	8.0	386.732143
8	2020-02-20 15:12:42	39.803 sec	8.0	5.0	386.664586
9	2020-02-20 15:12:47	44.387 sec	9.0	5.0	386.607298
10	2020-02-20 15:12:51	48.896 sec	10.0	4.0	386.550663

	0	1	algo
0	573.728842	203.508799	km_co
1	133.744288	393.898371	km_co
2	100.095328	126.964439	km_co
3	585.305011	418.143112	km_co
4	345.861888	281.141212	km_co
5	266.990470	122.451603	km_co
6	578.007404	318.342757	km_co
7	355.965845	389.771451	km_co
8	128.568808	286.907431	km_co
9	521.531612	104.519515	km_co
0	568.775922	201.920099	km_co_agg_1/2
1	131.608865	396.226071	km_co_agg_1/2
2	104.380929	124.546519	km_co_agg_1/2
3	585.544331	417.416145	km_co_agg_1/2
4	340.504389	277.464477	km_co_agg_1/2
5	274.120831	118.697088	km_co_agg_1/2
6	574.578165	316.891841	km_co_agg_1/2
7	352.010968	391.491030	km_co_agg_1/2
8	122.792111	285.733267	km_co_agg_1/2
9	527.328463	98.201551	km_co_agg_1/2
0	561.261476	192.532877	km_co_agg_1/4
1	127.089594	396.637400	km_co_agg_1/4
2	104.014574	116.121202	km_co_agg_1/4
3	582.102163	417.127032	km_co_agg_1/4
4	333.999154	273.114326	km_co_agg_1/4
5	285.327334	116.059544	km_co_agg_1/4
6	571.867079	311.335805	km_co_agg_1/4
7	352.365579	394.340531	km_co_agg_1/4
8	118.349588	275.254412	km_co_agg_1/4
9	539.858625	86.908300	km_co_agg_1/4