```
In [1]:
```import h2o
import imp
from h2o.estimators.kmeans import H2OKMeansEstimator

```
In [2]:
```# Start a local instance of the H2O engine.
h2o.init();

```
```

```
In [3]:
```iris = h2o.import_file(path="https://github.com/h2oai/h2o-3/raw/master/h2o-r/h2o-package/inst/extdata/iris_wheader.csv")

```
```

```
In [4]:
```iris.describe()

```
```

The iris data set is labeled into three classes; there are four measurements that were taken for each iris. While we will not be using the labeled data for clustering, it does provide us a convenient comparison and visualization of the data as it was provided. In this example I use Seaborn for the visualization of the data.

_{(As an aside, the approach taken here of using all the data for visualization does not scale to large datasets. One approach to dealing with large data sets is to sample the data in H2O and then transfer the sample of data to the Python environment for plotting).}

```
In [5]:
```try:
imp.find_module('pandas')
can_pandas = True
import pandas as pd
except:
can_pandas = False
try:
imp.find_module('seaborn')
can_seaborn = True
import seaborn as sns
except:
can_seaborn = False
%matplotlib inline
if can_seaborn:
sns.set()

```
In [14]:
```if can_seaborn:
sns.set_context("notebook")
sns.pairplot(iris.as_data_frame(True), vars=["sepal_len", "sepal_wid", "petal_len", "petal_wid"], hue="class");

```
```

```
In [7]:
```results = [H2OKMeansEstimator(k=clusters, init="Random", seed=2, standardize=True) for clusters in range(2,13)]
for estimator in results:
estimator.train(x=iris.col_names[0:-1], training_frame = iris)

```
```

There are three diagnostics that will be demonstrated to help with determining the number of clusters: total within cluster sum of squares, AIC, and BIC.

Total within cluster sum of squares measures sums the distance from each point in a cluster to that point's assigned cluster center. This is the minimization criteria of kmeans. The standard guideline for picking the number of clusters is to look for a 'knee' in the plot, showing where the total within sum of squares stops decreasing rapidly. Total within cluster sum of squares can be difficult to intepret, with the criteria being to look for an arbitrary knee in the plot.

With this challenge from total within cluster sum of squares, we will also use two merit statistics for determining the number of clusters. AIC and BIC are both measures of the relative quality of a statistical model. AIC and BIC introduce penality terms for the number of parameters in the model to counter the problem of overfitting; BIC has a larger penality term than AIC. With these merit statistics one is to look for the number of clusters that minimize the statistic.

Here we build a method for extracting the inputs for each diagnostics and calculating the AIC and BIC values on a model. Each model is then inspected by the method and the results plotted for quick analysis.

```
In [8]:
```import math as math
def diagnostics_from_clusteringmodel(model):
total_within_sumofsquares = model.tot_withinss()
number_of_clusters = len(model.centers()[0])
number_of_dimensions = len(model.centers())
number_of_rows = sum(model.size())
aic = total_within_sumofsquares + 2 * number_of_dimensions * number_of_clusters
bic = total_within_sumofsquares + math.log(number_of_rows) * number_of_dimensions * number_of_clusters
return {'Clusters':number_of_clusters,
'Total Within SS':total_within_sumofsquares,
'AIC':aic,
'BIC':bic}

```
In [9]:
```if can_pandas:
diagnostics = pd.DataFrame( [diagnostics_from_clusteringmodel(model) for model in results])
diagnostics.set_index('Clusters', inplace=True)

```
In [10]:
```if can_pandas:
diagnostics.plot(kind='line');

```
```

```
In [11]:
```clusters = 4
predicted = results[clusters-2].predict(iris)
iris["Predicted"] = predicted["predict"].asfactor()

```
In [13]:
```if can_seaborn:
sns.pairplot(iris.as_data_frame(True), vars=["sepal_len", "sepal_wid", "petal_len", "petal_wid"], hue="Predicted");

```
```

This iPython notebook is available for download. Grab the latest version of H2O to try it out and be sure to check out other H2O Python demos.

-- Hank