KMeans Clustering using scikit-learn


In [25]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

Generate data


In [10]:
X, y = make_blobs(n_samples=1000, centers=3, n_features=2)
df = pd.DataFrame(X)
df = df.rename(columns={0: 'x1', 1: 'x2'})
df['y'] = y
df.head()


Out[10]:
x1 x2 y
0 3.515992 5.718453 2
1 0.979406 6.933223 2
2 8.014965 -6.052760 1
3 0.714026 5.140077 2
4 6.871349 0.790905 0

Cluster data using KMeans


In [31]:
k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(df[['x1', 'x2']])
df['y_pred'] = k_means.labels_
k_means.cluster_centers_


Out[31]:
array([[ 6.14040257, -0.49850067],
       [ 2.01084733,  5.8233732 ],
       [ 9.04113487, -4.90957409]])

Plot data and color by cluster


In [32]:
sns.pairplot(df, hue='y_pred', vars=('x1', 'x2'), diag_kind="kde", plot_kws=dict(alpha=0.1, edgecolor=None))


Out[32]:
<seaborn.axisgrid.PairGrid at 0x12640b6d8>