Title: Group Observations Using K-Means Clustering
Slug: group_observations_using_clustering
Summary: How to group observations using clustering for machine learning in Python.
Date: 2016-09-06 12:00
Category: Machine Learning
Tags: Preprocessing Structured Data
Authors: Chris Albon

Preliminaries


In [1]:
# Load libraries
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import pandas as pd

Create Data


In [2]:
# Make simulated feature matrix
X, _ = make_blobs(n_samples = 50,
                  n_features = 2,
                  centers = 3,
                  random_state = 1)

# Create DataFrame
df = pd.DataFrame(X, columns=['feature_1','feature_2'])

Train Clusterer


In [3]:
# Make k-means clusterer
clusterer = KMeans(3, random_state=1)

# Fit clusterer
clusterer.fit(X)


Out[3]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=1, tol=0.0001, verbose=0)

Create Feature Based On Predicted Cluster


In [4]:
# Predict values
df['group'] = clusterer.predict(X)

# First few observations
df.head(5)


Out[4]:
feature_1 feature_2 group
0 -9.877554 -3.336145 0
1 -7.287210 -8.353986 2
2 -6.943061 -7.023744 2
3 -7.440167 -8.791959 2
4 -6.641388 -8.075888 2