Cluster Analysis

Data reduction in statistics and machine learning: factor analysis, principle components analysis, cluster analysis

  • in statistics, several observed variables might represent a single latent variable (reduce number of dimensions)
  • in spatial analysis, several lat-long points may represent a single "place" (reduce number of examples)

Cluster analysis algorithms

  • k-means partitions the data space into Voronoi tessellations. Translation: you give it a desired number of clusters and it will partition them into equal-sized clusters (minimizes variance and fails at density-based clustering).
  • DBSCAN discovers clusters as dense areas in space, surrounded by sparser areas. Points in the sparse areas are usually considered noise (needs there to be drop-offs in density to detect cluster borders).
  • OPTICS is similar to DBSCAN, but lets you find clusters of varying density.
  • many more...

DBSCAN is appropriate for geospatial data (when using haversine/great-circle distances or projected data), and we'll focus on it today using scikit-learn.


In [1]:
# magic command to display matplotlib plots inline within the ipython notebook
%matplotlib inline

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint

In [2]:
# define the number of kilometers in one radian
kms_per_radian = 6371.0088

Part 1: spatial clustering into groups


In [3]:
# load the data set
df = pd.read_csv('data/summer-travel-gps-full.csv')
df.head()


Out[3]:
lat lon date city country
0 51.481292 -0.451011 05/14/2014 09:07 West Drayton United Kingdom
1 51.474005 -0.450999 05/14/2014 09:22 Hounslow United Kingdom
2 51.478199 -0.446081 05/14/2014 10:51 Hounslow United Kingdom
3 51.478199 -0.446081 05/14/2014 11:24 Hounslow United Kingdom
4 51.474146 -0.451562 05/14/2014 11:38 Hounslow United Kingdom

In [4]:
# how many rows are in this data set?
len(df)


Out[4]:
1759

In [5]:
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)


Compute DBSCAN

The scikit-learn DBSCAN haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.

  • eps is the physical distance from each point that forms its neighborhood
  • min_samples is the min cluster size, otherwise it's noise - set to 1 so we get no noise

Extract the lat, lon columns into a numpy matrix of coordinates, then convert to radians when you call fit, for use by scikit-learn's haversine metric.


In [6]:
# represent points consistently as (lat, lon)
coords = df.as_matrix(columns=['lat', 'lon'])

# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian

In [7]:
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))

In [8]:
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
    
    size = 150
    if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
        color = 'gray'
        size = 30
    
    # plot the points that match the current cluster label
    x_coords = coords[cluster_labels==cluster_label][:,1]
    y_coords = coords[cluster_labels==cluster_label][:,0]
    ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)

ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()


matplotlib offers lots of built-in colormaps: http://matplotlib.org/examples/color/colormaps_reference.html

The silhouette coefficient evaluates how close a point is to the other points in its cluster in comparison with how close it is to the points in the next nearest cluster. A high silhouette coefficient indicates the points are well-clustered and a low value indicates an outlier.

http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient


In [9]:
coefficient = metrics.silhouette_score(coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(coords, cluster_labels)))


Silhouette coefficient: 0.854

Now you try: experiment with different epsilon values, minimum sizes, and colormaps.


In [ ]:

Part 2: clustering to reduce data set size

Rather than clustering to discover groups, I want to cluster to reduce the size of my data set. Even zoomed in very close, several locations have hundreds of data points stacked directly on top of each other due to the duration of time spent at one location. Unless we are interested in time dynamics, we simply do not need all of these spatially redundant points – they just bloat the data set’s size.


In [10]:
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian

# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))

# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))


Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.05 seconds

In [11]:
coefficient = metrics.silhouette_score(coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(coords, cluster_labels)))


Silhouette coefficient: 0.652

In [12]:
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))


Number of clusters: 138

In [13]:
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()


Out[13]:
133                           [[50.3736895, 18.8892047]]
134    [[50.4487039, 19.0594976], [50.4492979, 19.060...
135                           [[50.4622715, 19.0815141]]
136                           [[50.4893042, 19.0983815]]
137    [[51.474005, -0.4509991], [51.4741456, -0.4515...
dtype: object

Find the point in each cluster that is closest to its centroid

DBSCAN clusters may be non-convex. This technique just returns one representative point from each cluster. First get the lat,lon coordinates of the cluster's centroid (shapely represents the first coordinate in the tuple as x and the second as y, so lat is x and lon is y here). Then find the member of the cluster with the smallest great circle distance to the centroid.


In [14]:
# given a cluster of points, return the point nearest to the cluster's centroid
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)

Geopy's great circle distance calculates the shortest distance between two points along the surface of a sphere. https://en.wikipedia.org/wiki/Great-circle_distance


In [15]:
# find the point in each cluster that is closest to its centroid
centermost_points = clusters.map(get_centermost_point)

# unzip the list of centermost points (lat, lon) tuples into separate lat and lon lists
lats, lons = zip(*centermost_points)

# from these lats/lons create a new df of one representative point for each cluster
representative_points = pd.DataFrame({'lon':lons, 'lat':lats})
representative_points.tail()
representative_points.tail()


Out[15]:
lat lon
133 50.373690 18.889205
134 50.449298 19.060560
135 50.462271 19.081514
136 50.489304 19.098381
137 51.478199 -0.446081

My data set of 1,759 points has been reduced down to a spatially representative sample of 158 points. Last thing - grab the rows in the original dataframe that correspond to these 158 points (so that I keep city, country, and timestamp).


In [16]:
# pull row from full data set (df) where lat/lon match the lat/lon of each row of representative points
# use iloc[0] to pull just the first row if there are multiple matches
rs = representative_points.apply(lambda row: df[(df['lat']==row['lat']) & (df['lon']==row['lon'])].iloc[0], axis=1)
rs.to_csv('data/reduced-set.csv', index=False)
rs.tail()


Out[16]:
lat lon date city country
133 50.373690 18.889205 06/02/2014 07:25 Bytom Poland
134 50.449298 19.060560 05/30/2014 16:39 Tarnowskie Góry County Poland
135 50.462271 19.081514 06/02/2014 07:10 Tarnowskie Góry County Poland
136 50.489304 19.098381 05/30/2014 16:10 Zendek Poland
137 51.478199 -0.446081 05/14/2014 10:51 Hounslow United Kingdom

In [17]:
# to demonstrate the data reduction, compare how many observations of 'Spain' in each data set
print(len(df[df['country']=='Spain']))
print(len(rs[rs['country']=='Spain']))


646
4

In [18]:
# plot the final reduced set of coordinate points vs the original full set
fig, ax = plt.subplots(figsize=[10,7])
rs_scatter = ax.scatter(rs['lon'], rs['lat'], c='m', alpha=0.3, s=200)
df_scatter = ax.scatter(df['lon'], df['lat'], c='k', alpha=0.5, s=5)
ax.set_title('Full data set vs DBSCAN reduced set')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend([df_scatter, rs_scatter], ['Full set', 'Reduced set'], loc='upper left')
plt.show()


Despite the massive reduction in data set size, our smaller set is still spatially representative of the larger set (until you get to very fine spatial scales, as determined by the DBSCAN epsilon value).

If you're interested in cluster analysis and want to explore further, advanced topics include: