Cluster Analysis

Data reduction in statistics and machine learning: factor analysis, principle components analysis, cluster analysis

in statistics, several observed variables might represent a single latent variable (reduce number of dimensions)
in spatial analysis, several lat-long points may represent a single "place" (reduce number of examples)

Cluster analysis algorithms

k-means partitions the data space into Voronoi tessellations. Translation: you give it a desired number of clusters and it will partition them into equal-sized clusters (minimizes variance and fails at density-based clustering).
DBSCAN discovers clusters as dense areas in space, surrounded by sparser areas. Points in the sparse areas are usually considered noise (needs there to be drop-offs in density to detect cluster borders).
OPTICS is similar to DBSCAN, but lets you find clusters of varying density.
many more...

DBSCAN is appropriate for geospatial data (when using haversine/great-circle distances or projected data), and we'll focus on it today using scikit-learn.



In [1]:

    
# magic command to display matplotlib plots inline within the ipython notebook
%matplotlib inline

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint



In [2]:

    
# define the number of kilometers in one radian
kms_per_radian = 6371.0088

Part 1: spatial clustering into groups



In [3]:

    
# load the data set
df = pd.read_csv('data/summer-travel-gps-full.csv')
df.head()









    Out[3]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      0
      51.481292
      -0.451011
      05/14/2014 09:07
      West Drayton
      United Kingdom
    
    
      1
      51.474005
      -0.450999
      05/14/2014 09:22
      Hounslow
      United Kingdom
    
    
      2
      51.478199
      -0.446081
      05/14/2014 10:51
      Hounslow
      United Kingdom
    
    
      3
      51.478199
      -0.446081
      05/14/2014 11:24
      Hounslow
      United Kingdom
    
    
      4
      51.474146
      -0.451562
      05/14/2014 11:38
      Hounslow
      United Kingdom



In [4]:

    
# how many rows are in this data set?
len(df)









    Out[4]:





1759



In [5]:

    
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)

Compute DBSCAN

The scikit-learn DBSCAN haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.

eps is the physical distance from each point that forms its neighborhood
min_samples is the min cluster size, otherwise it's noise - set to 1 so we get no noise

Extract the lat, lon columns into a numpy matrix of coordinates, then convert to radians when you call fit, for use by scikit-learn's haversine metric.



In [6]:

    
# represent points consistently as (lat, lon)
coords = df.as_matrix(columns=['lat', 'lon'])

# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian



In [7]:

    
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))



In [8]:

    
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
    
    size = 150
    if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
        color = 'gray'
        size = 30
    
    # plot the points that match the current cluster label
    x_coords = coords[cluster_labels==cluster_label][:,1]
    y_coords = coords[cluster_labels==cluster_label][:,0]
    ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)

ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()

matplotlib offers lots of built-in colormaps: http://matplotlib.org/examples/color/colormaps_reference.html

The silhouette coefficient evaluates how close a point is to the other points in its cluster in comparison with how close it is to the points in the next nearest cluster. A high silhouette coefficient indicates the points are well-clustered and a low value indicates an outlier.

http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient



In [9]:

    
coefficient = metrics.silhouette_score(coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(coords, cluster_labels)))









    



Silhouette coefficient: 0.854

Now you try: experiment with different epsilon values, minimum sizes, and colormaps.



In [ ]:

Part 2: clustering to reduce data set size

Rather than clustering to discover groups, I want to cluster to reduce the size of my data set. Even zoomed in very close, several locations have hundreds of data points stacked directly on top of each other due to the duration of time spent at one location. Unless we are interested in time dynamics, we simply do not need all of these spatially redundant points – they just bloat the data set’s size.



In [10]:

    
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian

# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))

# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))









    



Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.05 seconds



In [11]:

    
coefficient = metrics.silhouette_score(coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(coords, cluster_labels)))









    



Silhouette coefficient: 0.652



In [12]:

    
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))









    



Number of clusters: 138



In [13]:

    
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()









    Out[13]:





133                           [[50.3736895, 18.8892047]]
134    [[50.4487039, 19.0594976], [50.4492979, 19.060...
135                           [[50.4622715, 19.0815141]]
136                           [[50.4893042, 19.0983815]]
137    [[51.474005, -0.4509991], [51.4741456, -0.4515...
dtype: object

Find the point in each cluster that is closest to its centroid

DBSCAN clusters may be non-convex. This technique just returns one representative point from each cluster. First get the lat,lon coordinates of the cluster's centroid (shapely represents the first coordinate in the tuple as x and the second as y, so lat is x and lon is y here). Then find the member of the cluster with the smallest great circle distance to the centroid.



In [14]:

    
# given a cluster of points, return the point nearest to the cluster's centroid
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)

Geopy's great circle distance calculates the shortest distance between two points along the surface of a sphere. https://en.wikipedia.org/wiki/Great-circle_distance



In [15]:

    
# find the point in each cluster that is closest to its centroid
centermost_points = clusters.map(get_centermost_point)

# unzip the list of centermost points (lat, lon) tuples into separate lat and lon lists
lats, lons = zip(*centermost_points)

# from these lats/lons create a new df of one representative point for each cluster
representative_points = pd.DataFrame({'lon':lons, 'lat':lats})
representative_points.tail()
representative_points.tail()

My data set of 1,759 points has been reduced down to a spatially representative sample of 158 points. Last thing - grab the rows in the original dataframe that correspond to these 158 points (so that I keep city, country, and timestamp).



In [16]:

    
# pull row from full data set (df) where lat/lon match the lat/lon of each row of representative points
# use iloc[0] to pull just the first row if there are multiple matches
rs = representative_points.apply(lambda row: df[(df['lat']==row['lat']) & (df['lon']==row['lon'])].iloc[0], axis=1)
rs.to_csv('data/reduced-set.csv', index=False)
rs.tail()









    Out[16]:






  
    
      
      lat
      lon
      date
      city
      country
    
  
  
    
      133
      50.373690
      18.889205
      06/02/2014 07:25
      Bytom
      Poland
    
    
      134
      50.449298
      19.060560
      05/30/2014 16:39
      Tarnowskie Góry County
      Poland
    
    
      135
      50.462271
      19.081514
      06/02/2014 07:10
      Tarnowskie Góry County
      Poland
    
    
      136
      50.489304
      19.098381
      05/30/2014 16:10
      Zendek
      Poland
    
    
      137
      51.478199
      -0.446081
      05/14/2014 10:51
      Hounslow
      United Kingdom



In [17]:

    
# to demonstrate the data reduction, compare how many observations of 'Spain' in each data set
print(len(df[df['country']=='Spain']))
print(len(rs[rs['country']=='Spain']))



In [18]:

    
# plot the final reduced set of coordinate points vs the original full set
fig, ax = plt.subplots(figsize=[10,7])
rs_scatter = ax.scatter(rs['lon'], rs['lat'], c='m', alpha=0.3, s=200)
df_scatter = ax.scatter(df['lon'], df['lat'], c='k', alpha=0.5, s=5)
ax.set_title('Full data set vs DBSCAN reduced set')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend([df_scatter, rs_scatter], ['Full set', 'Reduced set'], loc='upper left')
plt.show()

Despite the massive reduction in data set size, our smaller set is still spatially representative of the larger set (until you get to very fine spatial scales, as determined by the DBSCAN epsilon value).

If you're interested in cluster analysis and want to explore further, advanced topics include:

R-trees: https://en.wikipedia.org/wiki/R-tree
k-d trees: https://en.wikipedia.org/wiki/K-d_tree
OPTICS: https://en.wikipedia.org/wiki/OPTICS_algorithm
ELKI, a Java-based tool, lets you use DBSCAN explicitly with lat-long distances: http://elki.dbs.ifi.lmu.de/

	lat	lon	date	city	country
0	51.481292	-0.451011	05/14/2014 09:07	West Drayton	United Kingdom
1	51.474005	-0.450999	05/14/2014 09:22	Hounslow	United Kingdom
2	51.478199	-0.446081	05/14/2014 10:51	Hounslow	United Kingdom
3	51.478199	-0.446081	05/14/2014 11:24	Hounslow	United Kingdom
4	51.474146	-0.451562	05/14/2014 11:38	Hounslow	United Kingdom

	lat	lon
133	50.373690	18.889205
134	50.449298	19.060560
135	50.462271	19.081514
136	50.489304	19.098381
137	51.478199	-0.446081

	lat	lon	date	city	country
133	50.373690	18.889205	06/02/2014 07:25	Bytom	Poland
134	50.449298	19.060560	05/30/2014 16:39	Tarnowskie Góry County	Poland
135	50.462271	19.081514	06/02/2014 07:10	Tarnowskie Góry County	Poland
136	50.489304	19.098381	05/30/2014 16:10	Zendek	Poland
137	51.478199	-0.446081	05/14/2014 10:51	Hounslow	United Kingdom