Data reduction in statistics and machine learning: factor analysis, principle components analysis, cluster analysis
Cluster analysis algorithms
DBSCAN is appropriate for geospatial data (when using haversine/great-circle distances or projected data), and we'll focus on it today using scikit-learn.
In [1]:
# magic command to display matplotlib plots inline within the ipython notebook
%matplotlib inline
# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
In [2]:
# define the number of kilometers in one radian
kms_per_radian = 6371.0088
In [3]:
# load the data set
df = pd.read_csv('data/summer-travel-gps-full.csv')
df.head()
Out[3]:
In [4]:
# how many rows are in this data set?
len(df)
Out[4]:
In [5]:
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)
The scikit-learn DBSCAN haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.
Extract the lat, lon columns into a numpy matrix of coordinates, then convert to radians when you call fit, for use by scikit-learn's haversine metric.
In [6]:
# represent points consistently as (lat, lon)
coords = df.as_matrix(columns=['lat', 'lon'])
# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian
In [7]:
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
In [8]:
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))
# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
size = 150
if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
color = 'gray'
size = 30
# plot the points that match the current cluster label
x_coords = coords[cluster_labels==cluster_label][:,1]
y_coords = coords[cluster_labels==cluster_label][:,0]
ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)
ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()
matplotlib offers lots of built-in colormaps: http://matplotlib.org/examples/color/colormaps_reference.html
The silhouette coefficient evaluates how close a point is to the other points in its cluster in comparison with how close it is to the points in the next nearest cluster. A high silhouette coefficient indicates the points are well-clustered and a low value indicates an outlier.
http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient
In [9]:
coefficient = metrics.silhouette_score(coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(coords, cluster_labels)))
Now you try: experiment with different epsilon values, minimum sizes, and colormaps.
In [ ]:
Rather than clustering to discover groups, I want to cluster to reduce the size of my data set. Even zoomed in very close, several locations have hundreds of data points stacked directly on top of each other due to the duration of time spent at one location. Unless we are interested in time dynamics, we simply do not need all of these spatially redundant points – they just bloat the data set’s size.
In [10]:
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian
# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
In [11]:
coefficient = metrics.silhouette_score(coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(coords, cluster_labels)))
In [12]:
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))
In [13]:
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
Out[13]:
DBSCAN clusters may be non-convex. This technique just returns one representative point from each cluster. First get the lat,lon coordinates of the cluster's centroid (shapely represents the first coordinate in the tuple as x and the second as y, so lat is x and lon is y here). Then find the member of the cluster with the smallest great circle distance to the centroid.
In [14]:
# given a cluster of points, return the point nearest to the cluster's centroid
def get_centermost_point(cluster):
centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
return tuple(centermost_point)
Geopy's great circle distance calculates the shortest distance between two points along the surface of a sphere. https://en.wikipedia.org/wiki/Great-circle_distance
In [15]:
# find the point in each cluster that is closest to its centroid
centermost_points = clusters.map(get_centermost_point)
# unzip the list of centermost points (lat, lon) tuples into separate lat and lon lists
lats, lons = zip(*centermost_points)
# from these lats/lons create a new df of one representative point for each cluster
representative_points = pd.DataFrame({'lon':lons, 'lat':lats})
representative_points.tail()
representative_points.tail()
Out[15]:
My data set of 1,759 points has been reduced down to a spatially representative sample of 158 points. Last thing - grab the rows in the original dataframe that correspond to these 158 points (so that I keep city, country, and timestamp).
In [16]:
# pull row from full data set (df) where lat/lon match the lat/lon of each row of representative points
# use iloc[0] to pull just the first row if there are multiple matches
rs = representative_points.apply(lambda row: df[(df['lat']==row['lat']) & (df['lon']==row['lon'])].iloc[0], axis=1)
rs.to_csv('data/reduced-set.csv', index=False)
rs.tail()
Out[16]:
In [17]:
# to demonstrate the data reduction, compare how many observations of 'Spain' in each data set
print(len(df[df['country']=='Spain']))
print(len(rs[rs['country']=='Spain']))
In [18]:
# plot the final reduced set of coordinate points vs the original full set
fig, ax = plt.subplots(figsize=[10,7])
rs_scatter = ax.scatter(rs['lon'], rs['lat'], c='m', alpha=0.3, s=200)
df_scatter = ax.scatter(df['lon'], df['lat'], c='k', alpha=0.5, s=5)
ax.set_title('Full data set vs DBSCAN reduced set')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend([df_scatter, rs_scatter], ['Full set', 'Reduced set'], loc='upper left')
plt.show()
Despite the massive reduction in data set size, our smaller set is still spatially representative of the larger set (until you get to very fine spatial scales, as determined by the DBSCAN epsilon value).
If you're interested in cluster analysis and want to explore further, advanced topics include: