by Talha Oz (submitted as GeoSocial Class Assignment #4)
The question asks to identify clusters of tweets as a broad concept, and does not state a particular domain to cluster on, i.e. one can cluster the tweets solely based on their geogprahic properties, or alternatively can color them based on their sentiment polarity. Here, I chose the former viewpoint, where the clusters are formed only according to their location.
I first compute the distances between every unique pair of tweeting points using Vincenty
algorithm implemented in geopy
library because default implementations of clustering algorithms in sklearn
package use Euclidean distance
as their similarity metric. Since Vincenty
algorithm measures near-actual distances given geographical coordinate points, this is a better metric than Euclidean
approach.
I initially tried DBSCAN
algorithm, where I only got a few clusters and too many unclustered points. Since the locations of the tweeps are scattered around several countries, and there is a high variation of density of points in our dataset, this density based clustering approach apparently not fitted well to our case.
As I do not want to set number of clusters before running the algorithm, rather want algorithm to choose the appropriate number of clusters from the dataset, I used Affinity Propagation (AP)
algorithm implemented in sklearn
as the details of the implemention is discussed here
pairwise_distances
in sklearn.metrics.pairwise
, and vincenty
in geopy.distance
. And described this matrix to have better grasp of the dataset. Average distance between any two points is 116 miles and the maximum distant points are located 938 miles away from each other.X.max() - X
to transform distance matrix to similarity matrix preference
parameter in AffinityPropagation algorithm defaults to median, so inversion does not change it. Along with damping factor
they control number of examplers used (and hence # of clusters), as default values gave satisfactory results, I did not change it.Folium (Leaflet.js) library is used for interactive mapping where locations are marked in circles whose radii are proportional to tweets originated from the same lat,lon. Moreover, five tile sets are offered by Folium. I used 'Mapbox Bright' as I found it to have the least distractive colors (simply white) which will not lead to any biases for the colors of the clusters (e.g. green forests make it hard to see the markers on top of them). As the colormap, I use qualitative palette (Dark2 with 8 colors), as I believe that reflects the clusters best for this tileset.
Please see below, an interactive map is provided as the output of the last command
In [1]:
import pandas as pd
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import pairwise_distances
from geopy.distance import vincenty
import numpy as np
import folium
from IPython.display import HTML
from itertools import cycle
%matplotlib inline
In [2]:
def vincenty_mi(p1,p2):
"""get two points and return their distance in miles"""
return vincenty((p1[0],p1[1]),(p2[0],p2[1])).miles
In [3]:
df = pd.read_csv('Olympic_torch_2012_UK.csv',header=None,names=['twtime','lat','lon','sp'],parse_dates=[0])
df['cnt'] = 0
# average sp (sentiment polarities) and count tweets from the same lat/lon
df = pd.DataFrame(df.groupby(by=['lat','lon'],as_index=False).agg({'cnt':len,'sp':np.mean}))
print('Total number of tweets:',df['cnt'].sum())
print('Number of unique locations:',len(df))
print('Location with the highest tweet count (London city center):')
df[df.cnt == df['cnt'].max()]
Out[3]:
In [4]:
# this takes about 1 min 14 secs (measured by %timeit -n1 -r1)...
X = pairwise_distances(df[['lat','lon']],metric=vincenty_mi)
pd.Series(X.flatten()).describe()
Out[4]:
In [5]:
# convert distance matrix to similarity
X = X.max()-X
In [6]:
# feed in the precomputed similarity matrix to the clustering algorithm
# db = DBSCAN(eps=15,min_samples=10,metric='precomputed').fit_predict(X) # eps=0.3, min_samples=10
db = AffinityPropagation(affinity='precomputed').fit_predict(X)
df['cluster'] = db
df.head() # every point now has a cluster value
Out[6]:
In [7]:
#let's group by clusters and print the number of points in each cluster
grouped = df.groupby(by='cluster',as_index=False)
print('size of each cluster:',[{k:len(v)} for k,v in grouped.groups.items()])
In [8]:
#let's plot the clusters in ordered by cluster sizes (x: ID of cluster, y: number of points)
pd.options.display.mpl_style = 'default'
clusters = pd.Series([len(v) for v in grouped.groups.values()]).order()
clusters.plot(kind='bar',figsize = (16,8),title='Cluster sizes');
In [9]:
#let's see the cluster sizes in a histogram of 20 bins
clusters.plot(kind='hist',bins=20,figsize = (16,8), title='Histogram of Cluster sizes');
In [10]:
# among others, this colormap in this palette looks the best
from palettable.colorbrewer.qualitative import Dark2_8 as colmap
#from palettable.tableau import Tableau_20 as colmap
colors = {}
for i,c in enumerate(set(df['cluster'])):
i = i % colmap.number
colors.update({c:colmap.hex_colors[i]})
In [11]:
uk = folium.Map(location=[53.3, -3.5], zoom_start=6, width=991, height = 1000, tiles='Mapbox Bright')
df.apply(lambda x: uk.circle_marker(location=[x['lat'], x['lon']],
radius=x['cnt'],
popup=str(x['cluster']), line_color=colors[x['cluster']],
fill_color=colors[x['cluster']], fill_opacity=0.2),
axis=1);
uk.create_map(path='uk.html')
HTML('<iframe src="uk.html" style="width: 100%; height: 1000px; border: none"></iframe>')
Out[11]: