by Talha Oz (submitted as GeoSocial Class Assignment #4)

The question asks to identify *clusters of tweets* as a broad concept, and does not state a particular domain to cluster on, i.e. one can cluster the tweets solely based on their geogprahic properties, or alternatively can color them based on their sentiment polarity. Here, I chose the former viewpoint, where the clusters are formed only according to their location.

I first compute the distances between every unique pair of tweeting points using `Vincenty`

algorithm implemented in `geopy`

library because default implementations of clustering algorithms in `sklearn`

package use `Euclidean distance`

as their similarity metric. Since `Vincenty`

algorithm measures near-actual distances given geographical coordinate points, this is a better metric than `Euclidean`

approach.

I initially tried `DBSCAN`

algorithm, where I only got a few clusters and too many unclustered points. Since the locations of the tweeps are scattered around several countries, and there is a high variation of density of points in our dataset, this density based clustering approach apparently not fitted well to our case.

As I do not want to set number of clusters before running the algorithm, rather want algorithm to choose the appropriate number of clusters from the dataset, I used `Affinity Propagation (AP)`

algorithm implemented in `sklearn`

as the details of the implemention is discussed here

- Read in the csv file into a dataframe and assign column names (as there is no header row in the provided CSV file).
- Group the tweets by their coordinates [i.e., (lat,lon) pairs]:
- Average the sentiment polarities
- Count number of tweets in each group

- As a result of groupby operation 5729 tweets are reduced to 1185 locations
- Location with the highest tweet count is found to be London city center, where 1778 of 5729 tweets originated from.

- I created distance matrix using
`pairwise_distances`

in`sklearn.metrics.pairwise`

, and`vincenty`

in`geopy.distance`

. And described this matrix to have better grasp of the dataset. Average distance between any two points is 116 miles and the maximum distant points are located 938 miles away from each other. - I used
`X.max() - X`

to transform distance matrix to similarity matrix- This transformation technique revealed better results than suggested 1/(1+X) method.
`preference`

parameter in AffinityPropagation algorithm defaults to median, so inversion does not change it. Along with`damping factor`

they control number of examplers used (and hence # of clusters), as default values gave satisfactory results, I did not change it.

- 64 clusters are detected. I created plots to see the cluster sizes
- first plot shows the sizes of every individual cluster
- second plot is a histogram, shows the distribution of cluster sizes among 20 bins

- Every point is clustered.
- Time complexity of the algorithm (in big O notation) is O(N^2 T), i.e. quadratic to the number of input N (points), and T is a constant (number of iterations, in our case 200). Memory complexity of the algorithm is also O(N^2).
- I think the algorithm did a pretty good job in clustering the locations (based on my eyeball test).

Folium (Leaflet.js) library is used for interactive mapping where locations are marked in circles whose radii are proportional to tweets originated from the same lat,lon. Moreover, five tile sets are offered by Folium. I used 'Mapbox Bright' as I found it to have the least distractive colors (simply white) which will not lead to any biases for the colors of the clusters (e.g. green forests make it hard to see the markers on top of them). As the colormap, I use qualitative palette (Dark2 with 8 colors), as I believe that reflects the clusters best for this tileset.

Please see below, an interactive map is provided as the output of the last command

```
In [1]:
```import pandas as pd
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import pairwise_distances
from geopy.distance import vincenty
import numpy as np
import folium
from IPython.display import HTML
from itertools import cycle
%matplotlib inline

```
In [2]:
```def vincenty_mi(p1,p2):
"""get two points and return their distance in miles"""
return vincenty((p1[0],p1[1]),(p2[0],p2[1])).miles

```
In [3]:
```df = pd.read_csv('Olympic_torch_2012_UK.csv',header=None,names=['twtime','lat','lon','sp'],parse_dates=[0])
df['cnt'] = 0
# average sp (sentiment polarities) and count tweets from the same lat/lon
df = pd.DataFrame(df.groupby(by=['lat','lon'],as_index=False).agg({'cnt':len,'sp':np.mean}))
print('Total number of tweets:',df['cnt'].sum())
print('Number of unique locations:',len(df))
print('Location with the highest tweet count (London city center):')
df[df.cnt == df['cnt'].max()]

```
Out[3]:
```

```
In [4]:
```# this takes about 1 min 14 secs (measured by %timeit -n1 -r1)...
X = pairwise_distances(df[['lat','lon']],metric=vincenty_mi)
pd.Series(X.flatten()).describe()

```
Out[4]:
```

```
In [5]:
```# convert distance matrix to similarity
X = X.max()-X

```
In [6]:
```# feed in the precomputed similarity matrix to the clustering algorithm
# db = DBSCAN(eps=15,min_samples=10,metric='precomputed').fit_predict(X) # eps=0.3, min_samples=10
db = AffinityPropagation(affinity='precomputed').fit_predict(X)
df['cluster'] = db
df.head() # every point now has a cluster value

```
Out[6]:
```

```
In [7]:
```#let's group by clusters and print the number of points in each cluster
grouped = df.groupby(by='cluster',as_index=False)
print('size of each cluster:',[{k:len(v)} for k,v in grouped.groups.items()])

```
```

```
In [8]:
```#let's plot the clusters in ordered by cluster sizes (x: ID of cluster, y: number of points)
pd.options.display.mpl_style = 'default'
clusters = pd.Series([len(v) for v in grouped.groups.values()]).order()
clusters.plot(kind='bar',figsize = (16,8),title='Cluster sizes');

```
```

```
In [9]:
```#let's see the cluster sizes in a histogram of 20 bins
clusters.plot(kind='hist',bins=20,figsize = (16,8), title='Histogram of Cluster sizes');

```
```

```
In [10]:
```# among others, this colormap in this palette looks the best
from palettable.colorbrewer.qualitative import Dark2_8 as colmap
#from palettable.tableau import Tableau_20 as colmap
colors = {}
for i,c in enumerate(set(df['cluster'])):
i = i % colmap.number
colors.update({c:colmap.hex_colors[i]})

```
In [11]:
```uk = folium.Map(location=[53.3, -3.5], zoom_start=6, width=991, height = 1000, tiles='Mapbox Bright')
df.apply(lambda x: uk.circle_marker(location=[x['lat'], x['lon']],
radius=x['cnt'],
popup=str(x['cluster']), line_color=colors[x['cluster']],
fill_color=colors[x['cluster']], fill_opacity=0.2),
axis=1);
uk.create_map(path='uk.html')
HTML('<iframe src="uk.html" style="width: 100%; height: 1000px; border: none"></iframe>')

```
Out[11]:
```