In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
%matplotlib inline
plt.rcParams['text.usetex'] = True
plt.rcParams['figure.figsize'] = [10, 8]
plt.rcParams['font.size'] = 16
This notebook look at some location data provided by the Google takeout website. It was inspired from this original post by Chris Albon.
In the first part we follow Chris's advise on how to import the Json data, then set up a helper function to
plot the data using matplotlib.basemap. The second part we will look at ways to separate the data: initially a simple bisection is used to separate home and away. Then we use a more complex clustering algorithm.
Having downloaded the json data to a file LocationHistory.json in the local
directory (or just amend the path below), we start by following Chris Albons post
and generating a first plot of the data.
In [4]:
# Create a dataframe from the json file in the filepath
raw = pd.io.json.read_json('LocationHistory.json')
df = raw['locations'].apply(pd.Series)
# Create a list from the latitude column, multiplied by -E7
df['latitude'] = df['latitudeE7'] * 1e-7
# Create a list from the longitude column, multiplied by -E7
df['longitude'] = df['longitudeE7'] * 1e-7
This gives us a pandas dataframe with columns of the latitude and longitude for each recorded point in my location history. There are several other columns as well, but for the time being lets ignore these.
Now we define a helper function which will plot all of the relevant data in a data frame. This will be useful later on when we want to plot only some subset.
In [5]:
def PaddingFunction(xL, xR, frac=0.1):
""" Return xL and xR with an added padding factor of frac either side """
xRange = xR - xL
xL_new = xL - frac*xRange
xR_new = xR + frac*xRange
return xL_new, xR_new
def GeneratePlot(data, fig=None, ignore_first=False, *args, **kwargs):
""" Helper function to plot points on a map
Parameters
----------
ignore_first : bool,
If true the data in the first df in data is ignored and used only to set
up the map
"""
if type(data) == pd.core.frame.DataFrame:
# Single df
df = data
df_list = [df]
elif type(data) == list:
df_list = data
df = data[0]
if not fig:
fig = plt.figure()
# Calculate some parameters which will be resused]
lat_0 = df.latitude.mean()
lon_0 = df.longitude.mean()
llcrnrlon, urcrnrlon = PaddingFunction(df.longitude.min(), df.longitude.max(), frac=0.3)
llcrnrlat, urcrnrlat = PaddingFunction(df.latitude.min(), df.latitude.max())
# Create a map, using the Gall–Peters projection,
m = Basemap(projection='gall',
resolution = 'l',
area_thresh = 10000.0,
lat_0=lat_0, lon_0=lon_0,
llcrnrlon=llcrnrlon,
urcrnrlon=urcrnrlon,
llcrnrlat=llcrnrlat,
urcrnrlat=urcrnrlat,
ax=fig.gca()
)
m.drawcoastlines()
m.drawcountries()
m.fillcontinents(color = '#996633')
m.drawmapboundary(fill_color='#0099FF')
if ignore_first:
df_list = df_list[1:]
for df in df_list:
# Define our longitude and latitude points
x, y = m(df['longitude'].values, df['latitude'].values)
# Plot them using round markers of size 6
m.plot(x, y, "o", zorder=100, *args, **kwargs)
return fig
Okay so finally, lets plot all the data
In [6]:
fig = GeneratePlot(df, color="r")
Okay there is clearly a large amount of data in and near to my home in the UK, then several holidays and conferances spread across Europe. The first thing to do will be to separate our data geographically.
Clustering algorithms are a good way to separate data such as this. However, in this data set there is an unreasonably large fraction of the points in one cluster (around the UK). While there may be some algorithms which can handle this, a more sensible method is to use our intuition first. We can do this by bisecting the data based on the distance from a single point. For reasons of simpicity we will use Greenwhich as this point (the longitude is rather easy to remember here).
The distance between any points $A$ and $B$ with latitude and longitude $\phi, \lambda$ is given by the Haversine formulae:
\begin{equation} d = 2r \arcsin\left(\sqrt{\mathrm{haversin}(\phi_{A} - \phi_{B}) + \cos(\phi_A)\cos(\phi_B) \mathrm{haversin}(\lambda_{A} - \lambda_{B})}\right) \end{equation}where $r$ is the Earth's radius, and the haversine function is given by
\begin{equation} \mathrm{haversine}(\theta) = \sin^{2}\left(\frac{\theta}{2}\right) \end{equation}So if we know the latitude and longitude of Greenwhich we can code this into a python function. We will then add this as a new column to our data frame.
In [7]:
def Haversine(theta):
return np.sin(theta/2.0)**2
def DistanceFromGreenwhich(lat, lon):
R = 6.371e6 # m
latG, lonG = 51.48, 0.00 # Grenwhich lat and long
latG = np.radians(latG)
lonG = np.radians(lonG)
lat = np.radians(lat)
lon = np.radians(lon)
arg = Haversine(lat - latG) + np.cos(latG)*np.cos(lat)*Haversine(lon - lonG)
return 2 * R * np.arcsin(np.sqrt(arg))
df['DistanceFromGreenwhich'] = DistanceFromGreenwhich(df.latitude, df.longitude)
fig, ax = plt.subplots()
out = ax.hist(df.DistanceFromGreenwhich * 1e-3, bins=50)
ax.set_xlabel("Distance from Greenwhich (km)")
ax.set_ylabel("Count")
plt.show()
This shows that $\sim$ 500 km is a good cut-off to describe if a point is from home or away. Lets generate a new dataframe with data from abroad only
In [13]:
df_away = df[df.DistanceFromGreenwhich > 300e3].copy(deep=True)
fig = GeneratePlot(df_away, color="r")
We now have a good set of sample data to apply a basic sampling algorithm. Since the clusters are
of relatively even size, and we can easily by eye assign 7 clusters, we will use the K-means clustering
algorithm. Several python interfaces exist, for now I
will use the excellent scikit-learn module.
In [35]:
from scipy.cluster.vq import kmeans,vq
data = np.vstack((df_away.latitude.values, df_away.longitude.values)).T
centroids,_ = kmeans(data, 7, iter=50, thresh=1e-9)
idx,_ = vq(data, centroids)
df_away['cluster_idx'] = idx
This has created a new column in the data frame, the cluster index to which kmeans has assigned the point. We can plot all the data frames indicating the cluster by color as follows. Note that the original data frame is passed in as the first element of df_list, this is done only to set up the map and is ignored due to the ignore_first flag.
In [36]:
df_list = [df]
for idx in df_away.cluster_idx.unique():
df_list.append(df_away[df_away.cluster_idx == idx])
fig = GeneratePlot(df_list, ignore_first=True)
We have succesfully clustered the data by region, it should be noted that there is some tweaking required with the iter and thresh arguments for the kmeans algorithm.
In [ ]: