All students:
Evaluate: Are you able to interpret the results and justify your interpretation based on the observed data?
Advanced students:
Apply: Are you able to execute code (using the supplied examples) that performs the required functionality on supplied or generated data sets?
Analyze: Are you able to pick the relevant method or library to resolve specific stated questions?
Create: Are you able to produce notebooks that serve as computational record of a session, and can be used to share your insights with others?
By the end of this notebook you will be expected to understand and apply the steps involved in geotagging, which are the following:
- Match the records in time.
- Compute the median location of each access point (AP).
- Filter out the non-stationary routers.
- Exercise 1: Identification of stationary WiFi routers.
- Exercise 2 [Advanced]: Identification of non-stationary WiFi routers.
This notebook will use the same Dartmouth StudentLife data set, as in previous exercises. In this exercise, you will combine WiFi scans with location information to create a small database of WiFi access point (AP) locations, using Google's location services. You will be replicating the work of Piotr Sapieżyński et al. (2015).
You will start by importing the necessary modules and variable definitions.
In [ ]:
# Load relevant libraries.
from os import path
import pandas as pd
import numpy as np
import folium
import glob
from tqdm import tqdm
import random
%matplotlib inline
# Load custom modules.
import sys
sys.path.append('..')
from utils import getmedian, haversine
from utils import llaToECEF as coords_to_geomedian
from utils import ECEFTolla as geomedian_to_coords
from IPython.display import Image
In [ ]:
# Define variable definitions.
wifi_path = '../data/dartmouth/wifi'
location_path = '../data/dartmouth/location/'
For simplicity, review a single user's data, examine the properties of that data, and try to see if analysis yields any results of value. You should be familiar with both WiFi and location data, from previous exercises. As a result, they will be loaded and presented without extensive clarification.
In [ ]:
# Load WiFi data.
u00_wifi = pd.read_csv(path.join(wifi_path, 'wifi_u00.csv'))
u00_wifi.head(3)
In [ ]:
# Load location data.
u00_loc = pd.read_csv(path.join(location_path, 'gps_u00.csv'))
u00_loc.head(3)
In [ ]:
# Remove columns from WiFi dataset.
u00_wifi.drop(['freq', 'level'], axis=1, inplace=True)
u00_wifi.head(3)
In [ ]:
# Remove irrelevant columns from location dataset.
u00_loc.drop(['provider', 'network_type', 'bearing', 'speed', 'travelstate'], axis=1, inplace=True)
u00_loc.head(3)
The accuracy reported in location records is interpreted counterintuitively. The higher the value, the less accurate the measurement. It denotes the radius of a circle within which 68% of the measurements (or one standard deviation) of the reported coordinates are present. Since the radius of an outdoor access point can reach 250 metres (Sapiezynski et al. 2015), it is safe to assume a more conservative measure of 200 metres (at an elevated risk of classifying routers as non-stationary).
The accuracy of location measurements is a major source of noise, hence the need for additional consideration. To do that, you need to plot the cumulative distribution of the accuracy.
In [ ]:
# Plot histogram of accuracy observations.
u00_loc.accuracy.hist(cumulative=True, density=1, histtype='step', bins=100)
It looks like the data set contains quite accurate location measurements, as a visual inspection of the histogram suggests that almost 90% of the observations have relatively good accuracy. It is therefore safe to select only the most accurate observations.
Using the Pandas "describe" function, you can get a quick view of the data set.
In [ ]:
# Review the dataset with Pandas decribe function.
u00_loc.accuracy.describe()
Next, determine how many observations to keep. The impact of using an accuracy value of 40 is demonstrated in the cell below.
In [ ]:
# Determine the number of records meeting our threshold of 40 for accuracy.
result = len(u00_loc[u00_loc.accuracy <= 40]) / float(len(u00_loc))
print('Proportion of records that meet the criteria is {:.1f}%'.format(100*result))
73% of the records meet your criteria, and will be used as a filter in subsequent steps.
In [ ]:
# Make a copy of the original dataset before applying the filter.
u00_loc_raw = u00_loc.copy()
# Apply the filter.
u00_loc = u00_loc[u00_loc['accuracy'] <= 40]
# Get the lenghts of each of the data objects.
original_location_count = len(u00_loc_raw)
filtered_location_count = len(u00_loc)
print("Number of location observations before filtering: {}".format(original_location_count))
print("Number of observations remaining after filtering: {}".format(filtered_location_count))
Drop the accuracy column from the DataFrame, as it is no longer required.
In [ ]:
# Update the object to remove accuracy.
u00_loc.drop('accuracy', axis=1, inplace=True)
# Display the head of the new dataset.
u00_loc.head(3)
Note:
For certain methods, Pandas has the option of applying changes to data sets "inplace". While convenient, this feature should be used with care as you will no longer be able to re-execute earlier cells. The guiding principle is that you can use this feature in data cleaning and wrangling steps, where you no longer need to go back and revisit earlier steps.
Should you need to revisit earlier steps, you can either restart the notebook and execute all the cells up to that point, or only execute the cells needed to get the object in the required form to continue your analysis.
In order to geotag, location and WiFi readouts need to be matched based on the time of the observations. As in the paper by Sapiezynski et al. (2015), readouts will be constrained to those happening at exactly the same second, to reduce impact of readouts from moving vehicles.
There are three steps involved in geotagging:
These three steps will be explored in further detail in the following sections of this notebook.
This requires the use of Pandas magic to join (much like SQL's join) the DataFrames based on time. First, use the time as the index with the "df.set_index()
" method, and then join them with the "df.join()
" method.
In [ ]:
# Set the index for WiFi.
u00_wifi = u00_wifi.set_index('time')
u00_wifi.head(3)
In [ ]:
# Set the index for location.
u00_loc = u00_loc.set_index('time')
u00_loc.head(3)
Having two DataFrames with time as an index, you can simply "join" them on the index columns by assigning the value “None” to the argument “on” as demonstrated below.
A JOIN clause is used to merge DataFrames by combining rows from two or more tables, based on a common field between them. The most common type of join is an "inner join". An "inner join" between two tables (A and B) returns all rows from A and B, where the join condition is met. That is, the intersection of the two tables.
In [ ]:
# Join the two data sets, print the number of records found and display the head of the new dataset.
u00_raw_geotags = u00_wifi.join(u00_loc, how='inner',on=None)
print('{} WiFi records found time matching location records.'.format(len(u00_raw_geotags)))
u00_raw_geotags.head(3)
It is time to account for possible noise, and remove the routers with sparse data (i.e., less than five observations, as in the referenced paper). Pandas "df.groupby()" will be used to do this.
In [ ]:
# Create object u00_groups.
u00_groups = u00_raw_geotags.groupby('BSSID')
# Create a new object where filter criteria is met.
u00_geotags = u00_groups.filter(lambda gr: len(gr)>=5)
print("{} geotagged records remained after trimming for sparse data.".format(len(u00_geotags)))
print("They correspond to {} unique router APs".format(len(u00_groups)))
Define stationary routers as ones for which 95% of observations fall inside a radius of 200 metres from the geometric median of all of the observations. In order to compute the median and calculate the distances, you will need to import the custom function from the "utils” directory.
In order to compute the geometric medians with the tools at your disposal, the "getmedian()
" method needs properly-formatted data. That means a lot of list points, where each point is an array of "longitude", "latitude", and "altitude". The algorithm accepts input in degrees as units.
In [ ]:
# Create a new DataFrame with latitude and longitude.
u00_geo_medians = pd.DataFrame(columns=[u'latitude', u'longitude'])
# Transform the data set using the provided set of utilities.
for (BSSID, geotags) in u00_groups:
geotags = [row for row in np.array(geotags[['latitude', 'longitude', 'altitude']])]
geotags = [coords_to_geomedian(row) for row in geotags]
median = getmedian(geotags)
median = geomedian_to_coords(median)[:2]
u00_geo_medians.loc[BSSID] = median
After completing the above, you will have your geomedians, and will be ready to move on to the last step, which is to filter out the non-stationary access points.
In [ ]:
# Display the head of the geomedians object.
u00_geo_medians.head(3)
In [ ]:
# Calculate the distances from the median.
u00_distances = {}
for BSSID, geotags in u00_groups:
u00_distances[BSSID] = []
(lat_median, lon_median) = u00_geo_medians.loc[BSSID]
for (lat, lon) in np.array(geotags[['latitude','longitude']]):
u00_distances[BSSID].append(haversine(lon, lat, lon_median, lat_median)*1000) # haversine() returns distance in [km]
Now, check how many of the routers pass the threshold. Iterate over the access points, and count the ratio of measurements outside the threshold to all measurements. They are assigned to "static" or "others" based on your confidence level.
In [ ]:
# Group access points as static or non-static.
# Set the thresholds.
distance_threshold = 200
confidence_level = 0.95
# Create empty lists.
static = []
others = []
for BSSID, distances in u00_distances.items():
all_count = len(distances)
near_count = len(list(filter(lambda distance: distance <= distance_threshold, distances)))
if( near_count / all_count >= confidence_level ):
static.append(BSSID)
else:
others.append(BSSID)
# Print summary results.
print("We identified {} static routers and {} non-static (moved or mobile).".format(len(static), len(others)))
The tagged routers (access points) can now be visualized on a map.
In [ ]:
# Plot the access points on a map.
map_center = list(u00_geo_medians.median())
routers_map = folium.Map(location=map_center, zoom_start=14)
# Add points to the map for each of the locations.
for router in static:
folium.CircleMarker(u00_geo_medians.loc[router], fill_color='red', radius=15, fill_opacity=0.5).add_to(routers_map)
#Display the map.
routers_map
Note:
In order to validate your results, you can compare the location with the known location (lat: 43.7068263, long:-72.2868704).
In [ ]:
# Set the provided location.
lat = 43.7068263
lon = -72.2868704
bssid1 = '00:01:36:57:be:88'
bssid2 = '00:01:36:57:be:87'
You can now compare this to your computed values.
In [ ]:
u00_geo_medians.loc[[bssid1, bssid2]]
The results are acceptable. You can compute the actual distance between the points with the "haversine" function.
In [ ]:
# Calculate and display the difference between calculated and Google API provided locations.
lat_m1, lon_m1 = u00_geo_medians.loc[bssid1]
lat_m2, lon_m2 = u00_geo_medians.loc[bssid2]
print('Distance from the Google API provided location to our first router ' \
'estimation is {:2g}m'.format(haversine(lon,lat,lon_m1,lat_m1)*1000))
print('Distance from the Google API provided location to our first router ' \
'estimation is {:2g}m'.format(haversine(lon,lat,lon_m2,lat_m2)*1000))
Next, repeat the analysis from the previous section for all users. This analysis will be used in the next exercise.
Note:
There will be less contextual information provided in this section, as the details have already been provided in Section 1 of this notebook
This section utilizes two new libraries that do not fall within the scope of this course. Interested students can read more about glob, which is used to read files in the specified directory, and tqdm, which is used to render a progress bar. Set your variables, then create the required function to process the input files, and finally, execute this function to process all the files.
Note:
Processing a large amount of files or records can be time consuming. It is good practice to include progress bars to provide visual feedback where applicable.
In [ ]:
# Set variables.
all_geotags = pd.DataFrame(columns=['time','BSSID','latitude','longitude','altitude'])
all_geotags = all_geotags.set_index('time')
pcounter = 0
In [ ]:
# Define function to build the dataset, all_geotags, using the input files supplied.
def build_ds(file_in, all_geotags):
# Get the user id.
user_id = path.basename(file_in)[5:-4]
# Read the WiFi and location data for the user.
wifi = pd.read_csv(file_in)
loc = pd.read_csv(path.join(location_path, 'gps_'+user_id+'.csv'))
# Filter location data not meeting the accuracy threshold.
loc = loc[loc.accuracy <= 40]
# Drop the columns not required.
wifi.drop(['freq', 'level'], axis=1, inplace=True)
loc.drop(['accuracy', 'provider', 'network_type', 'bearing', 'speed', 'travelstate'], axis=1, inplace=True)
# Index the datasets based on time.
loc = loc.set_index('time')
wifi = wifi.set_index('time')
# Join the datasets based on time index.
raw_tags = wifi.join(loc, how='inner')
# Return the dataset for the user.
return [raw_tags]
In [ ]:
# Iterate through the files in the specified directory and append the results of the function to the all_geotags variable.
for f in tqdm(glob.glob(wifi_path + '/*.csv')):
# Append result from our function to all_geotags for each input file supplied.
all_geotags = all_geotags.append(build_ds(f, all_geotags))
In [ ]:
print("{} all geotags found".format(len(all_geotags)))
all_groups = all_geotags.groupby('BSSID')
print("{} unique routers found".format(len(all_groups)))
# Drop sparsely populated access points.
all_geotags = all_groups.filter(lambda gr: len(gr)>=5)
all_groups = all_geotags.groupby('BSSID')
print("{} unique router APs remaining after dropping routers with sparse data".format(len(all_groups)))
In [ ]:
# Create a new variable containing all the coordinates.
all_geo_medians = pd.DataFrame(columns=[u'latitude', u'longitude'])
# Compute the geomedians and add to all_geo_medians.
# Initiate progress bar.
with tqdm(total=len(all_groups)) as pbar:
# Iterate through data in all_groups as per single user example.
for i, data in enumerate(all_groups):
(BSSID, geotags) = data
geotags = [row for row in np.array(geotags[['latitude', 'longitude', 'altitude']])]
geotags = [coords_to_geomedian(row) for row in geotags]
median = getmedian(geotags)
median = geomedian_to_coords(median)[:2]
all_geo_medians.loc[BSSID] = median
pbar.update()
pbar.close()
In [ ]:
# Calculate the distances from the median.
all_distances = {}
# Initiate progress bar.
with tqdm(total=len(all_groups)) as pbar:
# Iterate through data in all_groups as per single user example.
for i, data in enumerate(all_groups):
(BSSID, geotags) = data
all_distances[BSSID] = []
(lat_median, lon_median) = all_geo_medians.loc[BSSID]
for (lat, lon) in np.array(geotags[['latitude','longitude']]):
all_distances[BSSID].append(haversine(lon, lat, lon_median, lat_median)*1000)
pbar.update()
pbar.close()
In [ ]:
# Group access points as static or non-static.
# Set the thresholds.
distance_threshold = 200
confidence_level = 0.95
# Create empty lists.
all_static = []
all_others = []
for BSSID, distances in all_distances.items():
all_count = len(distances)
near_count = len(list(filter(lambda distance: distance <= distance_threshold, distances)))
if( near_count / all_count >= confidence_level ):
all_static.append(BSSID)
else:
all_others.append(BSSID)
# Print summary results.
print("We identified {} static routers and {} non-static (moved or mobile).".format(len(all_static), len(all_others)))
In [ ]:
# Plot the access points on a map.
all_map_center = list(all_geo_medians.median())
all_routers_map = folium.Map(location=all_map_center, zoom_start=10)
# Add 1000 randomly sampled points to a new variable.
random.seed(3)
rn_static = random.sample(all_static,1000)
# Add the points in rn_static to the map. A random seed value is used for reproducibility of results.
for router in rn_static:
folium.CircleMarker(all_geo_medians.loc[router], fill_color='red', radius=15, fill_opacity=0.5).add_to(all_routers_map)
# Display the map.
all_routers_map
Question: Are you able to locate a static router located outside North America? If so, where?
Your markdown answer here.
In [ ]:
Exercise complete:
This is a good time to "Save and Checkpoint".
Note:
This activity is for advanced students only and extra credit will be allocated. Students will not be penalized for not completing this activity.
Can you identify moving BSSIDs and plot the observed locations on a map?
This is not a trivial task (compared to geotagging stationary routers) and there are many possible ways to perform the analysis.
Input : All data points for an access point.
Output: Boolean mobility status.
- Perform agglomerative clustering. (This is not covered in the scope of this course.)
- Discard clusters with n<5 observations as noise.
- Compare pairwise timeframes in which each of the clusters were observed.
- If at least two clusters overlap (in time), consider the AP as mobile.
Note:
Keep in mind that other, possibly better, solutions exist, and you are encouraged to provide your input.
Hints:
You can start by reviewing SciPy.
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, fcluster
In [ ]:
# Your answer here.
# Please add as many cells as you require in this section.
In [ ]:
# Your plot here.
Exercise complete:
This is a good time to "Save and Checkpoint".
It is strongly recommended that you select "Close and Halt" when you have completed this notebook, to ensure that it does not continue to consume available resources.
In [ ]: