Our goal is to reconstruct users' travel trajectories from photos with geo-tag and timestamp around Melbourne area. To do this, we use the YFCC100M dataset dataset which is a collection of photos/videos uploaded to Flickr between 2004 and 2014.
The original YFCC100M dataset contains 100 million photos/videos and does not provide any trajectories about users. In this notebook, we will describe how we reconstruct user trajectories from the original dataset.
(*We will not distinguish a photo and video in the rest of this notebook)
Three major steps for constructing trajectories:
Step 1. Extract photos taken near Melbourne area from original YFCC100M dataset (filtering_bigbox.py
)
Step 2. Extract candidate trajectories based on the extracted photos (generate_tables.py
)
Step 3. Filter out some abnormal trajectories using various criteria (this notebook
)
Below we will describe the details of each step with source codes.
In [1]:
%matplotlib inline
import os
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import generate_tables
# data files
data_dir = '../data/'
table0 = 'Melb_photos_bigbox.csv'
table1 = 'trajectory_photos.csv'
table2 = 'trajectory_stats.csv'
raw_table = os.path.join(data_dir, table0)
photo_table = os.path.join(data_dir, table1)
traj_table = os.path.join(data_dir, table2)
# parameters for generating trajectories from extracted photos
time_gap = 8 # hour
minimum_photo = 1 # minimum number of photos for each trajectory
lng_min = 144.597363 # small bounding box
lat_min = -38.072257
lng_max = 145.360413
lat_max = -37.591764
# parameters for filtering trajectories
minimum_distance = 0.5 # km
speed_filter = 1 #(0: filter by average speed, 1: filter by point-to-point speed)
maximum_speed = 100 # km/h
maximum_duration = 1440 # minutes
minimum_duration = 30 #minute
The original YFCC100M dataset contains 100million photos. We are only interested in user behavior near Melbourne area. Therefore, we first extract the photos belongs to the below region to reduce the further computational cost.
The latitudes and longitudes of this region is described in data/Melbourne-bbox.kml
file.
This process can be done by src/filtering_bigbox.py
.
filtering_bigbox.py
file takes the original YFCC100M file to extract photos and videos from above region as well as a time window [2000-01-01 00:00:00, 2015-03-05 23:59:59]
, then generates a cvs file containing:
The usage of this file is :
python filtering_bigbox.py YFCC100M_DATA_FILE
which will generate filtered output out.YFCC100M_DATA_FILE
file.
The original YFCC100M data files are not incorporated in this repository. But we incorporate the filtered output in data/Melb_photos_bigbox.csv
Here are some basic statistics after extracting relevant photos from the YFCC100M.
In [2]:
raw = pd.read_csv(raw_table, parse_dates=[2], skipinitialspace=True)
print('Number of photos:', raw['Photo_ID'].shape[0])
print('Number of users: ', raw['User_ID'].unique().shape[0])
raw[['Longitude', 'Latitude', 'Accuracy']].describe()
Out[2]:
We also plot the location of extracted photos. The high density area represents the areas where a lot of photos has been taken.
In [3]:
plt.figure(figsize=[8, 8])
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.scatter(raw['Longitude'], raw['Latitude'])
Out[3]:
From the extracted photos, we reconstruct user trajectories using geo-tag and timestamp of photos as follows:
src/generate_tables.py
will generate the inital trajectories using arguments.
The usage of this file is :
`python generate_tables.py extracted_points_file lng_min lat_min lng_max lat_max minimum_photo time_gap
with arguments:
extracted_points_file
= the output of src/filtering_bigbox.py
lng_min
= min longitude of target regionlat_min
= min latitude of target regionlng_max
= max longtidue of target regionlat_max
= max latitude of targer regionminimum_photo
= minimum number of photos for each trajectorytime_gap
= Split the sorted photos into trajectories if the time gap between two consecutive photos is greater than this
In [4]:
extracted_points_file = raw_table # outputfile path of extracted points
%run generate_tables $extracted_points_file $lng_min $lat_min $lng_max $lat_max $minimum_photo $time_gap
This will result two data files: 1 trajectory data file (\$photo table), and 2 trajectory statistic file (\$traj_table).
trajectory data file: each entry(line) of this file reprsents single photo with additional information about the photo:
* Trajectory_ID: trajectory ID of entry (multiple entries belong to the same trajectory will have the same trajectory ID) * Photo_ID: Unique Photo ID of entry * User_ID: User ID * Timestamp: Timestamp of when the photo was taken * Longitude: Longitude of entry * Latitude: Latitude of entry * Accuracy: GPS Accuracy level (16 - the most accurate, 1 - the least accurate) * Marker: 0 if the entry is photo, 1 if the entry is video * URL: flickr URL to the entry
trajectory statistic file: each entry(line) of this file represents single trajectory with addtional information about the trajectory:
* Trajectory_ID: Unique trajectory ID * User_ID: User ID * #Photo: Number of photos in the trajectory * Start_Time: When the first photo was taken * Travel_Distance(km): Sum of the distances between sequantially consecutive photos (Euclidean Distance) * Total_Time(min): The time gap between the first photo and the last photo * Average_Speed(km/h): Travel_Distances(km)/Total_Time(h)
We read these files by using pandas library for further processing:
In [5]:
traj = pd.read_csv(photo_table, parse_dates=[3], skipinitialspace=True)
traj_stats = pd.read_csv(traj_table, parse_dates=[3], skipinitialspace=True)
Here are the basic statistics about candidate trajectories from src/generate tables.py
:
In [6]:
num_photo = traj['Photo_ID'].shape[0]
num_user = traj_stats['User_ID'].unique().shape[0]
num_traj = traj_stats['Trajectory_ID'].shape[0]
print('Number of photos:', num_photo)
print('Number of users: ', num_user)
print('Number of trajectories:', num_traj)
print('Average number of photos per user:', num_photo / num_user)
print('Average number of trajectories per user:', num_traj / num_user)
In [7]:
traj_stats[['#Photo', 'Travel_Distance(km)', 'Total_Time(min)', 'Average_Speed(km/h)']].describe()
Out[7]:
We plot the location of extracted photos in the cadidate trajectories. The high density area represents the areas where a lot of photos has been taken.
In [8]:
plt.figure(figsize=[8, 8])
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.scatter(traj['Longitude'], traj['Latitude'])
Out[8]:
In [9]:
plt.figure(figsize=[18, 10])
plt.subplot(2,2,1)
plt.xlabel('#Photo')
plt.ylabel('#Trajectory')
plt.title('Histogram of #Photo in trajectories')
ax0 = traj_stats['#Photo'].hist(bins=50)
ax0.set_yscale('log')
plt.subplot(2,2,2)
plt.xlabel('Travel Distance (km)')
plt.ylabel('#Trajectory')
plt.title('Histogram of Travel Distance of Trajectories')
ax1 = traj_stats['Travel_Distance(km)'].hist(bins=50)
ax1.set_yscale('log')
plt.subplot(2,2,3)
plt.xlabel('Total Time (minute)')
plt.ylabel('#Trajectory')
plt.title('Histogram of Total Time of Trajectories')
ax2 = traj_stats['Total_Time(min)'].hist(bins=50)
ax2.set_yscale('log')
plt.subplot(2,2,4)
plt.xlabel('Average Speed (km/h)')
plt.ylabel('#Trajectory')
plt.title('Histogram of Average Speed of Trajectories')
ax3 = traj_stats['Average_Speed(km/h)'].hist(bins=50)
ax3.set_yscale('log')
As these histogram indicates, there are several abnormal trajectories in this dataset. For example, some trajectories span several days, some trajectory shows improbably high speed (3500000 km/h), and travel distance of some trajectories is almost zero which might not be interesting as a travel trajectory.
In the following section, we will provide guidelines to filter out these abnormal trajectories.
After getting an initial list of trajectories, we further filter out improbable trajectories with various criteria. We use four different criteria as follows:
maximum_duration
, minimum_duration
).minimum_distance
)The list of arguments we used to generate final trajectories are available at the top of the notebook.
First, we filter out trajectories which have suspiciously long or short travel times. We want to see the one-day long travel trajectories of users, and also want to avoid the trajectory that are captured in very short time.
In this step, we filtered out the trajectories of which travel time is greather than maximum_duration or less than minimum_duration.
In [10]:
traj_stats1 = traj_stats[traj_stats['Total_Time(min)'] < maximum_duration]
traj_stats1 = traj_stats1[traj_stats1['Total_Time(min)'] > minimum_duration]
traj1 = traj[traj['Trajectory_ID'].isin(traj_stats1['Trajectory_ID'])]
Here's the histogram of travel time before and after filtering. We removed several trajectories of which travel time is less than 30 min and greater than 24 hours.
In [11]:
plt.figure(figsize=[18, 5])
plt.subplot(1,2,1)
plt.xlabel('Travel_Time(min)')
plt.ylabel('#traj')
plt.title('Before Filtering')
ax0 = traj_stats['Total_Time(min)'].hist(bins=50)
ax0.set_yscale('log')
plt.subplot(1,2,2)
plt.xlabel('Travel_Time(min)')
plt.ylabel('#traj')
plt.title('After filtering')
ax1 = traj_stats1['Total_Time(min)'].hist(bins=50)
ax1.set_yscale('log')
To be a meaningful trajectory, the travel distance of trajactory spans at least several hundred meters. Extremely short travel distance only shows the interesting area where the photo has been taken.
To get the trajectory, we filter out the trajectories of which travel distance is less than minimum_distance.
In [12]:
traj_stats2 = traj_stats1[traj_stats1['Travel_Distance(km)'] > minimum_distance]
traj2 = traj[traj['Trajectory_ID'].isin(traj_stats2['Trajectory_ID'])]
Here's the histogram of travel distances before and after filtering. Trajectories with very short travel distance has been removed from our dataset.
In [13]:
plt.figure(figsize=[18, 5])
plt.subplot(1,2,1)
plt.xlabel('Travel_Distance(km)')
plt.ylabel('#traj')
plt.title('Before Filtering')
ax1 = traj_stats1['Travel_Distance(km)'].hist(bins=50)
ax1.set_yscale('log')
plt.subplot(1,2,2)
plt.xlabel('Travel_Distance(km)')
plt.ylabel('#traj')
plt.title('After filtering')
ax2 = traj_stats2['Travel_Distance(km)'].hist(bins=50)
ax2.set_yscale('log')
In [14]:
traj_stats_new = traj_stats2
traj_new = traj2
Some trajectories have suspiciously high speed. It may caused by various reasons. For example, errors in GPS system or errors in time stampmight yeild super sonic users.
There are two (or more) alternative ways to filter out trajectory which has suspiciously high speed.
Here, we provide two filtering method: (the switch to use one of the methods can be set at the top of the notebook)
We check average speed of every trajectory, and then throw out all trajectories of which average speed is less than predefined maximum_speed
In [15]:
if speed_filter == 0:
traj_stats_new = traj_stats_new[traj_stats_new['Average_Speed(km/h)'] < maximum_speed]
traj_new = traj_new[traj_new['Trajectory_ID'].isin(traj_stats_new['Trajectory_ID'])]
In [16]:
if speed_filter == 0:
plt.figure(figsize=[18, 5])
plt.subplot(1,2,1)
plt.xlabel('Average_Speed(km/h)')
plt.ylabel('#traj')
ax = traj_stats_new['Average_Speed(km/h)'].hist(bins=50)
ax.set_yscale('log')
The first approach might be inefficient when the improbable speed occurs by GPS calibration error. To keep as much information as possible, we propose more sophisticated method to recover information from abnormal trajectories.
There are four cases of improbably fast trajectory might be happened
The first and second cases are easy to recover by cutting the corresponding point. But it seems we could not easily decide which point(s) should be cut for third and fourth cases. We've decided to remove trajectories in case 3 and 4.
Compute point-to-point speed before filtering
In [17]:
speeds = []
if speed_filter == 1:
for tid in traj_stats_new['Trajectory_ID']:
photos = traj_new[traj_new['Trajectory_ID'] == tid]
if photos.shape[0] < 2: continue
for i in range(len(photos.index)-1):
idx1 = photos.index[i]
idx2 = photos.index[i+1]
dist = generate_tables.calc_dist(photos.loc[idx1, 'Longitude'], photos.loc[idx1, 'Latitude'], \
photos.loc[idx2, 'Longitude'], photos.loc[idx2, 'Latitude'])
seconds = (photos.loc[idx1, 'Timestamp'] - photos.loc[idx2, 'Timestamp']).total_seconds()
if seconds == 0: continue
speed = dist * 60. * 60. / abs(seconds)
speeds.append(speed)
Histogram of point-to-point speed before filtering
In [18]:
#S = [100, 150, 200, 250, 500, 1000, 1236, 100000]
S = [100, 150, 200, 250]
if speed_filter == 1:
p2pspeeds = pd.Series(speeds)
plt.figure(figsize=[18,20])
for it in range(len(S)):
plt.subplot(4,2,it+1)
plt.xlabel('Point-to-Point Speed (km/h)')
plt.ylabel('#Point-pair')
plt.title('Speed < ' + str(S[it]) + ' km/h')
ax = p2pspeeds[p2pspeeds < S[it]].hist(bins=50)
ax.set_yscale('log')
Drop the first/last point in a trajectories for case1/case2, drop enter trajectories for case3 and case4
In [19]:
if speed_filter == 1:
# raise an exception if assigning value to a copy (instead of the original data) of DataFrame
pd.set_option('mode.chained_assignment','raise')
traj_stats_new = traj_stats_new.copy()
traj_new = traj_new.copy()
indicator_traj = pd.Series(data=np.ones(traj_stats_new.shape[0], dtype=np.bool), index=traj_stats_new.index)
indicator_photo = pd.Series(data=np.ones(traj_new.shape[0], dtype=np.bool), index=traj_new.index)
cnt1 = 0
cnt2 = 0
cnt34 = 0
for i in traj_stats_new['Trajectory_ID'].index:
tid = traj_stats_new.loc[i, 'Trajectory_ID']
photos = traj_new[traj_new['Trajectory_ID'] == tid]
if photos.shape[0] <= 2:
if traj_stats_new.loc[i, 'Average_Speed(km/h)'] > maximum_speed: # drop the trajectory
indicator_traj.loc[i] = False
indicator_photo.loc[photos.index] = False
continue
# trajectory: 1-->2-->...-->3-->4, 2 and 3 could be the same
idx1 = photos.index[0]
idx2 = photos.index[1]
idx3 = photos.index[-2]
idx4 = photos.index[-1]
d12 = generate_tables.calc_dist(photos.loc[idx1, 'Longitude'], photos.loc[idx1, 'Latitude'], \
photos.loc[idx2, 'Longitude'], photos.loc[idx2, 'Latitude'])
d24 = traj_stats_new.loc[i, 'Travel_Distance(km)'] - d12
t12 = abs((photos.loc[idx1, 'Timestamp'] - photos.loc[idx2, 'Timestamp']).total_seconds())
t24 = abs((photos.loc[idx2, 'Timestamp'] - photos.loc[idx4, 'Timestamp']).total_seconds())
# check case 1
if t12 == 0 or (d12 * 60. * 60. / t12) > maximum_speed: #photo1-->photo2, inf speed or large speed
if t24 == 0 or abs(d24) < 1e-3 or (d24 * 60. * 60. / t24) > maximum_speed: # drop the trajectory
indicator_traj.loc[i] = False
indicator_photo.loc[photos.index] = False
continue
else: # case 1, drop the first photo, update trajectory statistics
assert(d24 > 0.)
#traj_stats.ix[i]['Start_Time'] = photos.ix[idx2]['Timestamp'] # SettingWithCopyWarning
indicator_photo.loc[idx1] = False
traj_stats_new.loc[i, 'Start_Time'] = photos.loc[idx2, 'Timestamp']
traj_stats_new.loc[i, 'Travel_Distance(km)'] = d24
traj_stats_new.loc[i, 'Total_Time(min)'] = t24 / 60.
traj_stats_new.loc[i, 'Average_Speed(km/h)'] = d24 * 60. * 60. / t24
cnt1 += 1
continue
# check case 2
d34 = generate_tables.calc_dist(photos.loc[idx3, 'Longitude'], photos.loc[idx3, 'Latitude'], \
photos.loc[idx4, 'Longitude'], photos.loc[idx4, 'Latitude'])
d13 = traj_stats_new.loc[i, 'Travel_Distance(km)'] - d34
t34 = abs((photos.loc[idx3, 'Timestamp'] - photos.loc[idx4, 'Timestamp']).total_seconds())
t13 = abs((photos.loc[idx1, 'Timestamp'] - photos.loc[idx3, 'Timestamp']).total_seconds())
if t34 == 0 or (d34 * 60. * 60. / t34) > maximum_speed: #photo3-->photo4, inf speed or large speed
if t13 == 0 or abs(d13) < 1e-3 or (d13 * 60. * 60. / t13) > maximum_speed: # drop the trajectory
indicator_traj.loc[i] = False
indicator_photo.loc[photos.index] = False
continue
else: # case 2, drop the last photo, update trajectory statistics
assert(d13 > 0.)
#traj_stats.ix[i]['Travel_Distance(km)'] = d13 # SettingWithCopyWarning
indicator_photo.loc[idx4] = False
traj_stats_new.loc[i, 'Travel_Distance(km)'] = d13
traj_stats_new.loc[i, 'Total_Time(min)'] = d13 / 60.
traj_stats_new.loc[i, 'Average_Speed(km/h)'] = d13 * 60. * 60. / t13
cnt2 += 1
continue
# case 3 or 4, drop trajectory
if traj_stats_new.loc[i, 'Average_Speed(km/h)'] > maximum_speed:
indicator_traj.loc[i] = False
indicator_photo.loc[photos.index] = False
cnt34 += 1
print('Number of trajectories in case 1:', cnt1)
print('Number of trajectories in case 2:', cnt2)
print('Number of trajectories in case 3 & 4:', cnt34)
traj_new = traj_new[indicator_photo]
traj_stats_new = traj_stats_new[indicator_traj]
Compute point-to-point speed after filtering
In [20]:
speeds_new = []
if speed_filter == 1:
for tid in traj_stats_new['Trajectory_ID']:
photos = traj_new[traj_new['Trajectory_ID'] == tid]
if photos.shape[0] < 2: continue
for i in range(len(photos.index)-1):
idx1 = photos.index[i]
idx2 = photos.index[i+1]
dist = generate_tables.calc_dist(photos.loc[idx1, 'Longitude'], photos.loc[idx1, 'Latitude'], \
photos.loc[idx2, 'Longitude'], photos.loc[idx2, 'Latitude'])
seconds = (photos.loc[idx1, 'Timestamp'] - photos.loc[idx2, 'Timestamp']).total_seconds()
if seconds == 0: continue
speed = dist * 60. * 60. / abs(seconds)
speeds_new.append(speed)
Histogram of point-to-point speed after filtering
In [21]:
#S = [100, 150, 200, 250, 500, 1000, 1236, 100000]
S = [100, 150, 200, 250]
if speed_filter == 1:
p2pspeeds_new = pd.Series(speeds_new)
plt.figure(figsize=[18,20])
for it in range(len(S)):
plt.subplot(4,2,it+1)
plt.xlabel('Point-to-Point Speed(km/h)')
plt.ylabel('#Point-pair')
plt.title('Speed < ' + str(S[it]) + ' km/h')
ax = p2pspeeds_new[p2pspeeds_new < S[it]].hist(bins=50)
ax.set_yscale('log')
In this section, we will show some basic statistics about our final trajectory data.
More detail analysis will be included in filckr_analysis.ipynb
and slides. Here we show simple stats from the final result.
In [22]:
num_photo = traj_new['Photo_ID'].shape[0]
num_user = traj_stats_new['User_ID'].unique().shape[0]
num_traj = traj_stats_new['Trajectory_ID'].shape[0]
print('Number of photos:', num_photo)
print('Number of users: ', num_user)
print('Number of trajectories:', num_traj)
print('Average number of photos per user:', num_photo / num_user)
print('Average number of trajectories per user:', num_traj / num_user)
In [23]:
traj_stats_new[['#Photo', 'Travel_Distance(km)', 'Total_Time(min)', 'Average_Speed(km/h)']].describe()
Out[23]:
Histograms of number of photos in trajectories, total time/distance and average speed of trajectories
In [24]:
plt.figure(figsize=[18, 10])
plt.subplot(2,2,1)
plt.xlabel('#Photo')
plt.ylabel('#Trajectory')
plt.title('Histogram of #Photo in trajectories after Filtering')
ax0 = traj_stats_new['#Photo'].hist(bins=50)
ax0.set_yscale('log')
plt.subplot(2,2,2)
plt.xlabel('Travel Distance (km)')
plt.ylabel('#Trajectory')
plt.title('Histogram of Travel Distance of Trajectories after Filtering')
ax1 = traj_stats_new['Travel_Distance(km)'].hist(bins=50)
ax1.set_yscale('log')
plt.subplot(2,2,3)
plt.xlabel('Total Time (minute)')
plt.ylabel('#Trajectory')
plt.title('Histogram of Total Time of Trajectories after Filtering')
ax2 = traj_stats_new['Total_Time(min)'].hist(bins=50)
ax2.set_yscale('log')
plt.subplot(2,2,4)
plt.xlabel('Average Speed (km/h)')
plt.ylabel('#Trajectory')
plt.title('Histogram of Average Speed of Trajectories after Filtering')
ax3 = traj_stats_new['Average_Speed(km/h)'].hist(bins=50)
ax3.set_yscale('log')
Save final trajectories to the data folder
In [25]:
file1 = os.path.join(data_dir + table1)
file2 = os.path.join(data_dir + table2)
traj_new.to_csv(file1, index=False)
traj_stats_new.to_csv(file2, index=False)