NOTE: Please view this page via IPython Notebook Viewer Service, otherwise the within-page links may not work properly.
The dataset used in this paper can be downloaded here. It also gives some description and statistics of this dataset. However, one critical portion of information is missing in this dataset, i.e. the geo-location of each Points of Interest(POI), which makes it impossible to calculate the travel time from one POI to another unless the longitude and latitude of each POI is provided by other means.
Simple statistics of this dataset
City | #POIs | #Users | #POI_Visits | #Travel_Sequences |
Toronto | 29 | 1,395 | 39,419 | 6,057 |
Osaka | 27 | 450 | 7,747 | 1,115 |
Glasgow | 27 | 601 | 11,434 | 2,227 |
Edinburgh | 28 | 1,454 | 33,944 | 5,028 |
NOTE: the number of photos for each city described in paper is NOT available in this dataset
Fortunately, this info could be retrived from the original YFCC100M dataset by search the individual photoID which is included in this dataset. The YFCC100M dataset could be downloaded from here easily (but with patience).
For each photo used in this paper, its POI ID is available in the dataset, thus, the longitude and latitude of each POI could be approximated by the mean value of all the corresponding photos' coordinates which could be retrived from the YFCC100M dataset.
To accelerate the searching process, first extract the photo id, longitude and latitude columns from the whole dataset
In [ ]:
cut -d $'\t' -f1,11,12 yfcc100m_dataset >> dataset.yfcc
then search the coordinates by grepping photo id
In [ ]:
cut -d ';' -f 1,4 uservisit.txt |while read line
do
photoid=`echo "$line" |cut -d ';' -f 1`
result=`grep -P "^$photoid\t" dataset.yfcc`
if [ ! -z "$result" ]; then
coords=`echo "$result" |sed 's/\t/:/g'`
echo "$coords" >> poi.coords
fi
done
For further accelertion, one could import YFCC100M dataset to a database, though very time-consuming, then search by photo id. The SQL statements for create a database and a table for the dataset looks like
In [ ]:
CREATE DATABASE yfcc100m;
CREATE TABLE yfcc100m.tdata(
pv_id BIGINT UNSIGNED NOT NULL UNIQUE PRIMARY KEY, /* Photo/video identifier */
longitude FLOAT, /* Longitude */
latitude FLOAT /* Latitude */
);
COMMIT;
and Python code for importing dataset
In [ ]:
import mysql.connector as db
def import_data(fname):
"""Import data records from file"""
dbconnection = db.connect(user='USERNAME', password='PASSWORD')
cursor = dbconnection.cursor()
with open(fname, 'r') as f:
for line in f:
items = line.split('\t')
assert(len(items) == 3)
pv_id = items[0]
longitude = items[1]
latitude = items[2]
if len(longitude.strip()) == 0 or len(latitude.strip()) == 0:
continue
sqlstr = 'INSERT INTO yfcc100m.tdata VALUES (' + pv_id + ', ' + longitude.strip() + ', ' + latitude.strip() + ')'
try:
cursor.execute(sqlstr)
except db.Error as error:
print('ERROR: {}'.format(error))
dbconnection.commit()
dbconnection.close()
Then, searching the coordinates of each photo is very easy and fast
In [ ]:
import mysql.connector as db
def search_coords(fin, fout):
"""Search Longitude and Latitude for each geo-tagged photo"""
dbconnection = db.connect(user='USERNAME', password='PASSWORD', database='yfcc100m')
cursor = dbconnection.cursor()
records = []
with open(fin, 'r') as f:
for line in f:
items = line.split(';')
assert len(items) == 7
photoID = items[0]
poiID = items[3]
sqlstr = "SELECT ROUND(longitude, 6), ROUND(latitude, 6) FROM tdata WHERE pv_id = " + photoID
cursor.execute(sqlstr)
for longitude, latitude in cursor:
records.append(poiID + ':' + photoID + ':' + str(longitude) + ':' + str(latitude))
dbconnection.commit()
dbconnection.close()
with open(fout, 'w') as f:
for line in records:
f.write(line + '\n')
The paper formulates an Integer Linear Programming(ILP) to recommend an itinerary given a set of POIs, a budget, a source and destination POI, it maximize an objective function which combines travelling time with personalized visit durations.
PuLP from the COIN-OR project is a useful Python library for modeling linear and integer programs, many LP solvers such as GLPK, CBC, CPLEX and Gurobi can be called to solve the model. Its comprehensive documentation is available here.The formulation details can be found in the MIP_recommend() method.
Is it necessary to consider visiting a certain POI more than one times? This paper ignores this setting.
Dealing with edge case $\bar{V}(p) = 0$
It appears when POIs at which just one photo was taken for each visited user (including some user just took/uploaded two or more photos with the same timestamp), the case does appear in this dataset.
For all users $U$, POI $p$, arrival time $p^a$ and depature time $p^d$, The Average POI Visit Duration is defined as: $\bar{V}(p) = \frac{1}{n}\sum_{u \in U}\sum_{p_x \in S_u}(t_{p_x}^d - t_{p_x}^a)\delta(p_x = p), \forall p \in P$
and Time-based User Interest is defined as: $Int_u^Time(c) = \sum_{p_x \in S_u} \frac{t_{p_x}^d - t_{p_x}^a}{\bar{V}(p_x)} \delta(Cat_{p_x} = c), \forall c \in C$
Up to now, two strategies have been tried:
CBC is still too slow for large sequences (length >= 4)
The interest of a user $u$ in POI category $c$: $ Int_u^{Time}(c) = \sum_{p_x \in S_u} \frac{(t_{p_x}^d - t_{p_x}^a)}{\bar{V}(p_x)}\delta(Cat_{p_x} = c), \forall c \in C $
where $Cat_{p_x}$ is the category of POI $p_x$, $t_{p_x}^a$ is user $u$'s arrival time at POI $p_x$ and $t_{p_x}^d$ is the departure time. $S_u$ is the user $u$'s tranvel history and $\delta(Cat_{p_x} = c)$ equals $1$ if $Cat_{p_x} = c$ and $0$ otherwise.$\bar{V}(p)$ is the average visit duration at POI $p$ for all users, $ \bar{V}(p) = \frac{1}{n} \sum_{u \in U} \sum_{p_x \in S_u} (t_{p_x}^d - t_{p_x}^a) \delta(p_x = p), \forall p \in P $
Frequency-based User Interest $Int_u^{Freq}(c)$: the number of times user $u$ visits POIs of category $c$
Settings: Toronto, $\eta$=0.5 with time-based user interest and POI popularity, 42/335 ≈ 12.5% solutions are suboptimal, leave-one-out
Recall | Precision | F1-score | |
Paper | 0.779±0.010 | 0.706±0.013 | 0.732±0.012 |
Reproduce | 0.732±0.179 | 0.736±0.181 | 0.734±0.179 |
Length of Recall, Precision and F1-score: 335
Unique values of Recall, Precision, F1-score: 0.40, 0.44, 0.50, 0.57, 0.60, 0.63, 0.67, 0.71, 0.75, 0.78, 0.80, 0.83, 0.85, 1.00
Box plot of Recall, Precision and F1-score
Settings: Toronto, $\eta$=0.5 with frequency-based user interest and POI popularity, 43/335 ≈ 12.8% solutions are suboptimal, leave-one-out
Recall | Precision | F1-score | |
Paper | 0.760±0.009 | 0.679±0.013 | 0.708±0.012 |
Reproduce | 0.709±0.179 | 0.711±0.180 | 0.710±0.179 |
Length of Recall, Precision and F1-score: 335
Unique values of Recall, Precision, F1-score: 0.40, 0.43, 0.50, 0.56, 0.57, 0.60, 0.63, 0.67, 0.71, 0.75, 0.80, 0.83, 0.85, 1.00
Settings for recommended trajectories: Toronto, $\eta$=0.5 with time-based user interest and 59/335 ≈ 17.6% solutions are suboptimal
NOTE: the value of matrix[i, j] denotes the probability of visiting category j after visiting category i for an average visitor.
Transition matrix for recommended trajectories:
Amusement | Beach | Cultural | Shopping | Sport | Structure | |
Amusement | 0.035 | 0.266 | 0.259 | 0.098 | 0.266 | 0.077 |
Beach | 0.177 | 0.222 | 0.254 | 0.113 | 0.069 | 0.165 |
Cultural | 0.209 | 0.295 | 0.175 | 0.107 | 0.124 | 0.090 |
Shopping | 0.146 | 0.404 | 0.281 | 0.000 | 0.090 | 0.079 |
Sport | 0.229 | 0.219 | 0.305 | 0.086 | 0.086 | 0.076 |
Structure | 0.212 | 0.282 | 0.247 | 0.094 | 0.141 | 0.024 |
Transition matrix for actual trajectories:
Amusement | Beach | Cultural | Shopping | Sport | Structure | |
Amusement | 0.091 | 0.117 | 0.344 | 0.110 | 0.240 | 0.097 |
Beach | 0.059 | 0.127 | 0.183 | 0.269 | 0.056 | 0.305 |
Cultural | 0.131 | 0.204 | 0.114 | 0.201 | 0.064 | 0.286 |
Shopping | 0.057 | 0.358 | 0.219 | 0.057 | 0.065 | 0.244 |
Sport | 0.317 | 0.183 | 0.167 | 0.103 | 0.063 | 0.167 |
Structure | 0.084 | 0.303 | 0.261 | 0.200 | 0.077 | 0.074 |
For recommended tour/trajectories with length 3, 4 and 5, enumerating all possibilities are much easier to understand the recommendation results than using ILP to get a single best recommendation.
Settings for enumeration: Toronto, $\eta$=0.5 with time-based user interest, length for actual and recommended trajectories $\in \{3, 4, 5\}$, data are available here.
Interactive Python scripts to generate boxplots of enumerated trajectories' scores as well as the score of the actual travel sequences, e.g.
In [ ]:
python3 plot_enumseq.py Toro_eta05_time_seqs.list
file Toro_eta05_time_seqs.list
is available here.
Visit duration matrix $M$ for each user at each POI, data: POI_visit_duration.zip.
For each cell of a matrix, $M_{ij}$, denotes the $log10$ of visit duration for user $i$ at POI $j$, i.e. $M_{ij}$ = $log10$(visit duration of user $i$ at POI $j$) if the visit duration is greater than $0$ and $0$ otherwise.