NOTE: Before running this notebook, please run script src/ijcai15_setup.py
to setup data properly.
The dataset used in this paper can be downloaded here, the summary of this dataset is also available. Unfortunately, one critical portion of information is missing, i.e. the geo-coordinates of each POI, which is necessary for calculating the travel time from one POI to another. However, it could be approximated by averaging (longitude, latitude) of all photos mapped to a specific POI by retriving coordinates of all photos in this dataset from the original YFCC100M dataset using photoID.
Simple statistics of this dataset
City | #POIs | #Users | #POI_Visits | #Travel_Sequences |
Toronto | 29 | 1,395 | 39,419 | 6,057 |
Osaka | 27 | 450 | 7,747 | 1,115 |
Glasgow | 27 | 601 | 11,434 | 2,227 |
Edinburgh | 28 | 1,454 | 33,944 | 5,028 |
NOTE: the number of photos for each city described in the paper is NOT available in this dataset
To compute the mean value of coordinates for all photos mapped to a POI, we need to search the coordinates for each photo from the 100 million records. To accelerate the searching process, first extract the photo id, longitude and latitude columns from the whole dataset
cut -d $'\t' -f1,11,12 yfcc100m_dataset >> dataset.yfcc
and then import them to a database which was created by the following SQL scripts
CREATE DATABASE yfcc100m;
CREATE TABLE yfcc100m.tdata(
pv_id BIGINT UNSIGNED NOT NULL UNIQUE PRIMARY KEY, /* Photo/video identifier */
longitude FLOAT, /* Longitude */
latitude FLOAT /* Latitude */
);
COMMIT;
Python scripts to import these data to DB looks like
import mysql.connector as db
def import_data(fname):
dbconnection = db.connect(user='USERNAME', password='PASSWORD')
cursor = dbconnection.cursor()
with open(fname, 'r') as f:
for line in f:
items = line.split('\t')
assert(len(items) == 3)
pv_id = items[0].strip()
lon = items[1].strip()
lat = items[2].strip()
if len(lon) == 0 or len(lat) == 0:
continue
sqlstr = 'INSERT INTO yfcc100m.tdata VALUES (' + pv_id + ', ' + lon + ', ' + lat + ')'
try:
cursor.execute(sqlstr)
except db.Error as error:
print('ERROR: {}'.format(error))
dbconnection.commit()
dbconnection.close()
Python scripts to search coordinates for photos looks like
import mysql.connector as db
def search_coords(fin, fout):
dbconnection = db.connect(user='USERNAME', password='PASSWORD', database='yfcc100m')
cursor = dbconnection.cursor()
with open(fout, 'w') as fo:
with open(fin, 'r') as fi:
for line in fi:
items = line.split(';')
assert(len(items) == 7)
photoID = items[0].strip()
sqlstr = 'SELECT longitude, latitude FROM tdata WHERE pv_id = ' + photoID
cursor.execute(sqlstr)
for longitude, latitude in cursor:
fo.write(photoID + ';' + str(longitude) + ';' + str(latitude) + '\n')
dbconnection.commit()
dbconnection.close()
The above retrived results are available and will be downloaded automatically by executing scripts src/ijcai15_setup.py
.
For user $u$ and POI $p$, define
Define the interest of user $u$ in POI category $c$ as
Evaluation metrics: Let $P_r$ be the set of POIs of the recommended trajectory and $P_v$ be the set of POIs visited in real-life travel sequence.
The paper formulates the itinerary recommendation problem as an Integer Linear Programming(ILP) as follows.
Given a set of POIs, time budget $B$, the starting/destination POI $p_1$/$p_N$, recommend a trajectory $(p_1,\dots,p_N)$ to user $u$ that \begin{equation*} \text{Maximize} \sum_{i=2}^{N-1} \sum_{j=2}^{N} x_{i,j} \left(\eta Int(u, Cat_{p_i}) + (1-\eta) Pop(p_i)\right) \end{equation*} Subject to \begin{align*} %x_{i,j} \in \{0, 1\}, & \forall i,j = 1,\dots,N \\ \sum_{j=2}^N x_{1,j} &= \sum_{i=1}^{N-1} x_{i,N} = 1 \\ %\text{(starts/ends at $p_1$/$p_N$)} \\ \sum_{i=1}^{N-1} x_{i,k} &= \sum_{j=2}^{N} x_{k,j} \le 1, \forall k = 2,\dots,N-1 \\ %\text{(connected, enters/leaves $p_k$ at most once)} %q_i \in \{2,\dots,N\}, & \forall i = 2,\dots,N \\ q_i - q_j + 1 &\le (N-1)(1-x_{i,j}), \forall i,j = 2,\dots,N \\ %\text{sub-tour elimination} \\ %\sum_{i=1}^{N-1} \sum_{j=2}^N x_{i,j} Cost(i,j) \le B %\text{(budget constraint)} %\sum_{i=1}^{N-1} \sum_{j=2}^N x_{i,j} \left(T^{Travel}(p_i, p_j) + Int(u, Cat_{p_j}) * \bar{V}(p_j) \right) & \le B \sum_{i=1}^{N-1} \sum_{j=2}^N x_{i,j} & \left(Time(p_i, p_j) + Int(u, Cat_{p_j}) * \bar{V}(p_j) \right) \le B \end{align*}
In [1]:
%matplotlib inline
import os
import re
import sys
import math
import pulp
import random
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
In [2]:
speed = 4 # 4km/h
random.seed(123456789)
In [3]:
data_dir = 'data/data-ijcai15'
#fvisit = os.path.join(data_dir, 'userVisits-Osak.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Osak.csv')
#fvisit = os.path.join(data_dir, 'userVisits-Glas.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Glas.csv')
#fvisit = os.path.join(data_dir, 'userVisits-Edin.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Edin.csv')
fvisit = os.path.join(data_dir, 'userVisits-Toro.csv')
fcoord = os.path.join(data_dir, 'photoCoords-Toro.csv')
In [4]:
suffix = fvisit.split('-')[-1].split('.')[0]
frecseq = os.path.join(data_dir, 'reccommendSeq-' + suffix + '.pkl')
In [5]:
visits = pd.read_csv(fvisit, sep=';')
visits.head()
Out[5]:
In [6]:
coords = pd.read_csv(fcoord, sep=';')
coords.head()
Out[6]:
In [7]:
# merge data frames according to column 'photoID'
assert(visits.shape[0] == coords.shape[0])
traj = pd.merge(visits, coords, on='photoID')
traj.head()
Out[7]:
In [8]:
pd.DataFrame([traj[['photoLon', 'photoLat']].min(), traj[['photoLon', 'photoLat']].max(), \
traj[['photoLon', 'photoLat']].max() - traj[['photoLon', 'photoLat']].min()], \
index = ['min', 'max', 'range'])
Out[8]:
In [9]:
plt.figure(figsize=[15, 5])
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.scatter(traj['photoLon'], traj['photoLat'], marker='+')
Out[9]:
In [10]:
num_photo = traj['photoID'].unique().shape[0]
num_user = traj['userID'].unique().shape[0]
num_seq = traj['seqID'].unique().shape[0]
num_poi = traj['poiID'].unique().shape[0]
pd.DataFrame([num_photo, num_user, num_seq, num_poi, num_photo/num_user, num_seq/num_user], \
index = ['#photo', '#user', '#seq', '#poi', '#photo/user', '#seq/user'], columns=[str(suffix)])
Out[10]:
Compute POI (Longitude, Latitude) as the average coordinates of the assigned photos.
In [11]:
poi_coords = traj[['poiID', 'photoLon', 'photoLat']].groupby('poiID').agg(np.mean)
poi_coords.reset_index(inplace=True)
poi_coords.rename(columns={'photoLon':'poiLon', 'photoLat':'poiLat'}, inplace=True)
poi_coords.head()
Out[11]:
Extract POI category and visiting frequency.
In [12]:
poi_catfreq = traj[['poiID', 'poiTheme', 'poiFreq']].groupby('poiID').first()
poi_catfreq.reset_index(inplace=True)
poi_catfreq.head()
Out[12]:
In [13]:
poi_all = pd.merge(poi_catfreq, poi_coords, on='poiID')
poi_all.set_index('poiID', inplace=True)
poi_all.head()
Out[13]:
In [14]:
seq_all = traj[['userID', 'seqID', 'poiID', 'dateTaken']].copy()\
.groupby(['userID', 'seqID', 'poiID']).agg([np.min, np.max])
seq_all.head()
Out[14]:
In [15]:
seq_all.columns = seq_all.columns.droplevel()
seq_all.head()
Out[15]:
In [16]:
seq_all.reset_index(inplace=True)
seq_all.head()
Out[16]:
In [17]:
seq_all.rename(columns={'amin':'arrivalTime', 'amax':'departureTime'}, inplace=True)
seq_all['poiDuration(sec)'] = seq_all['departureTime'] - seq_all['arrivalTime']
seq_all.head()
Out[17]:
In [18]:
#tseq = seq_all[['poiID', 'poiDuration(sec)']].copy().groupby('poiID').agg(np.mean)
#tseq
In [19]:
seq_user = seq_all[['seqID', 'userID']].copy()
seq_user = seq_user.groupby('seqID').first()
seq_user.head()
Out[19]:
In [20]:
seq_len = seq_all[['userID', 'seqID', 'poiID']].copy()
seq_len = seq_len.groupby(['userID', 'seqID']).agg(np.size)
seq_len.reset_index(inplace=True)
seq_len.rename(columns={'poiID':'seqLen'}, inplace=True)
#seq_len.head()
ax = seq_len['seqLen'].hist(bins=20)
ax.set_yscale('log')
In [21]:
users = seq_all['userID'].unique()
transmat_time = pd.DataFrame(np.zeros((len(users), poi_all.index.shape[0]), dtype=np.float64), \
index=users, columns=poi_all.index)
In [22]:
poi_time = seq_all[['userID', 'poiID', 'poiDuration(sec)']].copy().groupby(['userID', 'poiID']).agg(np.sum)
poi_time.head()
Out[22]:
In [23]:
for idx in poi_time.index:
transmat_time.loc[idx[0], idx[1]] += poi_time.loc[idx].iloc[0]
print(transmat_time.shape)
transmat_time.head()
Out[23]:
In [24]:
# add 1 (sec) to each cell as a smooth factor
log10_transmat_time = np.log10(transmat_time.copy() + 1)
print(log10_transmat_time.shape)
log10_transmat_time.head()
Out[24]:
In [25]:
poi_cats = traj['poiTheme'].unique().tolist()
poi_cats.sort()
poi_cats
Out[25]:
In [26]:
ncats = len(poi_cats)
transmat_cat = pd.DataFrame(data=np.zeros((ncats, ncats), dtype=np.float64), index=poi_cats, columns=poi_cats)
In [29]:
for seqid in seq_all['seqID'].unique().tolist():
seqi = seq_all[seq_all['seqID'] == seqid].copy()
seqi.sort(columns=['arrivalTime'], ascending=True, inplace=True)
for j in range(len(seqi.index)-1):
idx1 = seqi.index[j]
idx2 = seqi.index[j+1]
poi1 = seqi.loc[idx1, 'poiID']
poi2 = seqi.loc[idx2, 'poiID']
cat1 = poi_all.loc[poi1, 'poiTheme']
cat2 = poi_all.loc[poi2, 'poiTheme']
transmat_cat.loc[cat1, cat2] += 1
transmat_cat
Out[29]:
Normalise each row to get an estimate of transition probabilities (MLE).
In [30]:
for r in transmat_cat.index:
rowsum = transmat_cat.ix[r].sum()
if rowsum == 0: continue # deal with lack of data
transmat_cat.loc[r] /= rowsum
transmat_cat
Out[30]:
Compute the log of transition probabilities with smooth factor $\epsilon=10^{-12}$.
In [31]:
log10_transmat_cat = np.log10(transmat_cat.copy() + 1e-12)
log10_transmat_cat
Out[31]:
A different leave-one-out cross-validation approach:
In [32]:
cv_seqs = seq_all[['userID', 'seqID', 'poiID']].copy().groupby(['userID', 'seqID']).agg(np.size)
cv_seqs.rename(columns={'poiID':'seqLen'}, inplace=True)
cv_seqs = cv_seqs[cv_seqs['seqLen'] > 2]
cv_seqs.reset_index(inplace=True)
print(cv_seqs.shape)
cv_seqs.head()
Out[32]:
In [33]:
cv_seq_set = []
In [34]:
# choose one sequence for each user in cv_seqs uniformly at random
for user in cv_seqs['userID'].unique():
seqlist = cv_seqs[cv_seqs['userID'] == user]['seqID'].tolist()
seqid = random.choice(seqlist)
cv_seq_set.append(seqid)
In [35]:
len(cv_seq_set)
Out[35]:
In [37]:
def calc_poi_info(seqid_set, seq_all, poi_all):
poi_info = seq_all[seq_all['seqID'].isin(seqid_set)][['poiID', 'poiDuration(sec)']].copy()
poi_info = poi_info.groupby('poiID').agg([np.mean, np.size])
poi_info.columns = poi_info.columns.droplevel()
poi_info.reset_index(inplace=True)
poi_info.rename(columns={'mean':'avgDuration(sec)', 'size':'popularity'}, inplace=True)
poi_info.set_index('poiID', inplace=True)
poi_info['poiTheme'] = poi_all.loc[poi_info.index, 'poiTheme']
poi_info['poiLon'] = poi_all.loc[poi_info.index, 'poiLon']
poi_info['poiLat'] = poi_all.loc[poi_info.index, 'poiLat']
return poi_info.copy()
In [38]:
def calc_user_interest(seqid_set, seq_all, poi_all, poi_info):
user_interest = seq_all[seq_all['seqID'].isin(seqid_set)][['userID', 'poiID', 'poiDuration(sec)']].copy()
user_interest['timeRatio'] = [poi_info.loc[x, 'avgDuration(sec)'] for x in user_interest['poiID']]
user_interest['timeRatio'] = user_interest['poiDuration(sec)'] / user_interest['timeRatio']
user_interest['poiTheme'] = [poi_all.loc[x, 'poiTheme'] for x in user_interest['poiID']]
user_interest.drop(['poiID', 'poiDuration(sec)'], axis=1, inplace=True)
user_interest = user_interest.groupby(['userID', 'poiTheme']).agg([np.sum, np.size]) # the sum
user_interest.columns = user_interest.columns.droplevel()
user_interest.rename(columns={'sum':'timeBased', 'size':'freqBased'}, inplace=True)
user_interest.reset_index(inplace=True)
user_interest.set_index(['userID', 'poiTheme'], inplace=True)
return user_interest.copy()
In [39]:
def calc_dist(longitude1, latitude1, longitude2, latitude2):
"""Calculate the distance (unit: km) between two places on earth"""
# convert degrees to radians
lon1 = math.radians(longitude1)
lat1 = math.radians(latitude1)
lon2 = math.radians(longitude2)
lat2 = math.radians(latitude2)
radius = 6371.009 # mean earth radius is 6371.009km, en.wikipedia.org/wiki/Earth_radius#Mean_radius
# The haversine formula, en.wikipedia.org/wiki/Great-circle_distance
dlon = math.fabs(lon1 - lon2)
dlat = math.fabs(lat1 - lat2)
return 2 * radius * math.asin(math.sqrt(\
(math.sin(0.5*dlat))**2 + math.cos(lat1) * math.cos(lat2) * (math.sin(0.5*dlon))**2 ))
In [40]:
def calc_dist_mat(poi_info):
poi_dist_mat = pd.DataFrame(data=np.zeros((poi_info.shape[0], poi_info.shape[0]), dtype=np.float64), \
index=poi_info.index, columns=poi_info.index)
for i in range(poi_info.index.shape[0]):
for j in range(i+1, poi_info.index.shape[0]):
r = poi_info.index[i]
c = poi_info.index[j]
dist = calc_dist(poi_info.loc[r, 'poiLon'], poi_info.loc[r, 'poiLat'], \
poi_info.loc[c, 'poiLon'], poi_info.loc[c, 'poiLat'])
assert(dist > 0.)
poi_dist_mat.loc[r, c] = dist
poi_dist_mat.loc[c, r] = dist
return poi_dist_mat
In [41]:
def calc_seq_budget(user, seq, poi_info, poi_dist_mat, user_interest):
"""Calculate the travel budget for the given travelling sequence"""
assert(len(seq) > 1)
budget = 0. # travel budget
for i in range(len(seq)-1):
px = seq[i]
py = seq[i+1]
assert(px in poi_info.index)
assert(py in poi_info.index)
budget += 60 * 60 * poi_dist_mat.loc[px, py] / speed # travel time (seconds)
caty = poi_info.loc[py, 'poiTheme']
avgtime = poi_info.loc[py, 'avgDuration(sec)']
userint = 0
if (user, caty) in user_interest.index: userint = user_interest.loc[user, caty] # for testing set
budget += userint * avgtime # expected visit duration
return budget
In [42]:
def recommend_ILP(user, budget, startPoi, endPoi, poi_info, poi_dist_mat, eta, speed, user_interest):
assert(0 <= eta <= 1); assert(budget > 0)
p0 = str(startPoi); pN = str(endPoi); N = poi_info.index.shape[0]
# REF: pythonhosted.org/PuLP/index.html
pois = [str(p) for p in poi_info.index] # create a string list for each POI
prob = pulp.LpProblem('TourRecommendation', pulp.LpMaximize) # create problem
# visit_i_j = 1 means POI i and j are visited in sequence
visit_vars = pulp.LpVariable.dicts('visit', (pois, pois), 0, 1, pulp.LpInteger)
# a dictionary contains all dummy variables
dummy_vars = pulp.LpVariable.dicts('u', [x for x in pois if x != p0], 2, N, pulp.LpInteger)
# add objective
objlist = []
for pi in [x for x in pois if x not in {p0, pN}]:
for pj in [y for y in pois if y != p0]:
cati = poi_info.loc[int(pi), 'poiTheme']
userint = 0; poipop = 0
if (user, cati) in user_interest.index: userint = user_interest.loc[user, cati]
if int(pi) in poi_info.index: poipop = poi_info.loc[int(pi), 'popularity']
objlist.append(visit_vars[pi][pj] * (eta * userint + (1.-eta) * poipop))
prob += pulp.lpSum(objlist), 'Objective'
# add constraints, each constraint should be in ONE line
prob += pulp.lpSum([visit_vars[p0][pj] for pj in pois if pj != p0]) == 1, 'StartAtp0'
prob += pulp.lpSum([visit_vars[pi][pN] for pi in pois if pi != pN]) == 1, 'EndAtpN'
for pk in [x for x in pois if x not in {p0, pN}]:
prob += pulp.lpSum([visit_vars[pi][pk] for pi in pois if pi != pN]) == \
pulp.lpSum([visit_vars[pk][pj] for pj in pois if pj != p0]), 'Connected_' + pk
prob += pulp.lpSum([visit_vars[pi][pk] for pi in pois if pi != pN]) <= 1, 'LeaveAtMostOnce_' + pk
prob += pulp.lpSum([visit_vars[pk][pj] for pj in pois if pj != p0]) <= 1, 'EnterAtMostOnce_' + pk
costlist = []
for pi in [x for x in pois if x != pN]:
for pj in [y for y in pois if y != p0]:
catj = poi_info.loc[int(pj), 'poiTheme']
traveltime = 60 * 60 * poi_dist_mat.loc[int(pi), int(pj)] / speed # seconds
userint = 0; avgtime = 0
if (user, catj) in user_interest.index: userint = user_interest.loc[user, catj]
if int(pj) in poi_info.index: avgtime = poi_info.loc[int(pj), 'avgDuration(sec)']
costlist.append(visit_vars[pi][pj] * (traveltime + userint * avgtime))
prob += pulp.lpSum(costlist) <= budget, 'WithinBudget'
for pi in [x for x in pois if x != p0]:
for pj in [y for y in pois if y != p0]:
prob += dummy_vars[pi] - dummy_vars[pj] + 1 <= (N - 1) * (1 - visit_vars[pi][pj]), \
'SubTourElimination_' + str(pi) + '_' + str(pj)
# solve problem
#prob.solve() # using PuLP's default solver
#prob.solve(pulp.PULP_CBC_CMD(options=['-threads', '8', '-strategy', '1', '-maxIt', '2000000'])) # CBC
#prob.solve(pulp.GLPK_CMD()) # GLPK
gurobi_options = [('TimeLimit', '7200'), ('Threads', '18'), ('NodefileStart', '0.9'), ('Cuts', '2')]
prob.solve(pulp.GUROBI_CMD(options=gurobi_options)) # GUROBI
print('status:', pulp.LpStatus[prob.status]) # print the status of the solution
#print('obj:', pulp.value(prob.objective)) # print the optimised objective function value
#for v in prob.variables(): # print each variable with it's resolved optimum value
# print(v.name, '=', v.varValue)
# if v.varValue != 0: print(v.name, '=', v.varValue)
visit_mat = pd.DataFrame(data=np.zeros((len(pois), len(pois)), dtype=np.float), index=pois, columns=pois)
for pi in pois:
for pj in pois: visit_mat.loc[pi, pj] = visit_vars[pi][pj].varValue
# build the recommended trajectory
recseq = [p0]
while True:
pi = recseq[-1]
pj = visit_mat.loc[pi].idxmax()
assert(round(visit_mat.loc[pi, pj]) == 1)
recseq.append(pj);
#print(recseq); sys.stdout.flush()
if pj == pN: return [int(x) for x in recseq]
In [43]:
cv_seq_dict = dict()
rec_seq_dict = dict()
In [46]:
for seqid in cv_seq_set:
seqi = seq_all[seq_all['seqID'] == seqid].copy()
seqi.sort(columns=['arrivalTime'], ascending=True, inplace=True)
cv_seq_dict[seqid] = seqi['poiID'].tolist()
In [47]:
eta = 0.5
time_based = True
In [48]:
doCompute = True
In [49]:
if os.path.exists(frecseq):
seq_dict = pickle.load(open(frecseq, 'rb'))
if (np.array(sorted(cv_seq_dict.keys())) == np.array(sorted(seq_dict.keys()))).all():
rec_seq_dict = seq_dict
doCompute = False
In [50]:
if doCompute:
n = 1
print('#sequences', len(cv_seq_set))
for seqid, seq in cv_seq_dict.items():
train_set = [x for x in seq_all['seqID'].unique() if x != seqid]
poi_info = calc_poi_info(train_set, seq_all, poi_all)
user_interest = calc_user_interest(train_set, seq_all, poi_all, poi_info)
poi_dist_mat = calc_dist_mat(poi_info)
user = seq_user.loc[seqid].iloc[0]
the_user_interest = None
if time_based == True: the_user_interest = user_interest['timeBased'].copy()
else: the_user_interest = user_interest['freqBased'].copy()
budget = calc_seq_budget(user, seq, poi_info, poi_dist_mat, the_user_interest)
print(n, 'sequence', seq, ', user', user, ', budget', budget); sys.stdout.flush()
recseq = recommend_ILP(user, budget, seq[0], seq[-1], poi_info, poi_dist_mat, eta, speed, the_user_interest)
rec_seq_dict[seqid] = recseq
print('->', recseq, '\n'); sys.stdout.flush()
n += 1
pickle.dump(rec_seq_dict, open(frecseq, 'wb'))
Results from paper (Toronto data, time-based uesr interest, eta=0.5):
In [51]:
def calc_recall_precision_F1score(seq_act, seq_rec):
assert(len(seq_act) > 0)
assert(len(seq_rec) > 0)
actset = set(seq_act)
recset = set(seq_rec)
intersect = actset & recset
recall = len(intersect) / len(seq_act)
precision = len(intersect) / len(seq_rec)
F1score = 2. * precision * recall / (precision + recall)
return recall, precision, F1score
In [52]:
recall = []
precision = []
F1score = []
In [53]:
for seqid in rec_seq_dict.keys():
assert(seqid in cv_seq_dict)
seq = cv_seq_dict[seqid]
recseq = rec_seq_dict[seqid]
r, p, F1 = calc_recall_precision_F1score(seq, recseq)
recall.append(r)
precision.append(p)
F1score.append(F1)
In [54]:
print('Recall:', np.mean(recall), np.std(recall))
print('Precision:', np.mean(precision), np.std(precision))
print('F1-score:', np.mean(F1score), np.std(F1score))
The paper stated "We evaluate PERSTOUR and the baselines using leave-one-out cross-validation [Kohavi,1995] (i.e., when evaluating a specific travel sequence of a user, we use this user's other travel sequences for training our algorithms"
While it's not clear if this means when evaluate a travel sequence for a user,
Trajectories with length greater than 3
are used in the paper.
In [558]:
seq_ge3 = seq_len[seq_len['seqLen'] >= 3]
seq_ge3['seqLen'].hist(bins=20)
Out[558]:
Split travelling sequences into training set and testing set using leave-one-out for each user.
For testing purpose, users with less than two travelling sequences are not considered in this experiment.
In [559]:
train_set = []
test_set = []
In [560]:
user_seqs = seq_ge3[['userID', 'seqID']].groupby('userID')
In [561]:
for user, indices in user_seqs.groups.items():
if len(indices) < 2: continue
idx = random.choice(indices)
test_set.append(seq_ge3.loc[idx, 'seqID'])
train_set.extend([seq_ge3.loc[x, 'seqID'] for x in indices if x != idx])
In [562]:
print('#seq in trainset:', len(train_set))
print('#seq in testset:', len(test_set))
seq_ge3[seq_ge3['seqID'].isin(train_set)]['seqLen'].hist(bins=20)
#data = np.array(seqs1['seqLen'])
#hist, bins = np.histogram(data, bins=3)
#print(hist)
Out[562]:
In [563]:
seq_exp = seq_ge3[['userID', 'seqID']].copy()
seq_exp = seq_exp.groupby('userID').agg(np.size)
seq_exp.reset_index(inplace=True)
seq_exp.rename(columns={'seqID':'#seq'}, inplace=True)
seq_exp = seq_exp[seq_exp['#seq'] > 1] # user with more than 1 sequences
print('total #seq for experiment:', seq_exp['#seq'].sum())
#seq_exp.head()
Compute average POI visit duration, POI popularity as defined at the top of the notebook.
In [564]:
poi_info = seq_all[seq_all['seqID'].isin(train_set)]
poi_info = poi_info[['poiID', 'poiDuration(sec)']].copy()
In [565]:
poi_info = poi_info.groupby('poiID').agg([np.mean, np.size])
poi_info.columns = poi_info.columns.droplevel()
poi_info.reset_index(inplace=True)
poi_info.rename(columns={'mean':'avgDuration(sec)', 'size':'popularity'}, inplace=True)
poi_info.set_index('poiID', inplace=True)
print('#poi:', poi_info.shape[0])
if poi_info.shape[0] < poi_all.shape[0]:
extra_index = list(set(poi_all.index) - set(poi_info.index))
extra_poi = pd.DataFrame(data=np.zeros((len(extra_index), 2), dtype=np.float64), \
index=extra_index, columns=['avgDuration(sec)', 'popularity'])
poi_info = poi_info.append(extra_poi)
print('#poi after extension:', poi_info.shape[0])
poi_info['poiTheme'] = poi_all.loc[poi_info.index, 'poiTheme']
poi_info['poiLon'] = poi_all.loc[poi_info.index, 'poiLon']
poi_info['poiLat'] = poi_all.loc[poi_info.index, 'poiLat']
poi_info.head()
Out[565]:
Compute time/frequency based user interest as defined at the top of the notebook.
In [567]:
user_interest = seq_all[seq_all['seqID'].isin(train_set)]
user_interest = user_interest[['userID', 'poiID', 'poiDuration(sec)']].copy()
In [568]:
user_interest['timeRatio'] = [poi_info.loc[x, 'avgDuration(sec)'] for x in user_interest['poiID']]
#user_interest[user_interest['poiID'].isin({9, 10, 12, 18, 20, 26})]
#user_interest[user_interest['timeRatio'] < 1]
user_interest.head()
Out[568]:
In [569]:
user_interest['timeRatio'] = user_interest['poiDuration(sec)'] / user_interest['timeRatio']
user_interest.head()
Out[569]:
In [570]:
user_interest['poiTheme'] = [poi_all.loc[x, 'poiTheme'] for x in user_interest['poiID']]
user_interest.drop(['poiID', 'poiDuration(sec)'], axis=1, inplace=True)
Sum defined in paper, but sum of (time ratio) * (avg duration) will become extremely large in some cases, which is unrealistic, switch between the two to have a look at the effects.
In [571]:
#user_interest = user_interest.groupby(['userID', 'poiTheme']).agg([np.sum, np.size]) # the sum
user_interest = user_interest.groupby(['userID', 'poiTheme']).agg([np.mean, np.size]) # try the mean value
In [572]:
user_interest.columns = user_interest.columns.droplevel()
#user_interest.rename(columns={'sum':'timeBased', 'size':'freqBased'}, inplace=True)
user_interest.rename(columns={'mean':'timeBased', 'size':'freqBased'}, inplace=True)
user_interest.reset_index(inplace=True)
user_interest.set_index(['userID', 'poiTheme'], inplace=True)
user_interest.head()
Out[572]:
In [573]:
#user_interest.columns.shape[0]
In [575]:
poi_dist_mat = pd.DataFrame(data=np.zeros((poi_info.shape[0], poi_info.shape[0]), dtype=np.float64), \
index=poi_info.index, columns=poi_info.index)
for i in range(poi_info.index.shape[0]):
for j in range(i+1, poi_info.index.shape[0]):
r = poi_info.index[i]
c = poi_info.index[j]
dist = calc_dist(poi_info.loc[r, 'poiLon'], poi_info.loc[r, 'poiLat'], \
poi_info.loc[c, 'poiLon'], poi_info.loc[c, 'poiLat'])
assert(dist > 0.)
poi_dist_mat.loc[r, c] = dist
poi_dist_mat.loc[c, r] = dist
In [577]:
def generate_ILP(lpFilename, user, budget, startPoi, endPoi, poi_info, poi_dist_mat, eta, speed, user_interest):
"""Recommend a trajectory given an existing travel sequence S_N,
the first/last POI and travel budget calculated based on S_N
"""
assert(0 <= eta <= 1)
assert(budget > 0)
p0 = str(startPoi)
pN = str(endPoi)
N = poi_info.index.shape[0]
# The MIP problem
# REF: pythonhosted.org/PuLP/index.html
# create a string list for each POI
pois = [str(p) for p in poi_info.index]
# create problem
prob = pulp.LpProblem('TourRecommendation', pulp.LpMaximize)
# visit_i_j = 1 means POI i and j are visited in sequence
visit_vars = pulp.LpVariable.dicts('visit', (pois, pois), 0, 1, pulp.LpInteger)
# a dictionary contains all dummy variables
dummy_vars = pulp.LpVariable.dicts('u', [x for x in pois if x != p0], 2, N, pulp.LpInteger)
# add objective
objlist = []
for pi in [x for x in pois if x not in {p0, pN}]:
for pj in [y for y in pois if y != p0]:
cati = poi_info.loc[int(pi), 'poiTheme']
userint = 0
if (user, cati) in user_interest.index: userint = user_interest.loc[user, cati]
objlist.append(visit_vars[pi][pj] * (eta * userint + (1.-eta) * poi_info.loc[int(pi), 'popularity']))
prob += pulp.lpSum(objlist), 'Objective'
# add constraints
# each constraint should be in ONE line
prob += pulp.lpSum([visit_vars[p0][pj] for pj in pois if pj != p0]) == 1, 'StartAtp0' # starts at the first POI
prob += pulp.lpSum([visit_vars[pi][pN] for pi in pois if pi != pN]) == 1, 'EndAtpN' # ends at the last POI
for pk in [x for x in pois if x not in {p0, pN}]:
prob += pulp.lpSum([visit_vars[pi][pk] for pi in pois if pi != pN]) == \
pulp.lpSum([visit_vars[pk][pj] for pj in pois if pj != p0]), \
'Connected_' + pk # the itinerary is connected
prob += pulp.lpSum([visit_vars[pi][pk] for pi in pois if pi != pN]) <= 1, \
'LeaveAtMostOnce_' + pk # LEAVE POIk at most once
prob += pulp.lpSum([visit_vars[pk][pj] for pj in pois if pj != p0]) <= 1, \
'EnterAtMostOnce_' + pk # ENTER POIk at most once
# travel cost within budget
costlist = []
for pi in [x for x in pois if x != pN]:
for pj in [y for y in pois if y != p0]:
catj = poi_info.loc[int(pj), 'poiTheme']
traveltime = 60 * 60 * poi_dist_mat.loc[int(pi), int(pj)] / speed # seconds
userint = 0
if (user, catj) in user_interest.index: userint = user_interest.loc[user, catj]
costlist.append(visit_vars[pi][pj] * (traveltime + userint * poi_info.loc[int(pj), 'avgDuration(sec)']))
prob += pulp.lpSum(costlist) <= budget, 'WithinBudget'
for pi in [x for x in pois if x != p0]:
for pj in [y for y in pois if y != p0]:
prob += dummy_vars[pi] - dummy_vars[pj] + 1 <= \
(N - 1) * (1 - visit_vars[pi][pj]), \
'SubTourElimination_' + str(pi) + '_' + str(pj) # TSP sub-tour elimination
# write problem data to an .lp file
prob.writeLP(lpFilename)
In [88]:
def extract_seq(seqid_set, seq_all):
"""Extract the actual sequences (i.e. a list of POI) from a set of sequence ID"""
seq_dict = dict()
for seqid in seqid_set:
seqi = seq_all[seq_all['seqID'] == seqid].copy()
seqi.sort(columns=['arrivalTime'], ascending=True, inplace=True)
seq_dict[seqid] = seqi['poiID'].tolist()
return seq_dict
In [579]:
train_seqs = extract_seq(train_set, seq_all)
In [580]:
lpDir = os.path.join(data_dir, 'lp_' + suffix)
if not os.path.exists(lpDir):
print('Please create directory "' + lpDir + '"')
In [581]:
eta = 0.5
#eta = 1
time_based = True
In [585]:
for seqid in sorted(train_seqs.keys()):
if not os.path.exists(lpDir):
print('Please create directory "' + lpDir + '"')
break
seq = train_seqs[seqid]
lpFile = os.path.join(lpDir, str(seqid) + '.lp')
user = seq_user.loc[seqid].iloc[0]
the_user_interest = None
if time_based == True:
the_user_interest = user_interest['timeBased'].copy()
else:
the_user_interest = user_interest['freqBased'].copy()
budget = calc_seq_budget(user, seq, poi_info, poi_dist_mat, the_user_interest)
print('generating ILP', lpFile, 'for user', user, 'sequence', seq, 'budget', round(budget, 2))
generate_ILP(lpFile, user, budget, seq[0], seq[-1], poi_info, poi_dist_mat, eta, speed, the_user_interest)
In [586]:
test_seqs = extract_seq(test_set, seq_all)
In [587]:
for seqid in sorted(test_seqs.keys()):
if not os.path.exists(lpDir):
print('Please create directory "' + lpDir + '"')
break
seq = test_seqs[seqid]
lpFile = os.path.join(lpDir, str(seqid) + '.lp')
user = seq_user.loc[seqid].iloc[0]
the_user_interest = None
if time_based == True:
the_user_interest = user_interest['timeBased'].copy()
else:
the_user_interest = user_interest['freqBased'].copy()
budget = calc_seq_budget(user, seq, poi_info, poi_dist_mat, the_user_interest)
print('generating ILP', lpFile, 'for user', user, 'sequence', seq, 'budget', round(budget, 2))
generate_ILP(lpFile, user, budget, seq[0], seq[-1], poi_info, poi_dist_mat, eta, speed, the_user_interest)
In [588]:
def load_solution_gurobi(fsol, startPoi, endPoi):
"""Load recommended itinerary from MIP solution file by GUROBI"""
seqterm = []
with open(fsol, 'r') as f:
for line in f:
if re.search('^visit_', line): # e.g. visit_0_7 1\n
item = line.strip().split(' ') # visit_21_16 1.56406801399038e-09\n
if round(float(item[1])) == 1:
fromto = item[0].split('_')
seqterm.append((int(fromto[1]), int(fromto[2])))
p0 = startPoi
pN = endPoi
recseq = [p0]
while True:
px = recseq[-1]
for term in seqterm:
if term[0] == px:
recseq.append(term[1])
if term[1] == pN:
return recseq
else:
seqterm.remove(term)
break
In [590]:
train_seqs_rec = dict()
In [591]:
solDir = os.path.join(data_dir, os.path.join('lp_' + suffix, 'eta05_time'))
#solDir = os.path.join(data_dir, os.path.join('lp_' + suffix, 'eta10_time'))
if not os.path.exists(solDir):
print('Directory for solution files', solDir, 'does not exist.')
In [592]:
for seqid in sorted(train_seqs.keys()):
if not os.path.exists(solDir):
print('Directory for solution files', solDir, 'does not exist.')
break
seq = train_seqs[seqid]
solFile = os.path.join(solDir, str(seqid) + '.lp.sol')
recseq = load_solution_gurobi(solFile, seq[0], seq[-1])
train_seqs_rec[seqid] = recseq
print('Sequence', seqid, 'Actual:', seq, ', Recommended:', recseq)
In [593]:
recall = []
precision = []
F1score = []
for seqid in train_seqs.keys():
r, p, F1 = calc_recall_precision_F1score(train_seqs[seqid], train_seqs_rec[seqid])
recall.append(r)
precision.append(p)
F1score.append(F1)
In [594]:
print('Recall:', round(np.mean(recall), 2), ',', round(np.std(recall), 2))
print('Precision:', round(np.mean(precision), 2), ',', round(np.std(recall), 2))
print('F1-score:', round(np.mean(F1score), 2), ',', round(np.std(recall), 2))
Results from paper (Toronto data, time-based uesr interest, eta=0.5):
In [595]:
test_seqs_rec = dict()
In [596]:
solDirTest = os.path.join(data_dir, os.path.join('lp_' + suffix, 'eta05_time.test'))
if not os.path.exists(solDirTest):
print('Directory for solution files', solDirTest, 'does not exist.')
In [597]:
for seqid in sorted(test_seqs.keys()):
if not os.path.exists(solDirTest):
print('Directory for solution files', solDirTest, 'does not exist.')
break
seq = test_seqs[seqid]
solFile = os.path.join(solDirTest, str(seqid) + '.lp.sol')
recseq = load_solution_gurobi(solFile, seq[0], seq[-1])
test_seqs_rec[seqid] = recseq
print('Sequence', seqid, 'Actual:', seq, ', Recommended:', recseq)
In [598]:
recallT = []
precisionT = []
F1scoreT = []
for seqid in test_seqs.keys():
r, p, F1 = calc_recall_precision_F1score(test_seqs[seqid], test_seqs_rec[seqid])
recallT.append(r)
precisionT.append(p)
F1scoreT.append(F1)
In [599]:
print('Recall:', round(np.mean(recallT), 2), ',', round(np.std(recallT), 2))
print('Precision:', round(np.mean(precisionT), 2), ',', round(np.std(recallT), 2))
print('F1-score:', round(np.mean(F1scoreT), 2), ',', round(np.std(recallT), 2))
Large budget leads to unrealistic recommended trajectory.
Is it necessary to consider visiting a certain POI more than one times? This paper ignores this setting.
Dealing with edge case $\bar{V}(p) = 0$
It appears when POIs at which just one photo was taken for each visited user (including some user just took/uploaded two or more photos with the same timestamp), the case does appear in this dataset.
For all users $U$, POI $p$, arrival time $p^a$ and depature time $p^d$, The Average POI Visit Duration is defined as: $\bar{V}(p) = \frac{1}{n}\sum_{u \in U}\sum_{p_x \in S_u}(t_{p_x}^d - t_{p_x}^a)\delta(p_x = p), \forall p \in P$
and Time-based User Interest is defined as: $Int_u^Time(c) = \sum_{p_x \in S_u} \frac{t_{p_x}^d - t_{p_x}^a}{\bar{V}(p_x)} \delta(Cat_{p_x} = c), \forall c \in C$
Up to now, two strategies have been tried:
CBC is too slow for large sequences (length >= 4)