Methods and Experiments

1. Experimental Setup

Enumerate all trajectories for each user given the trajectory length (e.g. 3, 4, 5) and the (start, end) POIs.

For each trajectory, compute a score based on the features below:

  • User Interest (time-based)
  • User Interest (frequency-based)
  • POI Popularity
  • Travelling Cost (distance/time)
  • Trajectory probability based on the transition probabilities between different POI categories and the following rules for choosing a specific POI within certain category:
    • The Nearest Neighbor of the current POI
    • The most Popular POI
    • A random POI choosing with probability proportional to the reciprocal of its distance to current POI
    • A random POI choosing with probability proportional to its popularity

To avoid numerical underflow, use the log of of probabilities instead of the probabilities themselves, to avoid zero probabilities, add a smooth value $\epsilon = 10^{-12}$ for each probability.

Plot the scores of generated and actual trajectories for each (user, trajectoryLength, startPOI, endPOI) tuple with some degree of transparency (alpha).

Recommend trajectory with the highest score and measure the performance of recommendation using recall, precision and F1-score.

Optimise parameters in the score function by learning, in this specific case, the cost function could be based on recall, precision or F1-score, we can also control the estimation of transition matrix.

1.1 Basic definitions

For user $u$ and POI $p$, define

  • Travel History: \begin{equation*} S_u = \{(p_1, t_{p_1}^a, t_{p_1}^d), \dots, (p_n, t_{p_n}^a, t_{p_n}^d)\} \end{equation*} where $t_{p_i}^a$ is the arrival time and $t_{p_i}^d$ the departure time of user $u$ at POI $p_i$

  • Travel Sequences: split $S_u$ if \begin{equation*} |t_{p_i}^d - t_{p_{i+1}}^a| > \tau ~(\text{e.g.}~ \tau = 8 ~\text{hours}) \end{equation*}

  • POI Popularity: \begin{equation*} Pop(p) = \sum_{u \in U} \sum_{p_i \in S_u} \delta(p_i == p) \end{equation*}

  • Average POI Visit Duration: \begin{equation*} \bar{V}(p) = \frac{1}{N} \sum_{u \in U} \sum_{p_i \in S_u} (t_{p_i}^d - t_{p_i}^a) \delta(p_i == p) \end{equation*} where $N$ is #visits of POI $p$ by all users

Define the interest of user $u$ in POI category $c$ as

  • Time based User Interest: \begin{equation*} Int^{Time}(u, c) = \sum_{p_i \in S_u} \frac{(t_{p_i}^d - t_{p_i}^a)}{\bar{V}(p_i)} \delta(Cat_{p_i} == c) \end{equation*} where $Cat_{p_i}$ is the category of POI $p_i$
    we also tried this one \begin{equation*} Int^{Time}(u, c) = \frac{1}{n} \sum_{p_i \in S_u} \frac{(t_{p_i}^d - t_{p_i}^a)}{\bar{V}(p_i)} \delta(Cat_{p_i} == c) \end{equation*} where $n$ is the number of visit of category $c$ by user $u$ (i.e. the frequency base user interest defined below).
  • Frequency based User Interest: \begin{equation*} Int^{Freq}(u, c) = \sum_{p_i \in S_u} \delta(Cat_{p_i} == c) \end{equation*}

2. Computation Steps

  1. Split actual trajectories input two parts, one for training, the other for testing.
    Concretely, For each user, consider all the trajectories with length 3, 4 and 5, pick one for testing set and put all others into training set.

  2. Use trajectories in training set to compute (MLE) a transition matrix where element [i, j] denotes the transition probability from POI category i to POI category j.

  3. For each trajectory $T$ in training set, enumerate all possible trajectories that satisfy the following requirements:

    • The trajectory length is the same as that of $T$
    • The start/end POI are the same as those of $T$
    • No sub-tour exists
  4. Compute the 8 scores described above as features, rescale each score into range [-1, 1], compute the weighted sum of these score to get a single score.
    (weights are normalised so that they are in range [0, 1] and their sum is 1)
    (the values in parameter/weight vector are now drawn from range [-1, 1], and the sum now doesn't necessarily to be 1.)

  5. Choose the trajectory with the highest score $T^*$ and compute F1 score as follows:

    • recall = $\frac{|T^* \cap T|}{|T|}$
    • precision = $\frac{|T^* \cap T|}{T^*}$
    • F1-score = $\frac{2 \times \text{recall} \times \text{precision}}{\text{recall} + \text{precision}}$
  6. Compute the mean F1 score for all trajectory $T$ in the training set.

  7. Use coordinate-wise grid search to find an good weight vector such that the mean F1 score is as large as possible.

3. Compute Experimental Data

NOTE: Before running this notebook, please run script src/ijcai15_setup.py to setup data properly.

3.1 Load Trajectory Data


In [229]:
%matplotlib inline

import os
import math
import random
import pickle
import pandas as pd
import numpy as np
import numpy.matlib
from datetime import datetime
from joblib import Parallel, delayed
import matplotlib.pyplot as plt

In [230]:
nfeatures = 8 # number of features
EPS = 1e-12 # smooth, deal with 0 probability
random.seed(123456789) # control random choice when splitting training/testing set

In [231]:
data_dir = 'data/data-ijcai15'
#fvisit = os.path.join(data_dir, 'userVisits-Osak.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Osak.csv')
#fvisit = os.path.join(data_dir, 'userVisits-Glas.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Glas.csv')
#fvisit = os.path.join(data_dir, 'userVisits-Edin.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Edin.csv')
fvisit = os.path.join(data_dir, 'userVisits-Toro.csv')
fcoord = os.path.join(data_dir, 'photoCoords-Toro.csv')

In [232]:
suffix = fvisit.split('-')[-1].split('.')[0]

In [233]:
fpoi = os.path.join(data_dir, 'poi-' + suffix + '.csv')  
fseq = os.path.join(data_dir, 'seq-' + suffix + '.csv')
ftrain = os.path.join(data_dir, 'trainset-' + suffix + '.pkl')
ftest  = os.path.join(data_dir, 'testset-' + suffix + '.pkl')
ffeatures_train = os.path.join(data_dir, 'featuresTrain-' + suffix + '.pkl')
ffeatures_test  = os.path.join(data_dir, 'featuresTest-' + suffix + '.pkl')

In [234]:
visits = pd.read_csv(fvisit, sep=';')
visits.head()


Out[234]:
photoID userID dateTaken poiID poiTheme poiFreq seqID
0 7941504100 10007579@N00 1346844688 30 Structure 1538 1
1 4886005532 10012675@N05 1142731848 6 Cultural 986 2
2 4886006468 10012675@N05 1142732248 6 Cultural 986 2
3 4885404441 10012675@N05 1142732373 6 Cultural 986 2
4 4886008334 10012675@N05 1142732445 6 Cultural 986 2

In [235]:
coords = pd.read_csv(fcoord, sep=';')
coords.head()


Out[235]:
photoID photoLon photoLat
0 7941504100 -79.380844 43.645641
1 4886005532 -79.391525 43.654335
2 4886006468 -79.391525 43.654335
3 4885404441 -79.391525 43.654335
4 4886008334 -79.391525 43.654335

In [236]:
# merge data frames according to column 'photoID'
assert(visits.shape[0] == coords.shape[0])
traj = pd.merge(visits, coords, on='photoID')
traj.head()


Out[236]:
photoID userID dateTaken poiID poiTheme poiFreq seqID photoLon photoLat
0 7941504100 10007579@N00 1346844688 30 Structure 1538 1 -79.380844 43.645641
1 4886005532 10012675@N05 1142731848 6 Cultural 986 2 -79.391525 43.654335
2 4886006468 10012675@N05 1142732248 6 Cultural 986 2 -79.391525 43.654335
3 4885404441 10012675@N05 1142732373 6 Cultural 986 2 -79.391525 43.654335
4 4886008334 10012675@N05 1142732445 6 Cultural 986 2 -79.391525 43.654335

In [237]:
pd.DataFrame([traj[['photoLon', 'photoLat']].min(), traj[['photoLon', 'photoLat']].max(), \
             traj[['photoLon', 'photoLat']].max() - traj[['photoLon', 'photoLat']].min()], \
             index = ['min', 'max', 'range'])


Out[237]:
photoLon photoLat
min -79.465088 43.618523
max -79.180336 43.822018
range 0.284752 0.203495

In [238]:
plt.figure(figsize=[15, 5])
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.scatter(traj['photoLon'], traj['photoLat'], marker='+')


Out[238]:
<matplotlib.collections.PathCollection at 0x7fda7e294c88>

In [239]:
num_photo = traj['photoID'].unique().shape[0]
num_user = traj['userID'].unique().shape[0]
num_poi = traj['poiID'].unique().shape[0]
num_seq = traj['seqID'].unique().shape[0]
pd.DataFrame([num_photo, num_user, num_poi, num_seq, num_photo/num_user, num_seq/num_user], \
             index = ['#photo', '#user', '#poi', '#seq', '#photo/user', '#seq/user'], columns=[str(suffix)])


Out[239]:
Toro
#photo 39419.000000
#user 1395.000000
#poi 29.000000
#seq 6057.000000
#photo/user 28.257348
#seq/user 4.341935

3.2 Compute POI Info

Compute POI (Longitude, Latitude) as the average coordinates of the assigned photos.


In [240]:
poi_coords = traj[['poiID', 'photoLon', 'photoLat']].groupby('poiID').mean()
poi_coords.reset_index(inplace=True)
poi_coords.rename(columns={'photoLon':'poiLon', 'photoLat':'poiLat'}, inplace=True)
#poi_coords

Extract POI category and visiting frequency.


In [241]:
poi_catfreq = traj[['poiID', 'poiTheme', 'poiFreq']].groupby('poiID').first()
poi_catfreq.reset_index(inplace=True)
#poi_catfreq

In [242]:
poi_all = pd.merge(poi_catfreq, poi_coords, on='poiID')
poi_all.set_index('poiID', inplace=True)
#poi_all

Save POI info to file.


In [243]:
poi_all.to_csv(fpoi, index=True)
#poi_all2 = pd.read_csv(fpoi, index_col=0)
#poi_all2

3.3 Construct Travelling Sequences


In [244]:
seq_all = traj[['userID', 'seqID', 'poiID', 'dateTaken']].copy()\
          .groupby(['userID', 'seqID', 'poiID']).agg([np.min, np.max])
#seq_all.head()

In [245]:
seq_all.columns = seq_all.columns.droplevel()
seq_all.head()


Out[245]:
amin amax
userID seqID poiID
10007579@N00 1 30 1346844688 1346844688
10012675@N05 2 6 1142731848 1142732445
3 6 1142916492 1142916492
4 13 1319327174 1319332848
10014440@N06 5 24 1196128621 1196128878

In [246]:
seq_all.reset_index(inplace=True)
seq_all.head()


Out[246]:
userID seqID poiID amin amax
0 10007579@N00 1 30 1346844688 1346844688
1 10012675@N05 2 6 1142731848 1142732445
2 10012675@N05 3 6 1142916492 1142916492
3 10012675@N05 4 13 1319327174 1319332848
4 10014440@N06 5 24 1196128621 1196128878

In [247]:
seq_all.rename(columns={'amin':'arrivalTime', 'amax':'departureTime'}, inplace=True)
seq_all['poiDuration(sec)'] = seq_all['departureTime'] - seq_all['arrivalTime']
#print('Found %d sequences' % len(seq_all))
#pickle.dump(seq_all, open('all_trajectories.pkl', 'bw'))
seq_all.head()


Out[247]:
userID seqID poiID arrivalTime departureTime poiDuration(sec)
0 10007579@N00 1 30 1346844688 1346844688 0
1 10012675@N05 2 6 1142731848 1142732445 597
2 10012675@N05 3 6 1142916492 1142916492 0
3 10012675@N05 4 13 1319327174 1319332848 5674
4 10014440@N06 5 24 1196128621 1196128878 257

Save travelling sequences to file.


In [248]:
seq_all.to_csv(fseq, index=False)
#seq_all2 = pd.read_csv(fseq)
#seq_all2.head()

In [249]:
#seq_all = pickle.load(open('all_trajectories.pkl'))
seq_user = seq_all[['seqID', 'userID']].copy()
seq_user = seq_user.groupby('seqID').first()
#type(seq_user)
#seq_user.loc[1].iloc[0]
#seq_user.reset_index(inplace=True)
#seq_user.set_index('seqID', inplace=True)
seq_user.head()


Out[249]:
userID
seqID
1 10007579@N00
2 10012675@N05
3 10012675@N05
4 10012675@N05
5 10014440@N06

3.4 Choose Travelling Sequences for experiment

Trajectories with length {3, 4, 5} are used in our experiment.


In [250]:
seq_len = seq_all[['userID', 'seqID', 'poiID']].copy()
seq_len = seq_len.groupby(['userID', 'seqID']).agg(np.size)
seq_len.reset_index(inplace=True)
seq_len.rename(columns={'poiID':'seqLen'}, inplace=True)
#seq_len.head()
ax = seq_len['seqLen'].hist(bins=20)
ax.set_yscale('log')



In [251]:
seq_345 = seq_len[seq_len['seqLen'].isin({3, 4, 5})]
seq_345['seqLen'].hist(bins=9)


Out[251]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fda6c7c9f60>

Split travelling sequences into training set and testing set using leave-one-out for each user.
For testing purpose, users with less than two travelling sequences are not considered in this experiment.


In [252]:
train_set = []
test_set = []

In [253]:
user_seqs = seq_345[['userID', 'seqID']].groupby('userID')

In [254]:
for user, indices in user_seqs.groups.items():
    if len(indices) < 2: continue
    idx = random.choice(indices)
    test_set.append(seq_345.loc[idx, 'seqID'])
    train_set.extend([seq_345.loc[x, 'seqID'] for x in indices if x != idx])

In [255]:
print('#seq in trainset:', len(train_set))
print('#seq in testset:', len(test_set))
seq_345[seq_345['seqID'].isin(train_set)]['seqLen'].hist(bins=9)
#data = np.array(seqs1['seqLen'])
#hist, bins = np.histogram(data, bins=3)
#print(hist)


#seq in trainset: 128
#seq in testset: 41
Out[255]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fda7be289e8>

Sanity check: the total number of travelling sequences used in experiment.


In [256]:
seq_exp = seq_345[['userID', 'seqID']].copy()
seq_exp = seq_exp.groupby('userID').agg(np.size)
seq_exp.reset_index(inplace=True)
seq_exp.rename(columns={'seqID':'#seq'}, inplace=True)
seq_exp = seq_exp[seq_exp['#seq'] > 1] # user with more than 1 sequences
print('total #seq for experiment:', seq_exp['#seq'].sum())
#seq_exp.head()


total #seq for experiment: 169

In [257]:
seq_t = seq_345[seq_345['seqID'].isin(train_set)]
nseq3 = seq_t[seq_t['seqLen'] == 3].shape[0]
nseq4 = seq_t[seq_t['seqLen'] == 4].shape[0]
nseq5 = seq_t[seq_t['seqLen'] == 5].shape[0]
nseq345 = nseq3 + nseq4 + nseq5
randF1 = (2/3) * (nseq3 / nseq345) + (2/4) * (nseq4 / nseq345) + (2/5) * (nseq5 / nseq345)
print('%d sequences with length 3, %d sequences with length 4, %d sequences with length 5' % (nseq3, nseq4, nseq5))
print('F1-score by random guessing is %.3f' % randF1)


86 sequences with length 3, 29 sequences with length 4, 13 sequences with length 5
F1-score by random guessing is 0.602

Save training/testing set to file.


In [258]:
pickle.dump(train_set, open(ftrain, 'wb'))
pickle.dump(test_set, open(ftest, 'wb'))

3.5 Compute transition matrix using training set

Compute transition probabilities between different kind of POI categories.


In [259]:
poi_cats = traj['poiTheme'].unique().tolist()
poi_cats.sort()
poi_cats


Out[259]:
['Amusement', 'Beach', 'Cultural', 'Shopping', 'Sport', 'Structure']

In [260]:
ncats = len(poi_cats)
trans_mat = pd.DataFrame(data=np.zeros((ncats, ncats), dtype=np.float64), index=poi_cats, columns=poi_cats)
#trans_mat

Count the transition number for each possible transition.


In [261]:
#train_set = [4, 13, 32, 33, 34, 99, 101]
#seq_all[seq_all['seqID'] == train_set[0]]
for seqid in train_set:
    seqi = seq_all[seq_all['seqID'] == seqid].copy()
    seqi.sort(columns=['arrivalTime'], ascending=True, inplace=True)
    for j in range(len(seqi.index)-1):
        idx1 = seqi.index[j]
        idx2 = seqi.index[j+1]
        poi1 = seqi.loc[idx1, 'poiID']
        poi2 = seqi.loc[idx2, 'poiID']
        cat1 = poi_all.loc[poi1, 'poiTheme']
        cat2 = poi_all.loc[poi2, 'poiTheme']
        trans_mat.loc[cat1, cat2] += 1
trans_mat


Out[261]:
Amusement Beach Cultural Shopping Sport Structure
Amusement 5 3 3 2 5 3
Beach 4 12 10 19 3 26
Cultural 2 13 5 12 3 19
Shopping 4 27 8 6 5 18
Sport 6 3 2 2 1 5
Structure 2 21 16 24 4 8

Normalise each row to get an estimate of transition probabilities (MLE).


In [262]:
for r in trans_mat.index:
    rowsum = trans_mat.ix[r].sum()
    if rowsum == 0: continue  # deal with lack of data
    for c in trans_mat.columns:
        trans_mat.loc[r, c] /= rowsum
trans_mat


Out[262]:
Amusement Beach Cultural Shopping Sport Structure
Amusement 0.238095 0.142857 0.142857 0.095238 0.238095 0.142857
Beach 0.054054 0.162162 0.135135 0.256757 0.040541 0.351351
Cultural 0.037037 0.240741 0.092593 0.222222 0.055556 0.351852
Shopping 0.058824 0.397059 0.117647 0.088235 0.073529 0.264706
Sport 0.315789 0.157895 0.105263 0.105263 0.052632 0.263158
Structure 0.026667 0.280000 0.213333 0.320000 0.053333 0.106667

Compute the log of transition probabilities with smooth factor $\epsilon=10^{-12}$.


In [263]:
log10_trans_mat = np.log10(trans_mat.copy() + EPS)
log10_trans_mat


Out[263]:
Amusement Beach Cultural Shopping Sport Structure
Amusement -0.623249 -0.845098 -0.845098 -1.021189 -0.623249 -0.845098
Beach -1.267172 -0.790050 -0.869232 -0.590478 -1.392110 -0.454258
Cultural -1.431364 -0.618450 -1.033424 -0.653213 -1.255273 -0.453640
Shopping -1.230449 -0.401145 -0.929419 -1.054358 -1.133539 -0.577236
Sport -0.500602 -0.801632 -0.977724 -0.977724 -1.278754 -0.579784
Structure -1.574031 -0.552842 -0.670941 -0.494850 -1.273001 -0.971971

3.6 Compute POI popularity and user interest using training set

Compute average POI visit duration, POI popularity as defined at the top of the notebook.


In [264]:
poi_avg_pop = seq_all[seq_all['seqID'].isin(train_set)]
poi_avg_pop = poi_avg_pop[['poiID', 'poiDuration(sec)']].copy()

In [265]:
poi_avg_pop = poi_avg_pop.groupby('poiID').agg([np.mean, np.size])
poi_avg_pop.columns = poi_avg_pop.columns.droplevel()
poi_avg_pop.reset_index(inplace=True)
poi_avg_pop.rename(columns={'mean':'avgDuration(sec)', 'size':'popularity'}, inplace=True)
poi_avg_pop.set_index('poiID', inplace=True)
print('#poi:', poi_avg_pop.shape[0])
if poi_avg_pop.shape[0] < poi_all.shape[0]:
    extra_index = list(set(poi_all.index) - set(poi_avg_pop.index))
    extra_poi = pd.DataFrame(data=np.zeros((len(extra_index), 2), dtype=np.float64), \
                             index=extra_index, columns=['avgDuration(sec)', 'popularity'])
    poi_avg_pop = poi_avg_pop.append(extra_poi)
    print('#poi after extension:', poi_avg_pop.shape[0])
poi_avg_pop


#poi: 23
#poi after extension: 29
Out[265]:
avgDuration(sec) popularity
poiID
1 6691.200000 5
2 4099.833333 6
3 1045.923077 13
4 3473.285714 7
6 2593.300000 10
7 4030.868421 38
8 5465.090909 11
11 1320.583333 12
13 2341.000000 2
14 505.166667 6
15 851.500000 2
16 2593.200000 15
17 9601.500000 8
19 1604.000000 4
21 2432.566038 53
22 2727.909091 55
23 1374.573770 61
24 1580.454545 11
25 246.250000 8
27 1921.375000 8
28 1215.000000 51
29 1031.750000 8
30 3745.222222 45
9 0.000000 0
10 0.000000 0
12 0.000000 0
18 0.000000 0
20 0.000000 0
26 0.000000 0

Compute time/frequency based user interest as defined at the top of the notebook.


In [266]:
user_interest = seq_all[seq_all['seqID'].isin(train_set)]
user_interest = user_interest[['userID', 'poiID', 'poiDuration(sec)']].copy()

In [267]:
user_interest['timeRatio'] = [poi_avg_pop.loc[x, 'avgDuration(sec)'] for x in user_interest['poiID']]
user_interest['timeRatio'] = user_interest['poiDuration(sec)'] / user_interest['timeRatio']
user_interest['poiTheme'] = [poi_all.loc[x, 'poiTheme'] for x in user_interest['poiID']]
user_interest.drop(['poiID', 'poiDuration(sec)'], axis=1, inplace=True)

Sum defined in paper, but cumulative of (time ratio) * (avg POI visit duration) will become extremely large in many cases, which is unrealistic.


In [268]:
#user_interest = user_interest.groupby(['userID', 'poiTheme']).agg([np.sum, np.size])
user_interest = user_interest.groupby(['userID', 'poiTheme']).agg([np.mean, np.size]) # try the mean

In [269]:
user_interest.columns = user_interest.columns.droplevel()
#user_interest.rename(columns={'sum':'timeBased', 'size':'freqBased'}, inplace=True)
user_interest.rename(columns={'mean':'timeBased', 'size':'freqBased'}, inplace=True)
user_interest.reset_index(inplace=True)
user_interest.set_index(['userID', 'poiTheme'], inplace=True)
user_interest.head()


Out[269]:
timeBased freqBased
userID poiTheme
10502709@N05 Beach 0.793282 1
Cultural 0.657675 2
Shopping 1.930100 2
Structure 14.311111 1
10627620@N06 Shopping 0.046924 2

3.7 Enumerate Trajectories with length {3, 4, 5}


In [270]:
poi_list = poi_all.index.tolist()

In [271]:
def enum_345_seq(seqid_set, poi_list):
    """Enumerate all possible travelling sequences with length {3, 4, 5}"""
    act_seqs = dict()
    enum_seqs = dict()
    
    for seqid in seqid_set:
        seqi = seq_all[seq_all['seqID'] == seqid].copy()
        seqi.sort(columns=['arrivalTime'], ascending=True, inplace=True)
        act_seqs[seqid] = seqi['poiID'].tolist()
        p0 = seqi.loc[seqi.index[0], 'poiID']
        pN = seqi.loc[seqi.index[-1],'poiID']
    
        # enumerate sequences with length 3
        if seqi.shape[0] == 3:
            enum_seqs[seqid] = [[p0, p, pN] \
                                for p in poi_list if p not in {p0, pN}]
            continue

        # enumerate sequences with length 4
        if seqi.shape[0] == 4:
            enum_seqs[seqid] = [[p0, p1, p2, pN] \
                                for p1 in poi_list if p1 not in {p0, pN} \
                                for p2 in poi_list if p2 not in {p0, p1, pN}]
            continue

        # enumerate sequences with length 5
        if seqi.shape[0] == 5:
            enum_seqs[seqid] = [[p0, p1, p2, p3, pN] \
                                for p1 in poi_list if p1 not in {p0, pN} \
                                for p2 in poi_list if p2 not in {p0, p1, pN} \
                                for p3 in poi_list if p3 not in {p0, p1, p2, pN}]
            continue
    return enum_seqs, act_seqs

3.8 Compute Features (scores) for each enumerated sequence

As described at the top of the notebook, features for each trajectory used in this experiment are:

  1. total time-based user interest
  2. total freq-based user interest
  3. total POI popularity
  4. total travel distance (without the visit duration time at each POI)
  5. features 5 to 8 are trajectory (log) probabilities based on the transition matrix between different POI categories and the following rules for choosing a specific POI within certain category:
    • The Nearest Neighbor of the current POI
    • The most Popular POI
    • A random POI choosing with probability proportional to the reciprocal of its distance to current POI
    • A random POI choosing with probability proportional to its popularity

In [272]:
def calc_dist(longitude1, latitude1, longitude2, latitude2):
    """Calculate the distance (unit: km) between two places on earth"""
    # convert degrees to radians
    lon1 = math.radians(longitude1)
    lat1 = math.radians(latitude1)
    lon2 = math.radians(longitude2)
    lat2 = math.radians(latitude2)
    radius = 6371.009 # mean earth radius is 6371.009km, en.wikipedia.org/wiki/Earth_radius#Mean_radius
    # The haversine formula, en.wikipedia.org/wiki/Great-circle_distance
    dlon = math.fabs(lon1 - lon2)
    dlat = math.fabs(lat1 - lat2)
    return 2 * radius * math.asin(math.sqrt(\
                (math.sin(0.5*dlat))**2 + math.cos(lat1) * math.cos(lat2) * (math.sin(0.5*dlon))**2 ))

In [273]:
poi_dist_mat = pd.DataFrame(data=np.zeros((poi_all.shape[0], poi_all.shape[0]), dtype=np.float64), \
                        index=poi_all.index, columns=poi_all.index)
poi_rdist_mat = poi_dist_mat.copy()

In [274]:
for i in range(poi_all.index.shape[0]):
    for j in range(i+1, poi_all.index.shape[0]):
        r = poi_all.index[i]
        c = poi_all.index[j]
        dist = calc_dist(poi_all.loc[r, 'poiLon'], poi_all.loc[r, 'poiLat'], \
                         poi_all.loc[c, 'poiLon'], poi_all.loc[c, 'poiLat'])
        poi_dist_mat.loc[r, c] = dist
        poi_dist_mat.loc[c, r] = dist
        assert(dist > 0.)
        rdist = 1./dist
        poi_rdist_mat.loc[r, c] = rdist
        poi_rdist_mat.loc[c, r] = rdist

In [275]:
def calc_features(user, seq, poi_all, user_interest, log10_trans_mat, poi_dist_mat, poi_rdist_mat):
    """Compute 8 features for each enumerated trajectory"""
    features = np.zeros(nfeatures, dtype=np.float64)
    
    # POI based features
    for poi in seq:
        cat = poi_all.loc[poi, 'poiTheme']
        if (user, cat) in user_interest.index:
            features[0] += user_interest.loc[user, cat]['timeBased'] # 1. time-based user interest
            features[1] += user_interest.loc[user, cat]['freqBased'] # 2. freq-based user interest
        features[2] += poi_avg_pop.loc[poi, 'popularity']            # 3. POI popularity
    
    # POI-pair based features
    for k in range(len(seq)-1):
        poi1 = seq[k]
        poi2 = seq[k+1]
        assert(poi1 != poi2)
        cat1 = poi_all.loc[poi1, 'poiTheme']
        cat2 = poi_all.loc[poi2, 'poiTheme']                     
        features[3] += -1 * poi_dist_mat.loc[poi1, poi2]              # 4. travel distance
        trans_prob = log10_trans_mat.loc[cat1, cat2] # log of transition probability
        for l in range(4, 8):
            features[l] += trans_prob
        poi_cat2 = poi_all[poi_all['poiTheme'] == cat2].copy()
        if cat1 == cat2: 
            poi_cat2.drop(poi1, axis=0, inplace=True) # drop row
        distvec = pd.DataFrame(data=[poi_rdist_mat.loc[poi1, x] for x in poi_cat2.index], index=poi_cat2.index)
        if distvec.idxmax().iloc[0] == poi2:
            features[4] += math.log10(1. + EPS) # poi2 is the nearest neighbor of poi1
        else:
            features[4] += math.log10(0. + EPS) # poi2 is not the nearest neighbor of poi1
        popvec = pd.DataFrame(data=[poi_avg_pop.loc[x,'popularity'] for x in poi_cat2.index], index=poi_cat2.index)
        #popvec = pd.DataFrame(data=[poi_all.loc[x,'poiFreq'] for x in poi_cat2.index], index=poi_cat2.index)
        if popvec.idxmax().iloc[0] == poi2:
            features[5] += math.log10(1. + EPS) # poi2 is the most popular one within cat2
        else:
            features[5] += math.log10(0. + EPS) # poi2 is not the most popular one within cat2
        features[6] += math.log10(EPS + distvec.loc[poi2].iloc[0] / distvec.sum().iloc[0])
        features[7] += math.log10(EPS +  popvec.loc[poi2].iloc[0] /  popvec.sum().iloc[0])
        
    # normalise score, range [-1, 1]
    features /= abs(features).max()
    return features

In [276]:
enum_seqs, train_seqs = enum_345_seq(train_set, poi_list)

Load features from file if possible.


In [277]:
doCompute = True

In [278]:
train_set1 = pickle.load(open(ftrain, 'rb'))
if (np.array(sorted(train_set1)) == np.array(sorted(train_set))).all() and os.path.exists(ffeatures_train):
    doCompute = False

In [279]:
all_features = None

In [280]:
if doCompute:
    #[(seqid, seqidx_in_enum_seqs_dict)]
    seq_info = [(seqid, j) for seqid in train_set for j in range(len(enum_seqs[seqid]))]
    # all CPUs but one are used
    all_features = Parallel\
                   (n_jobs=-2)\
                   (delayed\
                   (calc_features)\
                   (seq_user.loc[x[0]].iloc[0], enum_seqs[x[0]][x[1]], poi_all, user_interest, \
                    log10_trans_mat, poi_dist_mat, poi_rdist_mat) \
                   for x in seq_info)
    pickle.dump(all_features, open(ffeatures_train, 'wb'))
else:
    # load features
    all_features = pickle.load(open(ffeatures_train, 'rb'))

In [281]:
print(len(all_features))


250830

In [282]:
num_enum_seq = 0 # total number of enumerated sequences
score_indices = dict() # score vector index in the feature matrix

In [283]:
for seqid in train_set:
    num = len(enum_seqs[seqid])
    score_indices[seqid] = [x for x in range(num_enum_seq, num_enum_seq + num)]
    num_enum_seq += num

In [284]:
print(num_enum_seq)


250830

In [285]:
assert(len(all_features) == num_enum_seq)

4. Experiments with training set


In [286]:
features_name = \
['total time-based user interest', 'total freq-based user interest', \
 'total POI popularity', 'total (negative) travel distance', \
 'trajectory probability with nearest neighbor rule',   'trajectory probability with most popular POI rule', \
 'trajectory probability which prefers near neighbors', 'trajectory probability which prefers popular POIs']

In [287]:
def calc_F1score(seq_act, seq_rec):
    assert(len(seq_act) > 0)
    assert(len(seq_rec) > 0)
    actset = set(seq_act)
    recset = set(seq_rec)
    intersect = actset & recset
    recall = len(intersect) / len(seq_act)
    precision = len(intersect) / len(seq_rec)
    return 2. * precision * recall / (precision + recall)

In [288]:
def calc_mean_F1score(train_set, train_seqs, enum_seqs, all_scores, score_indices):
    F1scores = []
    for seqid in train_set:
        scores = np.array([all_scores[x] for x in score_indices[seqid]])
        bestseq = enum_seqs[seqid][scores.argmax()]
        F1scores.append(calc_F1score(train_seqs[seqid], bestseq))
    return np.mean(F1scores)

In [289]:
features_mat = np.array(all_features)

In [290]:
print(features_mat.shape)
#print(features_mat[0])


(250830, 8)

4.1 Experiment with random weights


In [291]:
N = 1000
rand_weights = np.zeros((N, nfeatures), dtype=np.float64)
rand_F1scores = np.zeros(N, dtype=np.float64)

In [292]:
for j in range(N):
    weights = np.random.uniform(-1, 1, nfeatures)
    rand_weights[j] = weights
    all_scores = features_mat.dot(weights)
    rand_F1scores[j] = calc_mean_F1score(train_set, train_seqs, enum_seqs, all_scores, score_indices)
maxidx = rand_F1scores.argmax()
minidx = rand_F1scores.argmin()
print('max avgF1:', rand_F1scores[maxidx], ', weights:', rand_weights[maxidx])
print('min avgF1:', rand_F1scores[minidx], ', weights:', rand_weights[minidx])


max avgF1: 0.742708333333 , weights: [-0.50901545  0.21959893  0.45506513  0.85326357  0.00535068  0.63971964
  0.2620045  -0.48840356]
min avgF1: 0.601822916667 , weights: [-0.38066391 -0.50872977 -0.18906913 -0.15051161 -0.80622014 -0.74454076
  0.16886518  0.43388463]

In [293]:
# all_scores matrix are too large to fit in memory
#randscores = Parallel(n_jobs=-2)\
#                        (delayed\
#                        (calc_mean_F1score)\
#                        (train_set, train_seqs, enum_seqs, all_scores[j], score_indices) for j in range(N))

In [294]:
plt.figure(figsize=[10, 8])
plt.xlim([0, N])
plt.ylim([rand_F1scores.min()-0.01, max(rand_F1scores.max(), randF1)+0.01])
plt.xlabel('Experiment')
plt.ylabel('Mean F1score')
plt.plot([0, N], [randF1, randF1], color='r', linestyle='--', label='random guessing: ' + str(round(randF1,3)))
plt.scatter([range(N)], rand_F1scores, marker='+')
plt.legend()


Out[294]:
<matplotlib.legend.Legend at 0x7fda5a548630>

In [295]:
pd.Series(rand_F1scores).hist(bins=50)


Out[295]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fda78230828>

4.2 Experiment with single feature

NOTE:

  • When trajectories in training set can't cover all POIs and/or all types of POI category transitions, all above single feature plots show uniform results, and all points are below the green line (e.g. use Osaka data).
  • For experiment with single feature, if the weight of the feature is 0, then no features are used in the algorithm, and the scores of all candidate trajectories are the same, so the F1 score of the algorithm will depend on the order of candidates (python will choose the first one), as a result, the actual value of F1 score is meaningless, which indicates that 0 should be skipped in this case.

In [296]:
#values = [-0.5, 0, 0.5]
#values = np.linspace(-1, 1, 11)
#values = [-0.1, 0.1]
values = [-1, -0.5, -0.1, 0.1, 0.5, 1]
plt.figure(figsize=[16, 16])
for j in range(nfeatures):
    all_F1scores = []
    for k in range(len(values)):
        weights = np.zeros(nfeatures, dtype=np.float64)
        weights[j] = values[k]
        all_scores = features_mat.dot(weights)
        F1scores = []
        for seqid in train_set:
            scores = np.array([all_scores[x] for x in score_indices[seqid]])
            bestseq = enum_seqs[seqid][scores.argmax()]
            F1scores.append(calc_F1score(train_seqs[seqid], bestseq))
        all_F1scores.append(F1scores)
    plt.subplot(4, 2, j+1)
    xlim = [min(values)-0.1, max(values)+0.1]
    ylim = [0.3, 1.1]
    plt.xlim(xlim)
    plt.ylim(ylim)
    plt.xlabel(features_name[j])
    plt.ylabel('Mean F1-score')
    plt.boxplot(all_F1scores, labels=values) #OK
    avgF1scores = np.array([np.mean(x) for x in all_F1scores])
    #plt.scatter(values, avgF1scores, marker='+')
    #plt.plot([values[avgF1scores.argmax()]], [avgF1scores.max()], marker='o', color='r', \
    #                  label='max: ' + str(round(avgF1scores.max(),3)))
    #plt.plot(xlim, [randF1, randF1], color='g', linestyle='--', label='random guessing: ' + str(round(randF1,3)))
    #plt.legend()


4.2.1 Observations with single feature

The first feature is the total time-based user interest in a trajectory.
(i.e sum of (expected time the user spent at each POI)/(average time a user spent at each POI)) It seems the existance of this feature negative affected the algorithm, which is strange as the IJCAI paper argues that capturing the expected time a user spent at POI will improve the accuracy of trajectory recommendation.

The second feature is the total number of visit (of the user) of all POIs in a trajectory. Similar to the first feature, it seems the existance of this feature negative affected the algorithm, which is also strange as experiments from the IJCAI paper show that capturing a user's visiting frequency of POI will improve the accuracy of trajectory recommendation, though less than capturing visit time duration, but still much better than greedy and random selection strategies.

The third feature is the total POI popularity (i.e. # of visit of a POI by all users) of all POIs in a trajectory. It seems that doesn't affect the recommendation much, though a positive weight of this feature will help the recommendation algorithm slightly.

The fourth feature is the negative (i.e. multiple -1) of total travelling cost (i.e. total travel distance in the trajectory) for a user of a trajectory. It's strange that the algorithm prefers long travelling distance.

The fifth feature is the probability of a recommended trajectory based on the transition probabilities between POI categories and the nearest neighbor rule for choosing a specific POI within a certain category. It's clear the algorithm likes nearest neighbors.

The sixth feature is the probability of a recommended trajectory based on the transition probabilities between POI categories and the most popular POI rule for choosing a specific POI within a certain category. It's clear the algorithm likes popular POIs.

The seventh feature is the probability of a recommended trajectory based on the transition probabilities between POI categories and a rule below for choosing a specific POI within a certain category.
Rule: choose a random POI with probability proportional to the reciprocal of its distance to current POI.
Similar to the fifth feature which utilises the nearest neighbor rule, the algorithm doesn't like far neighbors.

The last feature is the probability of a recommended trajectory based on the transition probabilities between POI categories and a rule below for choosing a specific POI within a certain category.
Rule: choose a random POI with probability proportional to its popularity.
Similar to the sixth feature which utilises the most popular POI rule, the algorithm doesn't like non-popular POIs.


In [297]:
weights = np.array([0.01, -0.01, 0, 0.88, -0.01, 0.01, -0.05, 0.05])
calc_mean_F1score(train_set, train_seqs, enum_seqs, features_mat.dot(weights), score_indices)


Out[297]:
0.78203124999999996

In [352]:
# uniform

# 0.714, [-1.    0.    0.    0.85  0.64 -0.05  0.03  0.]
#weights = np.zeros(nfeatures, dtype=np.float64)

# 0.728, [-0.3   0.65 -0.37  0.99  1.    0.05  0.95  1.]
#weights = np.ones(nfeatures, dtype=np.float64)

# 0.694, [-1.   -1.   -1.    0.22  0.99  0.49  0.25  0.76]
#weights = -1 * np.ones(nfeatures, dtype=np.float64)

# random

# 0.723, [-0.74  0.09 -0.19  0.99  0.84 -0.04  0.03  0.45]
#weights = np.random.uniform(-1, 1, nfeatures)

# 0.753, [0.35 -0.34 -0.17  0.86  0.78  0.35  0.89  0.63]
#weights = rand_weights[ rand_F1scores.argmax() ]

# hints from single feature experiment

# 0.732, [1 -1  1  1  1  1  1 -1]
#weights = np.array([-1, -1, 1, 1, 1, 1, 1, 1])

# 0.756, [0.12 -0.1  -0.03  0.75  0.1   0.1   0.1   0.1]
#weights = np.array([-0.1, -0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1])

# 0.7996, [0.06 -0.05 -0.01  0.98 -0.02  0.04  0.23  0.43]
#weights = np.array([-0.05, -0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05])
#weights = np.array([0.06, -0.05, -0.01, 0.98, -0.02, 0.04, 0.23, 0.43]) # init: 0.7996, end: 0.803
#weights = np.array([0.05, -0.05, -0.01, 0.86, -0.02, 0.03, 0.16, 0.43]) # init: 0.803, end: 0.805
#weights = np.array([0.05, -0.05, -0.01, 0.97, -0.02, 0.03, 0.16, 0.36]) # init: 0.805, end: 0.805

# 0.808, [0.02 -0.01  0.01  0.92 -0.02  0.01  0.28 -0.19]
#weights = np.array([-0.02, -0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02])
weights = np.array([0.02, -0.01, 0.01, 0.92, -0.02, 0.01, 0.28, -0.19]) # init: 0.808, end: 0.810
#weights = np.array([0.02, -0.01, 0, 0.87, -0.02, 0.01, 0.37, -0.19]) # init: 0.810, end: 0.810

# 0.810             [0.02, -0.01, 0.01, 0.87, -0.02, 0.01, 0.37, -0.19]
#weights = np.array([0.02, -0.01, 0,    1,    -0.02, 0.01, 0.37, -0.19])

# 0.794, [0.01, -0.01, 0.01, 0.45, -0.01, 0.02, 0.07, 0.01]
#weights = np.array([-0.01, -0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01])

# 0.794, [0.01 -0.01  0.01  0.45 -0.01  0.02  0.07  0.01]
#weights = np.array([-0.01, -0.01, 0, 0.01, 0.01, 0.01, 0.01, 0.01])

#weights = np.array([0, 0, 0, 0.01, 0.01, 0.01, 0.01, 0.01]) # 0.789

#weights = np.array([0.01, -0.01, 0.01, 0.45, -0.01, 0.02, 0.07, 0.01]) # init: 0.794, end: 0.795
#weights = np.array([0.02, -0.01, 0, 0.44, -0.01, 0.02, 0.07, 0.01]) # init: 0.795, end: 0.795

In [353]:
#params = np.linspace(-1, 1, 41)
#params = np.linspace(-1, 1, 101)
params = np.linspace(-1, 1, 201)
F1scores = np.zeros((weights.shape[0], params.shape[0]), dtype=np.float64)

In [354]:
t1 = datetime.now()
for k in range(nfeatures):    
    for j in range(params.shape[0]):
        weights[k] = params[j]
        all_scores = features_mat.dot(weights)
        F1scores[k, j] = calc_mean_F1score(train_set, train_seqs, enum_seqs, all_scores, score_indices)
    maxidx = F1scores[k].argmax()
    weights[k] = params[maxidx]
t2 = datetime.now()
print('%d seconds used' % (t2-t1).total_seconds()) # 120 seconds


107 seconds used

In [355]:
#t1 = datetime.now()
#for k in range(nfeatures):
#    all_weights = np.matlib.repmat(weights, params.shape[0], 1)
#    for j in range(params.shape[0]):
#        all_weights[j, k] = params[j]
#    F1scores1 = Parallel(n_jobs=-2)\
#                        (delayed(calc_mean_F1score)
#                         (train_set, train_seqs, enum_seqs, features_mat.dot(all_weights[j]), score_indices)
#                         for j in range(params.shape[0]))
#    F1scores[k] = np.array(F1scores1)       
#    maxidx = F1scores[k].argmax()
#    weights[k] = params[maxidx]
#t2 = datetime.now()
#print('%d seconds used' % (t2-t1).total_seconds()) # 407 seconds

In [356]:
print(weights)
print(F1scores[-1].max())


[ 0.32 -0.01  0.01  0.92 -0.02  0.03  0.27  0.08]
0.80078125

In [357]:
plt.figure(figsize=[16, 16])
for k in range(nfeatures):
    plt.subplot(4, 2, k+1)
    plt.xlim([-1.1, 1.1])
    plt.ylim([0.3, 0.9])
    plt.xlabel(features_name[k])
    plt.ylabel('Mean F1-score')
    plt.scatter(params, F1scores[k], marker='+')
    plt.plot([-1.1, 1.1], [randF1, randF1], color='g', linestyle='--', label='random guessing: ' + str(round(randF1,3)))
    plt.legend()


5. Apply search results to testing set


In [358]:
enum_seqs_test, test_seqs = enum_345_seq(test_set, poi_list)

Load features from file if possible.


In [359]:
doCompute = True

In [360]:
test_set1 = pickle.load(open(ftest, 'rb'))
if (np.array(sorted(test_set1))  == np.array(sorted(test_set))).all() and os.path.exists(ffeatures_test):
    doCompute = False

In [361]:
all_features_test = None

In [362]:
if doCompute:
    #[(seqid, seqidx_in_enum_seqs_dict)]
    seq_info_test = [(seqid, j) for seqid in test_set for j in range(len(enum_seqs_test[seqid]))]
    # all CPUs but one are used
    all_features_test = Parallel\
                        (n_jobs=-2)\
                        (delayed\
                        (calc_features)\
                        (seq_user.loc[x[0]].iloc[0], enum_seqs_test[x[0]][x[1]], poi_all, user_interest, \
                         log10_trans_mat, poi_dist_mat, poi_rdist_mat) \
                        for x in seq_info_test)
    pickle.dump(all_features_test, open(ffeatures_test, 'wb'))
else:
    # load features
    all_features_test = pickle.load(open(ffeatures_test, 'rb'))

In [363]:
print(len(all_features_test))


109620

In [364]:
score_indices_test = dict() # score vector index in the feature matrix
num_enum_seq_test = 0

In [365]:
for seqid in test_set:
    num = len(enum_seqs_test[seqid])
    score_indices_test[seqid] = [x for x in range(num_enum_seq_test, num_enum_seq_test + num)]
    num_enum_seq_test += num

In [366]:
print(num_enum_seq_test)


109620

In [367]:
assert(len(all_features_test) == num_enum_seq_test)

In [368]:
features_mat_test = np.array(all_features_test)
print(features_mat_test.shape)


(109620, 8)

In [372]:
weights = np.array([0.01, -0.01, 0, 0.88, -0.01, 0.01, -0.05, 0.05]) # train: 0.7993, test: 0.75040650406504061
#weights = np.array([0.32, -0.01, 0.01, 0.92, -0.02, 0.03, 0.27, 0.08]) 
#weights = np.array([0.02, -0.01, 0, 0.87, -0.02, 0.01, 0.37, -0.19]) # train: 0.810, test: 0.744
#weights = np.array([0.02, -0.01, 0.01, 0.92, -0.02, 0.01, 0.28, -0.19]) # train: 0.808, test: 0.744
#weights = np.array([0.05, -0.05, -0.01, 0.97, -0.02, 0.03, 0.16, 0.36]) # train: 0.805, test: 0.732
#weights = np.array([0.06, -0.05, -0.01, 0.98, -0.02, 0.04, 0.23, 0.43]) # train: 0.7996, test: 0.732
#weights = np.array([0.02, -0.01, 0, 0.44, -0.01, 0.02, 0.07, 0.01]) # init: 0.795, test: 0.740
calc_mean_F1score(test_set, test_seqs, enum_seqs_test, features_mat_test.dot(weights), score_indices_test)


Out[372]:
0.7544715447154472

In [ ]:


In [ ]: