Enumerate all trajectories for each user given the trajectory length (e.g. 3, 4, 5) and the (start, end) POIs.
For each trajectory, compute a score based on the features below:
To avoid numerical underflow, use the log of of probabilities instead of the probabilities themselves, to avoid zero probabilities, add a smooth value $\epsilon = 10^{-12}$ for each probability.
Plot the scores of generated and actual trajectories for each (user, trajectoryLength, startPOI, endPOI) tuple with some degree of transparency (alpha).
Recommend trajectory with the highest score and measure the performance of recommendation using recall, precision and F1-score.
Optimise parameters in the score function by learning, in this specific case, the cost function could be based on recall, precision or F1-score, we can also control the estimation of transition matrix.
For user $u$ and POI $p$, define
Travel History: \begin{equation*} S_u = \{(p_1, t_{p_1}^a, t_{p_1}^d), \dots, (p_n, t_{p_n}^a, t_{p_n}^d)\} \end{equation*} where $t_{p_i}^a$ is the arrival time and $t_{p_i}^d$ the departure time of user $u$ at POI $p_i$
Travel Sequences: split $S_u$ if \begin{equation*} |t_{p_i}^d - t_{p_{i+1}}^a| > \tau ~(\text{e.g.}~ \tau = 8 ~\text{hours}) \end{equation*}
POI Popularity: \begin{equation*} Pop(p) = \sum_{u \in U} \sum_{p_i \in S_u} \delta(p_i == p) \end{equation*}
Average POI Visit Duration: \begin{equation*} \bar{V}(p) = \frac{1}{N} \sum_{u \in U} \sum_{p_i \in S_u} (t_{p_i}^d - t_{p_i}^a) \delta(p_i == p) \end{equation*} where $N$ is #visits of POI $p$ by all users
Define the interest of user $u$ in POI category $c$ as
Split actual trajectories input two parts, one for training, the other for testing.
Concretely, For each user, consider all the trajectories with length 3
, 4
and 5
, pick one for testing set and put all others into training set.
Use trajectories in training set to compute (MLE) a transition matrix where element [i, j]
denotes the transition probability from POI category i
to POI category j
.
For each trajectory $T$ in training set, enumerate all possible trajectories that satisfy the following requirements:
Compute the 8
scores described above as features, rescale each score into range [-1, 1]
, compute the weighted sum of these score to get a single score.
(weights are normalised so that they are in range [0, 1]
and their sum is 1
)
(the values in parameter/weight vector are now drawn from range [-1, 1]
, and the sum now doesn't necessarily to be 1.)
Choose the trajectory with the highest score $T^*$ and compute F1 score as follows:
Compute the mean F1 score for all trajectory $T$ in the training set.
Use coordinate-wise grid search to find an good weight vector such that the mean F1 score is as large as possible.
NOTE: Before running this notebook, please run script src/ijcai15_setup.py
to setup data properly.
In [229]:
%matplotlib inline
import os
import math
import random
import pickle
import pandas as pd
import numpy as np
import numpy.matlib
from datetime import datetime
from joblib import Parallel, delayed
import matplotlib.pyplot as plt
In [230]:
nfeatures = 8 # number of features
EPS = 1e-12 # smooth, deal with 0 probability
random.seed(123456789) # control random choice when splitting training/testing set
In [231]:
data_dir = 'data/data-ijcai15'
#fvisit = os.path.join(data_dir, 'userVisits-Osak.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Osak.csv')
#fvisit = os.path.join(data_dir, 'userVisits-Glas.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Glas.csv')
#fvisit = os.path.join(data_dir, 'userVisits-Edin.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Edin.csv')
fvisit = os.path.join(data_dir, 'userVisits-Toro.csv')
fcoord = os.path.join(data_dir, 'photoCoords-Toro.csv')
In [232]:
suffix = fvisit.split('-')[-1].split('.')[0]
In [233]:
fpoi = os.path.join(data_dir, 'poi-' + suffix + '.csv')
fseq = os.path.join(data_dir, 'seq-' + suffix + '.csv')
ftrain = os.path.join(data_dir, 'trainset-' + suffix + '.pkl')
ftest = os.path.join(data_dir, 'testset-' + suffix + '.pkl')
ffeatures_train = os.path.join(data_dir, 'featuresTrain-' + suffix + '.pkl')
ffeatures_test = os.path.join(data_dir, 'featuresTest-' + suffix + '.pkl')
In [234]:
visits = pd.read_csv(fvisit, sep=';')
visits.head()
Out[234]:
In [235]:
coords = pd.read_csv(fcoord, sep=';')
coords.head()
Out[235]:
In [236]:
# merge data frames according to column 'photoID'
assert(visits.shape[0] == coords.shape[0])
traj = pd.merge(visits, coords, on='photoID')
traj.head()
Out[236]:
In [237]:
pd.DataFrame([traj[['photoLon', 'photoLat']].min(), traj[['photoLon', 'photoLat']].max(), \
traj[['photoLon', 'photoLat']].max() - traj[['photoLon', 'photoLat']].min()], \
index = ['min', 'max', 'range'])
Out[237]:
In [238]:
plt.figure(figsize=[15, 5])
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.scatter(traj['photoLon'], traj['photoLat'], marker='+')
Out[238]:
In [239]:
num_photo = traj['photoID'].unique().shape[0]
num_user = traj['userID'].unique().shape[0]
num_poi = traj['poiID'].unique().shape[0]
num_seq = traj['seqID'].unique().shape[0]
pd.DataFrame([num_photo, num_user, num_poi, num_seq, num_photo/num_user, num_seq/num_user], \
index = ['#photo', '#user', '#poi', '#seq', '#photo/user', '#seq/user'], columns=[str(suffix)])
Out[239]:
Compute POI (Longitude, Latitude) as the average coordinates of the assigned photos.
In [240]:
poi_coords = traj[['poiID', 'photoLon', 'photoLat']].groupby('poiID').mean()
poi_coords.reset_index(inplace=True)
poi_coords.rename(columns={'photoLon':'poiLon', 'photoLat':'poiLat'}, inplace=True)
#poi_coords
Extract POI category and visiting frequency.
In [241]:
poi_catfreq = traj[['poiID', 'poiTheme', 'poiFreq']].groupby('poiID').first()
poi_catfreq.reset_index(inplace=True)
#poi_catfreq
In [242]:
poi_all = pd.merge(poi_catfreq, poi_coords, on='poiID')
poi_all.set_index('poiID', inplace=True)
#poi_all
Save POI info to file.
In [243]:
poi_all.to_csv(fpoi, index=True)
#poi_all2 = pd.read_csv(fpoi, index_col=0)
#poi_all2
In [244]:
seq_all = traj[['userID', 'seqID', 'poiID', 'dateTaken']].copy()\
.groupby(['userID', 'seqID', 'poiID']).agg([np.min, np.max])
#seq_all.head()
In [245]:
seq_all.columns = seq_all.columns.droplevel()
seq_all.head()
Out[245]:
In [246]:
seq_all.reset_index(inplace=True)
seq_all.head()
Out[246]:
In [247]:
seq_all.rename(columns={'amin':'arrivalTime', 'amax':'departureTime'}, inplace=True)
seq_all['poiDuration(sec)'] = seq_all['departureTime'] - seq_all['arrivalTime']
#print('Found %d sequences' % len(seq_all))
#pickle.dump(seq_all, open('all_trajectories.pkl', 'bw'))
seq_all.head()
Out[247]:
Save travelling sequences to file.
In [248]:
seq_all.to_csv(fseq, index=False)
#seq_all2 = pd.read_csv(fseq)
#seq_all2.head()
In [249]:
#seq_all = pickle.load(open('all_trajectories.pkl'))
seq_user = seq_all[['seqID', 'userID']].copy()
seq_user = seq_user.groupby('seqID').first()
#type(seq_user)
#seq_user.loc[1].iloc[0]
#seq_user.reset_index(inplace=True)
#seq_user.set_index('seqID', inplace=True)
seq_user.head()
Out[249]:
Trajectories with length {3, 4, 5}
are used in our experiment.
In [250]:
seq_len = seq_all[['userID', 'seqID', 'poiID']].copy()
seq_len = seq_len.groupby(['userID', 'seqID']).agg(np.size)
seq_len.reset_index(inplace=True)
seq_len.rename(columns={'poiID':'seqLen'}, inplace=True)
#seq_len.head()
ax = seq_len['seqLen'].hist(bins=20)
ax.set_yscale('log')
In [251]:
seq_345 = seq_len[seq_len['seqLen'].isin({3, 4, 5})]
seq_345['seqLen'].hist(bins=9)
Out[251]:
Split travelling sequences into training set and testing set using leave-one-out for each user.
For testing purpose, users with less than two travelling sequences are not considered in this experiment.
In [252]:
train_set = []
test_set = []
In [253]:
user_seqs = seq_345[['userID', 'seqID']].groupby('userID')
In [254]:
for user, indices in user_seqs.groups.items():
if len(indices) < 2: continue
idx = random.choice(indices)
test_set.append(seq_345.loc[idx, 'seqID'])
train_set.extend([seq_345.loc[x, 'seqID'] for x in indices if x != idx])
In [255]:
print('#seq in trainset:', len(train_set))
print('#seq in testset:', len(test_set))
seq_345[seq_345['seqID'].isin(train_set)]['seqLen'].hist(bins=9)
#data = np.array(seqs1['seqLen'])
#hist, bins = np.histogram(data, bins=3)
#print(hist)
Out[255]:
In [256]:
seq_exp = seq_345[['userID', 'seqID']].copy()
seq_exp = seq_exp.groupby('userID').agg(np.size)
seq_exp.reset_index(inplace=True)
seq_exp.rename(columns={'seqID':'#seq'}, inplace=True)
seq_exp = seq_exp[seq_exp['#seq'] > 1] # user with more than 1 sequences
print('total #seq for experiment:', seq_exp['#seq'].sum())
#seq_exp.head()
In [257]:
seq_t = seq_345[seq_345['seqID'].isin(train_set)]
nseq3 = seq_t[seq_t['seqLen'] == 3].shape[0]
nseq4 = seq_t[seq_t['seqLen'] == 4].shape[0]
nseq5 = seq_t[seq_t['seqLen'] == 5].shape[0]
nseq345 = nseq3 + nseq4 + nseq5
randF1 = (2/3) * (nseq3 / nseq345) + (2/4) * (nseq4 / nseq345) + (2/5) * (nseq5 / nseq345)
print('%d sequences with length 3, %d sequences with length 4, %d sequences with length 5' % (nseq3, nseq4, nseq5))
print('F1-score by random guessing is %.3f' % randF1)
Save training/testing set to file.
In [258]:
pickle.dump(train_set, open(ftrain, 'wb'))
pickle.dump(test_set, open(ftest, 'wb'))
Compute transition probabilities between different kind of POI categories.
In [259]:
poi_cats = traj['poiTheme'].unique().tolist()
poi_cats.sort()
poi_cats
Out[259]:
In [260]:
ncats = len(poi_cats)
trans_mat = pd.DataFrame(data=np.zeros((ncats, ncats), dtype=np.float64), index=poi_cats, columns=poi_cats)
#trans_mat
Count the transition number for each possible transition.
In [261]:
#train_set = [4, 13, 32, 33, 34, 99, 101]
#seq_all[seq_all['seqID'] == train_set[0]]
for seqid in train_set:
seqi = seq_all[seq_all['seqID'] == seqid].copy()
seqi.sort(columns=['arrivalTime'], ascending=True, inplace=True)
for j in range(len(seqi.index)-1):
idx1 = seqi.index[j]
idx2 = seqi.index[j+1]
poi1 = seqi.loc[idx1, 'poiID']
poi2 = seqi.loc[idx2, 'poiID']
cat1 = poi_all.loc[poi1, 'poiTheme']
cat2 = poi_all.loc[poi2, 'poiTheme']
trans_mat.loc[cat1, cat2] += 1
trans_mat
Out[261]:
Normalise each row to get an estimate of transition probabilities (MLE).
In [262]:
for r in trans_mat.index:
rowsum = trans_mat.ix[r].sum()
if rowsum == 0: continue # deal with lack of data
for c in trans_mat.columns:
trans_mat.loc[r, c] /= rowsum
trans_mat
Out[262]:
Compute the log of transition probabilities with smooth factor $\epsilon=10^{-12}$.
In [263]:
log10_trans_mat = np.log10(trans_mat.copy() + EPS)
log10_trans_mat
Out[263]:
Compute average POI visit duration, POI popularity as defined at the top of the notebook.
In [264]:
poi_avg_pop = seq_all[seq_all['seqID'].isin(train_set)]
poi_avg_pop = poi_avg_pop[['poiID', 'poiDuration(sec)']].copy()
In [265]:
poi_avg_pop = poi_avg_pop.groupby('poiID').agg([np.mean, np.size])
poi_avg_pop.columns = poi_avg_pop.columns.droplevel()
poi_avg_pop.reset_index(inplace=True)
poi_avg_pop.rename(columns={'mean':'avgDuration(sec)', 'size':'popularity'}, inplace=True)
poi_avg_pop.set_index('poiID', inplace=True)
print('#poi:', poi_avg_pop.shape[0])
if poi_avg_pop.shape[0] < poi_all.shape[0]:
extra_index = list(set(poi_all.index) - set(poi_avg_pop.index))
extra_poi = pd.DataFrame(data=np.zeros((len(extra_index), 2), dtype=np.float64), \
index=extra_index, columns=['avgDuration(sec)', 'popularity'])
poi_avg_pop = poi_avg_pop.append(extra_poi)
print('#poi after extension:', poi_avg_pop.shape[0])
poi_avg_pop
Out[265]:
Compute time/frequency based user interest as defined at the top of the notebook.
In [266]:
user_interest = seq_all[seq_all['seqID'].isin(train_set)]
user_interest = user_interest[['userID', 'poiID', 'poiDuration(sec)']].copy()
In [267]:
user_interest['timeRatio'] = [poi_avg_pop.loc[x, 'avgDuration(sec)'] for x in user_interest['poiID']]
user_interest['timeRatio'] = user_interest['poiDuration(sec)'] / user_interest['timeRatio']
user_interest['poiTheme'] = [poi_all.loc[x, 'poiTheme'] for x in user_interest['poiID']]
user_interest.drop(['poiID', 'poiDuration(sec)'], axis=1, inplace=True)
Sum defined in paper, but cumulative of (time ratio) * (avg POI visit duration) will become extremely large in many cases, which is unrealistic.
In [268]:
#user_interest = user_interest.groupby(['userID', 'poiTheme']).agg([np.sum, np.size])
user_interest = user_interest.groupby(['userID', 'poiTheme']).agg([np.mean, np.size]) # try the mean
In [269]:
user_interest.columns = user_interest.columns.droplevel()
#user_interest.rename(columns={'sum':'timeBased', 'size':'freqBased'}, inplace=True)
user_interest.rename(columns={'mean':'timeBased', 'size':'freqBased'}, inplace=True)
user_interest.reset_index(inplace=True)
user_interest.set_index(['userID', 'poiTheme'], inplace=True)
user_interest.head()
Out[269]:
In [270]:
poi_list = poi_all.index.tolist()
In [271]:
def enum_345_seq(seqid_set, poi_list):
"""Enumerate all possible travelling sequences with length {3, 4, 5}"""
act_seqs = dict()
enum_seqs = dict()
for seqid in seqid_set:
seqi = seq_all[seq_all['seqID'] == seqid].copy()
seqi.sort(columns=['arrivalTime'], ascending=True, inplace=True)
act_seqs[seqid] = seqi['poiID'].tolist()
p0 = seqi.loc[seqi.index[0], 'poiID']
pN = seqi.loc[seqi.index[-1],'poiID']
# enumerate sequences with length 3
if seqi.shape[0] == 3:
enum_seqs[seqid] = [[p0, p, pN] \
for p in poi_list if p not in {p0, pN}]
continue
# enumerate sequences with length 4
if seqi.shape[0] == 4:
enum_seqs[seqid] = [[p0, p1, p2, pN] \
for p1 in poi_list if p1 not in {p0, pN} \
for p2 in poi_list if p2 not in {p0, p1, pN}]
continue
# enumerate sequences with length 5
if seqi.shape[0] == 5:
enum_seqs[seqid] = [[p0, p1, p2, p3, pN] \
for p1 in poi_list if p1 not in {p0, pN} \
for p2 in poi_list if p2 not in {p0, p1, pN} \
for p3 in poi_list if p3 not in {p0, p1, p2, pN}]
continue
return enum_seqs, act_seqs
As described at the top of the notebook, features for each trajectory used in this experiment are:
In [272]:
def calc_dist(longitude1, latitude1, longitude2, latitude2):
"""Calculate the distance (unit: km) between two places on earth"""
# convert degrees to radians
lon1 = math.radians(longitude1)
lat1 = math.radians(latitude1)
lon2 = math.radians(longitude2)
lat2 = math.radians(latitude2)
radius = 6371.009 # mean earth radius is 6371.009km, en.wikipedia.org/wiki/Earth_radius#Mean_radius
# The haversine formula, en.wikipedia.org/wiki/Great-circle_distance
dlon = math.fabs(lon1 - lon2)
dlat = math.fabs(lat1 - lat2)
return 2 * radius * math.asin(math.sqrt(\
(math.sin(0.5*dlat))**2 + math.cos(lat1) * math.cos(lat2) * (math.sin(0.5*dlon))**2 ))
In [273]:
poi_dist_mat = pd.DataFrame(data=np.zeros((poi_all.shape[0], poi_all.shape[0]), dtype=np.float64), \
index=poi_all.index, columns=poi_all.index)
poi_rdist_mat = poi_dist_mat.copy()
In [274]:
for i in range(poi_all.index.shape[0]):
for j in range(i+1, poi_all.index.shape[0]):
r = poi_all.index[i]
c = poi_all.index[j]
dist = calc_dist(poi_all.loc[r, 'poiLon'], poi_all.loc[r, 'poiLat'], \
poi_all.loc[c, 'poiLon'], poi_all.loc[c, 'poiLat'])
poi_dist_mat.loc[r, c] = dist
poi_dist_mat.loc[c, r] = dist
assert(dist > 0.)
rdist = 1./dist
poi_rdist_mat.loc[r, c] = rdist
poi_rdist_mat.loc[c, r] = rdist
In [275]:
def calc_features(user, seq, poi_all, user_interest, log10_trans_mat, poi_dist_mat, poi_rdist_mat):
"""Compute 8 features for each enumerated trajectory"""
features = np.zeros(nfeatures, dtype=np.float64)
# POI based features
for poi in seq:
cat = poi_all.loc[poi, 'poiTheme']
if (user, cat) in user_interest.index:
features[0] += user_interest.loc[user, cat]['timeBased'] # 1. time-based user interest
features[1] += user_interest.loc[user, cat]['freqBased'] # 2. freq-based user interest
features[2] += poi_avg_pop.loc[poi, 'popularity'] # 3. POI popularity
# POI-pair based features
for k in range(len(seq)-1):
poi1 = seq[k]
poi2 = seq[k+1]
assert(poi1 != poi2)
cat1 = poi_all.loc[poi1, 'poiTheme']
cat2 = poi_all.loc[poi2, 'poiTheme']
features[3] += -1 * poi_dist_mat.loc[poi1, poi2] # 4. travel distance
trans_prob = log10_trans_mat.loc[cat1, cat2] # log of transition probability
for l in range(4, 8):
features[l] += trans_prob
poi_cat2 = poi_all[poi_all['poiTheme'] == cat2].copy()
if cat1 == cat2:
poi_cat2.drop(poi1, axis=0, inplace=True) # drop row
distvec = pd.DataFrame(data=[poi_rdist_mat.loc[poi1, x] for x in poi_cat2.index], index=poi_cat2.index)
if distvec.idxmax().iloc[0] == poi2:
features[4] += math.log10(1. + EPS) # poi2 is the nearest neighbor of poi1
else:
features[4] += math.log10(0. + EPS) # poi2 is not the nearest neighbor of poi1
popvec = pd.DataFrame(data=[poi_avg_pop.loc[x,'popularity'] for x in poi_cat2.index], index=poi_cat2.index)
#popvec = pd.DataFrame(data=[poi_all.loc[x,'poiFreq'] for x in poi_cat2.index], index=poi_cat2.index)
if popvec.idxmax().iloc[0] == poi2:
features[5] += math.log10(1. + EPS) # poi2 is the most popular one within cat2
else:
features[5] += math.log10(0. + EPS) # poi2 is not the most popular one within cat2
features[6] += math.log10(EPS + distvec.loc[poi2].iloc[0] / distvec.sum().iloc[0])
features[7] += math.log10(EPS + popvec.loc[poi2].iloc[0] / popvec.sum().iloc[0])
# normalise score, range [-1, 1]
features /= abs(features).max()
return features
In [276]:
enum_seqs, train_seqs = enum_345_seq(train_set, poi_list)
Load features from file if possible.
In [277]:
doCompute = True
In [278]:
train_set1 = pickle.load(open(ftrain, 'rb'))
if (np.array(sorted(train_set1)) == np.array(sorted(train_set))).all() and os.path.exists(ffeatures_train):
doCompute = False
In [279]:
all_features = None
In [280]:
if doCompute:
#[(seqid, seqidx_in_enum_seqs_dict)]
seq_info = [(seqid, j) for seqid in train_set for j in range(len(enum_seqs[seqid]))]
# all CPUs but one are used
all_features = Parallel\
(n_jobs=-2)\
(delayed\
(calc_features)\
(seq_user.loc[x[0]].iloc[0], enum_seqs[x[0]][x[1]], poi_all, user_interest, \
log10_trans_mat, poi_dist_mat, poi_rdist_mat) \
for x in seq_info)
pickle.dump(all_features, open(ffeatures_train, 'wb'))
else:
# load features
all_features = pickle.load(open(ffeatures_train, 'rb'))
In [281]:
print(len(all_features))
In [282]:
num_enum_seq = 0 # total number of enumerated sequences
score_indices = dict() # score vector index in the feature matrix
In [283]:
for seqid in train_set:
num = len(enum_seqs[seqid])
score_indices[seqid] = [x for x in range(num_enum_seq, num_enum_seq + num)]
num_enum_seq += num
In [284]:
print(num_enum_seq)
In [285]:
assert(len(all_features) == num_enum_seq)
In [286]:
features_name = \
['total time-based user interest', 'total freq-based user interest', \
'total POI popularity', 'total (negative) travel distance', \
'trajectory probability with nearest neighbor rule', 'trajectory probability with most popular POI rule', \
'trajectory probability which prefers near neighbors', 'trajectory probability which prefers popular POIs']
In [287]:
def calc_F1score(seq_act, seq_rec):
assert(len(seq_act) > 0)
assert(len(seq_rec) > 0)
actset = set(seq_act)
recset = set(seq_rec)
intersect = actset & recset
recall = len(intersect) / len(seq_act)
precision = len(intersect) / len(seq_rec)
return 2. * precision * recall / (precision + recall)
In [288]:
def calc_mean_F1score(train_set, train_seqs, enum_seqs, all_scores, score_indices):
F1scores = []
for seqid in train_set:
scores = np.array([all_scores[x] for x in score_indices[seqid]])
bestseq = enum_seqs[seqid][scores.argmax()]
F1scores.append(calc_F1score(train_seqs[seqid], bestseq))
return np.mean(F1scores)
In [289]:
features_mat = np.array(all_features)
In [290]:
print(features_mat.shape)
#print(features_mat[0])
In [291]:
N = 1000
rand_weights = np.zeros((N, nfeatures), dtype=np.float64)
rand_F1scores = np.zeros(N, dtype=np.float64)
In [292]:
for j in range(N):
weights = np.random.uniform(-1, 1, nfeatures)
rand_weights[j] = weights
all_scores = features_mat.dot(weights)
rand_F1scores[j] = calc_mean_F1score(train_set, train_seqs, enum_seqs, all_scores, score_indices)
maxidx = rand_F1scores.argmax()
minidx = rand_F1scores.argmin()
print('max avgF1:', rand_F1scores[maxidx], ', weights:', rand_weights[maxidx])
print('min avgF1:', rand_F1scores[minidx], ', weights:', rand_weights[minidx])
In [293]:
# all_scores matrix are too large to fit in memory
#randscores = Parallel(n_jobs=-2)\
# (delayed\
# (calc_mean_F1score)\
# (train_set, train_seqs, enum_seqs, all_scores[j], score_indices) for j in range(N))
In [294]:
plt.figure(figsize=[10, 8])
plt.xlim([0, N])
plt.ylim([rand_F1scores.min()-0.01, max(rand_F1scores.max(), randF1)+0.01])
plt.xlabel('Experiment')
plt.ylabel('Mean F1score')
plt.plot([0, N], [randF1, randF1], color='r', linestyle='--', label='random guessing: ' + str(round(randF1,3)))
plt.scatter([range(N)], rand_F1scores, marker='+')
plt.legend()
Out[294]:
In [295]:
pd.Series(rand_F1scores).hist(bins=50)
Out[295]:
NOTE:
0
, then no features are used in the algorithm, and the scores of all candidate trajectories are the same, so the F1 score of the algorithm will depend on the order of candidates (python will choose the first one), as a result, the actual value of F1 score is meaningless, which indicates that 0
should be skipped in this case.
In [296]:
#values = [-0.5, 0, 0.5]
#values = np.linspace(-1, 1, 11)
#values = [-0.1, 0.1]
values = [-1, -0.5, -0.1, 0.1, 0.5, 1]
plt.figure(figsize=[16, 16])
for j in range(nfeatures):
all_F1scores = []
for k in range(len(values)):
weights = np.zeros(nfeatures, dtype=np.float64)
weights[j] = values[k]
all_scores = features_mat.dot(weights)
F1scores = []
for seqid in train_set:
scores = np.array([all_scores[x] for x in score_indices[seqid]])
bestseq = enum_seqs[seqid][scores.argmax()]
F1scores.append(calc_F1score(train_seqs[seqid], bestseq))
all_F1scores.append(F1scores)
plt.subplot(4, 2, j+1)
xlim = [min(values)-0.1, max(values)+0.1]
ylim = [0.3, 1.1]
plt.xlim(xlim)
plt.ylim(ylim)
plt.xlabel(features_name[j])
plt.ylabel('Mean F1-score')
plt.boxplot(all_F1scores, labels=values) #OK
avgF1scores = np.array([np.mean(x) for x in all_F1scores])
#plt.scatter(values, avgF1scores, marker='+')
#plt.plot([values[avgF1scores.argmax()]], [avgF1scores.max()], marker='o', color='r', \
# label='max: ' + str(round(avgF1scores.max(),3)))
#plt.plot(xlim, [randF1, randF1], color='g', linestyle='--', label='random guessing: ' + str(round(randF1,3)))
#plt.legend()
The first feature is the total time-based user interest in a trajectory.
(i.e sum of (expected time the user spent at each POI)/(average time a user spent at each POI))
It seems the existance of this feature negative affected the algorithm,
which is strange as the IJCAI paper argues that capturing the expected time a user spent at POI will improve the accuracy of trajectory recommendation.
The second feature is the total number of visit (of the user) of all POIs in a trajectory. Similar to the first feature, it seems the existance of this feature negative affected the algorithm, which is also strange as experiments from the IJCAI paper show that capturing a user's visiting frequency of POI will improve the accuracy of trajectory recommendation, though less than capturing visit time duration, but still much better than greedy and random selection strategies.
The third feature is the total POI popularity (i.e. # of visit of a POI by all users) of all POIs in a trajectory. It seems that doesn't affect the recommendation much, though a positive weight of this feature will help the recommendation algorithm slightly.
The fourth feature is the negative (i.e. multiple -1) of total travelling cost (i.e. total travel distance in the trajectory) for a user of a trajectory. It's strange that the algorithm prefers long travelling distance.
The fifth feature is the probability of a recommended trajectory based on the transition probabilities between POI categories and the nearest neighbor rule for choosing a specific POI within a certain category. It's clear the algorithm likes nearest neighbors.
The sixth feature is the probability of a recommended trajectory based on the transition probabilities between POI categories and the most popular POI rule for choosing a specific POI within a certain category. It's clear the algorithm likes popular POIs.
The seventh feature is the probability of a recommended trajectory based on the transition probabilities between POI categories and a rule below for choosing a specific POI within a certain category.
Rule: choose a random POI with probability proportional to the reciprocal of its distance to current POI.
Similar to the fifth feature which utilises the nearest neighbor rule, the algorithm doesn't like far neighbors.
The last feature is the probability of a recommended trajectory based on the transition probabilities between POI categories and a rule below for choosing a specific POI within a certain category.
Rule: choose a random POI with probability proportional to its popularity.
Similar to the sixth feature which utilises the most popular POI rule, the algorithm doesn't like non-popular POIs.
In [297]:
weights = np.array([0.01, -0.01, 0, 0.88, -0.01, 0.01, -0.05, 0.05])
calc_mean_F1score(train_set, train_seqs, enum_seqs, features_mat.dot(weights), score_indices)
Out[297]:
In [352]:
# uniform
# 0.714, [-1. 0. 0. 0.85 0.64 -0.05 0.03 0.]
#weights = np.zeros(nfeatures, dtype=np.float64)
# 0.728, [-0.3 0.65 -0.37 0.99 1. 0.05 0.95 1.]
#weights = np.ones(nfeatures, dtype=np.float64)
# 0.694, [-1. -1. -1. 0.22 0.99 0.49 0.25 0.76]
#weights = -1 * np.ones(nfeatures, dtype=np.float64)
# random
# 0.723, [-0.74 0.09 -0.19 0.99 0.84 -0.04 0.03 0.45]
#weights = np.random.uniform(-1, 1, nfeatures)
# 0.753, [0.35 -0.34 -0.17 0.86 0.78 0.35 0.89 0.63]
#weights = rand_weights[ rand_F1scores.argmax() ]
# hints from single feature experiment
# 0.732, [1 -1 1 1 1 1 1 -1]
#weights = np.array([-1, -1, 1, 1, 1, 1, 1, 1])
# 0.756, [0.12 -0.1 -0.03 0.75 0.1 0.1 0.1 0.1]
#weights = np.array([-0.1, -0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1])
# 0.7996, [0.06 -0.05 -0.01 0.98 -0.02 0.04 0.23 0.43]
#weights = np.array([-0.05, -0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05])
#weights = np.array([0.06, -0.05, -0.01, 0.98, -0.02, 0.04, 0.23, 0.43]) # init: 0.7996, end: 0.803
#weights = np.array([0.05, -0.05, -0.01, 0.86, -0.02, 0.03, 0.16, 0.43]) # init: 0.803, end: 0.805
#weights = np.array([0.05, -0.05, -0.01, 0.97, -0.02, 0.03, 0.16, 0.36]) # init: 0.805, end: 0.805
# 0.808, [0.02 -0.01 0.01 0.92 -0.02 0.01 0.28 -0.19]
#weights = np.array([-0.02, -0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02])
weights = np.array([0.02, -0.01, 0.01, 0.92, -0.02, 0.01, 0.28, -0.19]) # init: 0.808, end: 0.810
#weights = np.array([0.02, -0.01, 0, 0.87, -0.02, 0.01, 0.37, -0.19]) # init: 0.810, end: 0.810
# 0.810 [0.02, -0.01, 0.01, 0.87, -0.02, 0.01, 0.37, -0.19]
#weights = np.array([0.02, -0.01, 0, 1, -0.02, 0.01, 0.37, -0.19])
# 0.794, [0.01, -0.01, 0.01, 0.45, -0.01, 0.02, 0.07, 0.01]
#weights = np.array([-0.01, -0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01])
# 0.794, [0.01 -0.01 0.01 0.45 -0.01 0.02 0.07 0.01]
#weights = np.array([-0.01, -0.01, 0, 0.01, 0.01, 0.01, 0.01, 0.01])
#weights = np.array([0, 0, 0, 0.01, 0.01, 0.01, 0.01, 0.01]) # 0.789
#weights = np.array([0.01, -0.01, 0.01, 0.45, -0.01, 0.02, 0.07, 0.01]) # init: 0.794, end: 0.795
#weights = np.array([0.02, -0.01, 0, 0.44, -0.01, 0.02, 0.07, 0.01]) # init: 0.795, end: 0.795
In [353]:
#params = np.linspace(-1, 1, 41)
#params = np.linspace(-1, 1, 101)
params = np.linspace(-1, 1, 201)
F1scores = np.zeros((weights.shape[0], params.shape[0]), dtype=np.float64)
In [354]:
t1 = datetime.now()
for k in range(nfeatures):
for j in range(params.shape[0]):
weights[k] = params[j]
all_scores = features_mat.dot(weights)
F1scores[k, j] = calc_mean_F1score(train_set, train_seqs, enum_seqs, all_scores, score_indices)
maxidx = F1scores[k].argmax()
weights[k] = params[maxidx]
t2 = datetime.now()
print('%d seconds used' % (t2-t1).total_seconds()) # 120 seconds
In [355]:
#t1 = datetime.now()
#for k in range(nfeatures):
# all_weights = np.matlib.repmat(weights, params.shape[0], 1)
# for j in range(params.shape[0]):
# all_weights[j, k] = params[j]
# F1scores1 = Parallel(n_jobs=-2)\
# (delayed(calc_mean_F1score)
# (train_set, train_seqs, enum_seqs, features_mat.dot(all_weights[j]), score_indices)
# for j in range(params.shape[0]))
# F1scores[k] = np.array(F1scores1)
# maxidx = F1scores[k].argmax()
# weights[k] = params[maxidx]
#t2 = datetime.now()
#print('%d seconds used' % (t2-t1).total_seconds()) # 407 seconds
In [356]:
print(weights)
print(F1scores[-1].max())
In [357]:
plt.figure(figsize=[16, 16])
for k in range(nfeatures):
plt.subplot(4, 2, k+1)
plt.xlim([-1.1, 1.1])
plt.ylim([0.3, 0.9])
plt.xlabel(features_name[k])
plt.ylabel('Mean F1-score')
plt.scatter(params, F1scores[k], marker='+')
plt.plot([-1.1, 1.1], [randF1, randF1], color='g', linestyle='--', label='random guessing: ' + str(round(randF1,3)))
plt.legend()
In [358]:
enum_seqs_test, test_seqs = enum_345_seq(test_set, poi_list)
Load features from file if possible.
In [359]:
doCompute = True
In [360]:
test_set1 = pickle.load(open(ftest, 'rb'))
if (np.array(sorted(test_set1)) == np.array(sorted(test_set))).all() and os.path.exists(ffeatures_test):
doCompute = False
In [361]:
all_features_test = None
In [362]:
if doCompute:
#[(seqid, seqidx_in_enum_seqs_dict)]
seq_info_test = [(seqid, j) for seqid in test_set for j in range(len(enum_seqs_test[seqid]))]
# all CPUs but one are used
all_features_test = Parallel\
(n_jobs=-2)\
(delayed\
(calc_features)\
(seq_user.loc[x[0]].iloc[0], enum_seqs_test[x[0]][x[1]], poi_all, user_interest, \
log10_trans_mat, poi_dist_mat, poi_rdist_mat) \
for x in seq_info_test)
pickle.dump(all_features_test, open(ffeatures_test, 'wb'))
else:
# load features
all_features_test = pickle.load(open(ffeatures_test, 'rb'))
In [363]:
print(len(all_features_test))
In [364]:
score_indices_test = dict() # score vector index in the feature matrix
num_enum_seq_test = 0
In [365]:
for seqid in test_set:
num = len(enum_seqs_test[seqid])
score_indices_test[seqid] = [x for x in range(num_enum_seq_test, num_enum_seq_test + num)]
num_enum_seq_test += num
In [366]:
print(num_enum_seq_test)
In [367]:
assert(len(all_features_test) == num_enum_seq_test)
In [368]:
features_mat_test = np.array(all_features_test)
print(features_mat_test.shape)
In [372]:
weights = np.array([0.01, -0.01, 0, 0.88, -0.01, 0.01, -0.05, 0.05]) # train: 0.7993, test: 0.75040650406504061
#weights = np.array([0.32, -0.01, 0.01, 0.92, -0.02, 0.03, 0.27, 0.08])
#weights = np.array([0.02, -0.01, 0, 0.87, -0.02, 0.01, 0.37, -0.19]) # train: 0.810, test: 0.744
#weights = np.array([0.02, -0.01, 0.01, 0.92, -0.02, 0.01, 0.28, -0.19]) # train: 0.808, test: 0.744
#weights = np.array([0.05, -0.05, -0.01, 0.97, -0.02, 0.03, 0.16, 0.36]) # train: 0.805, test: 0.732
#weights = np.array([0.06, -0.05, -0.01, 0.98, -0.02, 0.04, 0.23, 0.43]) # train: 0.7996, test: 0.732
#weights = np.array([0.02, -0.01, 0, 0.44, -0.01, 0.02, 0.07, 0.01]) # init: 0.795, test: 0.740
calc_mean_F1score(test_set, test_seqs, enum_seqs_test, features_mat_test.dot(weights), score_indices_test)
Out[372]:
In [ ]:
In [ ]: