Features for Trajectory Recommendation

1. Features

POI Features ?

  1. POI category
  2. POI popularity
  3. POI location

Transition Features

  1. POI category (a transition matrix between different categories)
  2. POI popularity (a transition matrix between different class of popularity)
  3. POI pair distance (a transition matrix between different class of distance)

Computation Steps

  1. First compute the above features using a set of travelling sequences,
  2. Then compute the log likelihood of (actual/enumerated) sequences using POI category/popularity transition matrix and POI pair distance transition matrix and make comparison.

2. Load Data


In [101]:
%matplotlib inline

import os
import re
import math
import random
import pickle
import pandas as pd
import numpy as np
import scipy.stats
#from numba import jit
from datetime import datetime
from joblib import Parallel, delayed
import matplotlib.pyplot as plt

In [102]:
nfeatures = 8 # number of features
EPS = 1e-12 # smooth, deal with 0 probability
random.seed(123456789) # control random choice when splitting training/testing set

In [103]:
data_dir = 'data/data-ijcai15'
#fvisit = os.path.join(data_dir, 'userVisits-Osak.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Osak.csv')
#fvisit = os.path.join(data_dir, 'userVisits-Glas.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Glas.csv')
#fvisit = os.path.join(data_dir, 'userVisits-Edin.csv')
#fcoord = os.path.join(data_dir, 'photoCoords-Edin.csv')
fvisit = os.path.join(data_dir, 'userVisits-Toro.csv')
fcoord = os.path.join(data_dir, 'photoCoords-Toro.csv')

In [104]:
suffix = fvisit.split('-')[-1].split('.')[0]

In [105]:
visits = pd.read_csv(fvisit, sep=';')
coords = pd.read_csv(fcoord, sep=';')
# merge data frames according to column 'photoID'
assert(visits.shape[0] == coords.shape[0])
traj = pd.merge(visits, coords, on='photoID')
traj.head()


Out[105]:
photoID userID dateTaken poiID poiTheme poiFreq seqID photoLon photoLat
0 7941504100 10007579@N00 1346844688 30 Structure 1538 1 -79.380844 43.645641
1 4886005532 10012675@N05 1142731848 6 Cultural 986 2 -79.391525 43.654335
2 4886006468 10012675@N05 1142732248 6 Cultural 986 2 -79.391525 43.654335
3 4885404441 10012675@N05 1142732373 6 Cultural 986 2 -79.391525 43.654335
4 4886008334 10012675@N05 1142732445 6 Cultural 986 2 -79.391525 43.654335

In [106]:
num_photo = traj['photoID'].unique().shape[0]
num_user = traj['userID'].unique().shape[0]
num_poi = traj['poiID'].unique().shape[0]
num_seq = traj['seqID'].unique().shape[0]
pd.DataFrame({'#photo': num_photo, '#user': num_user, '#poi': num_poi, '#seq': num_seq, \
              '#photo/user': num_photo/num_user, '#seq/user': num_seq/num_user}, index=[str(suffix)])


Out[106]:
#photo #photo/user #poi #seq #seq/user #user
Toro 39419 28.257348 29 6057 4.341935 1395

In [107]:
#plt.figure(figsize=[15, 5])
#plt.xlabel('Longitude')
#plt.ylabel('Latitude')
#plt.scatter(traj['photoLon'], traj['photoLat'], marker='+')

2.1 Compute POI Info

Compute POI (Longitude, Latitude) as the average coordinates of the assigned photos.


In [108]:
poi_coords = traj[['poiID', 'photoLon', 'photoLat']].groupby('poiID').mean()
poi_coords.reset_index(inplace=True)
poi_coords.rename(columns={'photoLon':'poiLon', 'photoLat':'poiLat'}, inplace=True)

Extract POI category and visiting frequency.


In [109]:
poi_catfreq = traj[['poiID', 'poiTheme', 'poiFreq']].groupby('poiID').first()
poi_catfreq.reset_index(inplace=True)

In [110]:
poi_all = pd.merge(poi_catfreq, poi_coords, on='poiID')
poi_all.set_index('poiID', inplace=True)
#poi_all.to_csv(fpoi, index=True)

2.2 Construct Travelling Sequences


In [111]:
seq_all = traj[['userID', 'seqID', 'poiID', 'dateTaken']].copy().groupby(['userID', 'seqID', 'poiID'])\
          .agg([np.min, np.max, np.size])
seq_all.columns = seq_all.columns.droplevel()
seq_all.reset_index(inplace=True)
seq_all.rename(columns={'amin':'arrivalTime', 'amax':'departureTime', 'size':'#photo'}, inplace=True)
seq_all['poiDuration(sec)'] = seq_all['departureTime'] - seq_all['arrivalTime']
#seq_all.head()

In [112]:
seq_user = seq_all[['userID', 'seqID', 'poiID']].copy().groupby(['userID', 'seqID']).agg(np.size)
seq_user.reset_index(inplace=True)
seq_user.rename(columns={'size':'seqLen'}, inplace=True)
seq_user.set_index('seqID', inplace=True)
#seq_user.head()

2.3 Compute Some Sequence Statistics


In [113]:
seq_len = seq_all[['seqID', 'poiID']].copy().groupby('seqID').agg(np.size)
seq_len.reset_index(inplace=True)
seq_len.rename(columns={'poiID':'seqLen'}, inplace=True)
#seq_len.head()

In [114]:
seq_stats = seq_all[['seqID', '#photo', 'poiDuration(sec)']].copy().groupby('seqID').agg(np.sum)
seq_stats.reset_index(inplace=True)
#seq_stats.rename(columns={'poiDuration(sec)':'totalPoiDuration(sec)'}, inplace=True)
seq_stats = pd.merge(seq_len, seq_stats, on='seqID')
seq_stats['poiDuration(sec)'] /= 60
seq_stats.rename(columns={'poiDuration(sec)':'totalPoiDuration(min)'}, inplace=True)
seq_stats.set_index('seqID', inplace=True)
#seq_stats.head()

In [115]:
#ax = seq_stats['seqLen'].hist(bins=50)
#ax.set_xlabel('sequence length')
#ax.set_ylim([0.1, 1e4])
#ax.set_yscale('log')

In [116]:
#ax = seq_stats['#photo'].hist(bins=50)
#ax.set_xlabel('#photo for sequence')
#ax.set_ylim([0.1, 1e4])
#ax.set_yscale('log')

In [117]:
#ax = seq_stats['totalPoiDuration(min)'].hist(bins=100)
#ax.set_xlabel('totalPoiDuration(min)')
#ax.set_ylim([0.1, 1e4])
#ax.set_yscale('log')
#ax.set_xscale('log')

Sequences with length {3, 4, 5}


In [118]:
#seq_stats = seq_stats[seq_stats['seqLen'].isin({3, 4, 5})]
#ax = seq_stats['totalPoiDuration(min)'].hist(bins=50)
#ax.set_xlabel('totalPoiDuration(min)')
#ax.set_ylim([0.1, 1e4])
#ax.set_yscale('log')

In [119]:
def extract_seq(seqid, seq_all):
    seqi = seq_all[seq_all['seqID'] == seqid].copy()
    seqi.sort(columns=['arrivalTime'], ascending=True, inplace=True)
    return seqi['poiID'].tolist()

3. Compute Transition Probabilities

3.1 Basic Definitions

  • $\text{Pr}(\text{POI}_i \to \text{POI}_j)$: the transition probability from $\text{POI}_i$ to $\text{POI }_j$
  • $\text{Pr}(\text{Cat}_i \to \text{Cat}_j)$: the transition probability from a POI of category $\text{Cat}_i$ to a POI of category $\text{Cat}_j$
  • $\text{Pr}(\text{Pop}_i \to \text{Pop}_j)$: the transition probability from a POI of Popularity class $\text{Pop}_i$ to a POI of Popularity class $\text{Pop}_j$
  • $\text{Pr}(\text{Dist}_i \to \text{Dist}_j)$: the transition probability from a POI-POI pair with distance (between the two) class $\text{Dist}_i$ to a POI-POI pair with distance (between the two) class $\text{Dist}_j$

By design, $\text{Pr}(\text{POI}_i \to \text{POI}_j)$ should be bigger if any of $\text{Pr}(\text{Cat}_i \to \text{Cat}_j)$, $\text{Pr}(\text{Pop}_i \to \text{Pop}_j)$ and $\text{Pr}(\text{Dist}_i \to \text{Dist}_j)$ becomes bigger (if other factors stay the same).

So how to combine these probabilities?
Both addition and multiplication seems to be able to serve this purpose, what is the difference?

The Addition Case

For the addtion case,

\begin{equation} \text{Pr}(\text{POI}_i \to \text{POI}_j) = \frac{ \text{Pr}(\text{Cat}_{\text{POI}_i} \to \text{Cat}_{\text{POI}_j}) + \text{Pr}(\text{Pop}_{\text{POI}_i} \to \text{Pop}_{\text{POI}_j}) + \text{Pr}(\text{Dist}_{\text{POI}_{i-1} \to \text{POI}_i} \to \text{Dist}_{\text{POI}_{i} \to \text{POI}_j}) } {Z_i} \end{equation}

where $\text{POI}_{i-1}$ is the direct predecessor of $\text{POI}_i$ in a trajectory and $Z_i$ is a normalizing constant.

The Multiplication Case

For the multiplication case,

\begin{equation} \text{Pr}(\text{POI}_i \to \text{POI}_j) = \frac{ \text{Pr}(\text{Cat}_{\text{POI}_i} \to \text{Cat}_{\text{POI}_j}) \times \text{Pr}(\text{Pop}_{\text{POI}_i} \to \text{Pop}_{\text{POI}_j}) \times \text{Pr}(\text{Dist}_{\text{POI}_{i-1} \to \text{POI}_i} \to \text{Dist}_{\text{POI}_{i} \to \text{POI}_j}) } {Z_i} \end{equation}

similarly, $\text{POI}_{i-1}$ is the direct predecessor of $\text{POI}_i$ in a trajectory and $Z_i$ is again a normalizing constant.

The Difference between Addition and Multiplication

It is important to note the fact that, by design, $\text{Pr}(\text{POI}_i \to \text{POI}_j)$ should be very small if any of $\text{Pr}(\text{Cat}_i \to \text{Cat}_j)$, $\text{Pr}(\text{Pop}_i \to \text{Pop}_j)$ and $\text{Pr}(\text{Dist}_i \to \text{Dist}_j)$ is very small, in the extreme case, if any of the three probabilities is $0$, then $\text{Pr}(\text{POI}_i \to \text{POI}_j)$ should be $0$ because the event "Transition from POI$_i$ to POI$_j$" is impossible.

From the equation of the addition case, it is clear that the addition rule contradicts the above fact while the multiplication rule is consistent with it.

Intuitively, the addition rule could make an unlikely event become much more likely, specifically, make an impossible event become possible.

3.2 Transition Probabilities between POI Categories

We model transition probabilities between POI categories, i.e. $\text{Pr}(\text{Cat}_{\text{POI}_i} \to \text{Cat}_{\text{POI}_j})$.

We count the number of transition first, then normalise each row while taking care of zero by adding each cell a small number (i.e. $0.1$ times the minimum value of that row) if there exists a zero cell.


In [120]:
def calc_poi_cat_transmat(seqid_set, poi_all, seq_all):
    poi_cats = poi_all['poiTheme'].unique().tolist()
    poi_cats.sort()
    poi_cat_transmat = pd.DataFrame(data=np.zeros((len(poi_cats), len(poi_cats)), dtype=np.float), \
                                    index=poi_cats, columns=poi_cats)
    for seqid in seqid_set:
        seq = extract_seq(seqid, seq_all)
        for j in range(len(seq)-1):
            poi1 = seq[j]
            poi2 = seq[j+1]
            cat1 = poi_all.loc[poi1, 'poiTheme']
            cat2 = poi_all.loc[poi2, 'poiTheme']
            poi_cat_transmat.loc[cat1, cat2] += 1
    return poi_cat_transmat

In [121]:
def normalise_transmat(transmat):
    assert(isinstance(transmat, pd.DataFrame))
    for row in range(transmat.index.shape[0]):
        nonzeroidx = np.nonzero(transmat.iloc[row])[0].tolist()
        if len(nonzeroidx) < transmat.columns.shape[0]:
            minv = np.min(transmat.iloc[row, nonzeroidx])
            EPS = 0.1 * minv  # row-specific smooth factor
            #zeroidx = list(set(range(len(transmat.columns))) - set(nonzeroidx))
            #transmat.iloc[row, zeroidx] = EPS
            transmat.iloc[row] += EPS
        rowsum = np.sum(transmat.iloc[row])
        assert(rowsum > 0)
        transmat.iloc[row] /= rowsum
    return transmat

In [122]:
poi_cat_transmat = calc_poi_cat_transmat(seq_all['seqID'].unique(), poi_all, seq_all)
poi_cat_transmat


Out[122]:
Amusement Beach Cultural Shopping Sport Structure
Amusement 14 20 51 17 35 16
Beach 18 43 61 92 19 106
Cultural 46 70 40 72 22 94
Shopping 16 101 58 16 19 71
Sport 42 23 21 13 8 21
Structure 26 91 81 59 24 24

In [123]:
poi_cat_transmat = normalise_transmat(poi_cat_transmat)
poi_cat_transmat


Out[123]:
Amusement Beach Cultural Shopping Sport Structure
Amusement 0.091503 0.130719 0.333333 0.111111 0.228758 0.104575
Beach 0.053097 0.126844 0.179941 0.271386 0.056047 0.312684
Cultural 0.133721 0.203488 0.116279 0.209302 0.063953 0.273256
Shopping 0.056940 0.359431 0.206406 0.056940 0.067616 0.252669
Sport 0.328125 0.179688 0.164062 0.101562 0.062500 0.164062
Structure 0.085246 0.298361 0.265574 0.193443 0.078689 0.078689

In [124]:
poi_cat_transmat_log = np.log10(poi_cat_transmat)
poi_cat_transmat_log


Out[124]:
Amusement Beach Cultural Shopping Sport Structure
Amusement -1.038563 -0.883661 -0.477121 -0.954243 -0.640623 -0.980571
Beach -1.274927 -0.896731 -0.744870 -0.566412 -1.251446 -0.504894
Cultural -0.873801 -0.691460 -0.934498 -0.679226 -1.194136 -0.563431
Shopping -1.244586 -0.444385 -0.685278 -1.244586 -1.169953 -0.597448
Sport -0.483961 -0.745482 -0.784991 -0.993267 -1.204120 -0.784991
Structure -1.069326 -0.525258 -0.575815 -0.713448 -1.104089 -1.104089

3.3 Transition Probabilities between POI Popularity Classes

We model transition probabilities between POI popularities, i.e. $\text{Pr}(\text{Pop}_{\text{POI}_i} \to \text{Pop}_{\text{POI}_j})$ after discretizing POI popularities.

3.3.1 Discretize POI Popularity

What the general criteria of data discretization? Discretization of continuous features from Wikipedia.

TODO: Improve the discritization using Fayyad's minimum description length principle (MDLP) described by this paper.

NOTE: MDLP is a supervised discretization method which requires some sort of class labels that are not available here. We need an unsupervised discretization method, well-known ones including equal width, equal frequency, clustering based, etc.

Try different discretization strategies and choose the best one using the following metrics:

actSeqRank(Top%) = $100 \times \frac{\text{rank of actual sequence}~S_a}{\text{number of enumerated sequence w.r.p}~S_a}$

rankedTop5(%) = $100 \times \frac{\sum{\delta(\text{rank of actual sequence} \le 5)}} {\text{number of actual sequence}}$, where $\delta(\text{True}) = 1$ and $\delta(\text{False}) = 0$
i.e. the percentage of actual sequence $S_a$ ranked top 5 among all enumerated sequences with respect to $S_a$.

descretization strategyactSeqRank(Top%) Toronto (smaller is better)rankedTop5(%) Toronto (larger is better)actSeqRank(Top%) GlasgowrankedTop5(%) Glasgow
quantile, nbins=2mean: 18.496, std: 24.746131/266 = 49.25%mean: 16.328, std: 19.76355/100 = 55.00%
quantile, nbins=3mean: 16.049, std: 19.873131/266 = 49.25%mean: 16.188, std: 18.11752/100 = 52.00%
quantile, nbins=4mean: 16.478, std: 21.002132/266 = 49.62%mean: 15.130, std: 18.64358/100 = 58.00%
quantile, nbins=5mean: 16.811, std: 22.159127/266 = 47.74%mean: 14.316, std: 17.29461/100 = 61.00%
quantile, nbins=6mean: 15.049, std: 18.722128/266 = 48.12%mean: 14.581, std: 17.52258/100 = 58.00%
quantile, nbins=7mean: 15.831, std: 20.140130/266 = 48.87%mean: 14.777, std: 17.17559/100 = 59.00%
quantile, nbins=8mean: 15.567, std: 19.811135/266 = 50.75%mean: 15.019, std: 17.10156/100 = 56.00%
quantile, nbins=9mean: 14.556, std: 17.937138/266 = 51.88%mean: 14.255, std: 16.18464/100 = 64.00%
quantile, nbins=10mean: 14.398, std: 17.694135/266 = 50.75%mean: 13.771, std: 16.04362/100 = 62.00%
quantile, nbins=11mean: 14.874, std: 18.915138/266 = 51.88%mean: 13.568, std: 15.52263/100 = 63.00%
quantile, nbins=12mean: 13.491, std: 16.760145/266 = 54.51%mean: 11.464, std: 13.89666/100 = 66.00%
quantile, nbins=13mean: 13.528, std: 16.964144/266 = 54.14%mean: 12.416, std: 13.08163/100 = 63.00%
quantile, nbins=14mean: 13.279, std: 15.960142/266 = 53.38%mean: 12.596, std: 14.25965/100 = 65.00%
quantile, nbins=15mean: 12.784, std: 15.796151/266 = 56.77%mean: 12.209, std: 13.52766/100 = 66.00%
equalWidth, nbins=2mean: 24.761, std: 29.356115/266 = 43.23%mean: 23.705, std: 24.30447/100 = 47.00%
equalWidth, nbins=3mean: 21.841, std: 28.366127/266 = 47.74%mean: 23.225, std: 25.06547/100 = 47.00%
equalWidth, nbins=4mean: 23.636, std: 29.252120/266 = 45.11%mean: 19.416, std: 22.05953/100 = 53.00%
equalWidth, nbins=5mean: 20.154, std: 26.713125/266 = 46.99%mean: 19.201, std: 23.25054/100 = 54.00%
equalWidth, nbins=6mean: 19.129, std: 25.530125/266 = 46.99%mean: 19.907, std: 23.13253/100 = 53.00%
equalWidth, nbins=7mean: 16.922, std: 20.619126/266 = 47.37%mean: 17.030, std: 19.89857/100 = 57.00%
equalWidth, nbins=8mean: 18.240, std: 22.739121/266 = 45.49%mean: 19.977, std: 23.11854/100 = 54.00%
equalWidth, nbins=9mean: 17.507, std: 22.659129/266 = 48.50%mean: 18.183, std: 22.38855/100 = 55.00%
equalWidth, nbins=10mean: 18.196, std: 23.321125/266 = 46.99%mean: 18.158, std: 21.77352/100 = 52.00%
equalWidth, nbins=11mean: 17.711, std: 23.398131/266 = 49.25%mean: 15.555, std: 19.72260/100 = 60.00%
equalWidth, nbins=12mean: 17.389, std: 22.718132/266 = 49.62%mean: 15.174, std: 17.18860/100 = 60.00%
equalWidth, nbins=13mean: 15.820, std: 20.856141/266 = 53.01%mean: 15.184, std: 18.83761/100 = 61.00%
equalWidth, nbins=14mean: 15.681, std: 19.559136/266 = 51.13%mean: 12.922, std: 15.11963/100 = 63.00%
equalWidth, nbins=15mean: 15.712, std: 19.794136/266 = 51.13%mean: 14.274, std: 16.82861/100 = 61.00%
another, bins=[0, 500, 1500, 10000]mean: 15.977, std: 20.373132/266 = 49.62%mean: 20.427, std: 21.59650/100 = 50.00%

In [125]:
rank_mean_Toro = [18.496, 16.049, 16.478, 16.811, 15.049, 15.831, 15.567, 14.556, 14.398, 14.874, 13.491, 13.528, \
                  13.279, 12.784, 24.761, 21.841, 23.636, 20.154, 19.129, 16.922, 18.240, 17.507, 18.196, 17.711, \
                  17.389, 15.820, 15.681, 15.712, 15.977]
rank_std_Toro = [24.746, 19.873, 21.002, 22.159, 18.722, 20.140, 19.811, 17.937, 17.694, 18.915, 16.760, 16.964, \
                 15.960, 15.796, 29.356, 28.366, 29.252, 26.713, 25.530, 20.619, 22.739, 22.659, 23.321, 23.398, \
                 22.718, 20.856, 19.559, 19.794, 20.373]
rank_mean_Glas = [16.328, 16.188, 15.130, 14.316, 14.581, 14.777, 15.019, 14.255, 13.771, 13.568, 11.464, 12.416, \
                  12.596, 12.209, 23.705, 23.225, 19.416, 19.201, 19.907, 17.030, 19.977, 18.183, 18.158, 15.555, \
                  15.174, 15.184, 12.922, 14.274, 20.427]
rank_std_Glas = [19.763, 18.117, 18.643, 17.294, 17.522, 17.175, 17.101, 16.184, 16.043, 15.522, 13.896, 13.081, \
                 14.259, 13.527, 24.304, 25.065, 22.059, 23.250, 23.132, 19.898, 23.118, 22.388, 21.773, 19.722, \
                 17.188, 18.837, 15.119, 16.828, 21.596]
rank_top5_Toro = [49.25, 49.25, 49.62, 47.74, 48.12, 48.87, 50.75, 51.88, 50.75, 51.88, 54.51, 54.14, \
                  53.38, 56.77, 43.23, 47.74, 45.11, 46.99, 46.99, 47.37, 45.49, 48.50, 46.99, 49.25, \
                  49.62, 53.01, 51.13, 51.13, 49.62]
rank_top5_Glas = [55, 52, 58, 61, 58, 59, 56, 64, 62, 63, 66, 63, 65, 66, 47, 47, 53, 54, 53, 57, 54, 55, 52, 60, 60, \
                  61, 63, 61, 50]
#xlabels = ['qbins='+str(x) for x in range(2, 16)]
#xlabels.extend(['ebins='+str(x) for x in range(2, 16)])
#xlabels.append('another')
xlabels = [x for x in range(2, len(rank_mean_Toro)+2)]

In [126]:
plt.figure(figsize=[15, 10])
plt.xlim([0, len(rank_mean_Toro)+2])
plt.ylim([-20, 100])
plt.errorbar(xlabels, rank_mean_Toro, rank_std_Toro, linestyle='--', marker='s', label='errorbar_Toronto')
plt.errorbar(xlabels, rank_mean_Glas, rank_std_Glas, linestyle='--', marker='s', label='errorbar_Glasgow')
plt.plot(xlabels, rank_top5_Toro, linestyle='--', marker='s', label='top5_Toronto')
plt.plot(xlabels, rank_top5_Glas, linestyle='--', marker='s', label='top5_Glasgow')
plt.legend()
#idx = 10
idx = 7
plt.annotate('choose', xy=(xlabels[idx], rank_top5_Glas[idx]), xytext=(xlabels[idx], rank_top5_Glas[idx]+15), \
             arrowprops=dict(facecolor='green', shrink=0.1))


Out[126]:
<matplotlib.text.Annotation at 0x7f289a5207b8>
/home/dawei/apps/miniconda3/lib/python3.4/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

It could be seen from the above plot that discretization based on equal frequency (quantiles) performs better than that based on equal width, to balance the complexity and accuracy, we choose "quantile, nbins=9".


In [127]:
poi_all['poiFreq'].get_values()


Out[127]:
array([3506,  609,  688, 3056,  986, 2064, 1736,  278,  346, 4142,  481,
        964,  141,  113, 3553,  808,   26,  111,   89, 3594, 3619, 1874,
       1028, 1701,  104,  631,  936,  744, 1538])

In [128]:
poi_all['poiFreq'].describe()


Out[128]:
count      29.000000
mean     1360.896552
std      1289.191534
min        26.000000
25%       346.000000
50%       936.000000
75%      1874.000000
max      4142.000000
Name: poiFreq, dtype: float64

In [129]:
poi_all['poiFreq'].quantile([.25, .5, .75]).tolist()


Out[129]:
[346.0, 936.0, 1874.0]

In [130]:
ax = poi_all['poiFreq'].hist(bins=10)
ax.set_xlabel('POI Popularity')
ax.set_ylabel('#POI')


Out[130]:
<matplotlib.text.Text at 0x7f289a5209e8>

In [131]:
#plt.plot(np.ones(poi_all.index.shape[0]), np.sqrt(poi_all['poiFreq']), marker='+')

Quantile based bins (equal frequency)


In [132]:
nbins = 9
quantiles = np.round(np.linspace(0, 1, nbins+1), 2)[1:-1]
quantiles


Out[132]:
array([ 0.11,  0.22,  0.33,  0.44,  0.56,  0.67,  0.78,  0.89])

In [133]:
bins_qt = [0]
bins_qt.extend(poi_all['poiFreq'].quantile(quantiles))
bins_qt.append(poi_all['poiFreq'].max() + 1)
bins_qt


Out[133]:
[0,
 111.16,
 288.88,
 614.27999999999997,
 764.48000000000002,
 978.96000000000004,
 1661.8800000000003,
 2033.5999999999999,
 3549.2400000000002,
 4143]

Equal width bins


In [134]:
#nbins = 15
#inter = round((poi_all['poiFreq'].max() + 1) / nbins)
#bins_ew = [x*inter for x in range(nbins)]
#bins_ew.append(poi_all['poiFreq'].max() + 1)
#bins_ew

Another Equal frequency bins


In [135]:
#bins = np.linspace(0, 10000, 11)
#bins = np.logspace(0, 4, 5)
#bins = [1, 100, 500, 1000, 2000, 5000]
#bins_ef = [0, 500, 1500, 10000]

In [136]:
bins_pop = bins_qt
#bins_pop = bins_ew
#bins_pop = bins_ef

In [137]:
ax = poi_all['poiFreq'].hist(bins=bins_pop)
ax.set_xlabel('POI Popularity')
ax.set_ylabel('#POI')
ax.set_xscale('log')



In [138]:
poi_all['popClass'] = np.digitize(poi_all['poiFreq'].get_values(), bins_pop)
#poi_all

3.3.2 Compute Transition Probabilities


In [139]:
def calc_poi_pop_transmat(seqid_set, poi_all, seq_all):
    pop_class = poi_all['popClass'].unique().tolist()
    pop_class.sort()
    poi_pop_transmat = pd.DataFrame(data=np.zeros((len(pop_class), len(pop_class)), dtype=np.float), \
                                    index=pop_class, columns=pop_class)
    for seqid in seqid_set:
        seq = extract_seq(seqid, seq_all)
        for j in range(len(seq)-1):
            poi1 = seq[j]
            poi2 = seq[j+1]
            pc1 = poi_all.loc[poi1, 'popClass']
            pc2 = poi_all.loc[poi2, 'popClass']
            poi_pop_transmat.loc[pc1, pc2] += 1
    return poi_pop_transmat

In [140]:
poi_pop_transmat = calc_poi_pop_transmat(seq_all['seqID'].unique(), poi_all, seq_all)
poi_pop_transmat


Out[140]:
1 2 3 4 5 6 7 8 9
1 0 4 0 0 0 3 0 4 5
2 4 0 2 1 6 0 0 3 1
3 0 5 0 0 12 0 1 1 3
4 0 0 1 4 3 17 13 6 45
5 1 7 2 6 1 17 38 13 80
6 2 2 1 6 13 16 36 67 92
7 2 1 0 18 45 46 7 23 131
8 2 2 3 5 16 68 32 18 76
9 2 1 1 54 86 73 131 72 91

In [141]:
poi_pop_transmat = normalise_transmat(poi_pop_transmat)
poi_pop_transmat


Out[141]:
1 2 3 4 5 6 7 8 9
1 0.016043 0.229947 0.016043 0.016043 0.016043 0.176471 0.016043 0.229947 0.283422
2 0.229050 0.005587 0.117318 0.061453 0.340782 0.005587 0.005587 0.173184 0.061453
3 0.004367 0.222707 0.004367 0.004367 0.528384 0.004367 0.048035 0.048035 0.135371
4 0.001112 0.001112 0.012236 0.045606 0.034483 0.190211 0.145717 0.067853 0.501669
5 0.006061 0.042424 0.012121 0.036364 0.006061 0.103030 0.230303 0.078788 0.484848
6 0.008511 0.008511 0.004255 0.025532 0.055319 0.068085 0.153191 0.285106 0.391489
7 0.007667 0.004016 0.000365 0.066083 0.164659 0.168310 0.025922 0.084337 0.478642
8 0.009009 0.009009 0.013514 0.022523 0.072072 0.306306 0.144144 0.081081 0.342342
9 0.003914 0.001957 0.001957 0.105675 0.168297 0.142857 0.256360 0.140900 0.178082

In [142]:
poi_pop_transmat_log = np.log10(poi_pop_transmat)
poi_pop_transmat_log


Out[142]:
1 2 3 4 5 6 7 8 9
1 -1.794720 -0.638373 -1.794720 -1.794720 -1.794720 -0.753328 -1.794720 -0.638373 -0.547566
2 -0.640069 -2.252853 -0.930634 -1.211460 -0.467523 -2.252853 -2.252853 -0.761491 -1.211460
3 -2.359835 -0.652265 -2.359835 -2.359835 -0.277050 -2.359835 -1.318443 -1.318443 -0.868474
4 -2.953760 -2.953760 -1.912367 -1.340976 -1.462398 -0.720764 -0.836488 -1.168430 -0.299583
5 -2.217484 -1.372386 -1.916454 -1.439333 -2.217484 -0.987035 -0.637700 -1.103541 -0.314394
6 -2.070038 -2.070038 -2.371068 -1.592917 -1.257125 -1.166948 -0.814765 -0.544993 -0.407280
7 -2.115373 -2.396199 -3.437592 -1.179913 -0.783415 -0.773891 -1.586334 -1.073980 -0.319989
8 -2.045323 -2.045323 -1.869232 -1.647383 -1.142233 -0.513844 -0.841203 -1.091080 -0.465539
9 -2.407391 -2.708421 -2.708421 -0.976027 -0.773922 -0.845098 -0.591150 -0.851088 -0.749380

3.4 Transition Probabilities between POI Pair Distance Classes

We model transition probabilities between different POI pair distances, i.e. $\text{Pr}(\text{Dist}_{\text{POI}_{i-1} \to \text{POI}_i} \to \text{Dist}_{\text{POI}_{i} \to \text{POI}_j})$ after discretize POI pair distances.

TODO: Improve the distance calculation using Google maps distance API with different travel modes demonstrated here.

3.4.1 Compute POI Pair Distance

Compute POI-pair distance if the pair is observed in dataset.


In [143]:
def calc_dist(longitude1, latitude1, longitude2, latitude2):
    """Calculate the distance (unit: km) between two places on earth"""
    # convert degrees to radians
    lon1 = math.radians(longitude1)
    lat1 = math.radians(latitude1)
    lon2 = math.radians(longitude2)
    lat2 = math.radians(latitude2)
    radius = 6371.009 # mean earth radius is 6371.009km, en.wikipedia.org/wiki/Earth_radius#Mean_radius
    # The haversine formula, en.wikipedia.org/wiki/Great-circle_distance
    dlon = math.fabs(lon1 - lon2)
    dlat = math.fabs(lat1 - lat2)
    return 2 * radius * math.asin( math.sqrt( \
               (math.sin(0.5*dlat))**2 + math.cos(lat1) * math.cos(lat2) * (math.sin(0.5*dlon))**2 ))

In [144]:
def calc_obs_poipair_distmat(seqid_set, poi_all, seq_all):
    poi_distmat = pd.DataFrame(data=np.full((poi_all.shape[0], poi_all.shape[0]), np.nan, dtype=np.float), \
                               index=poi_all.index, columns=poi_all.index)
    for seqid in seqid_set:
        seq = extract_seq(seqid, seq_all)
        if len(seq) < 2: continue
        for j in range(len(seq)-1):
            poi1 = seq[j]
            poi2 = seq[j+1]
            if np.isnan(poi_distmat.loc[poi1, poi2]):
                dist = calc_dist(poi_all.loc[poi1, 'poiLon'], poi_all.loc[poi1, 'poiLat'], \
                                 poi_all.loc[poi2, 'poiLon'], poi_all.loc[poi2, 'poiLat'])
                poi_distmat.loc[poi1, poi2] = dist
                poi_distmat.loc[poi2, poi1] = dist
    return poi_distmat

In [145]:
poi_distmat = calc_obs_poipair_distmat(seq_all['seqID'].unique(), poi_all, seq_all)
#poi_distmat

3.4.2 Discretize POI Pair Distance

We use the same metrics as described above to choose a discretization strategy for POI pair distances.

descretization strategyactSeqRank(Top%) Toronto (smaller is better)rankedTop5(%) Toronto (larger is better)actSeqRank(Top%) GlasgowrankedTop5(%) Glasgow
quantile, nbins=2mean: 15.156, std: 16.911120/266 = 45.11%mean: 14.354, std: 16.54160/100 = 60.00%
quantile, nbins=3mean: 14.917, std: 17.484130/266 = 48.87%mean: 14.450, std: 16.17360/100 = 60.00%
quantile, nbins=4mean: 14.389, std: 17.527139/266 = 52.26%mean: 14.255, std: 16.18464/100 = 64.00%
quantile, nbins=5mean: 13.645, std: 16.550148/266 = 55.64%mean: 14.085, std: 16.15963/100 = 63.00%
quantile, nbins=6mean: 14.299, std: 17.550138/266 = 51.88%mean: 13.156, std: 14.43663/100 = 63.00%
quantile, nbins=7mean: 12.689, std: 14.674140/266 = 52.63%mean: 12.755, std: 13.68162/100 = 62.00%
quantile, nbins=8mean: 12.996, std: 15.606143/266 = 53.76%mean: 11.716, std: 13.00364/100 = 64.00%
quantile, nbins=9mean: 12.510, std: 14.946140/266 = 52.63%mean: 11.355, std: 12.89366/100 = 66.00%
quantile, nbins=10mean: 12.467, std: 14.549146/266 = 54.89%mean: 11.181, std: 12.10768/100 = 68.00%
quantile, nbins=11mean: 12.548, std: 14.758143/266 = 53.76%mean: 10.214, std: 10.87268/100 = 68.00%
quantile, nbins=12mean: 11.980, std: 13.883147/266 = 55.26%mean: 10.041, std: 10.89573/100 = 73.00%
quantile, nbins=13mean: 12.170, std: 13.983140/266 = 52.63%mean: 9.345, std: 9.75973/100 = 73.00%
quantile, nbins=14mean: 11.384, std: 12.787153/266 = 57.52%mean: 9.008, std: 9.44473/100 = 73.00%
quantile, nbins=15mean: 11.444, std: 12.888151/266 = 56.77%mean: 8.613, std: 9.05275/100 = 75.00%
quantile, nbins=16mean: 10.932, std: 12.621156/266 = 58.65%mean: 8.553, std: 9.33674/100 = 74.00%
quantile, nbins=17mean: 10.991, std: 12.950156/266 = 58.65%mean: 8.025, std: 8.50279/100 = 79.00%
quantile, nbins=18mean: 10.728, std: 12.285155/266 = 58.27%EMPTY ROW 13NaN
quantile, nbins=19mean: 10.754, std: 12.368150/266 = 56.39%EMPTY ROW 11NaN
quantile, nbins=20mean: 10.836, std: 12.383152/266 = 57.14%mean: 7.922, std: 8.70279/100 = 79.00%
quantile, nbins=21mean: 10.579, std: 12.301162/266 = 60.90%EMPTY ROW 15NaN
equalWidth, nbins=2EMPTY LAST BINNaNEMPTY LAST BINNaN
equalWidth, nbins=3EMPTY LAST BINNaNEMPTY LAST TWO BINNaN
equalWidth, nbins=4EMPTY LAST TWO BINNaNEMPTY LAST TWO BINNaN
another, bins=[0, 2, 5, 100]mean: 15.110, std: 16.204117/266 = 43.98%mean: 14.937, std: 16.63060/100 = 60.00%

In [146]:
rank_mean_Toro = [15.156, 14.917, 14.389, 13.645, 14.299, 12.689, 12.996, 12.510, 12.467, 12.548, 11.980, \
                  12.170, 11.384, 11.444, 10.932, 10.991, 10.836, 15.110]
rank_std_Toro = [16.911, 17.484, 17.527, 16.550, 17.550, 14.674, 15.606, 14.946, 14.549, 14.758, 13.883, \
                 13.983, 12.787, 12.888, 12.621, 12.950, 12.383, 16.204]
rank_mean_Glas = [14.354, 14.450, 14.255, 14.085, 13.156, 12.755, 11.716, 11.355, 11.181, 10.214, 10.041, \
                  9.345, 9.008, 8.613, 8.553, 8.025, 7.922, 14.937]
rank_std_Glas = [16.541, 16.173, 16.184, 16.159, 14.436, 13.681, 13.003, 12.893, 12.107, 10.872, 10.895, \
                 9.759, 9.444, 9.052, 9.336, 8.502, 8.702, 16.630]
rank_top5_Toro = [45.11, 48.87, 52.26, 55.64, 51.88, 52.63, 53.76, 52.63, 54.89, 53.76, 55.26, 52.63, 57.52, \
                  56.77, 58.65, 58.65, 57.14, 43.98]
rank_top5_Glas = [60, 60, 64, 63, 63, 62, 64, 66, 68, 68, 73, 73, 73, 75, 74, 79, 79, 60]
xlabels = [x for x in range(2, len(rank_mean_Toro)+2)]

In [147]:
plt.figure(figsize=[15, 10])
plt.xlim([0, len(rank_mean_Toro)+2])
plt.ylim([-20, 100])
plt.errorbar(xlabels, rank_mean_Toro, rank_std_Toro, linestyle='--', marker='s', label='errorbar_Toronto')
plt.errorbar(xlabels, rank_mean_Glas, rank_std_Glas, linestyle='--', marker='s', label='errorbar_Glasgow')
plt.plot(xlabels, rank_top5_Toro, linestyle='--', marker='s', label='top5_Toronto')
plt.plot(xlabels, rank_top5_Glas, linestyle='--', marker='s', label='top5_Glasgow')
plt.legend()
#idx = 10
idx = 8
plt.annotate('choose', xy=(xlabels[idx], rank_top5_Glas[idx]), xytext=(xlabels[idx], rank_top5_Glas[idx]+15), \
             arrowprops=dict(facecolor='green', shrink=0.1))


Out[147]:
<matplotlib.text.Annotation at 0x7f28bfa15860>
/home/dawei/apps/miniconda3/lib/python3.4/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Remove rows that contain NaN and plot the curve, we choose quantile, nbins=10 to balance the complexity and accuracy.


In [148]:
#distdata = pd.Series([x for x in np.unique(poi_distmat.get_values().flatten()) if not np.isnan(x)])
distdata = pd.Series([poi_distmat.iloc[x, y] \
                      for x in range(poi_distmat.index.shape[0]) \
                      for y in range(x+1, poi_distmat.index.shape[0]) \
                      if not np.isnan(poi_distmat.iloc[x, y])])

In [149]:
distdata.describe()


Out[149]:
count    179.000000
mean       2.069742
std        1.659092
min        0.147439
25%        0.975327
50%        1.710602
75%        2.655934
max        9.857157
dtype: float64

In [150]:
ax = distdata.hist(bins=20)
ax.set_xlabel('POI-Pair Distance (km)')
ax.set_ylabel('#POI-Pair')


Out[150]:
<matplotlib.text.Text at 0x7f28bf9b94e0>

Quantile based bins (equal frequency)


In [151]:
nbins = 10
quantiles = np.round(np.linspace(0, 1, nbins+1), 2)[1:-1]
quantiles


Out[151]:
array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])

In [152]:
bins_qt = [0]
bins_qt.extend(distdata.quantile(quantiles))
bins_qt.append(10*round(distdata.max()))
bins_qt


Out[152]:
[0,
 0.49867513032727601,
 0.83513546479600187,
 1.1161185353028702,
 1.3750370714330364,
 1.7106015051827066,
 2.0006314262495546,
 2.4371396900920255,
 2.8751184490006279,
 3.7563375218812638,
 100.0]

Equal width bins


In [153]:
#nbins = 4
#inter = round((round(distdata.max()) + 1) / nbins)
#maxdist = 30 # Toronto, maximum distance among all POI pairs
#maxdist = 46 # Glasgow
#inter = round(maxdist / nbins) 
#bins_ew = [x*inter for x in range(nbins)]
#bins_ew.append(maxdist)
#bins_ew

Another Equal frequency bins


In [154]:
#bins = np.linspace(0, 10, 7)
#bins = np.logspace(0, 2, 4)
#bins = [0, 1, 2, 3, 10]
#bins_a = [0, 2, 5, 100] # walk, ride, drive

In [155]:
#bins_ef = [0, 1.15, 2.25, 100]

In [156]:
bins_dist = bins_qt
#bins_dist = bins_ew
#bins_dist = bins_ef
#bins_dist = bins_a

In [157]:
ax = distdata.ix[np.nonzero(distdata)].hist(bins=bins_dist)
ax.set_xlabel('POI-Pair Distance (km)')
ax.set_ylabel('#POI-Pair')
ax.set_xscale('log')



In [158]:
poi_distclass_mat = pd.DataFrame(data=np.zeros((poi_all.shape[0], poi_all.shape[0]), dtype=np.int), \
                                 index=poi_all.index, columns=poi_all.index)

In [159]:
for i in range(poi_all.index.shape[0]):
    poi1 = poi_all.index[i]
    for j in range(i+1, poi_all.index.shape[0]):
        poi2 = poi_all.index[j]
        dc = None
        if np.isnan(poi_distmat.loc[poi1, poi2]):
            dist = calc_dist(poi_all.loc[poi1, 'poiLon'], poi_all.loc[poi1, 'poiLat'], \
                             poi_all.loc[poi2, 'poiLon'], poi_all.loc[poi2, 'poiLat'])
            dc = np.digitize([dist], bins_dist)[0]
        else:
            dc = np.digitize([poi_distmat.loc[poi1, poi2]], bins_dist)[0]
        assert(dc is not None)
        poi_distclass_mat.loc[poi1, poi2] = dc
        poi_distclass_mat.loc[poi2, poi1] = dc

In [160]:
poi_distclass_mat


Out[160]:
poiID 1 2 3 4 6 7 8 9 10 11 ... 21 22 23 24 25 26 27 28 29 30
poiID
1 0 9 7 3 5 1 2 10 9 9 ... 5 3 4 2 7 10 9 3 7 1
2 9 0 10 8 9 9 9 10 10 10 ... 10 9 10 9 8 10 10 9 10 9
3 7 10 0 8 4 5 7 10 5 4 ... 2 4 3 8 6 10 4 3 3 6
4 3 8 8 0 5 4 1 10 9 9 ... 6 4 5 2 6 10 9 4 7 3
6 5 9 4 5 0 5 5 10 9 5 ... 3 2 3 6 2 10 6 3 3 4
7 1 9 5 4 5 0 2 10 8 8 ... 3 2 2 3 7 10 8 2 7 1
8 2 9 7 1 5 2 0 10 9 8 ... 5 4 4 1 6 10 9 4 7 2
9 10 10 10 10 10 10 10 0 10 10 ... 10 10 10 10 10 10 10 10 10 10
10 9 10 5 9 9 8 9 10 0 8 ... 6 7 7 9 9 10 7 7 8 8
11 9 10 4 9 5 8 8 10 8 0 ... 5 6 6 9 5 10 1 6 2 8
12 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
13 10 10 9 10 9 10 10 10 10 6 ... 9 9 9 10 8 10 6 9 7 10
14 9 1 10 7 9 9 8 10 10 10 ... 10 9 9 8 8 10 10 9 10 9
15 8 10 10 8 10 9 8 10 10 10 ... 10 9 10 7 10 10 10 9 10 9
16 2 8 7 1 4 3 1 10 9 8 ... 5 3 4 2 6 10 9 4 7 2
17 9 1 10 7 9 9 8 10 10 10 ... 10 9 9 8 8 10 10 9 10 9
18 10 8 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
19 7 9 10 7 9 8 7 10 10 10 ... 10 9 9 6 10 10 10 9 10 8
20 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
21 5 10 2 6 3 3 5 10 6 5 ... 0 1 1 6 5 10 6 1 3 4
22 3 9 4 4 2 2 4 10 7 6 ... 1 0 1 5 5 10 7 1 4 2
23 4 10 3 5 3 2 4 10 7 6 ... 1 1 0 5 6 10 7 1 4 3
24 2 9 8 2 6 3 1 10 9 9 ... 6 5 5 0 7 10 9 5 8 2
25 7 8 6 6 2 7 6 10 9 5 ... 5 5 6 7 0 10 6 5 4 6
26 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 0 10 10 10 10
27 9 10 4 9 6 8 9 10 7 1 ... 6 7 7 9 6 10 0 7 3 8
28 3 9 3 4 3 2 4 10 7 6 ... 1 1 1 5 5 10 7 0 4 2
29 7 10 3 7 3 7 7 10 8 2 ... 3 4 4 8 4 10 3 4 0 7
30 1 9 6 3 4 1 2 10 8 8 ... 4 2 3 2 6 10 8 2 7 0

29 rows × 29 columns

3.4.3 Compute Transition Probabilities

Use POI pair that is observed in dataset to compute the transition matrix between different "class" of distances.


In [161]:
def calc_poipair_dist_transmat(seqid_set, poi_all, seq_all, poi_distclass_mat, bins_dist):
    dist_class = list(range(1, len(bins_dist)))
    poipair_dist_transmat = pd.DataFrame(data=np.zeros((len(dist_class), len(dist_class)), dtype=np.float), \
                                         index=dist_class, columns=dist_class)
    for seqid in seqid_set:
        seq = extract_seq(seqid, seq_all)
        if len(seq) < 3: continue
        for j in range(1, len(seq)-1):
            poi1 = seq[j-1]
            poi2 = seq[j]
            poi3 = seq[j+1]
            dc1 = poi_distclass_mat.loc[poi1, poi2]
            dc2 = poi_distclass_mat.loc[poi2, poi3]
            poipair_dist_transmat.loc[dc1, dc2] += 1
    return poipair_dist_transmat

In [162]:
poipair_dist_transmat = calc_poipair_dist_transmat(seq_all['seqID'].unique(), poi_all, seq_all, \
                                                   poi_distclass_mat, bins_dist)
poipair_dist_transmat


Out[162]:
1 2 3 4 5 6 7 8 9 10
1 121 54 36 30 9 5 3 3 0 2
2 65 25 11 7 5 3 4 3 0 1
3 25 20 4 4 2 2 3 2 1 0
4 20 8 3 2 0 0 6 2 0 0
5 6 1 3 2 3 2 0 1 1 1
6 5 3 1 1 2 1 0 2 0 1
7 4 2 0 1 1 2 1 0 1 1
8 3 2 0 0 2 3 0 1 3 0
9 1 2 0 0 1 1 0 0 2 2
10 5 3 0 0 0 0 1 0 1 0

In [163]:
poipair_dist_transmat = normalise_transmat(poipair_dist_transmat)
poipair_dist_transmat


Out[163]:
1 2 3 4 5 6 7 8 9 10
1 0.457358 0.204528 0.136604 0.113962 0.034717 0.019623 0.012075 0.012075 0.000755 0.008302
2 0.520800 0.200800 0.088800 0.056800 0.040800 0.024800 0.032800 0.024800 0.000800 0.008800
3 0.392188 0.314063 0.064062 0.064062 0.032813 0.032813 0.048438 0.032813 0.017188 0.001563
4 0.469767 0.190698 0.074419 0.051163 0.004651 0.004651 0.144186 0.051163 0.004651 0.004651
5 0.290476 0.052381 0.147619 0.100000 0.147619 0.100000 0.004762 0.052381 0.052381 0.052381
6 0.300000 0.182353 0.064706 0.064706 0.123529 0.064706 0.005882 0.123529 0.005882 0.064706
7 0.292857 0.150000 0.007143 0.078571 0.078571 0.150000 0.078571 0.007143 0.078571 0.078571
8 0.206667 0.140000 0.006667 0.006667 0.140000 0.206667 0.006667 0.073333 0.206667 0.006667
9 0.110000 0.210000 0.010000 0.010000 0.110000 0.110000 0.010000 0.010000 0.210000 0.210000
10 0.463636 0.281818 0.009091 0.009091 0.009091 0.009091 0.100000 0.009091 0.100000 0.009091

In [164]:
poipair_dist_transmat_log = np.log10(poipair_dist_transmat)
poipair_dist_transmat_log


Out[164]:
1 2 3 4 5 6 7 8 9 10
1 -0.339743 -0.689247 -0.864537 -0.943239 -1.459458 -1.707243 -1.918096 -1.918096 -3.122216 -2.080823
2 -0.283329 -0.697236 -1.051587 -1.245652 -1.389340 -1.605548 -1.484126 -1.605548 -3.096910 -2.055517
3 -0.406506 -0.502984 -1.193396 -1.193396 -1.483961 -1.483961 -1.314818 -1.483961 -1.764787 -2.806180
4 -0.328117 -0.719655 -1.128318 -1.291046 -2.332438 -2.332438 -0.841077 -1.291046 -2.332438 -2.332438
5 -0.536889 -1.280827 -0.830858 -1.000000 -0.830858 -1.000000 -2.322219 -1.280827 -1.280827 -1.280827
6 -0.522879 -0.739087 -1.189056 -1.189056 -0.908230 -1.189056 -2.230449 -0.908230 -2.230449 -1.189056
7 -0.533344 -0.823909 -2.146128 -1.104735 -1.104735 -0.823909 -1.104735 -2.146128 -1.104735 -1.104735
8 -0.684730 -0.853872 -2.176091 -2.176091 -0.853872 -0.684730 -2.176091 -1.134699 -0.684730 -2.176091
9 -0.958607 -0.677781 -2.000000 -2.000000 -0.958607 -0.958607 -2.000000 -2.000000 -0.677781 -0.677781
10 -0.333823 -0.550031 -2.041393 -2.041393 -2.041393 -2.041393 -1.000000 -2.041393 -1.000000 -2.041393

4. Compute Trajectory Likelihood (For tuning the discretization strategy)

Log likelihood of trajectory $[\text{POI}_1, \text{POI}_2, \dots, \text{POI}_i, ..., \text{POI}_N]$ is defined as \begin{align} \text{logl} =& \sum_{i=1}^{N-1} \log(\text{Pr}(\text{Cat}_{\text{POI}_i} \to \text{Cat}_{\text{POI}_{i+1}})) + \sum_{i=1}^{N-1} \log(\text{Pr}(\text{Pop}_{\text{POI}_i} \to \text{Pop}_{\text{POI}_{i+1}})) + \sum_{i=2}^{N-1} \log(\text{Pr}(\text{Dist}_{\text{POI}_{i-1} \to \text{POI}_i} \to \text{Dist}_{\text{POI}_{i} \to \text{POI}_{i+1}})) \\ & + \log(\text{Pr}(\text{POI}_1)) \end{align}

where $\text{Pr}(\text{POI}_1)$ is the prior of $\text{POI}_1$ and we assume $\text{Pr}(\text{POI}_1)=1.0$, 10-based logarithm is used here.


In [165]:
def calc_seq_loglikelihood(seq, poi_all, poi_cat_transmat_log, poi_pop_transmat_log, \
                           poi_distclass_mat, poipair_dist_transmat_log):
    assert(len(seq) > 1)
    cat1 = poi_all.loc[seq[0], 'poiTheme']
    cat2 = poi_all.loc[seq[1], 'poiTheme']
    pc1  = poi_all.loc[seq[0], 'popClass']
    pc2  = poi_all.loc[seq[1], 'popClass']
    logL = poi_cat_transmat_log.loc[cat1, cat2] + poi_pop_transmat_log.loc[pc1, pc2]
    for j in range(1, len(seq)-1):
        poi1 = seq[j-1]
        poi2 = seq[j]
        poi3 = seq[j+1]
        cat2 = poi_all.loc[poi2, 'poiTheme']
        cat3 = poi_all.loc[poi3, 'poiTheme']
        pc2  = poi_all.loc[poi2, 'popClass']
        pc3  = poi_all.loc[poi3, 'popClass']
        dc12 = poi_distclass_mat.loc[poi1, poi2]
        dc23 = poi_distclass_mat.loc[poi2, poi3]
        logL += poi_cat_transmat_log.loc[cat2, cat3] + poi_pop_transmat_log.loc[pc2, pc3]
        #print(seq, dc12, dc23)
        logL += poipair_dist_transmat_log.loc[dc12, dc23]
    return logL

Simple check.


In [166]:
seq1 = [10, 21, 28, 22]
d12 = calc_dist(poi_all.loc[10,'poiLon'], poi_all.loc[10,'poiLat'], poi_all.loc[21,'poiLon'], poi_all.loc[21, 'poiLat'])
d23 = calc_dist(poi_all.loc[21,'poiLon'], poi_all.loc[21,'poiLat'], poi_all.loc[28,'poiLon'], poi_all.loc[28, 'poiLat'])
d34 = calc_dist(poi_all.loc[28,'poiLon'], poi_all.loc[28,'poiLat'], poi_all.loc[22,'poiLon'], poi_all.loc[22, 'poiLat'])

In [167]:
print(d12, d23, d34)
print(bins_dist)


1.965215721038275 0.45890229014413647 0.20523940866045706
[0, 0.49867513032727601, 0.83513546479600187, 1.1161185353028702, 1.3750370714330364, 1.7106015051827066, 2.0006314262495546, 2.4371396900920255, 2.8751184490006279, 3.7563375218812638, 100.0]

In [168]:
s1 = poi_cat_transmat_log.loc[poi_all.loc[10, 'poiTheme'], poi_all.loc[21, 'poiTheme']] + \
     poi_cat_transmat_log.loc[poi_all.loc[21, 'poiTheme'], poi_all.loc[28, 'poiTheme']] + \
     poi_cat_transmat_log.loc[poi_all.loc[28, 'poiTheme'], poi_all.loc[22, 'poiTheme']] + \
     poi_pop_transmat_log.loc[poi_all.loc[10, 'popClass'], poi_all.loc[21, 'popClass']] + \
     poi_pop_transmat_log.loc[poi_all.loc[21, 'popClass'], poi_all.loc[28, 'popClass']] + \
     poi_pop_transmat_log.loc[poi_all.loc[28, 'popClass'], poi_all.loc[22, 'popClass']]
s2 = poipair_dist_transmat_log.loc[np.digitize([d12], bins_dist)[0], np.digitize([d23], bins_dist)[0]] + \
     poipair_dist_transmat_log.loc[np.digitize([d23], bins_dist)[0], np.digitize([d34], bins_dist)[0]]
print(s1+s2)


-4.54102487653

In [169]:
calc_seq_loglikelihood([10, 21, 28, 22], poi_all, poi_cat_transmat_log, poi_pop_transmat_log, \
                       poi_distclass_mat, poipair_dist_transmat_log)


Out[169]:
-4.5410248765268788

In [170]:
def parse_seqstr(seqstr):
    term = re.sub('[ \[\]]', '', seqstr).split(',')
    return [int(x) for x in term]

4.1 Log Likelihood of Actual Trajectories

To save computation power, we consider unique travelling sequences only (i.e. treat sequences with the same list of POI and visiting order of different users as one sequence) as no user-specific features are used here.


In [171]:
unique_seq = dict() # seq -> [(seqid, userid)]

In [172]:
for seqid in sorted(seq_all['seqID'].unique().tolist()):
    seq = extract_seq(seqid, seq_all)
    if str(seq) not in unique_seq:
        unique_seq[str(seq)] = [(seqid, seq_user.loc[seqid])]
    else:
        unique_seq[str(seq)].append((seqid, seq_user.loc[seqid]))

In [173]:
unique_seq345 = [parse_seqstr(x) for x in sorted(unique_seq.keys()) if len(x.split(',')) in {3,4,5}]

In [174]:
unique_seq345_logL = pd.DataFrame(data=np.zeros((len(unique_seq345), 2), dtype=np.float), \
                                  index=[str(x) for x in unique_seq345], columns=['logLikelihood', 'seqLen'])
unique_seq345_logL.index.name = 'actSeq'

In [175]:
for seq in unique_seq345:
    assert(len(seq) in {3,4,5})
    logL = calc_seq_loglikelihood(seq, poi_all, poi_cat_transmat_log, poi_pop_transmat_log, \
                                  poi_distclass_mat, poipair_dist_transmat_log)
    unique_seq345_logL.loc[str(seq), 'logLikelihood'] = logL
    unique_seq345_logL.loc[str(seq), 'seqLen'] = len(seq)
    #print('Sequence %-20s Log likelihood: %.3f' % (str(seq), logL))

In [176]:
print(unique_seq345_logL.index.shape[0])
unique_seq345_logL.head()


266
Out[176]:
logLikelihood seqLen
actSeq
[1, 30, 4] -3.812454 3
[1, 30, 7] -2.759386 3
[1, 7, 28] -4.270981 3
[1, 8, 24] -3.362640 3
[10, 21, 28, 22] -4.541025 4

In [177]:
unique_seq345_logL['seqLen'].hist(bins=10)


Out[177]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f289881fb00>

4.2 Log Likelihood of Enumerated Trajectories

Compute log likelihood of enumerated trajectories for all unique actual sequences of length {3, 4, 5}.


In [178]:
poi_list = poi_all.index.tolist()
#poi_list

Enumerate trajectories of the same (start, end) and length (3, 4 or 5) with respect to an actual sequence.


In [179]:
def enum_seq345(seq, poi_list): 
    assert(len(seq) in {3, 4, 5})
    p0 = seq[0]
    pN = seq[-1]
    
    # enumerate sequences with length 3
    if len(seq) == 3:
        return [[p0, p, pN] \
                for p in poi_list if p not in {p0, pN}]
    
    # enumerate sequences with length 4
    if len(seq) == 4:
        return [[p0, p1, p2, pN] \
                for p1 in poi_list if p1 not in {p0, pN} \
                for p2 in poi_list if p2 not in {p0, p1, pN}]
    
    # enumerate sequences with length 5
    if len(seq) == 5:
        return [[p0, p1, p2, p3, pN] \
                for p1 in poi_list if p1 not in {p0, pN} \
                for p2 in poi_list if p2 not in {p0, p1, pN} \
                for p3 in poi_list if p3 not in {p0, p1, p2, pN}]

Compute the log likelihood of enumerated trajectories.


In [180]:
enum_logL_df = pd.DataFrame()

In [181]:
for seq in unique_seq345:
    enum_seqs = enum_seq345(seq, poi_list)
    df = pd.DataFrame(data=sorted([str(x) for x in enum_seqs]), columns=['enumSeq'])
    df.set_index('enumSeq', inplace=True)
    df['actSeq'] = str(seq)
    enum_logL_df = enum_logL_df.append(df)

In [182]:
print(enum_logL_df.shape)
enum_logL_df.head()


(622566, 1)
Out[182]:
actSeq
enumSeq
[1, 10, 4] [1, 30, 4]
[1, 11, 4] [1, 30, 4]
[1, 12, 4] [1, 30, 4]
[1, 13, 4] [1, 30, 4]
[1, 14, 4] [1, 30, 4]

In [183]:
t1 = datetime.now()
logL = Parallel(n_jobs=-2)(delayed(calc_seq_loglikelihood)\
       (seq, poi_all, poi_cat_transmat_log, poi_pop_transmat_log, poi_distclass_mat, poipair_dist_transmat_log)\
       for seq in [parse_seqstr(x) for x in enum_logL_df.index])
print('%d seconds used' % (datetime.now()-t1).total_seconds()) # 930 seconds


930 seconds used

In [184]:
enum_logL_df['enumSeqLogLikelihood'] = logL

In [185]:
#enum_logL_df.head(23)

4.3 Compare the Log Likelihood of Actual and Enumerated Trajectories

Compare the log likelihood between actual sequences $S_a$ and the one of highest log likelihood among enumerated sequences with respect to $S_a$ as well as the log likelihood rank of $S_a$.


In [186]:
df = pd.DataFrame(data=sorted([str(x) for x in unique_seq345]), columns=['actSeq'])
df.set_index('actSeq', inplace=True)
df['actSeqLogLikelihood'] = unique_seq345_logL.loc[df.index, 'logLikelihood']
df['enumSeq'] = ''
df['enumSeqLogLikelihood'] = 0
df['actSeqRank'] = 0
df['#enumSeq'] = 0
for seqstr in df.index:
    sub_df = enum_logL_df[enum_logL_df['actSeq'] == seqstr].copy()
    sub_df.reset_index(inplace=True)
    sub_df.sort(columns=['enumSeqLogLikelihood'], ascending=False, inplace=True)
    df.loc[seqstr, 'enumSeq'] = sub_df.iloc[0]['enumSeq']
    df.loc[seqstr, 'enumSeqLogLikelihood'] = sub_df.iloc[0]['enumSeqLogLikelihood']
    df.loc[seqstr, 'actSeqRank'] = 1 + np.nonzero(sub_df['enumSeq'] == seqstr)[0][0] # rank of actual sequence
    df.loc[seqstr, '#enumSeq'] = sub_df.index.shape[0]
df['actSeqRank(Top%)'] = 100*df['actSeqRank']/df['#enumSeq']

In [187]:
#df

In [188]:
print('mean: %.3f, std: %.3f' % (round(df['actSeqRank(Top%)'].mean(),3), round(df['actSeqRank(Top%)'].std(),3)))
df['actSeqRank(Top%)'].describe()


mean: 12.467, std: 14.549
Out[188]:
count    266.000000
mean      12.467022
std       14.548930
min        0.011396
25%        3.703704
50%        7.407407
75%       15.455840
max       66.666667
Name: actSeqRank(Top%), dtype: float64

In [189]:
ntop = np.nonzero(df['actSeqRank'] <= 5)[0].shape[0]
print('%d/%d = %.2f%%' % (ntop, df.index.shape[0], 100*ntop/df.index.shape[0]))


146/266 = 54.89%