Reproduce the Results of the IJCAI'15 Paper

NOTE: Please view this page via IPython Notebook Viewer Service, otherwise the within-page links may not work properly.

1. Dataset

The dataset used in this paper can be downloaded here. It also gives some description and statistics of this dataset. However, one critical portion of information is missing in this dataset, i.e. the geo-location of each Points of Interest(POI), which makes it impossible to calculate the travel time from one POI to another unless the longitude and latitude of each POI is provided by other means.

Simple statistics of this dataset

City #POIs #Users #POI_Visits #Travel_Sequences
Toronto291,39539,4196,057
Osaka274507,7471,115
Glasgow2760111,4342,227
Edinburgh281,45433,9445,028

NOTE: the number of photos for each city described in paper is NOT available in this dataset

Fortunately, this info could be retrived from the original YFCC100M dataset by search the individual photoID which is included in this dataset. The YFCC100M dataset could be downloaded from here easily (but with patience).

2. POI Geo-coordinates

For each photo used in this paper, its POI ID is available in the dataset, thus, the longitude and latitude of each POI could be approximated by the mean value of all the corresponding photos' coordinates which could be retrived from the YFCC100M dataset.

To accelerate the searching process, first extract the photo id, longitude and latitude columns from the whole dataset


In [ ]:
cut -d $'\t' -f1,11,12 yfcc100m_dataset >> dataset.yfcc

then search the coordinates by grepping photo id


In [ ]:
cut -d ';' -f 1,4 uservisit.txt |while read line
do
    photoid=`echo "$line" |cut -d ';' -f 1`
    result=`grep -P "^$photoid\t" dataset.yfcc`
    if [ ! -z "$result" ]; then
        coords=`echo "$result" |sed 's/\t/:/g'`
        echo "$coords" >> poi.coords
    fi  
done

For further accelertion, one could import YFCC100M dataset to a database, though very time-consuming, then search by photo id. The SQL statements for create a database and a table for the dataset looks like


In [ ]:
CREATE DATABASE yfcc100m;
CREATE TABLE yfcc100m.tdata(
    pv_id           BIGINT UNSIGNED NOT NULL UNIQUE PRIMARY KEY, /* Photo/video identifier */
    longitude       FLOAT,  /* Longitude */
    latitude        FLOAT   /* Latitude */
);
COMMIT;

and Python code for importing dataset


In [ ]:
import mysql.connector as db

def import_data(fname):
    """Import data records from file"""
    dbconnection = db.connect(user='USERNAME', password='PASSWORD')
    cursor = dbconnection.cursor()
    with open(fname, 'r') as f:
        for line in f:
            items = line.split('\t')
            assert(len(items) == 3)
            pv_id     = items[0]
            longitude = items[1]
            latitude  = items[2]
            if len(longitude.strip()) == 0 or len(latitude.strip()) == 0:
                continue
            sqlstr = 'INSERT INTO yfcc100m.tdata VALUES (' + pv_id + ', ' + longitude.strip() + ', ' + latitude.strip() + ')' 
            try:
                cursor.execute(sqlstr)
            except db.Error as error:
                print('ERROR: {}'.format(error))
    dbconnection.commit()
    dbconnection.close()

Then, searching the coordinates of each photo is very easy and fast


In [ ]:
import mysql.connector as db

def search_coords(fin, fout):
    """Search Longitude and Latitude for each geo-tagged photo"""
    dbconnection = db.connect(user='USERNAME', password='PASSWORD', database='yfcc100m')
    cursor = dbconnection.cursor()
    records = []
    with open(fin, 'r') as f:
        for line in f:
            items = line.split(';')
            assert len(items) == 7
            photoID = items[0]
            poiID   = items[3]
            sqlstr = "SELECT ROUND(longitude, 6), ROUND(latitude, 6) FROM tdata WHERE pv_id = " + photoID
            cursor.execute(sqlstr)
            for longitude, latitude in cursor:
                records.append(poiID + ':' + photoID + ':' + str(longitude) + ':' + str(latitude))
    dbconnection.commit()
    dbconnection.close()
    with open(fout, 'w') as f:
        for line in records:
            f.write(line + '\n')

3. Recommend Itinerary

The paper formulates an Integer Linear Programming(ILP) to recommend an itinerary given a set of POIs, a budget, a source and destination POI, it maximize an objective function which combines travelling time with personalized visit durations.

PuLP from the COIN-OR project is a useful Python library for modeling linear and integer programs, many LP solvers such as GLPK, CBC, CPLEX and Gurobi can be called to solve the model. Its comprehensive documentation is available here.The formulation details can be found in the MIP_recommend() method.

4. Issues

  1. Is it necessary to consider visiting a certain POI more than one times? This paper ignores this setting.

  2. Dealing with edge case $\bar{V}(p) = 0$

    It appears when POIs at which just one photo was taken for each visited user (including some user just took/uploaded two or more photos with the same timestamp), the case does appear in this dataset.

    For all users $U$, POI $p$, arrival time $p^a$ and depature time $p^d$, The Average POI Visit Duration is defined as: $\bar{V}(p) = \frac{1}{n}\sum_{u \in U}\sum_{p_x \in S_u}(t_{p_x}^d - t_{p_x}^a)\delta(p_x = p), \forall p \in P$

    and Time-based User Interest is defined as: $Int_u^Time(c) = \sum_{p_x \in S_u} \frac{t_{p_x}^d - t_{p_x}^a}{\bar{V}(p_x)} \delta(Cat_{p_x} = c), \forall c \in C$

    Up to now, two strategies have been tried:

    • let the term $\frac{t_{p_x}^d - t_{p_x}^a}{\bar{V}(p_x)} = K$, where $K$ is a constant (e.g. 2). This approach seems to work, but the effects of different constants should be tested
    • discard all photo records in dataset related to the edge case. This approach suffers from throwing too much information, makes the useful dataset too small (at about 1% of the original dataset sometimes)
  3. CBC is still too slow for large sequences (length >= 4)

5. Results

5.1 POI Popularity

POI Popularity $Pop(p)$: the number of times POI $p$ has been visited.

5.2 Time-based User Interest

The interest of a user $u$ in POI category $c$: $ Int_u^{Time}(c) = \sum_{p_x \in S_u} \frac{(t_{p_x}^d - t_{p_x}^a)}{\bar{V}(p_x)}\delta(Cat_{p_x} = c), \forall c \in C $

where $Cat_{p_x}$ is the category of POI $p_x$, $t_{p_x}^a$ is user $u$'s arrival time at POI $p_x$ and $t_{p_x}^d$ is the departure time. $S_u$ is the user $u$'s tranvel history and $\delta(Cat_{p_x} = c)$ equals $1$ if $Cat_{p_x} = c$ and $0$ otherwise.$\bar{V}(p)$ is the average visit duration at POI $p$ for all users, $ \bar{V}(p) = \frac{1}{n} \sum_{u \in U} \sum_{p_x \in S_u} (t_{p_x}^d - t_{p_x}^a) \delta(p_x = p), \forall p \in P $

5.3 Frequency-based User Interest

Frequency-based User Interest $Int_u^{Freq}(c)$: the number of times user $u$ visits POIs of category $c$

5.4 Precision, Recall and F1-score

Settings: Toronto, $\eta$=0.5 with time-based user interest and POI popularity, 42/335 ≈ 12.5% solutions are suboptimal, leave-one-out

RecallPrecisionF1-score
Paper0.779±0.0100.706±0.0130.732±0.012
Reproduce0.732±0.1790.736±0.1810.734±0.179

Length of Recall, Precision and F1-score: 335
Unique values of Recall, Precision, F1-score: 0.40, 0.44, 0.50, 0.57, 0.60, 0.63, 0.67, 0.71, 0.75, 0.78, 0.80, 0.83, 0.85, 1.00

Box plot of Recall, Precision and F1-score

Settings: Toronto, $\eta$=0.5 with frequency-based user interest and POI popularity, 43/335 ≈ 12.8% solutions are suboptimal, leave-one-out

RecallPrecisionF1-score
Paper0.760±0.0090.679±0.0130.708±0.012
Reproduce0.709±0.1790.711±0.1800.710±0.179

Length of Recall, Precision and F1-score: 335
Unique values of Recall, Precision, F1-score: 0.40, 0.43, 0.50, 0.56, 0.57, 0.60, 0.63, 0.67, 0.71, 0.75, 0.80, 0.83, 0.85, 1.00

Box plot of Recall, Precision and F1-score

5.5 Transition matrix for POI categories

Settings for recommended trajectories: Toronto, $\eta$=0.5 with time-based user interest and 59/335 ≈ 17.6% solutions are suboptimal

NOTE: the value of matrix[i, j] denotes the probability of visiting category j after visiting category i for an average visitor.

Transition matrix for recommended trajectories:

Amusement Beach Cultural Shopping Sport Structure
Amusement 0.0350.2660.2590.0980.2660.077
Beach 0.1770.2220.2540.1130.0690.165
Cultural 0.2090.2950.1750.1070.1240.090
Shopping 0.1460.4040.2810.0000.0900.079
Sport 0.2290.2190.3050.0860.0860.076
Structure 0.2120.2820.2470.0940.1410.024

Transition matrix for actual trajectories:

Amusement Beach Cultural Shopping Sport Structure
Amusement 0.0910.1170.3440.1100.2400.097
Beach 0.0590.1270.1830.2690.0560.305
Cultural 0.1310.2040.1140.2010.0640.286
Shopping 0.0570.3580.2190.0570.0650.244
Sport 0.3170.1830.1670.1030.0630.167
Structure 0.0840.3030.2610.2000.0770.074

6. Tour Enumeration

For recommended tour/trajectories with length 3, 4 and 5, enumerating all possibilities are much easier to understand the recommendation results than using ILP to get a single best recommendation.

Settings for enumeration: Toronto, $\eta$=0.5 with time-based user interest, length for actual and recommended trajectories $\in \{3, 4, 5\}$, data are available here.

Interactive Python scripts to generate boxplots of enumerated trajectories' scores as well as the score of the actual travel sequences, e.g.


In [ ]:
python3 plot_enumseq.py Toro_eta05_time_seqs.list

file Toro_eta05_time_seqs.list is available here.

7. Transition matrix for time at POI

Visit duration matrix $M$ for each user at each POI, data: POI_visit_duration.zip.

For each cell of a matrix, $M_{ij}$, denotes the $log10$ of visit duration for user $i$ at POI $j$, i.e. $M_{ij}$ = $log10$(visit duration of user $i$ at POI $j$) if the visit duration is greater than $0$ and $0$ otherwise.