Expedia Hotel Recommendations Kaggle competition

Peeter Piksarv (piksarv .at. gmail.com)

The latest version of this Jupyter notebook is available at https://github.com/ppik/playdata/tree/master/Kaggle-Expedia

This is my take on that particular Kaggle competition started off using Dataquest tutorial by Vik Paruchuri.


In [1]:
import itertools
import operator
import random

import matplotlib.pyplot as plt
import ml_metrics as metrics
import numpy as np
import pandas as pd
import sklearn
import sklearn.decomposition
import sklearn.ensemble

%matplotlib notebook

Data import

Actually don't need to unpack gzipped cvs files, pandas' read_csv can handle those, although it can be slower (Reading 1000000 rows from train.csv.gz seems to be about 9% slower than from train.csv on my laptop).

Additionally, it's a good idea to specify the data types for each column tho ease the memory requirements. By default pandas detects the following data types:


In [2]:
train = pd.read_csv('data/train.csv.gz', nrows=10)
train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 24 columns):
date_time                    10 non-null object
site_name                    10 non-null int64
posa_continent               10 non-null int64
user_location_country        10 non-null int64
user_location_region         10 non-null int64
user_location_city           10 non-null int64
orig_destination_distance    6 non-null float64
user_id                      10 non-null int64
is_mobile                    10 non-null int64
is_package                   10 non-null int64
channel                      10 non-null int64
srch_ci                      10 non-null object
srch_co                      10 non-null object
srch_adults_cnt              10 non-null int64
srch_children_cnt            10 non-null int64
srch_rm_cnt                  10 non-null int64
srch_destination_id          10 non-null int64
srch_destination_type_id     10 non-null int64
is_booking                   10 non-null int64
cnt                          10 non-null int64
hotel_continent              10 non-null int64
hotel_country                10 non-null int64
hotel_market                 10 non-null int64
hotel_cluster                10 non-null int64
dtypes: float64(1), int64(20), object(3)
memory usage: 2.0+ KB

According to the specification the data fields are following:

train.csv

Column name Description Data type Equiv. type Notes
date_time Timestamp string [1]
site_name ID of the Expedia point of sale int np.int32
posa_continent ID of continent associated with site_name int np.int32
user_location_country The ID of the country the customer is located int np.int32
user_location_region The ID of the region the customer is located int np.int32
user_location_city The ID of the city the customer is located int np.int32
orig_destination_distance Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated double np.float64
user_id ID of user int np.int32
is_mobile 1 when a user connected from a mobile device, 0 otherwise tinyint np.uint8 [2]
is_package 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise int np.uint8 [2]
channel ID of a marketing channel int np.int32
srch_ci Checkin date string [1]
srch_co Checkout date string [1]
srch_adults_cnt The number of adults specified in the hotel room int np.int32
srch_children_cnt The number of (extra occupancy) children specified in the hotel room int np.int32 [4]
srch_rm_cnt The number of hotel rooms specified in the search int np.int32 [4]
srch_destination_id ID of the destination where the hotel search was performed int np.int32
srch_destination_type_id Type of destination int np.int32
hotel_continent Hotel continent int np.int32
hotel_country Hotel country int np.int32
hotel_market Hotel market int np.int32
is_booking 1 if a booking, 0 if a click tinyint np.uint8 [2]
cnt Numer of similar events in the context of the same user session bigint np.int64
hotel_cluster ID of a hotel cluster int np.int32

destinations.csv

Column name Description Data type Equiv. type Notes
srch_destination_id ID of the destination where the hotel search was performed int np.int32
d1-d149 latent description of search regions double np.float64 [3,5]

Notes

  1. Probably it would be good idea to parse dates while loading data. From date information useful features may include duration of the stay, season/month, how much in advance is the booking made, etc.
  2. May use np.bool instead.
  3. Single or even half-precision might be enough when starting to take account descriptions of search regions.
  4. If taking into account if the column srch_children_cnt or srch_room_cnt it may be worthwhile to simply this first to a boolean values if any number of children and/or rooms was specifien in the hotel room.
  5. Maybe the required clustering can be done solely on the base of latent descriptions of search regions.

In [2]:
traincols = ['date_time', 'site_name', 'posa_continent', 'user_location_country',
             'user_location_region', 'user_location_city', 'orig_destination_distance',
             'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co',
             'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id',
             'srch_destination_type_id', 'is_booking', 'cnt', 'hotel_continent',
             'hotel_country', 'hotel_market', 'hotel_cluster']
testcols = ['id', 'date_time', 'site_name', 'posa_continent', 'user_location_country',
            'user_location_region', 'user_location_city', 'orig_destination_distance',
            'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co',
            'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id',
            'srch_destination_type_id', 'hotel_continent', 'hotel_country', 'hotel_market']

Finding columns in testcols but not in traincols and vice versa:


In [4]:
[col for col in testcols if col not in traincols]


Out[4]:
['id']

In [5]:
[col for col in traincols if col not in testcols]


Out[5]:
['is_booking', 'cnt', 'hotel_cluster']

I don't know exactly what data colmuns I will be using eventually but I will define the data types for them here anyway just in case. Looking at the data most of the columns are actually non-negative integers so I can use unsigned integers for the most cases. Usage between uint8, uint32 and others was determined by the min and max values in the test dataset.


In [3]:
def read_csv( filename, cols, nrows=None ):
    datecols = ['date_time', 'srch_ci', 'srch_co']
    dateparser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S', errors='coerce')

    dtypes = {
        'id': np.uint32,
        'site_name': np.uint8,
        'posa_continent': np.uint8,
        'user_location_country': np.uint16,
        'user_location_region': np.uint16,
        'user_location_city': np.uint16,
        'orig_destination_distance': np.float32,
        'user_id': np.uint32,
        'is_mobile': bool,
        'is_package': bool,
        'channel': np.uint8,
        'srch_adults_cnt': np.uint8,
        'srch_children_cnt': np.uint8,
        'srch_rm_cnt': np.uint8,
        'srch_destination_id': np.uint32,
        'srch_destination_type_id': np.uint8,
        'is_booking': bool,
        'cnt': np.uint64,
        'hotel_continent': np.uint8,
        'hotel_country': np.uint16,
        'hotel_market': np.uint16,
        'hotel_cluster': np.uint8,
    }

    df = pd.read_csv(
        filename,
        nrows=nrows,
        usecols=cols,
        dtype=dtypes, # dtype can also specify datatypes for columns that do not excist in the particular datafile
        parse_dates=[col for col in datecols if col in cols], # columns here must be also in usecols
        date_parser=dateparser,
    )
    return df

In [5]:
train = read_csv('data/train.csv.gz', nrows=None, cols=traincols)
train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37670293 entries, 0 to 37670292
Data columns (total 24 columns):
date_time                    datetime64[ns]
site_name                    uint8
posa_continent               uint8
user_location_country        uint16
user_location_region         uint16
user_location_city           uint16
orig_destination_distance    float32
user_id                      uint32
is_mobile                    bool
is_package                   bool
channel                      uint8
srch_ci                      datetime64[ns]
srch_co                      datetime64[ns]
srch_adults_cnt              uint8
srch_children_cnt            uint8
srch_rm_cnt                  uint8
srch_destination_id          uint32
srch_destination_type_id     uint8
is_booking                   bool
cnt                          uint64
hotel_continent              uint8
hotel_country                uint16
hotel_market                 uint16
hotel_cluster                uint8
dtypes: bool(3), datetime64[ns](3), float32(1), uint16(5), uint32(2), uint64(1), uint8(9)
memory usage: 2.3 GB

With these type definitions the entire training set of 37 million entries takes 2.3 GB of memory.


In [6]:
test = read_csv('data/test.csv.gz', cols=testcols)
test.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2528243 entries, 0 to 2528242
Data columns (total 22 columns):
id                           uint32
date_time                    datetime64[ns]
site_name                    uint8
posa_continent               uint8
user_location_country        uint16
user_location_region         uint16
user_location_city           uint16
orig_destination_distance    float32
user_id                      uint32
is_mobile                    bool
is_package                   bool
channel                      uint8
srch_ci                      datetime64[ns]
srch_co                      datetime64[ns]
srch_adults_cnt              uint8
srch_children_cnt            uint8
srch_rm_cnt                  uint8
srch_destination_id          uint32
srch_destination_type_id     uint8
hotel_continent              uint8
hotel_country                uint16
hotel_market                 uint16
dtypes: bool(2), datetime64[ns](3), float32(1), uint16(5), uint32(3), uint8(8)
memory usage: 144.7 MB

Finding missing values in test data:


In [9]:
test.isnull().sum()


Out[9]:
id                                0
date_time                         0
site_name                         0
posa_continent                    0
user_location_country             0
user_location_region              0
user_location_city                0
orig_destination_distance    847461
user_id                           0
is_mobile                         0
is_package                        0
channel                           0
srch_ci                          22
srch_co                          17
srch_adults_cnt                   0
srch_children_cnt                 0
srch_rm_cnt                       0
srch_destination_id               0
srch_destination_type_id          0
hotel_continent                   0
hotel_country                     0
hotel_market                      0
dtype: int64

There are also some dates where the check in date is later than check out date:


In [10]:
(test.srch_ci > test.srch_co).sum()


Out[10]:
2184

Checking that all of the user_id-s in test set are contained in training set


In [7]:
test_ids = set(test.user_id.unique())
train_ids = set(train.user_id.unique())
test_ids <= train_ids # issubset


Out[7]:
True

However, not all all user_ids that are in training data are in


In [12]:
len(train_ids - test_ids)


Out[12]:
17209

Extract month and year field from the date


In [8]:
train['month'] = train['date_time'].dt.month.astype(np.uint8)
train['year'] = train['date_time'].dt.year.astype(np.uint16)

Pick 10000 users for smaller scale testing


In [12]:
sel_user_ids = sorted(random.sample(train_ids, 10000))
sel_train = train[train.user_id.isin(sel_user_ids)]

Create new test and training sets


In [13]:
t1 = sel_train[((sel_train.year == 2013) | ((sel_train.year == 2014) & (sel_train.month < 8)))]
t2 = sel_train[((sel_train.year == 2014) & (sel_train.month >= 8))]

Remove click events from t2 as in original test data.


In [14]:
t2 = t2[t2.is_booking == True]

Model 0: Most common clusters

Starting looking at the most common clusters and their properties.


In [57]:
most_common_clusters = list(train.hotel_cluster.value_counts().head().index)

Predicting most_common_clusters for every single row in selected test data.


In [13]:
predictions = [most_common_clusters for i in range(len(t2))]

Calculating Mean Average Precision with mapk from ml_metrics.


In [14]:
target = [[l] for l in t2['hotel_cluster']]
metrics.mapk(target, predictions, k=5)


Out[14]:
0.066735918744228989

That's not too great.

Model 1


In [20]:
#train.corr()['hotel_cluster']
# Calculating the correlations takes a while. No linear correlations were anyhow found in tutorial.

Generating features from destinations


In [25]:
dest = pd.read_csv('data/destinations.csv.gz')
dest.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62106 entries, 0 to 62105
Columns: 150 entries, srch_destination_id to d149
dtypes: float64(149), int64(1)
memory usage: 71.1 MB

In [26]:
dest.head()


Out[26]:
srch_destination_id d1 d2 d3 d4 d5 d6 d7 d8 d9 ... d140 d141 d142 d143 d144 d145 d146 d147 d148 d149
0 0 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -1.897627 -2.198657 -2.198657 -1.897627 ... -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657
1 1 -2.181690 -2.181690 -2.181690 -2.082564 -2.181690 -2.165028 -2.181690 -2.181690 -2.031597 ... -2.165028 -2.181690 -2.165028 -2.181690 -2.181690 -2.165028 -2.181690 -2.181690 -2.181690 -2.181690
2 2 -2.183490 -2.224164 -2.224164 -2.189562 -2.105819 -2.075407 -2.224164 -2.118483 -2.140393 ... -2.224164 -2.224164 -2.196379 -2.224164 -2.192009 -2.224164 -2.224164 -2.224164 -2.224164 -2.057548
3 3 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.115485 -2.177409 -2.177409 -2.177409 ... -2.161081 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409
4 4 -2.189562 -2.187783 -2.194008 -2.171153 -2.152303 -2.056618 -2.194008 -2.194008 -2.145911 ... -2.187356 -2.194008 -2.191779 -2.194008 -2.194008 -2.185161 -2.194008 -2.194008 -2.194008 -2.188037

5 rows × 150 columns


In [27]:
pca = sklearn.decomposition.PCA(n_components=3)
dest_small = pca.fit_transform(dest[['d{}'.format(i) for i in range(1,150)]])
dest_small = pd.DataFrame(dest_small)
dest_small['srch_destination_id'] = dest['srch_destination_id']

In [46]:
dest_small.head()


Out[46]:
0 1 2 srch_destination_id
0 0.044268 -0.169419 -0.032522 0
1 0.440761 -0.077405 0.091572 1
2 -0.001033 -0.020677 -0.012108 2
3 0.480467 0.040345 0.019320 3
4 0.207253 0.042694 0.011744 4

The variance ratio that is retained using principal components analysis with 3 principal components:


In [49]:
sum(pca.explained_variance_ratio_)


Out[49]:
0.61572578062540773

Generating features

  • New date features based on date_time, srch_ci, and srch_co.
  • Remove non-numeric columns like date_time.
  • Add in features from dest_small.
  • Replace any missing values with -1. (Initially planned to use unsigned integers for most of the variables, using -1 as fill value would not work then. May test with replacing na's using the most common values.

In [62]:
def calc_fast_features(df):
    # Assumes that the data frame date_time, srch_ci and srch_co are already converted to datetime.
    props = {}
    for prop in ['month', 'day', 'hour', 'minute', 'dayofweek', 'quarter']:
        props[prop] = getattr(df['date_time'].dt, prop)

    carryover = [p for p in df.columns if p not in ['date_time', 'srch_ci', 'srch_co']]
    for prop in carryover:
        props[prop] = df[prop]

    date_props = ['month', 'day', 'dayofweek', 'quarter']
    for prop in date_props:
        props['ci_{}'.format(prop)] = getattr(df['srch_ci'].dt, prop)
        props['co_{}'.format(prop)] = getattr(df['srch_co'].dt, prop)
    props['stay_span'] = (df['srch_co'] - df['srch_ci']).astype('timedelta64[h]')

    ret = pd.DataFrame(props)

    ret = ret.join(dest_small, on='srch_destination_id', how='left', rsuffix='dest')
    ret = ret.drop('srch_destination_iddest', axis=1)
    return ret

In [63]:
df = calc_fast_features(t1)

Using mean values to fill missing data.


In [74]:
df = df.fillna(df.mean())

Random forrest classifier

Using 5-fold cross validation on the training data to estimate error.


In [82]:
predictors = [c for c in df.columns if c not in ['hotel_cluster']]

clf = sklearn.ensemble.RandomForestClassifier(
    n_estimators=10, 
    min_weight_fraction_leaf=0.1,
)
scores = sklearn.cross_validation.cross_val_score(
    clf,
    df[predictors],
    df['hotel_cluster'],
    cv=5,
)
scores


Out[82]:
array([ 0.0652551 ,  0.06915912,  0.06577469,  0.06241378,  0.06689502])

Classifier accuracy seems rather low here as well.

Random forrest with binary classifier

Random forests should work better if only a single hotel cluster is predicted at times.


In [107]:
all_probs = []
unique_clusters = df['hotel_cluster'].unique()
for cluster in unique_clusters:
    df['target'] = 0
    df.loc[df['hotel_cluster'] == cluster, 'target'] = 1
    predictors = [c for c in df.columns if c not in ['hotel_cluster', 'target']]
    probs = []
    cv = sklearn.cross_validation.KFold(len(df), n_folds=5)
    clf = sklearn.ensemble.RandomForestClassifier(
        n_estimators=10,
        min_weight_fraction_leaf=0.1,
    )
    for i, (tr, te) in enumerate(cv):
        clf.fit(df[predictors].iloc[tr], df['target'].iloc[tr])
        preds = clf.predict_proba(df[predictors].iloc[te])
        probs.append(p[1] for p in preds)
    full_probs = itertools.chain.from_iterable(probs)
    all_probs.append(list(full_probs))

prediction_frame = pd.DataFrame(all_probs).T
prediction_frame.columns = unique_clusters
def find_top5(row):
    return list(row.nlargest(5).index)

preds = []
for index, row in prediction_frame.iterrows():
    preds.append(find_top5(row))

metrics.mapk([[l] for l in t2['hotel_cluster']], preds, k=5)


Out[107]:
0.046900535362073816

Using just the most popular clusters gives better scores. The approach here isn't particularly promising. One thing to note is that the input is full of categorical features. Therefore, to properly apply machine learning converting those values to separate binary features may be more appropriate approach.

Model 2

Finding top hotel clusters for each destination.


In [16]:
def make_key(items):
    return '_'.join([str(i) for i in items])

In [71]:
match_cols = ['srch_destination_id']
cluster_cols = match_cols + ['hotel_cluster']
groups = t1.groupby(cluster_cols)

In [73]:
top_clusters = {}
for name, group in groups:
    bookings = group['is_booking'].sum()
    clicks = len(group) - bookings
    
    score = bookings + .15*clicks
    
    clus_name = make_key(name[:len(match_cols)])
    if clus_name not in top_clusters:
        top_clusters[clus_name] = {}
    top_clusters[clus_name][name[-1]] = score

This dictionary has a key of srch_destination_id and each value is another dictionary, with hotel clusters as keys and scores as values.

Finding the top 5 for each destination.


In [19]:
cluster_dict = {}
for n in top_clusters:
    tc = top_clusters[n]
    top = [l[0] for l in sorted(tc.items(), key=operator.itemgetter(1), reverse=True)[:5]]
    cluster_dict[n] = top

Making predictions based on destination


In [20]:
preds = []
for index, row in t2.iterrows():
    key = make_key([row[m] for m in match_cols])
    if key in cluster_dict:
        preds.append(cluster_dict[key])
    else:
        preds.append(most_common_clusters)

metrics.mapk([[l] for l in t2["hotel_cluster"]], preds, k=5)


Out[20]:
0.24709650231774125

In [ ]:
cluster_dict

Data leak

Utilizing the data leak, that allows matching users in the training set from the testing set using a set of columns inculid user_location_country, and user_location_region.


In [41]:
match_cols = [
    'user_location_country',
    'user_location_region',
    'user_location_city',
    'hotel_market',
    'orig_destination_distance',
]

groups = t1.groupby(match_cols)

def generate_exact_matches(row, match_cols):
    index = tuple(row[t] for t in match_cols)
    try:
        group = groups.get_group(index)
    except KeyError:
        return []
    clus = list(set(group.hotel_cluster))
    return clus

exact_matches = []
for i in range(t2.shape[0]):
    exact_matches.append(generate_exact_matches(t2.iloc[i], match_cols))

Combining predictions

Combine exact_matches, preds, and most_common_clusters.


In [43]:
def f5(seq, idfun=None):
    """Uniquify a list by Peter Bengtsson
    https://www.peterbe.com/plog/uniqifiers-benchmark
    """
    if idfun is None:
        def idfun(x):
            return x

    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        if marker in seen:
            continue
        seen[marker] = 1
        result.append(item)
    return result

In [44]:
full_preds = [
    f5(exact_matches[p] + preds[p] + most_common_clusters)[:5]
    for p
    in range(len(preds))
]
metrics.mapk([[l] for l in t2["hotel_cluster"]], full_preds, k=5)


Out[44]:
0.28300884955752215

Submission file

I'll clean up the code here and make a separate script that uses full dataset for generating a submission file.

Here I'll just test the file making part.


In [56]:
write_p = [" ".join([str(l) for l in p]) for p in full_preds]
write_frame = ["{},{}".format(t2.index[i], write_p[i]) for i in range(len(full_preds))]
write_frame = ["id,hotel_clusters"] + write_frame
with open('predictions.csv', 'w+') as f:
    f.write('\n'.join(write_frame))

-- Peeter Piksarv