Expedia Hotel Recommendations Kaggle competition

Peeter Piksarv (piksarv .at. gmail.com)

The latest version of this Jupyter notebook is available at https://github.com/ppik/playdata/tree/master/Kaggle-Expedia

Here I'll try to test some machine learning techniques on this dataset.



In [1]:

    
import collections
import itertools
import operator
import random

import heapq
import matplotlib.pyplot as plt
import ml_metrics as metrics
import numpy as np
import pandas as pd
import sklearn
import sklearn.decomposition
import sklearn.linear_model
import sklearn.preprocessing

%matplotlib notebook

Data import

Defining a list of available data columns:



In [2]:

    
traincols = ['date_time', 'site_name', 'posa_continent', 'user_location_country',
             'user_location_region', 'user_location_city', 'orig_destination_distance',
             'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co',
             'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id',
             'srch_destination_type_id', 'is_booking', 'cnt', 'hotel_continent',
             'hotel_country', 'hotel_market', 'hotel_cluster']
testcols = ['id', 'date_time', 'site_name', 'posa_continent', 'user_location_country',
            'user_location_region', 'user_location_city', 'orig_destination_distance',
            'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co',
            'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id',
            'srch_destination_type_id', 'hotel_continent', 'hotel_country', 'hotel_market']

Convenience function for reading the data in:



In [3]:

    
def read_csv(filename, cols, nrows=None):
    datecols = ['date_time', 'srch_ci', 'srch_co']
    dateparser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S', errors='coerce')

    dtypes = {
        'id': np.uint32,
        'site_name': np.uint8,
        'posa_continent': np.uint8,
        'user_location_country': np.uint16,
        'user_location_region': np.uint16,
        'user_location_city': np.uint16,
        'orig_destination_distance': np.float32,
        'user_id': np.uint32,
        'is_mobile': bool,
        'is_package': bool,
        'channel': np.uint8,
        'srch_adults_cnt': np.uint8,
        'srch_children_cnt': np.uint8,
        'srch_rm_cnt': np.uint8,
        'srch_destination_id': np.uint32,
        'srch_destination_type_id': np.uint8,
        'is_booking': bool,
        'cnt': np.uint64,
        'hotel_continent': np.uint8,
        'hotel_country': np.uint16,
        'hotel_market': np.uint16,
        'hotel_cluster': np.uint8,
    }

    df = pd.read_csv(
        filename,
        nrows=nrows,
        usecols=cols,
        dtype=dtypes,
        parse_dates=[col for col in datecols if col in cols],
        date_parser=dateparser,
    )
    
    if 'date_time' in df.columns:
        df['month'] = df['date_time'].dt.month.astype(np.uint8)
        df['year'] = df['date_time'].dt.year.astype(np.uint16)
        
    if 'srch_ci' and 'srch_co' in df.columns:
        df['srch_ngt'] = (df['srch_co'] - df['srch_ci']).astype('timedelta64[h]')
    
    if 'srch_children_cnt' in df.columns:
        df['is_family'] = np.array(df['srch_children_cnt'] > 0)
    
    return df



In [4]:

    
train = read_csv('data/train.csv.gz', nrows=None, cols=traincols)

Getting a list of all user_ids in the sample.



In [5]:

    
train_ids = set(train.user_id.unique())
len(train_ids)









    Out[5]:





1198786

Pick a subset of users for testing and validation



In [7]:

    
sel_user_ids = sorted(random.sample(train_ids, 12000))
sel_train = train[train.user_id.isin(sel_user_ids)]

Create new test and training sets, using bookings from 2013 as training data and 2014 as test data.



In [8]:

    
cv_train = sel_train[sel_train.year == 2013]
cv_test = sel_train[sel_train.year == 2014]

Remove click events from cv_test as in original test data.



In [9]:

    
cv_test = cv_test[cv_test.is_booking == True]

Model 0: Most common clusters

Public solutions to the compedition (Dataquest tutorial by Vik Paruchuri and Leakage solution by ZFTurbo) use most common clusters in following groups:

srch_destination_id
user_location_city, orig_destination_distance (data leak)
srch_destination_id, hotel_country, hotel_market (for year 2014)
srch_destination_id
hotel_country

Finding the most common overall clusters



In [10]:

    
most_common_clusters = list(cv_train.hotel_cluster.value_counts().head().index)

Predicting the most common clusters in groups of srch_destination_id, hotel_country, hotel_market.



In [11]:

    
match_cols = ['srch_destination_id']
match_cols = ['srch_destination_id', 'hotel_country', 'hotel_market']
groups = cv_train.groupby(match_cols + ['hotel_cluster'])



In [12]:

    
top_clusters = {}
for name, group in groups:
    bookings = group['is_booking'].sum()
    clicks = len(group) - bookings
    
    score = bookings + .15*clicks
    
    clus_name = name[:len(match_cols)]
    if clus_name not in top_clusters:
        top_clusters[clus_name] = {}
    top_clusters[clus_name][name[-1]] = score

This dictionary has a key of srch_destination_id, hotel_country, hotel_market and each value is another dictionary, with hotel clusters as keys and scores as values.

Finding the top 5 for each destination.



In [13]:

    
cluster_dict = {}
for n in top_clusters:
    tc = top_clusters[n]
    top = [l[0] for l in sorted(tc.items(), key=operator.itemgetter(1), reverse=True)[:5]]
    cluster_dict[n] = top

Making predictions based on destination



In [14]:

    
preds = []
for index, row in cv_test.iterrows():
    key = tuple([row[m] for m in match_cols])
    pred = cluster_dict.get(key, most_common_clusters)
    preds.append(pred)



In [15]:

    
cv_target = [[l] for l in cv_test['hotel_cluster']]
metrics.mapk(cv_target, preds, k=5)









    Out[15]:





0.22201242636972485

srch_destination_id, is_booking: 0.212
srch_destination_id, hotel_country, hotel_market: 0.214

Model 1: Logistic regression

One vs all classification using stohastic gradient decent and forward stepwise feature selection.



In [ ]:

    
clf = sklearn.linear_model.SGDClassifier(loss='log', n_jobs=4)

Make dummy variables from categorical features. Pandas has get_dummies(), but currently this returns only float64-s, that thends to be rather memory hungry and slow. See #8725.



In [ ]:

    
cv_train_data = pd.DataFrame()
for elem in cv_train['srch_destination_id'].unique():
    cv_train_data[str(elem)] = cv_train['srch_destination_id'] == elem



In [ ]:

    
cv_test_data = pd.DataFrame()
for elem in cv_train_data.columns:
    cv_test_data[elem] = cv_test['srch_destination_id'] == int(elem)



In [ ]:

    
# cv_train_data['is_booking'] = cv_train['is_booking']
# cv_test_data['is_booking'] = np.ones(len(cv_test_data), dtype=bool)



In [ ]:

    
clf.fit(cv_train_data, cv_train['hotel_cluster'])



In [ ]:

    
result = clf.predict_proba(cv_test_data)



In [ ]:

    
preds = [heapq.nlargest(5, clf.classes_, row.take) for row in result]



In [ ]:

    
metrics.mapk(cv_target, preds, k=5)

I would say that not that bad at all (comparing the random forrest classifier in the Dataquest tutorial).

Using destination latent features form destination description data file.



In [ ]:

    
dest = pd.read_csv(
    'data/destinations.csv.gz',
    index_col = 'srch_destination_id',
)



In [ ]:

    
pca = sklearn.decomposition.PCA(n_components=10)
dest_small = pca.fit_transform(dest[['d{}'.format(i) for i in range(1,150)]])
dest_small = pd.DataFrame(dest_small, index=dest.index)



In [ ]:

    
cv_train_data = pd.DataFrame({key: cv_train[key] for key in ['srch_destination_id']})
cv_train_data = cv_train_data.join(dest_small, on=['srch_destination_id'], how='left')
cv_train_data = cv_train_data.fillna(dest_small.mean())



In [ ]:

    
cv_test_data = pd.DataFrame({key: cv_test[key] for key in ['srch_destination_id']})
cv_test_data = cv_test_data.join(dest_small, on='srch_destination_id', how='left', rsuffix='dest')
cv_test_data = cv_test_data.fillna(dest_small.mean())



In [ ]:

    
clf = sklearn.linear_model.SGDClassifier(loss='log', n_jobs=4)
clf.fit(cv_train_data, cv_train['hotel_cluster'])



In [ ]:

    
result = clf.predict_proba(cv_test_data)



In [ ]:

    
preds = [heapq.nlargest(5, clf.classes_, row.take) for row in result]



In [ ]:

    
metrics.mapk(cv_target, preds, k=5)

=> destination latent features seem not to be for any good use?!



In [16]:

    
features = [
    'site_name', 'posa_continent', 'user_location_country',
    'user_location_region', 'user_location_city',
    'is_mobile', 'is_package',
    'channel', 'srch_adults_cnt', 'srch_destination_id',
    'srch_destination_type_id', 'is_booking', 'cnt',
    'hotel_continent', 'hotel_country', 'hotel_market',
    'month', 'year', 'is_family',
]



In [17]:

    
def fit_features(features, train, test):
    # Data manipulation - split categorical features
    train_data = pd.DataFrame()
    test_data = pd.DataFrame()
    for feature in features:
        if train[feature].dtype == np.dtype('bool'):
            train_data[feature] = train[feature]
            test_data[feature] = test[feature]
        else:
            for elem in train[feature].unique():
                train_data['{}_{}'.format(feature, elem)] = train[feature] == elem
                test_data['{}_{}'.format(feature, elem)] = test[feature] == elem
    
    # Fitting
    clf = sklearn.linear_model.SGDClassifier(loss='log', n_jobs=4)
    clf.fit(train_data, train['hotel_cluster'])
    
    # Cross-validate the fit
    result = clf.predict_proba(test_data)
    preds = [heapq.nlargest(5, clf.classes_, row.take) for row in result]
    target = [[l] for l in test['hotel_cluster']]
    
    return metrics.mapk(target, preds, k=5)



In [20]:

    
cv_results = {}
for feature in features:
    cv_results[feature] = fit_features([feature], cv_train, cv_test)
    print('{}: {}'.format(feature, cv_results[feature]))









    



site_name: 0.07409979582998205
posa_continent: 0.075557342902514
user_location_country: 0.07867145125698612
user_location_region: 0.07620594361607788
user_location_city: 0.06994576089422343
is_mobile: 0.06114428014601249
is_package: 0.0689037719895234
channel: 0.06076687496133144
srch_adults_cnt: 0.06571387324960301
srch_destination_id: 0.17892150797088
srch_destination_type_id: 0.05779145785642104
is_booking: 0.0717564808513271
cnt: 0.06537152756295242
hotel_continent: 0.09783662273917795
hotel_country: 0.13986419600321723
hotel_market: 0.1959428942646786
month: 0.05816680071768855
year: 0.06597011693373753
is_family: 0.05973622883540597



In [23]:

    
sorted(cv_results.items(), key=operator.itemgetter(1), reverse=True)









    Out[23]:





[('hotel_market', 0.19594289426467859),
 ('srch_destination_id', 0.17892150797087999),
 ('hotel_country', 0.13986419600321723),
 ('hotel_continent', 0.097836622739177953),
 ('user_location_country', 0.078671451256986116),
 ('user_location_region', 0.076205943616077881),
 ('posa_continent', 0.075557342902513994),
 ('site_name', 0.074099795829982051),
 ('is_booking', 0.071756480851327104),
 ('user_location_city', 0.069945760894223427),
 ('is_package', 0.068903771989523396),
 ('year', 0.065970116933737527),
 ('srch_adults_cnt', 0.065713873249603011),
 ('cnt', 0.06537152756295242),
 ('is_mobile', 0.061144280146012489),
 ('channel', 0.060766874961331437),
 ('is_family', 0.059736228835405969),
 ('month', 0.058166800717688552),
 ('srch_destination_type_id', 0.057791457856421043)]

The best single predictor of a hotel cluster seems to be hotel_market.



In [25]:

    
features2 = [['hotel_market'] + [f] for f in features if f not in ['hotel_market']]



In [34]:

    
cv_results2 = {}
for feature in features2:
    cv_results2[tuple(feature)] = fit_features(feature, cv_train, cv_test)
    print('{}: {}'.format(feature, cv_results2[tuple(feature)]))









    



['hotel_market', 'site_name']: 0.1969524015756151
['hotel_market', 'posa_continent']: 0.19707665656128195
['hotel_market', 'user_location_country']: 0.1990002887252779
['hotel_market', 'user_location_region']: 0.19808255480624473
['hotel_market', 'user_location_city']: 0.19851512714223843
['hotel_market', 'is_mobile']: 0.19776959722823734
['hotel_market', 'is_package']: 0.19729577842397247
['hotel_market', 'channel']: 0.19772938192167294
['hotel_market', 'srch_adults_cnt']: 0.19720606735548268
['hotel_market', 'srch_destination_id']: 0.21783445729959375
['hotel_market', 'srch_destination_type_id']: 0.19991389799748396
['hotel_market', 'is_booking']: 0.19450854833054923
['hotel_market', 'cnt']: 0.19629916063437067
['hotel_market', 'hotel_continent']: 0.20295118480480107
['hotel_market', 'hotel_country']: 0.2116753284250036
['hotel_market', 'month']: 0.19871053228567304
['hotel_market', 'year']: 0.19773453773020686
['hotel_market', 'is_family']: 0.19823722906226157



In [42]:

    
sorted(cv_results2.items(), key=operator.itemgetter(1), reverse=True)[:3]









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-42-02307fe2eb80> in <module>()
----> 1 sorted(cv_results2.items(), key=operator.itemgetter(1), reverse=True)[:3]

NameError: name 'cv_results2' is not defined



In [18]:

    
features3 = [['hotel_market', 'srch_destination_id'] + [f] for f in features if f not in ['hotel_market', 'srch_destination_id']]



In [19]:

    
cv_results3 = {}
for feature in features3:
    cv_results3[tuple(feature)] = fit_features(feature, cv_train, cv_test)
    print('{}: {}'.format(feature, cv_results3[tuple(feature)]))









    



['hotel_market', 'srch_destination_id', 'site_name']: 0.2094012749132575
['hotel_market', 'srch_destination_id', 'posa_continent']: 0.21044944726861936
['hotel_market', 'srch_destination_id', 'user_location_country']: 0.21253530218671832
['hotel_market', 'srch_destination_id', 'user_location_region']: 0.21202775760509965
['hotel_market', 'srch_destination_id', 'user_location_city']: 0.21004115226337447
['hotel_market', 'srch_destination_id', 'is_mobile']: 0.20901234567901233
['hotel_market', 'srch_destination_id', 'is_package']: 0.20973291374162836
['hotel_market', 'srch_destination_id', 'channel']: 0.20971193415637862
['hotel_market', 'srch_destination_id', 'srch_adults_cnt']: 0.20703058178003714
['hotel_market', 'srch_destination_id', 'srch_destination_type_id']: 0.2112700718147341
['hotel_market', 'srch_destination_id', 'is_booking']: 0.2091680787541354
['hotel_market', 'srch_destination_id', 'cnt']: 0.21018074719599772
['hotel_market', 'srch_destination_id', 'hotel_continent']: 0.21682885499878965
['hotel_market', 'srch_destination_id', 'hotel_country']: 0.22068506414911643
['hotel_market', 'srch_destination_id', 'month']: 0.20766158315177924
['hotel_market', 'srch_destination_id', 'year']: 0.2134801904300815
['hotel_market', 'srch_destination_id', 'is_family']: 0.2122553054143468



In [41]:

    
sorted(cv_results3.items(), key=operator.itemgetter(1), reverse=True)[:3]









    Out[41]:





[(('hotel_market', 'srch_destination_id', 'hotel_country'),
  0.22068506414911643),
 (('hotel_market', 'srch_destination_id', 'hotel_continent'),
  0.21682885499878965),
 (('hotel_market', 'srch_destination_id', 'year'), 0.21348019043008151)]



In [21]:

    
features4 = [['hotel_market', 'srch_destination_id', 'hotel_country'] + [f] for f in features if f not in ['hotel_market', 'srch_destination_id', 'hotel_country']]



In [24]:

    
cv_results4 = {}
for feature in features4:
    cv_results4[tuple(feature)] = fit_features(feature, cv_train, cv_test)
    print('{}: {}'.format(feature, cv_results4[tuple(feature)]))









    



['hotel_market', 'srch_destination_id', 'hotel_country', 'site_name']: 0.2215855725006052
['hotel_market', 'srch_destination_id', 'hotel_country', 'posa_continent']: 0.21947147583313162
['hotel_market', 'srch_destination_id', 'hotel_country', 'user_location_country']: 0.22125958202210927
['hotel_market', 'srch_destination_id', 'hotel_country', 'user_location_region']: 0.22095779875736302
['hotel_market', 'srch_destination_id', 'hotel_country', 'user_location_city']: 0.22198015008472524
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_mobile']: 0.22127733397885904
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package']: 0.22352295650770596
['hotel_market', 'srch_destination_id', 'hotel_country', 'channel']: 0.21973129992737836
['hotel_market', 'srch_destination_id', 'hotel_country', 'srch_adults_cnt']: 0.2192939562656338
['hotel_market', 'srch_destination_id', 'hotel_country', 'srch_destination_type_id']: 0.22126038892923425
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_booking']: 0.22305252965383682
['hotel_market', 'srch_destination_id', 'hotel_country', 'cnt']: 0.221037682562737
['hotel_market', 'srch_destination_id', 'hotel_country', 'hotel_continent']: 0.22103284111998708
['hotel_market', 'srch_destination_id', 'hotel_country', 'month']: 0.21971435487775356
['hotel_market', 'srch_destination_id', 'hotel_country', 'year']: 0.2227580085532155
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_family']: 0.22101589607036234



In [40]:

    
sorted(cv_results4.items(), key=operator.itemgetter(1), reverse=True)[:3]









    Out[40]:





[(('hotel_market', 'srch_destination_id', 'hotel_country', 'is_package'),
  0.22352295650770596),
 (('hotel_market', 'srch_destination_id', 'hotel_country', 'is_booking'),
  0.22305252965383682),
 (('hotel_market', 'srch_destination_id', 'hotel_country', 'year'),
  0.22275800855321551)]



In [27]:

    
sel_features = ['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package']
features5 = [sel_features + [f] for f in features if f not in sel_features]



In [29]:

    
cv_results5 = {}
for feature in features5:
    cv_results5[tuple(feature)] = fit_features(feature, cv_train, cv_test)
    print('{}: {}'.format(feature, cv_results5[tuple(feature)]))









    



['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'site_name']: 0.2237359799887033
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'posa_continent']: 0.22256515775034297
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'user_location_country']: 0.22157588961510527
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'user_location_region']: 0.22076494795449042
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'user_location_city']: 0.22298959089808765
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_mobile']: 0.22112402162511094
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'channel']: 0.22243928023884452
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'srch_adults_cnt']: 0.22246106673121924
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'srch_destination_type_id']: 0.22149761962398126
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking']: 0.22440006455257
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'cnt']: 0.2215492616799806
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'hotel_continent']: 0.21973614137012829
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'month']: 0.2188404744613895
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'year']: 0.22280965060921487
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_family']: 0.22214072460259823



In [39]:

    
sorted(cv_results5.items(), key=operator.itemgetter(1), reverse=True)[:3]









    Out[39]:





[(('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'is_booking'),
  0.22440006455257),
 (('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'site_name'),
  0.22373597998870329),
 (('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'user_location_city'),
  0.22298959089808765)]



In [31]:

    
sel_features = ['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking']
features6 = [sel_features + [f] for f in features if f not in sel_features]



In [32]:

    
cv_results6 = {}
for feature in features6:
    cv_results6[tuple(feature)] = fit_features(feature, cv_train, cv_test)
    print('{}: {}'.format(feature, cv_results6[tuple(feature)]))









    



['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'site_name']: 0.2228467683369644
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent']: 0.22623577826192204
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'user_location_country']: 0.22255708867909302
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'user_location_region']: 0.22595336076817557
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'user_location_city']: 0.22120229161623498
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'is_mobile']: 0.22327120148470908
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'channel']: 0.22534818042443314
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'srch_adults_cnt']: 0.22382231905107722
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'srch_destination_type_id']: 0.22400710078269992
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'cnt']: 0.222798353909465
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'hotel_continent']: 0.22509803921568625
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'month']: 0.22231420963447104
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'year']: 0.22387396110707658
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'is_family']: 0.2245017348503187



In [38]:

    
sorted(cv_results6.items(), key=operator.itemgetter(1), reverse=True)[:3]









    Out[38]:





[(('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'is_booking',
   'posa_continent'),
  0.22623577826192204),
 (('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'is_booking',
   'user_location_region'),
  0.22595336076817557),
 (('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'is_booking',
   'channel'),
  0.22534818042443314)]



In [34]:

    
sel_features = ['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent']
features7 = [sel_features + [f] for f in features if f not in sel_features]



In [35]:

    
cv_results7 = {}
for feature in features7:
    cv_results7[tuple(feature)] = fit_features(feature, cv_train, cv_test)
    print('{}: {}'.format(feature, cv_results7[tuple(feature)]))









    



['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'site_name']: 0.21952876623900588
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'user_location_country']: 0.21938271604938273
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'user_location_region']: 0.22359638505608004
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'user_location_city']: 0.22424271766319692
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'is_mobile']: 0.22468570967481644
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'channel']: 0.22337367868958286
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'srch_adults_cnt']: 0.22353021867183087
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'srch_destination_type_id']: 0.226082465908174
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'cnt']: 0.2223045267489712
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'hotel_continent']: 0.22457193577019288
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'month']: 0.2214137012829823
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'year']: 0.2246816751391915
['hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent', 'is_family']: 0.2227854433954652



In [37]:

    
sorted(cv_results7.items(), key=operator.itemgetter(1), reverse=True)[:3]









    Out[37]:





[(('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'is_booking',
   'posa_continent',
   'srch_destination_type_id'),
  0.22608246590817399),
 (('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'is_booking',
   'posa_continent',
   'is_mobile'),
  0.22468570967481644),
 (('hotel_market',
   'srch_destination_id',
   'hotel_country',
   'is_package',
   'is_booking',
   'posa_continent',
   'year'),
  0.22468167513919149)]

No improvement over previous results. Best logistic regression result with fields 'hotel_market', 'srch_destination_id', 'hotel_country', 'is_package', 'is_booking', 'posa_continent'



In [ ]:

-- Peeter Piksarv