----------------------------- AIRBNB CHALLENGE -----------------------------

Where will new guests book their first travel experience?

Malo Grisard, Guillaume Jaume, Cyril Pecoraro - EPFL - 15th of January 2017

Main Preprocessing:

Pipeline:

1. Data exploration and cleaning
2. Machine learning preprocessing
3. Machine learning optimization
4. Results

The purpose of this project is to predict which country a new user's first booking destination will be. We are given a list of users along with their demographics, web session records, and some summary statistics. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'.

In this notebook, we explored and cleaned the given data and try to show the most relevant features extracted.

The cleaned data are saved into separated files and analysed in the Machine Learning Notebook.

Important Remark: Due to size constraints, we were not able to load the file session.csv in the Git, you can upload it directly online from the Kaggle Competition here.



In [1]:

    
import pandas as pd
import os as os
import preprocessing_helper as preprocessing_helper
import matplotlib as plt
% matplotlib inline

1. Data exploration and cleaning

The dataset is composed by several files. First, we are going to explore each of them and clean some variables. For a complete explanation of each file, please see the file DATA.md.

1.1 file 'train_users_2.csv'

This file is the most important file in our dataset as it contains the users, information about them and the country of destination.

When a user has booked a travel through Airbnb, the destination country will be specified. Otherwise, 'NDF' will be indicated.



In [2]:

    
filename = "train_users_2.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()









    Out[2]:






  
    
      
      id
      date_account_created
      timestamp_first_active
      date_first_booking
      gender
      age
      signup_method
      signup_flow
      language
      affiliate_channel
      affiliate_provider
      first_affiliate_tracked
      signup_app
      first_device_type
      first_browser
      country_destination
    
  
  
    
      0
      gxn3p5htnn
      2010-06-28
      20090319043255
      NaN
      -unknown-
      NaN
      facebook
      0
      en
      direct
      direct
      untracked
      Web
      Mac Desktop
      Chrome
      NDF
    
    
      1
      820tgsjxq7
      2011-05-25
      20090523174809
      NaN
      MALE
      38.0
      facebook
      0
      en
      seo
      google
      untracked
      Web
      Mac Desktop
      Chrome
      NDF
    
    
      2
      4ft3gnwmtx
      2010-09-28
      20090609231247
      2010-08-02
      FEMALE
      56.0
      basic
      3
      en
      direct
      direct
      untracked
      Web
      Windows Desktop
      IE
      US
    
    
      3
      bjjt8pjhuk
      2011-12-05
      20091031060129
      2012-09-08
      FEMALE
      42.0
      facebook
      0
      en
      direct
      direct
      untracked
      Web
      Mac Desktop
      Firefox
      other
    
    
      4
      87mebub9p4
      2010-09-14
      20091208061105
      2010-02-18
      -unknown-
      41.0
      basic
      0
      en
      direct
      direct
      untracked
      Web
      Mac Desktop
      Chrome
      US

There are missing values in the columns :

date_first_booking : users that never booked an airbnb apartment
gender : users that didn't wish to specify their gender
age : users that didn't wish to specify their age
first_affiliate_tracked : problem of missing data

We wil go each of these variable and take decisions regarding the missing values



In [3]:

    
df.isnull().any()









    Out[3]:





id                         False
date_account_created       False
timestamp_first_active     False
date_first_booking          True
gender                     False
age                         True
signup_method              False
signup_flow                False
language                   False
affiliate_channel          False
affiliate_provider         False
first_affiliate_tracked     True
signup_app                 False
first_device_type          False
first_browser              False
country_destination        False
dtype: bool

Ages

There are 2 problems regarding ages in the dataset.

First, many users did not specify an age. Also, some users specified their year of birth instead of age.

For the relevancy of the data we will keep users between the age of 15 and 100 years old, and those who specified their age. For the others, we will naively assign a value of -1



In [4]:

    
df = preprocessing_helper.cleanAge(df,'k')









    



Percentage of users with irrelevant age 1.11 %
Percentage of users with NaN age 41.22 %
All the invalid or missing age were replaced by value -1

The following graph presents the distribution of ages in the dataset. Also, the irrelevant ages are represented here, with their value of -1.



In [5]:

    
preprocessing_helper.plotAge(df)

Gender

The following graph highlights the gender of Airbnb users. Note that around 45% did not specify their age.



In [6]:

    
df = preprocessing_helper.cleanGender(df)
preprocessing_helper.plotGender(df)

first_affiliate_tracked feature

Set the first marketing the user interacted with before the signing up to 'Untracked' if not specified.



In [7]:

    
df = preprocessing_helper.cleanFirst_affiliate_tracked(df)

Date_first_booking

This has a high similarity with the dates where accounts were created. Despite the high growth of airbnb bookings throughout the years, it is possible to see that the difference between the months increases over the years as each year parabol curve increases.
By studying each year independently, we could see that four peaks arise each month corresponding to a certain day in the week. The following plots will show the bookings distribution over the months and later over the week
The most prolific day for Airbnb counted 248 bookings



In [8]:

    
df = preprocessing_helper.cleanDate_First_booking(df)
preprocessing_helper.plotDate_First_booking_years(df)

It is possible to understand from this histogram that the bookings are pretty well spread over the year. Much less bookings are made during november and december and the months of May and June are the ones where users book the most. For these two months Airbnb counts more than 20000 bookings which corresponds to allmost a quarter of the bookings from our dataset.



In [9]:

    
preprocessing_helper.plotDate_First_booking_months(df)

As for the day where most accounts are created, it seems that tuesday and wednesdays are the days where people book the most appartments on Airbnb.



In [10]:

    
preprocessing_helper.plotDate_First_booking_weekdays(df)

Save cleaned and explored file



In [11]:

    
filename = "cleaned_train_user.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(df, fileAdress)









    



file saved

1.2 file 'test_user.csv'

This file has a similar structure than train_user_2.csv, so here, we will just do the cleaning process here.



In [12]:

    
# extract file 
filename = "test_users.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
# process file
df = preprocessing_helper.cleanAge(df,'k')
df = preprocessing_helper.cleanGender(df)
df = preprocessing_helper.cleanFirst_affiliate_tracked(df)
# save file 
filename = "cleaned_test_user.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(df, fileAdress)









    



Percentage of users with irrelevant age 0.49 %
Percentage of users with NaN age 46.5 %
All the invalid or missing age were replaced by value -1
file saved

1.3 file 'countries.csv'

This file presents a summary of the countries presented in the dataset. This is the signification:

'AU' = Australia
'ES' = Spain
'PT' = Portugal
'US' = USA
'FR' = France
'CA' = Canada
'GB' = Great Britain
'IT' = Italy
'NL' = Netherlands
'DE' = Germany
'NDF'= No destination found

All the variables are calculated wrt. the US and english. The levenshtein distance is an indication on how far is the language spoken in the destination country compared to english. All the other variables are general geographics elements. This file will not be used in our model as it does not give direct indications regarding the users.



In [13]:

    
filename = "countries.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df









    Out[13]:






  
    
      
      country_destination
      lat_destination
      lng_destination
      distance_km
      destination_km2
      destination_language
      language_levenshtein_distance
    
  
  
    
      0
      AU
      -26.853388
      133.275160
      15297.7440
      7741220.0
      eng
      0.00
    
    
      1
      CA
      62.393303
      -96.818146
      2828.1333
      9984670.0
      eng
      0.00
    
    
      2
      DE
      51.165707
      10.452764
      7879.5680
      357022.0
      deu
      72.61
    
    
      3
      ES
      39.896027
      -2.487694
      7730.7240
      505370.0
      spa
      92.25
    
    
      4
      FR
      46.232193
      2.209667
      7682.9450
      643801.0
      fra
      92.06
    
    
      5
      GB
      54.633220
      -3.432277
      6883.6590
      243610.0
      eng
      0.00
    
    
      6
      IT
      41.873990
      12.564167
      8636.6310
      301340.0
      ita
      89.40
    
    
      7
      NL
      52.133057
      5.295250
      7524.3203
      41543.0
      nld
      63.22
    
    
      8
      PT
      39.553444
      -7.839319
      7355.2534
      92090.0
      por
      95.45
    
    
      9
      US
      36.966427
      -95.844030
      0.0000
      9826675.0
      eng
      0.00



In [14]:

    
df.describe()









    Out[14]:






  
    
      
      lat_destination
      lng_destination
      distance_km
      destination_km2
      language_levenshtein_distance
    
  
  
    
      count
      10.000000
      10.000000
      10.000000
      1.000000e+01
      10.00000
    
    
      mean
      39.799398
      -4.262446
      7181.897800
      2.973734e+06
      50.49900
    
    
      std
      24.739112
      63.700585
      3941.141734
      4.329619e+06
      44.54169
    
    
      min
      -26.853388
      -96.818146
      0.000000
      4.154300e+04
      0.00000
    
    
      25%
      39.639090
      -6.737559
      7001.557600
      2.580425e+05
      0.00000
    
    
      50%
      44.053092
      -0.139014
      7603.632650
      4.311960e+05
      67.91500
    
    
      75%
      51.891219
      9.163386
      7842.357000
      5.966865e+06
      91.39500
    
    
      max
      62.393303
      133.275160
      15297.744000
      9.984670e+06
      95.45000

1.4 file 'age_gender_bkts.csv'

This file presents demograhpic statistics about each country present in our dataset. This file will not be used in our model.



In [15]:

    
filename = "age_gender_bkts.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()









    Out[15]:






  
    
      
      age_bucket
      country_destination
      gender
      population_in_thousands
      year
    
  
  
    
      0
      100+
      AU
      male
      1.0
      2015.0
    
    
      1
      95-99
      AU
      male
      9.0
      2015.0
    
    
      2
      90-94
      AU
      male
      47.0
      2015.0
    
    
      3
      85-89
      AU
      male
      118.0
      2015.0
    
    
      4
      80-84
      AU
      male
      199.0
      2015.0

Population total per country

The following table shows the population in the country in 2015. These numbers correspond to data that can be found on the web.



In [16]:

    
df_country = df.groupby(['country_destination'],as_index=False).sum()
df_country









    Out[16]:






  
    
      
      country_destination
      population_in_thousands
      year
    
  
  
    
      0
      AU
      23923.0
      84630.0
    
    
      1
      CA
      35871.0
      84630.0
    
    
      2
      DE
      82562.0
      84630.0
    
    
      3
      ES
      47203.0
      84630.0
    
    
      4
      FR
      64983.0
      84630.0
    
    
      5
      GB
      63840.0
      84630.0
    
    
      6
      IT
      61145.0
      84630.0
    
    
      7
      NL
      16848.0
      84630.0
    
    
      8
      PT
      10609.0
      84630.0
    
    
      9
      US
      325132.0
      84630.0

1.5 file 'sessions.csv'

This file keeps a track of each action made by each user (represented by their id). For each action (lookup, search etc...), the device type is saved so as the time spend for this action.



In [20]:

    
filename = "sessions.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()









    Out[20]:






  
    
      
      user_id
      action
      action_type
      action_detail
      device_type
      secs_elapsed
    
  
  
    
      0
      d1mm9tcy42
      lookup
      NaN
      NaN
      Windows Desktop
      319.0
    
    
      1
      d1mm9tcy42
      search_results
      click
      view_search_results
      Windows Desktop
      67753.0
    
    
      2
      d1mm9tcy42
      lookup
      NaN
      NaN
      Windows Desktop
      301.0
    
    
      3
      d1mm9tcy42
      search_results
      click
      view_search_results
      Windows Desktop
      22141.0
    
    
      4
      d1mm9tcy42
      lookup
      NaN
      NaN
      Windows Desktop
      435.0

NaN users

As we can see, there are some missing user_id. Without a user_id, it is impossible to link them with the file train_user.csv. We will delete them as we cannot do anything with them.



In [21]:

    
df.isnull().any()









    Out[21]:





user_id           True
action            True
action_type       True
action_detail     True
device_type      False
secs_elapsed      True
dtype: bool



In [22]:

    
df = preprocessing_helper.cleanSubset(df, 'user_id')









    



0.33 % have been removed from the original dataframe

Invalid session time

If a session time is NaN, there was probably an error during the session. We are not going to remove the rows correponding, because there are still some data interesting for the actions variable.

Instead, we are naively going to assign them a value of -1.



In [23]:

    
df['secs_elapsed'].fillna(-1, inplace = True)

Actions

Some action produce -unknown- for action_type and/or action_detail. Sometimes they produce NaN. We replace the NaN values with -unknown- for action_type,action_detail, action



In [24]:

    
df = preprocessing_helper.cleanAction(df)









    



0.0 % have been removed from the original dataframe

As shown in the following, there are no more NaN values.



In [25]:

    
df.isnull().any()









    Out[25]:





user_id          False
action           False
action_type      False
action_detail    False
device_type      False
secs_elapsed     False
dtype: bool

Total number of actions per user

From the session, we can compute the total number of actions per user. Intuitively, we can imagine that a user totalising few actions might be a user that does not book in the end. This value will be used as a new feature for the machine learning.

Note: The total number of actions is represented on a logarithmic basis.



In [26]:

    
# Get total number of action per user_id
data_session_number_action = preprocessing_helper.createActionFeature(df)

# Save to .csv file
filename = "total_action_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_session_number_action, fileAdress)

# Plot distribution total number of action per user_id
preprocessing_helper.plotActionFeature(data_session_number_action)









    



file saved

Device types

There are 14 possible devices. Most of the users however are distributed on three main devices.



In [27]:

    
preprocessing_helper.plotHist(df['device_type'])

Time spent on average per user

The figure below shows the time spent on average per user. The following plot relates to the Total number of actions one with even clearer two gaussians. We display only time > 20s.

This value will also be used as a feature for the machine learning.



In [28]:

    
# Get Time spent on average per user_id
data_time_mean = preprocessing_helper.createAverageTimeFeature(df)

# Save to .csv file
data_time_mean = data_time_mean.rename(columns={'user_id': 'id'})
filename = "time_mean_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_time_mean, fileAdress)

# Plot distribution average time of session per user_id
preprocessing_helper.plotTimeFeature(data_time_mean['secs_elapsed'],'mean')









    



file saved

Time spent in total per user

The figure below shows the total amount of time spent per user. We display only time > 20s

This feature is the 3rd and last one used for the machine learning from the file session. Intuitively, a long time spent leads to a booking and possibly further destinations.



In [29]:

    
# Get Time spent in total per user_id
data_time_total = preprocessing_helper.createTotalTimeFeature(df)

# Save to .csv file
data_time_total = data_time_total.rename(columns={'user_id': 'id'})
filename = "time_total_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_time_total, fileAdress)

# Plot distribution total time of session per user_id
preprocessing_helper.plotTimeFeature(data_time_total['secs_elapsed'],'total')









    



file saved

Distribution of time spent

This last graph shows the distribution of time spent in second per session. We display only time > 20s



In [30]:

    
preprocessing_helper.plotTimeFeature(df['secs_elapsed'],'dist')

Conclusion on the preprocessing

Through this notebook, we explored all the files in the dataset and displayed the most relevant statictics. From the file session, we constructed features to reinforce the train_user2 file.

Starting from the cleaned data generated, we are now able to design a machine learning model. This problem will be adressed in the second notebook Machine Learning.

	id	date_account_created	timestamp_first_active	date_first_booking	gender	age	signup_method	signup_flow	language	affiliate_channel	affiliate_provider	first_affiliate_tracked	signup_app	first_device_type	first_browser	country_destination
0	gxn3p5htnn	2010-06-28	20090319043255	NaN	-unknown-	NaN	facebook	0	en	direct	direct	untracked	Web	Mac Desktop	Chrome	NDF
1	820tgsjxq7	2011-05-25	20090523174809	NaN	MALE	38.0	facebook	0	en	seo	google	untracked	Web	Mac Desktop	Chrome	NDF
2	4ft3gnwmtx	2010-09-28	20090609231247	2010-08-02	FEMALE	56.0	basic	3	en	direct	direct	untracked	Web	Windows Desktop	IE	US
3	bjjt8pjhuk	2011-12-05	20091031060129	2012-09-08	FEMALE	42.0	facebook	0	en	direct	direct	untracked	Web	Mac Desktop	Firefox	other
4	87mebub9p4	2010-09-14	20091208061105	2010-02-18	-unknown-	41.0	basic	0	en	direct	direct	untracked	Web	Mac Desktop	Chrome	US

	country_destination	lat_destination	lng_destination	distance_km	destination_km2	destination_language	language_levenshtein_distance
0	AU	-26.853388	133.275160	15297.7440	7741220.0	eng	0.00
1	CA	62.393303	-96.818146	2828.1333	9984670.0	eng	0.00
2	DE	51.165707	10.452764	7879.5680	357022.0	deu	72.61
3	ES	39.896027	-2.487694	7730.7240	505370.0	spa	92.25
4	FR	46.232193	2.209667	7682.9450	643801.0	fra	92.06
5	GB	54.633220	-3.432277	6883.6590	243610.0	eng	0.00
6	IT	41.873990	12.564167	8636.6310	301340.0	ita	89.40
7	NL	52.133057	5.295250	7524.3203	41543.0	nld	63.22
8	PT	39.553444	-7.839319	7355.2534	92090.0	por	95.45
9	US	36.966427	-95.844030	0.0000	9826675.0	eng	0.00

	lat_destination	lng_destination	distance_km	destination_km2	language_levenshtein_distance
count	10.000000	10.000000	10.000000	1.000000e+01	10.00000
mean	39.799398	-4.262446	7181.897800	2.973734e+06	50.49900
std	24.739112	63.700585	3941.141734	4.329619e+06	44.54169
min	-26.853388	-96.818146	0.000000	4.154300e+04	0.00000
25%	39.639090	-6.737559	7001.557600	2.580425e+05	0.00000
50%	44.053092	-0.139014	7603.632650	4.311960e+05	67.91500
75%	51.891219	9.163386	7842.357000	5.966865e+06	91.39500
max	62.393303	133.275160	15297.744000	9.984670e+06	95.45000

	age_bucket	country_destination	gender	population_in_thousands	year
0	100+	AU	male	1.0	2015.0
1	95-99	AU	male	9.0	2015.0
2	90-94	AU	male	47.0	2015.0
3	85-89	AU	male	118.0	2015.0
4	80-84	AU	male	199.0	2015.0

	country_destination	population_in_thousands	year
0	AU	23923.0	84630.0
1	CA	35871.0	84630.0
2	DE	82562.0	84630.0
3	ES	47203.0	84630.0
4	FR	64983.0	84630.0
5	GB	63840.0	84630.0
6	IT	61145.0	84630.0
7	NL	16848.0	84630.0
8	PT	10609.0	84630.0
9	US	325132.0	84630.0

	user_id	action	action_type	action_detail	device_type	secs_elapsed
0	d1mm9tcy42	lookup	NaN	NaN	Windows Desktop	319.0
1	d1mm9tcy42	search_results	click	view_search_results	Windows Desktop	67753.0
2	d1mm9tcy42	lookup	NaN	NaN	Windows Desktop	301.0
3	d1mm9tcy42	search_results	click	view_search_results	Windows Desktop	22141.0
4	d1mm9tcy42	lookup	NaN	NaN	Windows Desktop	435.0