Malo Grisard, Guillaume Jaume, Cyril Pecoraro - EPFL - 15th of January 2017
Pipeline:
1. Data exploration and cleaning
2. Machine learning preprocessing
3. Machine learning optimization
4. Results
The purpose of this project is to predict which country a new user's first booking destination will be. We are given a list of users along with their demographics, web session records, and some summary statistics. All the users in this dataset are from the USA.
There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'.
In this notebook, we explored and cleaned the given data and try to show the most relevant features extracted.
The cleaned data are saved into separated files and analysed in the Machine Learning Notebook.
Important Remark: Due to size constraints, we were not able to load the file session.csv in the Git, you can upload it directly online from the Kaggle Competition here.
In [1]:
import pandas as pd
import os as os
import preprocessing_helper as preprocessing_helper
import matplotlib as plt
% matplotlib inline
The dataset is composed by several files. First, we are going to explore each of them and clean some variables. For a complete explanation of each file, please see the file DATA.md.
This file is the most important file in our dataset as it contains the users, information about them and the country of destination.
When a user has booked a travel through Airbnb, the destination country will be specified. Otherwise, 'NDF' will be indicated.
In [2]:
filename = "train_users_2.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()
Out[2]:
There are missing values in the columns :
We wil go each of these variable and take decisions regarding the missing values
In [3]:
df.isnull().any()
Out[3]:
There are 2 problems regarding ages in the dataset.
First, many users did not specify an age. Also, some users specified their year of birth instead of age.
For the relevancy of the data we will keep users between the age of 15 and 100 years old, and those who specified their age. For the others, we will naively assign a value of -1
In [4]:
df = preprocessing_helper.cleanAge(df,'k')
The following graph presents the distribution of ages in the dataset. Also, the irrelevant ages are represented here, with their value of -1.
In [5]:
preprocessing_helper.plotAge(df)
In [6]:
df = preprocessing_helper.cleanGender(df)
preprocessing_helper.plotGender(df)
In [7]:
df = preprocessing_helper.cleanFirst_affiliate_tracked(df)
In [8]:
df = preprocessing_helper.cleanDate_First_booking(df)
preprocessing_helper.plotDate_First_booking_years(df)
It is possible to understand from this histogram that the bookings are pretty well spread over the year. Much less bookings are made during november and december and the months of May and June are the ones where users book the most. For these two months Airbnb counts more than 20000 bookings which corresponds to allmost a quarter of the bookings from our dataset.
In [9]:
preprocessing_helper.plotDate_First_booking_months(df)
As for the day where most accounts are created, it seems that tuesday and wednesdays are the days where people book the most appartments on Airbnb.
In [10]:
preprocessing_helper.plotDate_First_booking_weekdays(df)
In [11]:
filename = "cleaned_train_user.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(df, fileAdress)
In [12]:
# extract file
filename = "test_users.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
# process file
df = preprocessing_helper.cleanAge(df,'k')
df = preprocessing_helper.cleanGender(df)
df = preprocessing_helper.cleanFirst_affiliate_tracked(df)
# save file
filename = "cleaned_test_user.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(df, fileAdress)
This file presents a summary of the countries presented in the dataset. This is the signification:
All the variables are calculated wrt. the US and english. The levenshtein distance is an indication on how far is the language spoken in the destination country compared to english. All the other variables are general geographics elements. This file will not be used in our model as it does not give direct indications regarding the users.
In [13]:
filename = "countries.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df
Out[13]:
In [14]:
df.describe()
Out[14]:
In [15]:
filename = "age_gender_bkts.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()
Out[15]:
In [16]:
df_country = df.groupby(['country_destination'],as_index=False).sum()
df_country
Out[16]:
In [20]:
filename = "sessions.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()
Out[20]:
In [21]:
df.isnull().any()
Out[21]:
In [22]:
df = preprocessing_helper.cleanSubset(df, 'user_id')
In [23]:
df['secs_elapsed'].fillna(-1, inplace = True)
In [24]:
df = preprocessing_helper.cleanAction(df)
As shown in the following, there are no more NaN values.
In [25]:
df.isnull().any()
Out[25]:
From the session, we can compute the total number of actions per user. Intuitively, we can imagine that a user totalising few actions might be a user that does not book in the end. This value will be used as a new feature for the machine learning.
Note: The total number of actions is represented on a logarithmic basis.
In [26]:
# Get total number of action per user_id
data_session_number_action = preprocessing_helper.createActionFeature(df)
# Save to .csv file
filename = "total_action_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_session_number_action, fileAdress)
# Plot distribution total number of action per user_id
preprocessing_helper.plotActionFeature(data_session_number_action)
In [27]:
preprocessing_helper.plotHist(df['device_type'])
In [28]:
# Get Time spent on average per user_id
data_time_mean = preprocessing_helper.createAverageTimeFeature(df)
# Save to .csv file
data_time_mean = data_time_mean.rename(columns={'user_id': 'id'})
filename = "time_mean_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_time_mean, fileAdress)
# Plot distribution average time of session per user_id
preprocessing_helper.plotTimeFeature(data_time_mean['secs_elapsed'],'mean')
In [29]:
# Get Time spent in total per user_id
data_time_total = preprocessing_helper.createTotalTimeFeature(df)
# Save to .csv file
data_time_total = data_time_total.rename(columns={'user_id': 'id'})
filename = "time_total_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_time_total, fileAdress)
# Plot distribution total time of session per user_id
preprocessing_helper.plotTimeFeature(data_time_total['secs_elapsed'],'total')
In [30]:
preprocessing_helper.plotTimeFeature(df['secs_elapsed'],'dist')
Through this notebook, we explored all the files in the dataset and displayed the most relevant statictics. From the file session, we constructed features to reinforce the train_user2 file.
Starting from the cleaned data generated, we are now able to design a machine learning model. This problem will be adressed in the second notebook Machine Learning.