In this notebook we want to try to engineer some new numerical features from the raw data.
For example, I guess that there is a higher interest level for cheaper apartments. We already saw this tendency in the exploration notebook for numerical features. Now we want to refine it to a new feature which gives us the price advantage of one apartment compared to other apartments of the same size (number of rooms).
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
df = pd.read_json('../data/raw/train.json')
df['created'] = df['created'].apply(lambda row: pd.to_datetime(row))
In [2]:
df_bedrooms = df.copy(deep=True)
df_bedrooms = df_bedrooms[df['price'] < 10000]
df_bedrooms['bedrooms'] = df_bedrooms['bedrooms'].apply(lambda b: str(b) if b <= 4 else '5+')
sns.factorplot(x='price', y='interest_level', col='bedrooms', data=df_bedrooms, kind='violin',
col_order=['0', '1', '2', '3', '4', '5+'], order=['low', 'medium', 'high']);
We see that for all apartment sizes there is a tendency that cheaper apartments have a higher interest level. It's not as clear as we might have expected or hoped for, but it is there. However, for 0-2 bedrooms it's more pronounced than for more bedrooms.
Let's check the same for the number of bathrooms.
In [3]:
import math
df_bathrooms = df.copy(deep=True)
df_bathrooms = df_bathrooms[df['price'] < 10000]
df_bathrooms['bathrooms'] = df_bathrooms['bathrooms'].apply(lambda x: int(math.floor(x)))
df_bathrooms['bathrooms'] = df_bathrooms['bathrooms'].apply(lambda b: str(b) if b <= 2 else '3+')
sns.factorplot(x='price', y='interest_level', col='bathrooms', data=df_bathrooms, kind='violin',
col_order=['0', '1', '2', '3+'], order=['low', 'medium', 'high']);
For zero bathrooms, there does not seem to be a large number of samples with high interest level (thus the whole distribution is concentrated on an apartment with price around 800 dollars). Also for the number of bathrooms, there's a tendency for a higher interest level if the price is lower. I assume that there again is a low number of samples for apartments with three or more bath rooms and a high interest level.
In [4]:
len(df_bathrooms[(df_bathrooms['interest_level'] == 'high') & (df_bathrooms['bathrooms'] == '3+')])
grouped = df_bathrooms.groupby(['bathrooms', 'interest_level'])['price']
grouped.describe()
Out[4]:
We can see that there indeed are only one sample for an apartment with high interest level and 0 bathrooms and only 17 samples for an apartment with 3+ bathrooms and a high interest level.