Numerical Features Engineering

In this notebook we want to try to engineer some new numerical features from the raw data.

For example, I guess that there is a higher interest level for cheaper apartments. We already saw this tendency in the exploration notebook for numerical features. Now we want to refine it to a new feature which gives us the price advantage of one apartment compared to other apartments of the same size (number of rooms).


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

df = pd.read_json('../data/raw/train.json')

df['created'] = df['created'].apply(lambda row: pd.to_datetime(row))

Relative price

So let's start and see if we can see such a tendency in the data. Again, we exclude the expensive apartments, because they break our visualisations.


In [2]:
df_bedrooms = df.copy(deep=True)
df_bedrooms = df_bedrooms[df['price'] < 10000]
df_bedrooms['bedrooms'] = df_bedrooms['bedrooms'].apply(lambda b: str(b) if b <= 4 else '5+')

sns.factorplot(x='price', y='interest_level', col='bedrooms', data=df_bedrooms, kind='violin',
               col_order=['0', '1', '2', '3', '4', '5+'], order=['low', 'medium', 'high']);


We see that for all apartment sizes there is a tendency that cheaper apartments have a higher interest level. It's not as clear as we might have expected or hoped for, but it is there. However, for 0-2 bedrooms it's more pronounced than for more bedrooms.

Let's check the same for the number of bathrooms.


In [3]:
import math

df_bathrooms = df.copy(deep=True)
df_bathrooms = df_bathrooms[df['price'] < 10000]
df_bathrooms['bathrooms'] = df_bathrooms['bathrooms'].apply(lambda x: int(math.floor(x)))
df_bathrooms['bathrooms'] = df_bathrooms['bathrooms'].apply(lambda b: str(b) if b <= 2 else '3+')

sns.factorplot(x='price', y='interest_level', col='bathrooms', data=df_bathrooms, kind='violin',
               col_order=['0', '1', '2', '3+'], order=['low', 'medium', 'high']);


For zero bathrooms, there does not seem to be a large number of samples with high interest level (thus the whole distribution is concentrated on an apartment with price around 800 dollars). Also for the number of bathrooms, there's a tendency for a higher interest level if the price is lower. I assume that there again is a low number of samples for apartments with three or more bath rooms and a high interest level.


In [4]:
len(df_bathrooms[(df_bathrooms['interest_level'] == 'high') & (df_bathrooms['bathrooms'] == '3+')])

grouped = df_bathrooms.groupby(['bathrooms', 'interest_level'])['price']
grouped.describe()


Out[4]:
bathrooms  interest_level       
0          high            count       1.000000
                           mean      868.000000
                           std              NaN
                           min       868.000000
                           25%       868.000000
                           50%       868.000000
                           75%       868.000000
                           max       868.000000
           low             count     296.000000
                           mean     3134.479730
                           std      1461.772671
                           min       975.000000
                           25%      2398.750000
                           50%      2699.000000
                           75%      3300.000000
                           max      9800.000000
           medium          count       6.000000
                           mean     3067.500000
                           std      1650.214380
                           min      1050.000000
                           25%      2418.750000
                           50%      2737.500000
                           75%      3363.750000
                           max      5995.000000
1          high            count    3412.000000
                           mean     2457.004982
                           std       846.925473
                           min       700.000000
                           25%      1800.000000
                           50%      2300.000000
                                       ...     
2          medium          std      1308.147382
                           min      1100.000000
                           25%      4000.000000
                           50%      4900.000000
                           75%      5785.000000
                           max      9990.000000
3+         high            count      17.000000
                           mean     3998.235294
                           std      2488.816775
                           min      1000.000000
                           25%      1650.000000
                           50%      2895.000000
                           75%      6100.000000
                           max      7458.000000
           low             count     344.000000
                           mean     7235.040698
                           std      1728.866935
                           min      2050.000000
                           25%      6250.000000
                           50%      7500.000000
                           75%      8500.000000
                           max      9999.000000
           medium          count      57.000000
                           mean     6088.228070
                           std      1737.357528
                           min      1650.000000
                           25%      5300.000000
                           50%      6250.000000
                           75%      7250.000000
                           max      9750.000000
Name: price, dtype: float64

We can see that there indeed are only one sample for an apartment with high interest level and 0 bathrooms and only 17 samples for an apartment with 3+ bathrooms and a high interest level.

Next Steps

Now we have to create a new feature from this knowledge.