Numerical Features

The most simple features to check are the numerical features. There might be some correlation between the interest rate and the numerical features from the data set like bedrooms or bathrooms.



In [2]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

df = pd.read_json('../data/raw/train.json')

df['created'] = df['created'].apply(lambda row: pd.to_datetime(row))

Bedrooms

First, let's check the relation between the bedrooms and the bathrooms. So first the bedrooms. From our exploration we know that there is not much data available for apartments with 5+ bedrooms, thus we will combine them.



In [30]:

    
def relative_count(df, column):
    # Calculate counts per bedroom and interest_level
    grouped = df.groupby([column, 'interest_level'])[column].count()
    grouped_df = pd.DataFrame({column: grouped.index.get_level_values(0),
                           'interest_level': grouped.index.get_level_values(1),
                           'count': grouped.values})

    # Get the total counts per bedroom
    group_counts = df.groupby([column])[column].count()

    # Calculate relative counts per group. This allows us to see more easily
    # if there are differences
    grouped_df['relative_count'] = grouped_df.apply(
        lambda row: row['count'] / group_counts[row[column]], axis=1)
    
    return grouped_df

# Combine all apartments with 5+ bedrooms into one category
df_bedrooms = df.copy(deep=True)
df_bedrooms['bedrooms'] = df_bedrooms['bedrooms'].apply(lambda b: str(b) if b <= 4 else '5+')
grouped_df = relative_count(df_bedrooms, 'bedrooms')

plt.figure(figsize=(8, 6))
sns.barplot(x='bedrooms', y='relative_count', hue='interest_level', data=grouped_df,
            hue_order=['low', 'medium', 'high']);

We can see that the interest level rises slightly for apartments with more than 1 bedrooms, but for 5 bedrooms and more it falls sharply.

It's also time to find out, why there are apartments with 0 bedrooms. I was not able to get a definitive anwers for this, but I assume that these are studio apartments. At least renthop either displays Studio or something along the lines of 2 beds on its website, and on Wikipedia I found the following definition for Portugal:

Studio apartments are designated T0 (T-Zero). This designation follows the Portuguese house classification system, where apartments are classified by their typology as Tx, with the "x" representing the number of independent bedrooms. In the case of the T0, the "0" means that the apartment has no independent bedrooms.

Of course, Portugal is not the US, but I assume that it might be similar in the US (maybe without standards enforcing it).

At least we know that 0 bed apartments are valid apartments.

Bathrooms

We can do the same analysis for bathrooms, but here it makes sense to split the data into groups with 0, 1, 2 or 3+ bathrooms.



In [32]:

    
import math

# Combine all apartments with 3+ bathrooms into one category and round half bathrooms down
df_bathrooms = df.copy(deep=True)
df_bathrooms['bathrooms'] = df_bathrooms['bathrooms'].apply(lambda x: int(math.floor(x)))
df_bathrooms['bathrooms'] = df_bathrooms['bathrooms'].apply(lambda b: str(b) if b <= 2 else '3+')
grouped_df = relative_count(df_bathrooms, 'bathrooms')

plt.figure(figsize=(8, 6))
sns.barplot(x='bathrooms', y='relative_count', hue='interest_level', data=grouped_df,
            hue_order=['low', 'medium', 'high']);

We can see that the interest level is lowest for apartments with 0 bathrooms and also low for apartments with 3+ bathrooms.

Bedrooms / Bathrooms

I guess, there is a relation between the number of bedrooms and the number of bathrooms, so let's check this next.



In [59]:

    
sns.lmplot(x='bedrooms', y='bathrooms', x_jitter=0.5, y_jitter=0.5,
           data=df, hue='interest_level', hue_order=['low', 'medium', 'high'], size=8);

This plot is not so simple to read, because I applied a lot of jitter. We can see that apartments with zero or one bedrooms almost all the time have 1 bathroom. For two and three bedrooms there are both apartments with one and with two bathrooms. For more bedrooms the data gets quite spare, but the number of bathrooms rises.

Apart from this, there are areas which are much more sparse (I check this by plotting only 10% of the data) and have a low interest rate. This is interesting for us:

apartments with 0 bathrooms most of the time have low interest (already seen above)
apartments with 1 bedrooms, but 2 bathrooms have low interest
apartments with 3 bedrooms and 3 or more bathrooms have low interest level

So, there seem to be some areas where it's simple to always predict a low interest level. However in the middle, there is everything - the low interest levels are just hidden.

Number of photos

Next, let's check the relation between the interest rate and the number of pictures. From the initial exploration we know that there are few listings with 11 or more pictures, so let's combine those.



In [64]:

    
import math

# Combine all apartments with 3+ bathrooms into one category and round half bathrooms down
df_photos = df.copy(deep=True)
df_photos['photo_count'] = df_photos['photos'].apply(len)
df_photos['photo_count'] = df_photos['photo_count'].apply(lambda x: int(math.floor(x)))
df_photos['photo_count'] = df_photos['photo_count'].apply(lambda b: str(b) if b <= 10 else '11+')
grouped_df = relative_count(df_photos, 'photo_count')

plt.figure(figsize=(8, 6))
sns.barplot(x='photo_count', y='relative_count', hue='interest_level', data=grouped_df,
            order=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11+'],
            hue_order=['low', 'medium', 'high']);

We can see that the interest level for 0 photos is really low (which was expected) and if there are pictures, it has a quite nice distribution with high interest rates at about 4-8 photos.

Price

The last remaining simple numeric feature is the price. As we have seen during the initial exploration, there are a few apartments with a very high price. So we should split the data into cheap and expensive ones.



In [99]:

    
df_cheaper = df[df['price'] < 10000]

sns.violinplot(x='interest_level', y='price', data=df_cheaper);

We can see that there is a high interest rate for apartments cheaper than $2000 per month.

For the expensive apartments, there is too little data to get any real insight.