In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
df = pd.read_json('../data/raw/train.json')
df['created'] = df['created'].apply(lambda row: pd.to_datetime(row))
In [30]:
def relative_count(df, column):
# Calculate counts per bedroom and interest_level
grouped = df.groupby([column, 'interest_level'])[column].count()
grouped_df = pd.DataFrame({column: grouped.index.get_level_values(0),
'interest_level': grouped.index.get_level_values(1),
'count': grouped.values})
# Get the total counts per bedroom
group_counts = df.groupby([column])[column].count()
# Calculate relative counts per group. This allows us to see more easily
# if there are differences
grouped_df['relative_count'] = grouped_df.apply(
lambda row: row['count'] / group_counts[row[column]], axis=1)
return grouped_df
# Combine all apartments with 5+ bedrooms into one category
df_bedrooms = df.copy(deep=True)
df_bedrooms['bedrooms'] = df_bedrooms['bedrooms'].apply(lambda b: str(b) if b <= 4 else '5+')
grouped_df = relative_count(df_bedrooms, 'bedrooms')
plt.figure(figsize=(8, 6))
sns.barplot(x='bedrooms', y='relative_count', hue='interest_level', data=grouped_df,
hue_order=['low', 'medium', 'high']);
We can see that the interest level rises slightly for apartments with more than 1 bedrooms, but for 5 bedrooms and more it falls sharply.
It's also time to find out, why there are apartments with 0 bedrooms. I was not able to get a definitive anwers for this, but I assume that these are studio apartments. At least renthop either displays Studio or something along the lines of 2 beds on its website, and on Wikipedia I found the following definition for Portugal:
Studio apartments are designated T0 (T-Zero). This designation follows the Portuguese house classification system, where apartments are classified by their typology as Tx, with the "x" representing the number of independent bedrooms. In the case of the T0, the "0" means that the apartment has no independent bedrooms.
Of course, Portugal is not the US, but I assume that it might be similar in the US (maybe without standards enforcing it).
At least we know that 0 bed apartments are valid apartments.
In [32]:
import math
# Combine all apartments with 3+ bathrooms into one category and round half bathrooms down
df_bathrooms = df.copy(deep=True)
df_bathrooms['bathrooms'] = df_bathrooms['bathrooms'].apply(lambda x: int(math.floor(x)))
df_bathrooms['bathrooms'] = df_bathrooms['bathrooms'].apply(lambda b: str(b) if b <= 2 else '3+')
grouped_df = relative_count(df_bathrooms, 'bathrooms')
plt.figure(figsize=(8, 6))
sns.barplot(x='bathrooms', y='relative_count', hue='interest_level', data=grouped_df,
hue_order=['low', 'medium', 'high']);
In [59]:
sns.lmplot(x='bedrooms', y='bathrooms', x_jitter=0.5, y_jitter=0.5,
data=df, hue='interest_level', hue_order=['low', 'medium', 'high'], size=8);
This plot is not so simple to read, because I applied a lot of jitter. We can see that apartments with zero or one bedrooms almost all the time have 1 bathroom. For two and three bedrooms there are both apartments with one and with two bathrooms. For more bedrooms the data gets quite spare, but the number of bathrooms rises.
Apart from this, there are areas which are much more sparse (I check this by plotting only 10% of the data) and have a low interest rate. This is interesting for us:
So, there seem to be some areas where it's simple to always predict a low interest level. However in the middle, there is everything - the low interest levels are just hidden.
In [64]:
import math
# Combine all apartments with 3+ bathrooms into one category and round half bathrooms down
df_photos = df.copy(deep=True)
df_photos['photo_count'] = df_photos['photos'].apply(len)
df_photos['photo_count'] = df_photos['photo_count'].apply(lambda x: int(math.floor(x)))
df_photos['photo_count'] = df_photos['photo_count'].apply(lambda b: str(b) if b <= 10 else '11+')
grouped_df = relative_count(df_photos, 'photo_count')
plt.figure(figsize=(8, 6))
sns.barplot(x='photo_count', y='relative_count', hue='interest_level', data=grouped_df,
order=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11+'],
hue_order=['low', 'medium', 'high']);
We can see that the interest level for 0 photos is really low (which was expected) and if there are pictures, it has a quite nice distribution with high interest rates at about 4-8 photos.
In [99]:
df_cheaper = df[df['price'] < 10000]
sns.violinplot(x='interest_level', y='price', data=df_cheaper);
We can see that there is a high interest rate for apartments cheaper than $2000 per month.
For the expensive apartments, there is too little data to get any real insight.