First we should try to get an overview over the data. The dataset is from a rental listings website. Let's see what it contains.
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
In [2]:
df = pd.read_json('../data/raw/train.json')
df['created'] = df['created'].apply(lambda row: pd.to_datetime(row))
df.head()
Out[2]:
The data contains information about the apartment like number of rooms (bathrooms and bedrooms), possibly information about the building in which it is located (some elements seem to be zero), the textual description and features about the apartment. The monthly price is also given.
There's also geographical information like latitude/longitude, the displayed address and the street address (which might be some backend address, which can be hidden from the user).
The dataset also contains references to images - which are provided in the competition, too.
The target variable of the dataset is the interest level (how many people contacted the owner).
So let's first check the distribution of the prices:
In [3]:
sns.distplot(df['price']);
There seem to be some apartments with very high prices, so let's check those:
In [4]:
df.sort_values(by='price', ascending=False).head()
Out[4]:
The most expensive apartment is 4 million per month (could be a data entry error or true, we might want to check manually later), but then it decreases quickly. So let's exclude the most expensive apartments:
In [5]:
df_cheaper = df[df['price'] < 10000]
sns.distplot(df_cheaper['price']);
Now we can see more clear, that most apartments are around 2300 dollars per month and the distribution is skewed to the right (common for situations where negative values are impossible).
Let's go on an check the other numeric features bedrooms and bathrooms.
In [6]:
sns.countplot(df['bedrooms']);
Most apartments have 1 or 2 bedrooms. It's a bit surprising to me that there are apartments with 0 bedrooms. We should investigate this later.
In [7]:
sns.countplot(df['bathrooms']);
As for the bathrooms, we can see that some apartments have half bathrooms. For a native German this counting pattern seems uncommon (we do count half bedrooms instead, so in our area you would have seen half numbers for bedrooms), but there seems to be a general rule about it: A full bath room has toilet, sink and shower/tub while half a bath room has only toilet and sink.
It's not surprising that most apartments have exactly one bathroom.
Next we could check the creation date, just wondering if there have been more offers during specific years and for which years we got data.
In [8]:
months = df['created'].apply(lambda date: date.strftime('%Y-%m'))
sns.countplot(months.sort_values());
The data only contains three months, so there are not many changes in the number of listings. However, considering we only got three months, we should check if the testing data is from the same or different months.
We could also check the number of images apartments have.
In [9]:
df['photo_count'] = df['photos'].apply(len)
sns.distplot(df['photo_count']);
This is not a really nice Gaussian curve, but we see that it's common to have between something about 3 and 8 pictures for an apartment.
Next we should check what the possible features of apartments are and how often they are used:
In [10]:
import itertools
from collections import Counter
plt.figure(figsize=(12,8))
features = itertools.chain.from_iterable(df['features'].values)
most_common_features = Counter(features).most_common(30)
features_weighted = list(itertools.chain.from_iterable(map(lambda item: [item[0]] * item[1], most_common_features)))
sns.countplot(y=features_weighted);
From words like HARDWOOD or SIMPLEX we can see that features seem to be free text and managers can define them themselves. This could be interesting, because there could be some relation between interest_level
and writing style (e.g. caps lock or non caps). Of course, we should also check whether there is any relation between specific features and interest levels. This might also be related to the price. Maybe if there is an apartment with a high price, but without doorman, nobody will be interested?
Out of curiosity, let's check if there are any street addresses in the data several times (i.e. either duplicate offers or offers from the same building).
In [11]:
from collections import Counter
most_common_addr = Counter(df['street_address'].values).most_common(10)
pd.DataFrame(most_common_addr, columns=['Address', 'Count'])
Out[11]:
Now this is interesting. There are not only about two or three offers for one building, more than 100 offers for the same building - several times! By checking Google Maps we can see that 3333 Broadway is a really huge building. So it seems legit, that there are 174 offers for this building. 505 West 37th Street and 200 Water Street are also high scyscrapers, so I think we can assume that these offers from the same buildings are no duplicates. It might be interesting to check if the offers inside one building correlate more to each other regarding interest level than more general methods like geo location with latitude and longitude. But we should not focus too much on this field, because I assume that it's only visible in the backend.
So let's see the same chart for the display address.
In [12]:
from collections import Counter
most_common_addr = Counter(df['display_address'].values).most_common(10)
pd.DataFrame(most_common_addr, columns=['Address', 'Count'])
Out[12]:
We see more duplicates here, because it only gives us the street address.
Let's do one more simple thing and see how long the descriptions usually are.
In [13]:
df['description_length'] = df['description'].map(len)
sns.distplot(df['description_length']);
So, there is a large number of listings without a description and the rest has a nice distribution with the peak being at about 600 characters. The average word length in English seems to be 5.1 characters and the average sentence length in words about 14, resulting in an average sentence length of 71.4 characters. So, an offer would have around eight sentences.
Finally, we should also check our target variable. So what is the distribution of the interest levels?
In [14]:
sns.countplot(df['interest_level']);
There are relatively few listings with a high interest rate and many with a low interest rate, so an always low model would perform the best out of all three always models. However, this competition does not take a binary decision input, but rather the probabilities of each of the three interest levels (which probably can be mapped onto a numeric value of number of inquiries relatively easy).