First we should try to get an overview over the data. The dataset is from a rental listings website. Let's see what it contains.


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [2]:
df = pd.read_json('../data/raw/train.json')

df['created'] = df['created'].apply(lambda row: pd.to_datetime(row))

df.head()


Out[2]:
bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id longitude manager_id photos price street_address
10 1.5 3 53a5b119ba8f7b61d4e010512e0dfc85 2016-06-24 07:54:24 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... Metropolitan Avenue [] medium 40.7145 7211212 -73.9425 5ba989232d0489da1b5f2c45f6688adc [https://photos.renthop.com/2/7211212_1ed4542e... 3000 792 Metropolitan Avenue
10000 1.0 2 c5c8a357cba207596b04d1afd1e4f130 2016-06-12 12:19:27 Columbus Avenue [Doorman, Elevator, Fitness Center, Cats Allow... low 40.7947 7150865 -73.9667 7533621a882f71e25173b27e3139d83d [https://photos.renthop.com/2/7150865_be3306c5... 5465 808 Columbus Avenue
100004 1.0 1 c3ba40552e2120b0acfc3cb5730bb2aa 2016-04-17 03:26:41 Top Top West Village location, beautiful Pre-w... W 13 Street [Laundry In Building, Dishwasher, Hardwood Flo... high 40.7388 6887163 -74.0018 d9039c43983f6e564b1482b273bd7b01 [https://photos.renthop.com/2/6887163_de85c427... 2850 241 W 13 Street
100007 1.0 1 28d9ad350afeaab8027513a3e52ac8d5 2016-04-18 02:22:02 Building Amenities - Garage - Garden - fitness... East 49th Street [Hardwood Floors, No Fee] low 40.7539 6888711 -73.9677 1067e078446a7897d2da493d2f741316 [https://photos.renthop.com/2/6888711_6e660cee... 3275 333 East 49th Street
100013 1.0 4 0 2016-04-28 01:32:41 Beautifully renovated 3 bedroom flex 4 bedroom... West 143rd Street [Pre-War] low 40.8241 6934781 -73.9493 98e13ad4b495b9613cef886d79a6291f [https://photos.renthop.com/2/6934781_1fa4b41a... 3350 500 West 143rd Street

The data contains information about the apartment like number of rooms (bathrooms and bedrooms), possibly information about the building in which it is located (some elements seem to be zero), the textual description and features about the apartment. The monthly price is also given.

There's also geographical information like latitude/longitude, the displayed address and the street address (which might be some backend address, which can be hidden from the user).

The dataset also contains references to images - which are provided in the competition, too.

The target variable of the dataset is the interest level (how many people contacted the owner).

So let's first check the distribution of the prices:


In [3]:
sns.distplot(df['price']);


There seem to be some apartments with very high prices, so let's check those:


In [4]:
df.sort_values(by='price', ascending=False).head()


Out[4]:
bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id longitude manager_id photos price street_address
32611 1.0 2 cd25bbea2af848ebe9821da820b725da 2016-06-24 05:02:11 Hudson Street [Doorman, Elevator, Cats Allowed, Dogs Allowed... low 40.7299 7208764 -74.0071 d1737922fe92ccb0dc37ba85589e6415 [] 4490000 421 Hudson Street
12168 1.0 2 5d3525a5085445e7fcd64a53aac3cb0a 2016-06-24 05:02:58 West 116th Street [Doorman, Elevator, Cats Allowed, Dogs Allowed... low 40.8011 7208794 -73.9480 d1737922fe92ccb0dc37ba85589e6415 [] 1150000 40 West 116th Street
55437 1.0 1 37385c8a58176b529964083315c28e32 2016-05-14 05:21:28 West 57th Street [Doorman, Cats Allowed, Dogs Allowed] low 40.7676 7013217 -73.9844 8f5a9c893f6d602f4953fcc0b8e6e9b4 [] 1070000 333 West 57th Street
57803 1.0 1 37385c8a58176b529964083315c28e32 2016-05-19 02:37:06 This 1 Bedroom apartment is located on a prime... West 57th Street [Doorman, Elevator, Pre-War, Dogs Allowed, Cat... low 40.7676 7036279 -73.9844 18133bc914e6faf6f8cc1bf29d66fc0d [https://photos.renthop.com/2/7036279_924b52f0... 1070000 333 West 57th Street
123877 0.0 0 b9c72643feb2652536a898a5f13d2543 2016-04-12 02:11:10 Originally built in 1862, this extraordinary l... Duane Street [Elevator, Pre-War, Terrace, Dogs Allowed, Cat... low 40.7161 6857401 -74.0080 d98acd4fa3c463bd468603bd873cc54c [https://photos.renthop.com/2/6857401_a4a4c2f2... 135000 144 Duane Street

The most expensive apartment is 4 million per month (could be a data entry error or true, we might want to check manually later), but then it decreases quickly. So let's exclude the most expensive apartments:


In [5]:
df_cheaper = df[df['price'] < 10000]
sns.distplot(df_cheaper['price']);


Now we can see more clear, that most apartments are around 2300 dollars per month and the distribution is skewed to the right (common for situations where negative values are impossible).

Let's go on an check the other numeric features bedrooms and bathrooms.


In [6]:
sns.countplot(df['bedrooms']);


Most apartments have 1 or 2 bedrooms. It's a bit surprising to me that there are apartments with 0 bedrooms. We should investigate this later.


In [7]:
sns.countplot(df['bathrooms']);


As for the bathrooms, we can see that some apartments have half bathrooms. For a native German this counting pattern seems uncommon (we do count half bedrooms instead, so in our area you would have seen half numbers for bedrooms), but there seems to be a general rule about it: A full bath room has toilet, sink and shower/tub while half a bath room has only toilet and sink.

It's not surprising that most apartments have exactly one bathroom.

Next we could check the creation date, just wondering if there have been more offers during specific years and for which years we got data.


In [8]:
months = df['created'].apply(lambda date: date.strftime('%Y-%m'))
sns.countplot(months.sort_values());


The data only contains three months, so there are not many changes in the number of listings. However, considering we only got three months, we should check if the testing data is from the same or different months.

We could also check the number of images apartments have.


In [9]:
df['photo_count'] = df['photos'].apply(len)
sns.distplot(df['photo_count']);


This is not a really nice Gaussian curve, but we see that it's common to have between something about 3 and 8 pictures for an apartment.

Next we should check what the possible features of apartments are and how often they are used:


In [10]:
import itertools
from collections import Counter

plt.figure(figsize=(12,8))
features = itertools.chain.from_iterable(df['features'].values)
most_common_features = Counter(features).most_common(30)
features_weighted = list(itertools.chain.from_iterable(map(lambda item: [item[0]] * item[1], most_common_features)))
sns.countplot(y=features_weighted);


From words like HARDWOOD or SIMPLEX we can see that features seem to be free text and managers can define them themselves. This could be interesting, because there could be some relation between interest_level and writing style (e.g. caps lock or non caps). Of course, we should also check whether there is any relation between specific features and interest levels. This might also be related to the price. Maybe if there is an apartment with a high price, but without doorman, nobody will be interested?

Out of curiosity, let's check if there are any street addresses in the data several times (i.e. either duplicate offers or offers from the same building).


In [11]:
from collections import Counter

most_common_addr = Counter(df['street_address'].values).most_common(10)
pd.DataFrame(most_common_addr, columns=['Address', 'Count'])


Out[11]:
Address Count
0 3333 Broadway 174
1 505 West 37th Street 167
2 200 Water Street 160
3 90 Washington Street 142
4 100 Maiden Lane 131
5 401 East 34th Street 129
6 2 Gold Street 120
7 1 West Street 119
8 100 John Street 115
9 95 Wall Street 106

Now this is interesting. There are not only about two or three offers for one building, more than 100 offers for the same building - several times! By checking Google Maps we can see that 3333 Broadway is a really huge building. So it seems legit, that there are 174 offers for this building. 505 West 37th Street and 200 Water Street are also high scyscrapers, so I think we can assume that these offers from the same buildings are no duplicates. It might be interesting to check if the offers inside one building correlate more to each other regarding interest level than more general methods like geo location with latitude and longitude. But we should not focus too much on this field, because I assume that it's only visible in the backend.

So let's see the same chart for the display address.


In [12]:
from collections import Counter

most_common_addr = Counter(df['display_address'].values).most_common(10)
pd.DataFrame(most_common_addr, columns=['Address', 'Count'])


Out[12]:
Address Count
0 Broadway 438
1 East 34th Street 355
2 Second Avenue 349
3 Wall Street 332
4 West 37th Street 287
5 West Street 258
6 First Avenue 244
7 Gold Street 241
8 Washington Street 237
9 York Avenue 228

We see more duplicates here, because it only gives us the street address.

Let's do one more simple thing and see how long the descriptions usually are.


In [13]:
df['description_length'] = df['description'].map(len)

sns.distplot(df['description_length']);


So, there is a large number of listings without a description and the rest has a nice distribution with the peak being at about 600 characters. The average word length in English seems to be 5.1 characters and the average sentence length in words about 14, resulting in an average sentence length of 71.4 characters. So, an offer would have around eight sentences.

Finally, we should also check our target variable. So what is the distribution of the interest levels?


In [14]:
sns.countplot(df['interest_level']);


There are relatively few listings with a high interest rate and many with a low interest rate, so an always low model would perform the best out of all three always models. However, this competition does not take a binary decision input, but rather the probabilities of each of the three interest levels (which probably can be mapped onto a numeric value of number of inquiries relatively easy).

Further investigations

  • Is the most expensive apartment for 4 million per month an data entry error or real?
  • Why are there so many apartments with 0 bedrooms? These are so many, it cannot be a data entry error.
  • What about the 0 bathrooms? It's not too much data, but I assume these could be very old apartments, which might all have a low interest level.
  • Check from which years/months the test data is.