First we should try to get an overview over the data. The dataset is from a rental listings website. Let's see what it contains.



In [1]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline



In [2]:

    
df = pd.read_json('../data/raw/train.json')

df['created'] = df['created'].apply(lambda row: pd.to_datetime(row))

df.head()









    Out[2]:






  
    
      
      bathrooms
      bedrooms
      building_id
      created
      description
      display_address
      features
      interest_level
      latitude
      listing_id
      longitude
      manager_id
      photos
      price
      street_address
    
  
  
    
      10
      1.5
      3
      53a5b119ba8f7b61d4e010512e0dfc85
      2016-06-24 07:54:24
      A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...
      Metropolitan Avenue
      []
      medium
      40.7145
      7211212
      -73.9425
      5ba989232d0489da1b5f2c45f6688adc
      [https://photos.renthop.com/2/7211212_1ed4542e...
      3000
      792 Metropolitan Avenue
    
    
      10000
      1.0
      2
      c5c8a357cba207596b04d1afd1e4f130
      2016-06-12 12:19:27
      
      Columbus Avenue
      [Doorman, Elevator, Fitness Center, Cats Allow...
      low
      40.7947
      7150865
      -73.9667
      7533621a882f71e25173b27e3139d83d
      [https://photos.renthop.com/2/7150865_be3306c5...
      5465
      808 Columbus Avenue
    
    
      100004
      1.0
      1
      c3ba40552e2120b0acfc3cb5730bb2aa
      2016-04-17 03:26:41
      Top Top West Village location, beautiful Pre-w...
      W 13 Street
      [Laundry In Building, Dishwasher, Hardwood Flo...
      high
      40.7388
      6887163
      -74.0018
      d9039c43983f6e564b1482b273bd7b01
      [https://photos.renthop.com/2/6887163_de85c427...
      2850
      241 W 13 Street
    
    
      100007
      1.0
      1
      28d9ad350afeaab8027513a3e52ac8d5
      2016-04-18 02:22:02
      Building Amenities - Garage - Garden - fitness...
      East 49th Street
      [Hardwood Floors, No Fee]
      low
      40.7539
      6888711
      -73.9677
      1067e078446a7897d2da493d2f741316
      [https://photos.renthop.com/2/6888711_6e660cee...
      3275
      333 East 49th Street
    
    
      100013
      1.0
      4
      0
      2016-04-28 01:32:41
      Beautifully renovated 3 bedroom flex 4 bedroom...
      West 143rd Street
      [Pre-War]
      low
      40.8241
      6934781
      -73.9493
      98e13ad4b495b9613cef886d79a6291f
      [https://photos.renthop.com/2/6934781_1fa4b41a...
      3350
      500 West 143rd Street

The data contains information about the apartment like number of rooms (bathrooms and bedrooms), possibly information about the building in which it is located (some elements seem to be zero), the textual description and features about the apartment. The monthly price is also given.

There's also geographical information like latitude/longitude, the displayed address and the street address (which might be some backend address, which can be hidden from the user).

The dataset also contains references to images - which are provided in the competition, too.

The target variable of the dataset is the interest level (how many people contacted the owner).

So let's first check the distribution of the prices:



In [3]:

    
sns.distplot(df['price']);

There seem to be some apartments with very high prices, so let's check those:



In [4]:

    
df.sort_values(by='price', ascending=False).head()









    Out[4]:






  
    
      
      bathrooms
      bedrooms
      building_id
      created
      description
      display_address
      features
      interest_level
      latitude
      listing_id
      longitude
      manager_id
      photos
      price
      street_address
    
  
  
    
      32611
      1.0
      2
      cd25bbea2af848ebe9821da820b725da
      2016-06-24 05:02:11
      
      Hudson Street
      [Doorman, Elevator, Cats Allowed, Dogs Allowed...
      low
      40.7299
      7208764
      -74.0071
      d1737922fe92ccb0dc37ba85589e6415
      []
      4490000
      421 Hudson Street
    
    
      12168
      1.0
      2
      5d3525a5085445e7fcd64a53aac3cb0a
      2016-06-24 05:02:58
      
      West 116th Street
      [Doorman, Elevator, Cats Allowed, Dogs Allowed...
      low
      40.8011
      7208794
      -73.9480
      d1737922fe92ccb0dc37ba85589e6415
      []
      1150000
      40 West 116th Street
    
    
      55437
      1.0
      1
      37385c8a58176b529964083315c28e32
      2016-05-14 05:21:28
      
      West 57th Street
      [Doorman, Cats Allowed, Dogs Allowed]
      low
      40.7676
      7013217
      -73.9844
      8f5a9c893f6d602f4953fcc0b8e6e9b4
      []
      1070000
      333 West 57th Street
    
    
      57803
      1.0
      1
      37385c8a58176b529964083315c28e32
      2016-05-19 02:37:06
      This 1 Bedroom apartment is located on a prime...
      West 57th Street
      [Doorman, Elevator, Pre-War, Dogs Allowed, Cat...
      low
      40.7676
      7036279
      -73.9844
      18133bc914e6faf6f8cc1bf29d66fc0d
      [https://photos.renthop.com/2/7036279_924b52f0...
      1070000
      333 West 57th Street
    
    
      123877
      0.0
      0
      b9c72643feb2652536a898a5f13d2543
      2016-04-12 02:11:10
      Originally built in 1862, this extraordinary l...
      Duane Street
      [Elevator, Pre-War, Terrace, Dogs Allowed, Cat...
      low
      40.7161
      6857401
      -74.0080
      d98acd4fa3c463bd468603bd873cc54c
      [https://photos.renthop.com/2/6857401_a4a4c2f2...
      135000
      144 Duane Street

The most expensive apartment is 4 million per month (could be a data entry error or true, we might want to check manually later), but then it decreases quickly. So let's exclude the most expensive apartments:



In [5]:

    
df_cheaper = df[df['price'] < 10000]
sns.distplot(df_cheaper['price']);

Now we can see more clear, that most apartments are around 2300 dollars per month and the distribution is skewed to the right (common for situations where negative values are impossible).

Let's go on an check the other numeric features bedrooms and bathrooms.



In [6]:

    
sns.countplot(df['bedrooms']);

Most apartments have 1 or 2 bedrooms. It's a bit surprising to me that there are apartments with 0 bedrooms. We should investigate this later.



In [7]:

    
sns.countplot(df['bathrooms']);

As for the bathrooms, we can see that some apartments have half bathrooms. For a native German this counting pattern seems uncommon (we do count half bedrooms instead, so in our area you would have seen half numbers for bedrooms), but there seems to be a general rule about it: A full bath room has toilet, sink and shower/tub while half a bath room has only toilet and sink.

It's not surprising that most apartments have exactly one bathroom.

Next we could check the creation date, just wondering if there have been more offers during specific years and for which years we got data.



In [8]:

    
months = df['created'].apply(lambda date: date.strftime('%Y-%m'))
sns.countplot(months.sort_values());

The data only contains three months, so there are not many changes in the number of listings. However, considering we only got three months, we should check if the testing data is from the same or different months.

We could also check the number of images apartments have.



In [9]:

    
df['photo_count'] = df['photos'].apply(len)
sns.distplot(df['photo_count']);

This is not a really nice Gaussian curve, but we see that it's common to have between something about 3 and 8 pictures for an apartment.

Next we should check what the possible features of apartments are and how often they are used:



In [10]:

    
import itertools
from collections import Counter

plt.figure(figsize=(12,8))
features = itertools.chain.from_iterable(df['features'].values)
most_common_features = Counter(features).most_common(30)
features_weighted = list(itertools.chain.from_iterable(map(lambda item: [item[0]] * item[1], most_common_features)))
sns.countplot(y=features_weighted);

From words like HARDWOOD or SIMPLEX we can see that features seem to be free text and managers can define them themselves. This could be interesting, because there could be some relation between interest_level and writing style (e.g. caps lock or non caps). Of course, we should also check whether there is any relation between specific features and interest levels. This might also be related to the price. Maybe if there is an apartment with a high price, but without doorman, nobody will be interested?

Out of curiosity, let's check if there are any street addresses in the data several times (i.e. either duplicate offers or offers from the same building).



In [11]:

    
from collections import Counter

most_common_addr = Counter(df['street_address'].values).most_common(10)
pd.DataFrame(most_common_addr, columns=['Address', 'Count'])









    Out[11]:






  
    
      
      Address
      Count
    
  
  
    
      0
      3333 Broadway
      174
    
    
      1
      505 West 37th Street
      167
    
    
      2
      200 Water Street
      160
    
    
      3
      90 Washington Street
      142
    
    
      4
      100 Maiden Lane
      131
    
    
      5
      401 East 34th Street
      129
    
    
      6
      2 Gold Street
      120
    
    
      7
      1 West Street
      119
    
    
      8
      100 John Street
      115
    
    
      9
      95 Wall Street
      106

Now this is interesting. There are not only about two or three offers for one building, more than 100 offers for the same building - several times! By checking Google Maps we can see that 3333 Broadway is a really huge building. So it seems legit, that there are 174 offers for this building. 505 West 37th Street and 200 Water Street are also high scyscrapers, so I think we can assume that these offers from the same buildings are no duplicates. It might be interesting to check if the offers inside one building correlate more to each other regarding interest level than more general methods like geo location with latitude and longitude. But we should not focus too much on this field, because I assume that it's only visible in the backend.

So let's see the same chart for the display address.



In [12]:

    
from collections import Counter

most_common_addr = Counter(df['display_address'].values).most_common(10)
pd.DataFrame(most_common_addr, columns=['Address', 'Count'])









    Out[12]:






  
    
      
      Address
      Count
    
  
  
    
      0
      Broadway
      438
    
    
      1
      East 34th Street
      355
    
    
      2
      Second Avenue
      349
    
    
      3
      Wall Street
      332
    
    
      4
      West 37th Street
      287
    
    
      5
      West Street
      258
    
    
      6
      First Avenue
      244
    
    
      7
      Gold Street
      241
    
    
      8
      Washington Street
      237
    
    
      9
      York Avenue
      228

We see more duplicates here, because it only gives us the street address.

Let's do one more simple thing and see how long the descriptions usually are.



In [13]:

    
df['description_length'] = df['description'].map(len)

sns.distplot(df['description_length']);

So, there is a large number of listings without a description and the rest has a nice distribution with the peak being at about 600 characters. The average word length in English seems to be 5.1 characters and the average sentence length in words about 14, resulting in an average sentence length of 71.4 characters. So, an offer would have around eight sentences.

Finally, we should also check our target variable. So what is the distribution of the interest levels?



In [14]:

    
sns.countplot(df['interest_level']);

There are relatively few listings with a high interest rate and many with a low interest rate, so an always low model would perform the best out of all three always models. However, this competition does not take a binary decision input, but rather the probabilities of each of the three interest levels (which probably can be mapped onto a numeric value of number of inquiries relatively easy).

Further investigations

Is the most expensive apartment for 4 million per month an data entry error or real?
Why are there so many apartments with 0 bedrooms? These are so many, it cannot be a data entry error.
What about the 0 bathrooms? It's not too much data, but I assume these could be very old apartments, which might all have a low interest level.
Check from which years/months the test data is.

	bathrooms	bedrooms	building_id	created	description	display_address	features	interest_level	latitude	listing_id	longitude	manager_id	photos	price	street_address
10	1.5	3	53a5b119ba8f7b61d4e010512e0dfc85	2016-06-24 07:54:24	A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...	Metropolitan Avenue	[]	medium	40.7145	7211212	-73.9425	5ba989232d0489da1b5f2c45f6688adc	[https://photos.renthop.com/2/7211212_1ed4542e...	3000	792 Metropolitan Avenue
10000	1.0	2	c5c8a357cba207596b04d1afd1e4f130	2016-06-12 12:19:27		Columbus Avenue	[Doorman, Elevator, Fitness Center, Cats Allow...	low	40.7947	7150865	-73.9667	7533621a882f71e25173b27e3139d83d	[https://photos.renthop.com/2/7150865_be3306c5...	5465	808 Columbus Avenue
100004	1.0	1	c3ba40552e2120b0acfc3cb5730bb2aa	2016-04-17 03:26:41	Top Top West Village location, beautiful Pre-w...	W 13 Street	[Laundry In Building, Dishwasher, Hardwood Flo...	high	40.7388	6887163	-74.0018	d9039c43983f6e564b1482b273bd7b01	[https://photos.renthop.com/2/6887163_de85c427...	2850	241 W 13 Street
100007	1.0	1	28d9ad350afeaab8027513a3e52ac8d5	2016-04-18 02:22:02	Building Amenities - Garage - Garden - fitness...	East 49th Street	[Hardwood Floors, No Fee]	low	40.7539	6888711	-73.9677	1067e078446a7897d2da493d2f741316	[https://photos.renthop.com/2/6888711_6e660cee...	3275	333 East 49th Street
100013	1.0	4	0	2016-04-28 01:32:41	Beautifully renovated 3 bedroom flex 4 bedroom...	West 143rd Street	[Pre-War]	low	40.8241	6934781	-73.9493	98e13ad4b495b9613cef886d79a6291f	[https://photos.renthop.com/2/6934781_1fa4b41a...	3350	500 West 143rd Street

	bathrooms	bedrooms	building_id	created	description	display_address	features	interest_level	latitude	listing_id	longitude	manager_id	photos	price	street_address
32611	1.0	2	cd25bbea2af848ebe9821da820b725da	2016-06-24 05:02:11		Hudson Street	[Doorman, Elevator, Cats Allowed, Dogs Allowed...	low	40.7299	7208764	-74.0071	d1737922fe92ccb0dc37ba85589e6415	[]	4490000	421 Hudson Street
12168	1.0	2	5d3525a5085445e7fcd64a53aac3cb0a	2016-06-24 05:02:58		West 116th Street	[Doorman, Elevator, Cats Allowed, Dogs Allowed...	low	40.8011	7208794	-73.9480	d1737922fe92ccb0dc37ba85589e6415	[]	1150000	40 West 116th Street
55437	1.0	1	37385c8a58176b529964083315c28e32	2016-05-14 05:21:28		West 57th Street	[Doorman, Cats Allowed, Dogs Allowed]	low	40.7676	7013217	-73.9844	8f5a9c893f6d602f4953fcc0b8e6e9b4	[]	1070000	333 West 57th Street
57803	1.0	1	37385c8a58176b529964083315c28e32	2016-05-19 02:37:06	This 1 Bedroom apartment is located on a prime...	West 57th Street	[Doorman, Elevator, Pre-War, Dogs Allowed, Cat...	low	40.7676	7036279	-73.9844	18133bc914e6faf6f8cc1bf29d66fc0d	[https://photos.renthop.com/2/7036279_924b52f0...	1070000	333 West 57th Street
123877	0.0	0	b9c72643feb2652536a898a5f13d2543	2016-04-12 02:11:10	Originally built in 1862, this extraordinary l...	Duane Street	[Elevator, Pre-War, Terrace, Dogs Allowed, Cat...	low	40.7161	6857401	-74.0080	d98acd4fa3c463bd468603bd873cc54c	[https://photos.renthop.com/2/6857401_a4a4c2f2...	135000	144 Duane Street

	Address	Count
0	3333 Broadway	174
1	505 West 37th Street	167
2	200 Water Street	160
3	90 Washington Street	142
4	100 Maiden Lane	131
5	401 East 34th Street	129
6	2 Gold Street	120
7	1 West Street	119
8	100 John Street	115
9	95 Wall Street	106

	Address	Count
0	Broadway	438
1	East 34th Street	355
2	Second Avenue	349
3	Wall Street	332
4	West 37th Street	287
5	West Street	258
6	First Avenue	244
7	Gold Street	241
8	Washington Street	237
9	York Avenue	228