Rental Listings Analysis
By Prashant Tatineni

Project Overview

In this project, I attempt to predict the popularity (target variable: interest_level) of apartment rental listings in New York City based on listing characteristics. The data itself comes from a Kaggle Competition hosted in conjunction with renthop.com.

The dataset was provided as a single file train.json (49,352 rows).

An additional file, test.json (74,659 rows) contains the same columns as train.json, except that the target variable, interest_level, is missing. Predictions of the target variable are to be made on the test.json file and submitted to Kaggle.

Summary of Solution Steps

Load data from JSON
Build initial predictor variables, with interest_level as the target.
Initial run of classification models.
Add category indicators and aggregated features based on manager_id.
Run new Random Forest model.
An attempt to use the images for classification.
Further opportunities with this dataset.



In [1]:

    
# imports

import pandas as pd
import dateutil.parser
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import log_loss

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

%matplotlib inline

Step 1: Load Data



In [2]:

    
# Load the training dataset from Kaggle.
df = pd.read_json('train.json')
print df.shape









    



(49352, 15)



In [3]:

    
df.head(2)









    Out[3]:






  
    
      
      bathrooms
      bedrooms
      building_id
      created
      description
      display_address
      features
      interest_level
      latitude
      listing_id
      longitude
      manager_id
      photos
      price
      street_address
    
  
  
    
      10
      1.5
      3
      53a5b119ba8f7b61d4e010512e0dfc85
      2016-06-24 07:54:24
      A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...
      Metropolitan Avenue
      []
      medium
      40.7145
      7211212
      -73.9425
      5ba989232d0489da1b5f2c45f6688adc
      [https://photos.renthop.com/2/7211212_1ed4542e...
      3000
      792 Metropolitan Avenue
    
    
      10000
      1.0
      2
      c5c8a357cba207596b04d1afd1e4f130
      2016-06-12 12:19:27
      
      Columbus Avenue
      [Doorman, Elevator, Fitness Center, Cats Allow...
      low
      40.7947
      7150865
      -73.9667
      7533621a882f71e25173b27e3139d83d
      [https://photos.renthop.com/2/7150865_be3306c5...
      5465
      808 Columbus Avenue

Total number of columns is 14 + 1 target:

1 target variable (interest_level), with classes low, medium, high
1 list of photo links
lat/long, street address, display address
listing_id, building_id, manager_id
numerical (price, bathrooms, bedrooms)
created date
text (description, features)

Features for modeling:

bathrooms
bedrooms
created date (calculate age of posting in days)
description (number of words in description)
features (number of features)
photos (number of photos)
price
features (split into category indicators)
manager_id (with manager skill level)

Further opportunities for modeling:

description (with NLP)
building_id (with a building popularity level)
photos (quality, discussed in Step 6)



In [43]:

    
# Distribution of target value: interest_level

s = df.groupby('interest_level')['listing_id'].count()
s.plot.bar();



In [5]:

    
df_high = df.loc[df['interest_level'] == 'high']
df_medium = df.loc[df['interest_level'] == 'medium']
df_low = df.loc[df['interest_level'] == 'low']



In [46]:

    
plt.figure(figsize=(6,10))
plt.scatter(df_low.longitude, df_low.latitude, color='yellow', alpha=0.2, marker='.', label='Low')
plt.scatter(df_medium.longitude, df_medium.latitude, color='green', alpha=0.2, marker='.',  label='Medium')
plt.scatter(df_high.longitude, df_high.latitude, color='purple', alpha=0.2, marker='.',  label='High')

plt.xlim(-74.04,-73.80)
plt.ylim(40.6,40.9)
plt.title('Map of the listings in NYC')
plt.ylabel('N Lat.')
plt.xlabel('W Long.')
plt.legend(loc=2);

Step 2: Initial Features



In [7]:

    
(pd.to_datetime(df['created'])).sort_values(ascending=False).head()









    Out[7]:





28466   2016-06-29 21:41:47
34914   2016-06-29 18:30:41
622     2016-06-29 18:14:48
26349   2016-06-29 17:56:12
335     2016-06-29 17:47:34
Name: created, dtype: datetime64[ns]



In [8]:

    
# The most recent records are 6/29/2016. Computing days old from 6/30/2016.
df['days_old'] = (dateutil.parser.parse('2016-06-30') - pd.to_datetime(df['created'])).apply(lambda x: x.days)



In [9]:

    
# Add other "count" features
df['num_words'] = df['description'].apply(lambda x: len(x.split()))
df['num_features'] = df['features'].apply(len)
df['num_photos'] = df['photos'].apply(len)

Step 3: Modeling, First Pass



In [10]:

    
X = df[['bathrooms','bedrooms','price','latitude','longitude','days_old','num_words','num_features','num_photos']]
y = df['interest_level']



In [11]:

    
# Scaling is necessary for Logistic Regression and KNN
X_scaled = pd.DataFrame(preprocessing.scale(X))
X_scaled.columns = X.columns



In [12]:

    
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

Logistic Regression



In [13]:

    
lr = LogisticRegression()
lr.fit(X_train, y_train)









    Out[13]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [14]:

    
y_test_predicted_proba = lr.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)









    Out[14]:





0.72358877058510196



In [15]:

    
lr = LogisticRegression(solver='newton-cg', multi_class='multinomial')
lr.fit(X_train, y_train)









    Out[15]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)



In [16]:

    
y_test_predicted_proba = lr.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)









    Out[16]:





0.71939997996865968

KNN



In [17]:

    
for i in [95,100,105]:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    y_test_predicted_proba = knn.predict_proba(X_test)
    print log_loss(y_test, y_test_predicted_proba)









    



0.765759270842
0.765584342679
0.765560629848

Random Forest



In [33]:

    
rf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rf.fit(X_train, y_train)









    Out[33]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)



In [34]:

    
y_test_predicted_proba = rf.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)









    Out[34]:





0.62981513318807736

Random Forest performs the best with respect to Log Loss.



In [40]:

    
y_test_predicted = rf.predict(X_test)
accuracy_score(y_test, y_test_predicted)









    Out[40]:





0.72288863673204728



In [41]:

    
precision_recall_fscore_support(y_test, y_test_predicted)









    Out[41]:





(array([ 0.49083503,  0.7713531 ,  0.48831488]),
 array([ 0.25556734,  0.92851254,  0.27341598]),
 array([ 0.33612273,  0.84266781,  0.35055188]),
 array([ 943, 8491, 2904]))



In [37]:

    
rf.classes_









    Out[37]:





array([u'high', u'low', u'medium'], dtype=object)



In [42]:

    
plt.figure(figsize=(5,5))
pd.Series(index = X_train.columns, data = rf.feature_importances_).sort_values().plot(kind= 'bar');

Naive Bayes



In [72]:

    
bnb = BernoulliNB()
bnb.fit(X_train, y_train)









    Out[72]:





BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)



In [73]:

    
y_test_predicted_proba = bnb.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)









    Out[73]:





0.76915047053579133



In [74]:

    
gnb = GaussianNB()
gnb.fit(X_train, y_train)









    Out[74]:





GaussianNB(priors=None)



In [75]:

    
y_test_predicted_proba = gnb.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)









    Out[75]:





4.445755558020128

Neural Network



In [76]:

    
clf = MLPClassifier(hidden_layer_sizes=(100,50,10))
clf.fit(X_train, y_train)









    Out[76]:





MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100, 50, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)



In [77]:

    
y_test_predicted_proba = clf.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)









    Out[77]:





0.65649310927267335

Step 4: More Complex Features

Splitting out categories into 0/1 dummy variables



In [62]:

    
# Reduce 1556 unique category text values into 34 main categories

def reduce_categories(full_list):
    reduced_list = []
    for i in full_list:
        item = i.lower()
        if 'cats allowed' in item:
            reduced_list.append('cats')
        if 'dogs allowed' in item:
            reduced_list.append('dogs')
        if 'elevator' in item:
            reduced_list.append('elevator')
        if 'hardwood' in item:
            reduced_list.append('elevator')
        if 'doorman' in item or 'concierge' in item:
            reduced_list.append('doorman')
        if 'dishwasher' in item:
            reduced_list.append('dishwasher')
        if 'laundry' in item or 'dryer' in item:
            if 'unit' in item:
                reduced_list.append('laundry_in_unit')
            else:
                reduced_list.append('laundry')
        if 'no fee' in item:
            reduced_list.append('no_fee')
        if 'reduced fee' in item:
            reduced_list.append('reduced_fee')
        if 'fitness' in item or 'gym' in item:
            reduced_list.append('gym')
        if 'prewar' in item or 'pre-war' in item:
            reduced_list.append('prewar')
        if 'dining room' in item:
            reduced_list.append('dining')
        if 'pool' in item:
            reduced_list.append('pool')
        if 'internet' in item:
            reduced_list.append('internet')
        if 'new construction' in item:
            reduced_list.append('new_construction')
        if 'wheelchair' in item:
            reduced_list.append('wheelchair')
        if 'exclusive' in item:
            reduced_list.append('exclusive')
        if 'loft' in item:
            reduced_list.append('loft')
        if 'simplex' in item:
            reduced_list.append('simplex')
        if 'fire' in item:
            reduced_list.append('fireplace')
        if 'lowrise' in item or 'low-rise' in item:
            reduced_list.append('lowrise')
        if 'midrise' in item or 'mid-rise' in item:
            reduced_list.append('midrise')
        if 'highrise' in item or 'high-rise' in item:
            reduced_list.append('highrise')
        if 'pool' in item:
            reduced_list.append('pool')
        if 'ceiling' in item:
            reduced_list.append('high_ceiling')
        if 'garage' in item or 'parking' in item:
            reduced_list.append('parking')
        if 'furnished' in item:
            reduced_list.append('furnished')
        if 'multi-level' in item:
            reduced_list.append('multilevel')
        if 'renovated' in item:
            reduced_list.append('renovated')
        if 'super' in item:
            reduced_list.append('live_in_super')
        if 'green building' in item:
            reduced_list.append('green_building')
        if 'appliances' in item:
            reduced_list.append('new_appliances')
        if 'luxury' in item:
            reduced_list.append('luxury')
        if 'penthouse' in item:
            reduced_list.append('penthouse')
        if 'deck' in item or 'terrace' in item or 'balcony' in item or 'outdoor' in item or 'roof' in item or 'garden' in item or 'patio' in item:
            reduced_list.append('outdoor_space')
    return list(set(reduced_list))



In [63]:

    
df['categories'] = df['features'].apply(reduce_categories)



In [30]:

    
text = ''
for index, row in df.iterrows():
    for i in row.categories:
        text = text + i + ' '



In [32]:

    
plt.figure(figsize=(12,6))
wc = WordCloud(background_color='white', width=1200, height=600).generate(text)
plt.title('Reduced Categories', fontsize=30)
plt.axis("off")
wc.recolor(random_state=0)
plt.imshow(wc);



In [64]:

    
# Create indicators
X_dummies = pd.get_dummies(df['categories'].apply(pd.Series).stack()).sum(level=0)

Aggregate manager_id to get features representing manager performance

Note: Need to aggregate manager performance ONLY over a training subset in order to validate against test subset. Otherwise, for any given manager, a portion of their calculated skill level might have been due to listings from the test set. So the train-test split is being performed in this step before creating the columns for manager performance.



In [65]:

    
# Choose features for modeling (and sorting)
df = df.sort_values('listing_id')
X = df[['bathrooms','bedrooms','price','latitude','longitude','days_old','num_words','num_features','num_photos','listing_id','manager_id']]
y = df['interest_level']



In [66]:

    
# Merge indicators to X dataframe and sort again to match sorting of y
X = X.merge(X_dummies, how='outer', left_index=True, right_index=True).fillna(0)
X = X.sort_values('listing_id')



In [67]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)



In [68]:

    
# compute ratios and count for each manager
mgr_perf = pd.concat([X_train.manager_id,pd.get_dummies(y_train)], axis=1).groupby('manager_id').mean()



In [69]:

    
mgr_perf.head(2)









    Out[69]:






  
    
      
      high
      low
      medium
    
    
      manager_id
      
      
      
    
  
  
    
      0000abd7518b94c35a90d64b56fbf3e6
      0.0
      0.272727
      0.727273
    
    
      001ce808ce1720e24a9510e014c69707
      0.0
      1.000000
      0.000000



In [70]:

    
# Apply weighting for each manager's listings: +1 for High, 0 for Medium, -1 for Low.
mgr_perf['manager_count'] = X_train.groupby('manager_id').count().iloc[:,1]
mgr_perf['manager_skill'] = mgr_perf['high']*1 + mgr_perf['medium']*0 + mgr_perf['low']*-1



In [71]:

    
# for training set
X_train = X_train.merge(mgr_perf.reset_index(), how='left', left_on='manager_id', right_on='manager_id')



In [72]:

    
# for test set
X_test = X_test.merge(mgr_perf.reset_index(), how='left', left_on='manager_id', right_on='manager_id')

# Fill na's with mean skill and median count
X_test['manager_skill'] = X_test.manager_skill.fillna(X_test.manager_skill.mean())
X_test['manager_count'] = X_test.manager_count.fillna(X_test.manager_count.median())



In [73]:

    
# Delete unnecessary columns before modeling
del X_train['listing_id']
del X_train['manager_id']
del X_test['listing_id']
del X_test['manager_id']
del X_train['high']
del X_train['medium']
del X_train['low']
del X_test['high']
del X_test['medium']
del X_test['low']

Step 5: Modeling, second pass with Random Forest



In [78]:

    
rf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rf.fit(X_train, y_train)









    Out[78]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)



In [114]:

    
y_test_predicted_proba = rf.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)









    Out[114]:





0.59991955704408473



In [79]:

    
y_test_predicted = rf.predict(X_test)
accuracy_score(y_test, y_test_predicted)









    Out[79]:





0.73066947641432967



In [80]:

    
precision_recall_fscore_support(y_test, y_test_predicted)









    Out[80]:





(array([ 0.51055662,  0.8005352 ,  0.46216088]),
 array([ 0.27366255,  0.90747871,  0.34740608]),
 array([ 0.35632954,  0.85065894,  0.39665033]),
 array([ 972, 8571, 2795]))



In [117]:

    
rf.classes_









    Out[117]:





array([u'high', u'low', u'medium'], dtype=object)



In [122]:

    
plt.figure(figsize=(15,5))
pd.Series(index = X_train.columns, data = rf.feature_importances_).sort_values().plot(kind = 'bar');

As seen here, introducing feature categories and manager performance has improved the model. In particular, manager_skill shows up as the dominant feature in terms of importance in this Random Forest model.

Step 6: Image Classification Attempt

I did not use the actual listing images in my model. Here, I outline an attempt at classifying the listing based on image quality using a "blurry image" detector that I created with a Convolutional Neural Network. For more details see my discussion of that project.



In [100]:

    
import numpy as np
from keras.models import Sequential
from keras.layers.convolutional import Convolution2D
from keras.layers.pooling import MaxPooling2D
from keras.layers.core import Flatten, Dense, Activation, Dropout
from keras.preprocessing import image



In [ ]:

    
# My neural network layer sequence is based on the original LeNet architecture
model = Sequential()

# Layer 1
model.add(Convolution2D(32, 5, 5, input_shape=(192, 192, 3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Layer 2
model.add(Convolution2D(64, 5, 5))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))    

model.add(Flatten())

# Layer 3
model.add(Dense(1024))
model.add(Activation("relu"))
model.add(Dropout(0.5))

# Layer 4
model.add(Dense(512))
model.add(Activation("relu"))
model.add(Dropout(0.5))

# Layer 5
model.add(Dense(2))
model.add(Activation("softmax"))

# "lenet_weights.h5" is the file containing weights from my trained neural network.
model.load_weights('lenet_weights.h5')



In [142]:

    
# Loading three images from the dataset.
# img_1 & img_2 are from the same High popularity listing
# img_3 is from a Low popularity listing.
pics = []

img_1 = image.load_img('6811966_1.jpg', target_size=(192,192))
pics.append(np.asarray(img_1))
img_2 = image.load_img('6811966_2.jpg', target_size=(192,192))
pics.append(np.asarray(img_2))
img_3 = image.load_img('6812150_1.jpg', target_size=(192,192))
pics.append(np.asarray(img_3))
pics_array = np.stack(pics)/255.
plt.figure(figsize=(12,12))
plt.subplot(131),plt.imshow(img_1),plt.title('6811966, interest_level: High')
plt.xticks([]), plt.yticks([])
plt.subplot(132),plt.imshow(img_2),plt.title('6811966, interest_level: High')
plt.xticks([]), plt.yticks([])
plt.subplot(133),plt.imshow(img_3),plt.title('6812150, interest_level: Low')
plt.xticks([]), plt.yticks([])
plt.show()



In [134]:

    
model.predict_classes(pics_array)









    



3/3 [==============================] - 0s






    Out[134]:





array([0, 1, 1])

My model classified the first image for listing 6811966 as 0 = clear, but the second as 1 = blurry. This is likely due to the larger prevalance of a white wash-out effect in the second image from sunlight. The third image was classified correctly as 1 = blurry; it is indeed a blurry image. However, this alone is probably not enough to decide listing popularity. As seen below, the typical listing has 5 photos attached, while the High popularity listing we are discussing here has a total of 7 photos. The Low popularity listing meanwhile has only this 1 photo. So the number of photos is just as likely as blurriness to affect listing popularity in this case.



In [151]:

    
df['num_photos'].mode()









    Out[151]:





0    5
dtype: int64



In [154]:

    
df['num_photos'].median()









    Out[154]:





5.0



In [143]:

    
df[df.listing_id == 6811966][['listing_id','description','interest_level','num_photos']]









    Out[143]:






  
    
      
      listing_id
      description
      interest_level
      num_photos
    
  
  
    
      114617
      6811966
      --- East 31st St & Lexington Avenue --- This S...
      high
      7



In [144]:

    
df[df.listing_id == 6812150][['listing_id','description','interest_level','num_photos']]









    Out[144]:






  
    
      
      listing_id
      description
      interest_level
      num_photos
    
  
  
    
      93083
      6812150
      Great bright and spacious 2 bedrooms two bathr...
      low
      1

Step 7: Prediction and Further Opportunities

To make a prediction for submission to Kaggle, this notebook can be recreated with the test.json dataset. The submission requires the predicted high, medium, and low probabilities for each listing_id. Kaggle rankings are based on the Log Loss value on the test set.

Further opportunities to improve prediction on this dataset may lie in NLP of the text descriptions, which were not used thus far except as a numerical "length" value. Building popularity could also be assessed via the building_id variable, similar to the aggregation of the manager_id variable.



In [ ]:

	bathrooms	bedrooms	building_id	created	description	display_address	features	interest_level	latitude	listing_id	longitude	manager_id	photos	price	street_address
10	1.5	3	53a5b119ba8f7b61d4e010512e0dfc85	2016-06-24 07:54:24	A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...	Metropolitan Avenue	[]	medium	40.7145	7211212	-73.9425	5ba989232d0489da1b5f2c45f6688adc	[https://photos.renthop.com/2/7211212_1ed4542e...	3000	792 Metropolitan Avenue
10000	1.0	2	c5c8a357cba207596b04d1afd1e4f130	2016-06-12 12:19:27		Columbus Avenue	[Doorman, Elevator, Fitness Center, Cats Allow...	low	40.7947	7150865	-73.9667	7533621a882f71e25173b27e3139d83d	[https://photos.renthop.com/2/7150865_be3306c5...	5465	808 Columbus Avenue

	high	low	medium
manager_id
0000abd7518b94c35a90d64b56fbf3e6	0.0	0.272727	0.727273
001ce808ce1720e24a9510e014c69707	0.0	1.000000	0.000000