Project:  McNulty
Date:     02/22/2017
Name:     Prashant Tatineni

Project Overview

In this project, I attempt to predict the popularity (target variable: interest_level) of apartment rental listings based on listing characteristics. The data comes from a Kaggle Competition.

AWS and SQL were not used for joining to the dataset, as it was provided as a single file train.json (49,352 rows).

An additional file, test.json (74,659 rows) contains the same columns as train.json, except that the target variable, interest_level, is missing. Predictions of the target variable are to be made on the test.json file and submitted to Kaggle.

Summary of Solution Steps

  1. Load data from JSON
  2. Build initial predictor variables, with interest_level as the target.
  3. Initial run of classification models.
  4. Add category indicators and aggregated features based on manager_id.
  5. Run new Random Forest model.
  6. Predict interest_level for the available test dataset.

In [18]:
# imports

import pandas as pd
import dateutil.parser
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import log_loss

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

%matplotlib inline

Step 1: Load Data


In [2]:
# Load the training dataset from Kaggle.
df = pd.read_json('data/raw/train.json')
print df.shape


(49352, 15)

In [3]:
df.head(2)


Out[3]:
bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id longitude manager_id photos price street_address
10 1.5 3 53a5b119ba8f7b61d4e010512e0dfc85 2016-06-24 07:54:24 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... Metropolitan Avenue [] medium 40.7145 7211212 -73.9425 5ba989232d0489da1b5f2c45f6688adc [https://photos.renthop.com/2/7211212_1ed4542e... 3000 792 Metropolitan Avenue
10000 1.0 2 c5c8a357cba207596b04d1afd1e4f130 2016-06-12 12:19:27 Columbus Avenue [Doorman, Elevator, Fitness Center, Cats Allow... low 40.7947 7150865 -73.9667 7533621a882f71e25173b27e3139d83d [https://photos.renthop.com/2/7150865_be3306c5... 5465 808 Columbus Avenue

Total number of columns is 14 + 1 target:

  • 1 target variable (interest_level), with classes low, medium, high
  • 1 photo link
  • lat/long, street address, display address
  • listing_id, building_id, manager_id
  • numerical (price, bathrooms, bedrooms)
  • created date
  • text (description, features)

Features for modeling:

  • bathrooms
  • bedrooms
  • created date (calculate age of posting in days)
  • description (number of words in description)
  • features (number of features)
  • photos (number of photos)
  • price
  • features (split into category indicators)
  • manager_id (with manager skill level)

Further opportunities for modeling:

  • description (need to text parse)
  • building_id (possibly with a building popularity level)
  • photos (quality)

In [6]:
# Distribution of target, interest_level

s = df.groupby('interest_level')['listing_id'].count()
s.plot.bar();



In [9]:
df_high = df.loc[df['interest_level'] == 'high']
df_medium = df.loc[df['interest_level'] == 'medium']
df_low = df.loc[df['interest_level'] == 'low']

In [17]:
plt.figure(figsize=(6,10))
plt.scatter(df_low.longitude, df_low.latitude, color='yellow', alpha=0.2, marker='.', label='Low')
plt.scatter(df_medium.longitude, df_medium.latitude, color='green', alpha=0.2, marker='.',  label='Medium')
plt.scatter(df_high.longitude, df_high.latitude, color='purple', alpha=0.2, marker='.',  label='High')

plt.xlim(-74.04,-73.80)
plt.ylim(40.6,40.9)
plt.legend(loc=2);


Step 2: Initial Features


In [22]:
(pd.to_datetime(df['created'])).sort_values(ascending=False).head()


Out[22]:
28466   2016-06-29 21:41:47
34914   2016-06-29 18:30:41
622     2016-06-29 18:14:48
26349   2016-06-29 17:56:12
335     2016-06-29 17:47:34
Name: created, dtype: datetime64[ns]

In [24]:
# The most recent records are 6/29/2016. Computing days old from 6/30/2016.
df['days_old'] = (dateutil.parser.parse('2016-06-30') - pd.to_datetime(df['created'])).apply(lambda x: x.days)

In [25]:
# Add other "count" features
df['num_words'] = df['description'].apply(lambda x: len(x.split()))
df['num_features'] = df['features'].apply(len)
df['num_photos'] = df['photos'].apply(len)

Step 3: Modeling, First Pass


In [62]:
X = df[['bathrooms','bedrooms','price','latitude','longitude','days_old','num_words','num_features','num_photos']]
y = df['interest_level']

In [63]:
# Scaling is necessary for Logistic Regression and KNN
X_scaled = pd.DataFrame(preprocessing.scale(X))
X_scaled.columns = X.columns

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

Logistic Regression


In [65]:
lr = LogisticRegression()
lr.fit(X_train, y_train)


Out[65]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [66]:
y_test_predicted_proba = lr.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)


Out[66]:
0.72358863733883683

In [67]:
lr = LogisticRegression(solver='newton-cg', multi_class='multinomial')
lr.fit(X_train, y_train)


Out[67]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

In [68]:
y_test_predicted_proba = lr.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)


Out[68]:
0.71939998004777495

KNN


In [69]:
for i in [95,100,105]:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    y_test_predicted_proba = knn.predict_proba(X_test)
    print log_loss(y_test, y_test_predicted_proba)


0.765759270842
0.765584342679
0.765560629848

Random Forest


In [70]:
rf = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
rf.fit(X_train, y_train)


Out[70]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=1000, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [71]:
y_test_predicted_proba = rf.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)


Out[71]:
0.62858925206677441

Naive Bayes


In [72]:
bnb = BernoulliNB()
bnb.fit(X_train, y_train)


Out[72]:
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [73]:
y_test_predicted_proba = bnb.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)


Out[73]:
0.76915047053579133

In [74]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)


Out[74]:
GaussianNB(priors=None)

In [75]:
y_test_predicted_proba = gnb.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)


Out[75]:
4.445755558020128

Neural Network


In [76]:
clf = MLPClassifier(hidden_layer_sizes=(100,50,10))
clf.fit(X_train, y_train)


Out[76]:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100, 50, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [77]:
y_test_predicted_proba = clf.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)


Out[77]:
0.65649310927267335

Step 4: More Features

Splitting out categories into 0/1 dummy variables


In [26]:
# Reduce 1556 unique category text values into 34 main categories

def reduce_categories(full_list):
    reduced_list = []
    for i in full_list:
        item = i.lower()
        if 'cats allowed' in item:
            reduced_list.append('cats')
        if 'dogs allowed' in item:
            reduced_list.append('dogs')
        if 'elevator' in item:
            reduced_list.append('elevator')
        if 'hardwood' in item:
            reduced_list.append('elevator')
        if 'doorman' in item or 'concierge' in item:
            reduced_list.append('doorman')
        if 'dishwasher' in item:
            reduced_list.append('dishwasher')
        if 'laundry' in item or 'dryer' in item:
            if 'unit' in item:
                reduced_list.append('laundry_in_unit')
            else:
                reduced_list.append('laundry')
        if 'no fee' in item:
            reduced_list.append('no_fee')
        if 'reduced fee' in item:
            reduced_list.append('reduced_fee')
        if 'fitness' in item or 'gym' in item:
            reduced_list.append('gym')
        if 'prewar' in item or 'pre-war' in item:
            reduced_list.append('prewar')
        if 'dining room' in item:
            reduced_list.append('dining')
        if 'pool' in item:
            reduced_list.append('pool')
        if 'internet' in item:
            reduced_list.append('internet')
        if 'new construction' in item:
            reduced_list.append('new_construction')
        if 'wheelchair' in item:
            reduced_list.append('wheelchair')
        if 'exclusive' in item:
            reduced_list.append('exclusive')
        if 'loft' in item:
            reduced_list.append('loft')
        if 'simplex' in item:
            reduced_list.append('simplex')
        if 'fire' in item:
            reduced_list.append('fireplace')
        if 'lowrise' in item or 'low-rise' in item:
            reduced_list.append('lowrise')
        if 'midrise' in item or 'mid-rise' in item:
            reduced_list.append('midrise')
        if 'highrise' in item or 'high-rise' in item:
            reduced_list.append('highrise')
        if 'pool' in item:
            reduced_list.append('pool')
        if 'ceiling' in item:
            reduced_list.append('high_ceiling')
        if 'garage' in item or 'parking' in item:
            reduced_list.append('parking')
        if 'furnished' in item:
            reduced_list.append('furnished')
        if 'multi-level' in item:
            reduced_list.append('multilevel')
        if 'renovated' in item:
            reduced_list.append('renovated')
        if 'super' in item:
            reduced_list.append('live_in_super')
        if 'green building' in item:
            reduced_list.append('green_building')
        if 'appliances' in item:
            reduced_list.append('new_appliances')
        if 'luxury' in item:
            reduced_list.append('luxury')
        if 'penthouse' in item:
            reduced_list.append('penthouse')
        if 'deck' in item or 'terrace' in item or 'balcony' in item or 'outdoor' in item or 'roof' in item or 'garden' in item or 'patio' in item:
            reduced_list.append('outdoor_space')
    return list(set(reduced_list))

In [27]:
df['categories'] = df['features'].apply(reduce_categories)

In [30]:
text = ''

for index, row in df.iterrows():
    for i in row.categories:
        text = text + i + ' '

In [32]:
plt.figure(figsize=(12,6))
wc = WordCloud(background_color='white', width=1200, height=600).generate(text)
plt.title('Reduced Categories', fontsize=30)
plt.axis("off")
wc.recolor(random_state=0)
plt.imshow(wc);



In [ ]:
# Create indicators
X_dummies = pd.get_dummies(df['categories'].apply(pd.Series).stack()).sum(level=0)

Aggregate manager_id to get features representing manager performance

Note: Need to aggregate manager performance ONLY over a training subset in order to validate against test subset. So the train-test split is being performed in this step before creating the columns for manager performance.


In [79]:
# Choose features for modeling (and sorting)
df = df.sort_values('listing_id')
X = df[['bathrooms','bedrooms','price','latitude','longitude','days_old','num_words','num_features','num_photos','listing_id','manager_id']]
y = df['interest_level']

In [82]:
# Merge indicators to X dataframe and sort again to match sorting of y
X = X.merge(X_dummies, how='outer', left_index=True, right_index=True).fillna(0)
X = X.sort_values('listing_id')

In [102]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [103]:
# compute ratios and count for each manager
mgr_perf = pd.concat([X_train.manager_id,pd.get_dummies(y_train)], axis=1).groupby('manager_id').mean()

In [104]:
mgr_perf.head(2)


Out[104]:
high low medium
manager_id
0000abd7518b94c35a90d64b56fbf3e6 0.0 0.272727 0.727273
001ce808ce1720e24a9510e014c69707 0.0 1.000000 0.000000

In [105]:
mgr_perf['manager_count'] = X_train.groupby('manager_id').count().iloc[:,1]
mgr_perf['manager_skill'] = mgr_perf['high']*1 + mgr_perf['medium']*0 + mgr_perf['low']*-1

In [106]:
# for training set
X_train = X_train.merge(mgr_perf.reset_index(), how='left', left_on='manager_id', right_on='manager_id')

In [109]:
# for test set
X_test = X_test.merge(mgr_perf.reset_index(), how='left', left_on='manager_id', right_on='manager_id')

# Fill na's with mean skill and median count
X_test['manager_skill'] = X_test.manager_skill.fillna(X_test.manager_skill.mean())
X_test['manager_count'] = X_test.manager_count.fillna(X_test.manager_count.median())

In [111]:
# Delete unnecessary columns before modeling
del X_train['listing_id']
del X_train['manager_id']
del X_test['listing_id']
del X_test['manager_id']
del X_train['high']
del X_train['medium']
del X_train['low']
del X_test['high']
del X_test['medium']
del X_test['low']

Step 5: Modeling, second pass with Random Forest


In [112]:
rf = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
rf.fit(X_train, y_train)


Out[112]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=1000, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [114]:
y_test_predicted_proba = rf.predict_proba(X_test)
log_loss(y_test, y_test_predicted_proba)


Out[114]:
0.59991955704408473

In [115]:
y_test_predicted = rf.predict(X_test)
accuracy_score(y_test, y_test_predicted)


Out[115]:
0.73131787972118656

In [116]:
precision_recall_fscore_support(y_test, y_test_predicted)


Out[116]:
(array([ 0.5028463 ,  0.80115155,  0.46330935]),
 array([ 0.27263374,  0.90911212,  0.34561717]),
 array([ 0.35356905,  0.85172433,  0.39590164]),
 array([ 972, 8571, 2795]))

In [117]:
rf.classes_


Out[117]:
array([u'high', u'low', u'medium'], dtype=object)

In [122]:
plt.figure(figsize=(15,5))
pd.Series(index = X_train.columns, data = rf.feature_importances_).sort_values().plot(kind = 'bar');


As seen here, introducing feature categories and manager performance has improved the model. In particular, manager_skill shows up as the dominant feature in terms of importance in this random forest model.

Step 6: Prediction and Next opportunities

To make a prediction for submission to Kaggle, this notebook is recreated with the test.json dataset. The submission requires the predicted high, medium, and low probabilities for each listing_id.

Further opportunities to improve prediction on this dataset lie in the text descriptions and image data, which were not used thus far except as a numerical "length" value. Building popularity could also be assessed via the building_id variable, similar to the aggregation of the manager_id variable.

It will also be valuable to spend time optimizing the model used here (perhaps using GridSearch).