A Study of Airbnb Data - Boston, MA Metro Area

Team Hamel Husain - Xiaoyu Li - Yi Mao - Lana Awad

Project Website http://hamelsmu.github.io/AirbnbScrape/

Motivation

Airbnb is a website for people to rent out lodging. As a host of airbnb, we wanted to optimize the price of our listing, and wanted to understand things like: How other people priced around me, relative to dimensions such as locations, amenities, reviews, number of beds, etc? How I can set up my price in a more competitive way?

Agenda

Our project mainly focus on three parts:

  • Scraping data from airbnb website of cities: Boston, Cambridge, Brighton, Brookline, Watertwon, Charlestown, Medford, and Somerville.
  • Dataset Cleanup
  • Data Visualization and Trend Analysis

Data Scraping

Our ScrapeAirbnb file contains two main functions: IterateMainPage() and iterateDetail().

  • IterateMainPage: this function takes in a location string, and page limit as a parameter and downloads a list of dictionaries which correspond to all of the distinct listings for that location. For example, calling IterateMainPage('Cambridge--MA', 10) will scrape all of the distinct listings that appear on pages 1-10 of the page listings for that location. The output from this function will then be a list of dictionaries with each dictionary item corresponding to one unique listing on each page. The location string is in the format of 'City--State', as that is how the URL is structured.
  • iterateDetail: this reads in the output of the function IterateMainPage() and visits each specific listing to get mroe detailed information. If more detailed information is found, then the dictionary is updated to contain more values.

In [1]:
from ScrapeAirbnb import*
#ScrapeAirbnb is a seperate python file, in order to run it, please install libraries mechanize, cookielib and lxml


ScrapeAirbnb.py:35: UserWarning: gzip transfer encoding is experimental!
  br.set_handle_gzip(True)

In [2]:
test = IterateMainPage('Cambridge-MA', 1)
test2 = iterateDetail(test)


Processing Main Page 1 out of 1
Done Processing Main Page
Processing Listing 1 out of 18
Processing Listing 2 out of 18
Processing Listing 3 out of 18
Processing Listing 4 out of 18
Unable to parse stars listing id: 4615770
Processing Listing 5 out of 18
Processing Listing 6 out of 18
Processing Listing 7 out of 18
Processing Listing 8 out of 18
Processing Listing 9 out of 18
Processing Listing 10 out of 18
Processing Listing 11 out of 18
Processing Listing 12 out of 18
Processing Listing 13 out of 18
Processing Listing 14 out of 18
Processing Listing 15 out of 18
Processing Listing 16 out of 18
Processing Listing 17 out of 18
Processing Listing 18 out of 18

This is an example how the scraped dataset looks like.


In [3]:
import pandas as pd
test2 = pd.DataFrame(test2)
test2.head()


Out[3]:
A_AC A_Breakfast A_CableTV A_CarbonMonoxDetector A_Doorman A_Dryer A_Elevator A_Essentials A_Events A_FamilyFriendly A_FireExt A_Fireplace A_FirstAidKit A_Gym A_Heat A_HotTub A_Intercom A_Internet A_Kitchen A_Parking
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 ...
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 ...
2 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 ...
3 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 ...
4 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 1 ...

5 rows × 66 columns

Data Cleaning

We had data_clean_airbnb.py file to clean the dataset scraped from the airbnb websites. We had main function named DataClean that we input the raw data and the function calculates the length of membership date; parse the ShortDesc variable into three variables: property type, number of reviews and the neighborhood.(ShortDesc is a string variable: "Private room · 14 reviews · Cambridge"); infer the gender of the host by their first name. (female, male, couple and Andy[could be both gender or unknown foreign names). The function output the data as a csv file named final_v2.csv.


In [2]:
from DataCleanAirbnb import*

In [3]:
# for the simplicity purpose, I only read in first 10 rows of my raw data and show the clea
data = pd.read_csv("airbnbData.csv")[0:10]
DataClean(data)


Out[3]:
MemberLength HostGender SD_PropType SD_NumReviews SD_Neighborhood
0 1090 mostly_female Entire home/apt 24 Harvard Square, Cambridge
1 1243 male Entire home/apt 17 Charlestown, Boston
2 999 female Entire home/apt 5 Charlestown
3 1028 andy Entire home/apt 60 Brookline
4 907 female Private room 11 Brookline
5 1120 mostly_female Private room 26 Brookline
6 420 female Shared room 22 Brookline
7 662 andy Private room 4 Coolidge Corner, Brookline
8 1243 male Entire home/apt 18 Brookline
9 116 male Private room 24 Brookline

Imputation of data and filtering the outliers

We used most frequency method to impute the missing data from Random Forest.

Filtering Data


In [20]:
dat = pd.read_csv('Final_v2.csv', na_values=['Not Found'])

In [21]:
def filterAirbnbListings(x):
    filtered_dat = dat[(dat.SD_NumReviews > 1) & (dat.MemberLength < 70000)]
    return filtered_dat

filtered_dat = filterAirbnbListings(dat)

Encoding - Dummy Variables We want to convert some of the categorical variables in our data set to numeric values so we can more easily apply dimensionality reduction and clustering techniques. Below are functions that we use to do this.


In [22]:
from DummyOneHot import *

TransformedDat = dummyCode(filtered_dat)

TransformedDat.head()


Out[22]:
Unnamed: 0 ListingID Title UserID baseurl Price AboutListing HostName MemberDate Lat Long PageCounter PageNumber A_AC A_Breakfast A_CableTV A_CarbonMonoxDetector A_Doorman A_Dryer A_TV
0 0 281552 Harvard Sq Large 1BR overlooks park 1467518 https://www.airbnb.com/s/Cambridge--MA?page=1 175 \n Mary Catherine December 2011 42.377119 -71.120112 1 1 0 0 0 0 0 1 0 ...
1 1 182613 Luxury 2BR condo Charlestown Boston 875739 https://www.airbnb.com/s/Charlestown-MA?page=1 249 Entire large modern quiet city condo near ever... Max July 2011 42.377387 -71.060435 2 1 1 0 0 0 0 1 1 ...
2 2 1587540 Cozy House on Bunker Hill in Boston 2004732 https://www.airbnb.com/s/Charlestown-MA?page=1 225 \n Finola March 2012 42.378898 -71.061182 1 1 1 0 0 0 0 1 1 ...
3 3 469506 Luxury 1bd, Safe/Central Brookline 1766477 https://www.airbnb.com/s/Brookline--MA?page=1 140 \n Rupal February 2012 42.338848 -71.135475 18 1 1 0 1 0 0 1 1 ...
4 4 3937268 Boston bedroom & private bathroom 2530197 https://www.airbnb.com/s/Brookline--MA?page=1 99 I offer a private basement bedroom with a priv... Natasha June 2012 42.343206 -71.119955 17 1 1 0 0 0 0 0 0 ...

5 rows × 93 columns

Data Exploration

To get a general idea about the distribution of different potential predictors of price, we generated box-plots of these variables using Tableau software. We looked for trends to guide our further analysis.

a- Booking and Host Variables


In [1]:
from IPython.display import Image
Image(filename='Booking and Host.jpg')


Out[1]:

b- Amenities


In [4]:
Image(filename='Amenities.jpg')


Out[4]:

In [5]:
Image(filename='Amenties 2.jpg')


Out[5]:

In [6]:
Image(filename='Amenities 3.jpg')


Out[6]:

We found no striking conclusions just from visualizing the data. A strict cancellation policy was associated with the highest priced listings.

As we would expect, luxury amenities such as doorman, gym, pool, fireplace and elevator were associated with higher prices. Amenities that are considered essential and common to probably all listings on airbnb such as internet and safety devices showed no association with price.

c- Member Length

We calculated member length by using the date the member joined Airbnb and graphed this to see the distribution.


In [23]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 
import seaborn as sns

In [24]:
dat = pd.read_csv('Final_v2.csv', na_values=['Not Found'])

In [25]:
plt.hist(dat.MemberLength)
plt.xlabel('Member Length')
plt.ylabel('Number of Properties')
plt.title('Histogram of Member Length')
plt.show


Out[25]:
<function matplotlib.pyplot.show>

We will want to filter out the members who are older than 70,000 days as that is an outlier and may be some anomaly in the data.


In [26]:
from pylab import rcParams
rcParams['figure.figsize'] = 10, 5
plt.hist(dat.SD_NumReviews, bins = 50)
plt.xlabel('Number of Reviews')
plt.ylabel('Number of Properties')
plt.title('Histogram of Number of Reviews')
plt.show()


We walso filtered out members who don't have at least 3 reviews, as we want to capture properties that are actually being rented, and not inactive listings.

We plotted the number of reviews against the member length, to see how strong the relationship is.


In [27]:
filtered_dat = dat[(dat.SD_NumReviews > 2) & (dat.MemberLength < 70000)]
sns.lmplot("MemberLength", "SD_NumReviews", hue="SD_PropType", col = 'SD_PropType', data=filtered_dat, fit_reg=True)
plt.show()


Observations

Its pretty obvious that there is a wide distribution of how agressive people are in renting out their properties. Some people have been members for really long periods of time have not rented out their property as much as people who have been members for a relatively shorter amount of time. This means that when comparing the price of different properties, we will have to take into account that even though a property is listed, the owner may nevertheless may not be willing to rent it. One approach may be to try and look at the number of reviews / membership length to get an indication of how that relates to price. Below I plot some histograms concerning number of reviews / membership length.


In [28]:
rcParams['figure.figsize'] = 20, 5
plt.hist(filtered_dat.SD_NumReviews / filtered_dat.MemberLength, bins = 30)
plt.xlabel('Ratio of Reviews To Membership Length')
plt.ylabel('Number of Properties')
plt.title('Histogram of Reviews/Membership Length')
plt.show()


We came up with a metric that is calculated as follows: (Number of Reviews / Membership Length) This is mean to "measure" in a crude way how active a property has been in renting their place. We wanted to normalize the number of reviews somehow by the amount of time the property was available for rent. While we don't have prefect information, we used this metric as a proxy to see if there were outliers or not. The above histogram confirms our earlier observation that there is a wide range of activity amongst properties in terms of how they renting on Airbnb.

Correlation Plot

This plot shows the pairwise correlation between all the variables in the dataset


In [29]:
sns.set(style="darkgrid")

f, ax = plt.subplots(figsize=(9, 9))
sns.corrplot(filtered_dat, annot=False, sig_stars=False,
             diag_names=False, ax=ax)
plt.title('Correlation Matrix - All Variables')
plt.show()


Observations From Correlation Plot:

  • The reviews are are highly correlated (good reviews from cleanliness are correlated with good reviews for communication).
  • Price seems to be mostly strongly correlated with three variables: the number of people a property accomdoates, the number of bedrooms and the number of beds (essentially the space). Furthermore, these 3 variables are all correlated with eachother!
  • Suprisingly price is somewhat negatively correlated with number of reviews. This might be because of the noise created by mixing different types of properties together. This will be worth revisiting later.
  • Some amenities are highly correlated which make intutitive sense. Examples are washer + dryer, gym + pool, doorman + elevator. This might suggest that we can reduce the dimensionality of the amenties

In [30]:
filtered_dat.HostGender.value_counts()


Out[30]:
female           501
male             402
andy             173
couple           166
mostly_female     57
mostly_male       56
dtype: int64

On thing we also tried to do is to use host names to infer the host's gender. We tried using the python package sexmachine to transform the names into a gender, however it turns out that many of the names did not yield a gender. In the below table 'andy' means that the gender is not clear from the host name.

Clustering Of Properties

Since there are so many types of properties, and I want to be able to compare similar properties to eachother, I want to create some clusters of properties so that I can group them together a little easier. I am going to use the following attributes to cluster properties:

  • All of the amenitiy attributes (columns that start with "A_")
  • S_PropType: the type of property (House, Apartment, etc)
  • SD_PropType: the type of accomodation (Private Room, Entire House, Shared Room, etc).

    I did not want to cluster using Price or Reviews as I wanted to cluster based on inherent qualities of the listings that are not easily changed and that are largely out of the host's control. Things like the response rate, reviews, and price are within the host's control so I want to explore the relationship of those things to price in more detail. The purpose of the clustering is to simplify the wide variety of listings out there into some groups so I can compare properties more easily.

    One goal of clustering is to "group" the properties by consolidating the amenity variables and finding similarities between properties. Another goal is to reduce the dimensionality by substituting all of the amenity variables with some kind of cluster assignment or a reduced dimension


In [32]:
#subset the variables you want to cluster by
cluster_dat = TransformedDat[[u'A_AC', u'A_Breakfast', u'A_CableTV', u'A_CarbonMonoxDetector', u'A_Doorman', 
                      u'A_Dryer', u'A_TV', u'A_Elevator', u'A_Essentials', u'A_Events', u'A_FamilyFriendly', 
                      u'A_FireExt', u'A_Fireplace', u'A_FirstAidKit', u'A_Gym', u'A_Heat', u'A_HotTub', u'A_Intercom', 
                      u'A_Internet', u'A_Kitchen', u'A_Parking', u'A_Pets', u'A_Pool', 
                      u'A_SafetyCard', u'A_Shampoo', u'A_SmokeDetector', u'A_Smoking', u'A_Washer', u'A_Wheelchair',
                      u'S_BedType_Airbed', u'S_BedType_Couch', u'S_BedType_Futon', 
                      u'S_BedType_Pull-out Sofa', u'S_BedType_Real Bed', u'S_PropType_Apartment', 
                      u'S_PropType_Bed & Breakfast', u'S_PropType_Cabin', u'S_PropType_House', u'S_PropType_Loft', 
                      u'S_PropType_Other', u'SD_PropType_Entire home/apt', u'SD_PropType_Private room', 
                      u'SD_PropType_Shared room']]

Apply PCA on dimensions


In [33]:
from sklearn.decomposition import PCA
from sklearn import preprocessing
pca = PCA()
pcaResults = pca.fit(cluster_dat)

In [34]:
plt.plot(np.cumsum(pcaResults.explained_variance_ratio_))
plt.title('Cumalitive Proportion of Variance Explained - Principal Components')
plt.xlabel('Number of Principal Components')
plt.show()


Using PCA, We can reduce the number of amenity features to 10 from 43 and still explain 75% of the variance. If we are going to cluster by using something like k-means reducing dimensionality will be important. We choose to go with 10 principal components here.


Attempt K-Means On Principal Components


In [38]:
#get first 10 principal components
pcaDat = pca.fit_transform(cluster_dat)[:, :10]
#confirm shape of new data
np.shape(pcaDat)


Out[38]:
(1493, 10)

In [39]:
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb
import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

K = range(1,30)
X = pcaDat

#Run K Means With Up To 8 Clusters
KM = [kmeans(X,k) for k in K] # apply kmeans 1 to 10
centroids = [cent for (cent,var) in KM]   # cluster centroids

D_k = [cdist(X, cent, 'euclidean') for cent in centroids]

cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]

Choosing K in K Means:

One of the key parameters in K means is the number of clusters or value of K. I used the elbow method, and choose 5 clusters. This is somewhat subjective, however with unsupervised learning there are some subjective elements. I adapted code from the below link in order to make the elbow chart: http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb


In [40]:
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb
# plot elbow curve
rcParams['figure.figsize'] = 8, 5
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
tt = plt.title('Elbow for K-Means clustering')


Using the elbow method, we went with K=5 as the number of clusters. We choose 5 because that appears where the elbow is or where the gradient of the curve starts to drastically change.

Now that we have choosen K = 5 I re-run k-means clustering for K = 5 and then inspect the output to see what the data might looklike


In [41]:
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clusterin
from sklearn.cluster import KMeans

km = KMeans(5, init='k-means++') # initialize KMeans
c = km.fit_predict(X)

print np.shape(cluster_dat)
print np.shape(c)


(1493, 43)
(1493,)

In [42]:
#write out cluter data to csv file so I can inspect it
#dat2 = pd.DataFrame.reset_index(cluster_dat)[[a for a in cluster_dat.columns if a != 'index']]
#concatenate cluster assignments to the original data
clusters = pd.concat([cluster_dat, pd.DataFrame(c)], axis = 1)
#rename clusterID column
clusters = clusters.rename(columns = {0:'ClusterID'})
#summarize clusters by clusterID, output to excel to inspect it
clusterSummary = clusters.groupby(['ClusterID']).mean().T
clusterSummary.to_csv('Cluster_Data.csv')

In [43]:
clusters.groupby(['ClusterID']).mean()


Out[43]:
A_AC A_Breakfast A_CableTV A_CarbonMonoxDetector A_Doorman A_Dryer A_TV A_Elevator A_Essentials A_Events A_FamilyFriendly A_FireExt A_Fireplace A_FirstAidKit A_Gym A_Heat A_HotTub A_Intercom A_Internet A_Kitchen
ClusterID
0 0.745387 0.154982 0.424354 0.457565 0.029520 0.749077 0.586716 0.103321 0.638376 0.055351 0.450185 0.321033 0.132841 0.250923 0.040590 0.940959 0.066421 0.228782 0.977860 0.896679 ...
1 0.759777 0.173184 0.441341 0.452514 0.033520 0.737430 0.642458 0.134078 0.659218 0.050279 0.379888 0.312849 0.117318 0.268156 0.044693 0.966480 0.072626 0.273743 0.972067 0.888268 ...
2 0.792079 0.168317 0.514851 0.500000 0.014851 0.722772 0.643564 0.094059 0.702970 0.019802 0.356436 0.361386 0.118812 0.292079 0.039604 0.955446 0.024752 0.217822 0.980198 0.861386 ...
3 0.728814 0.172881 0.413559 0.522034 0.013559 0.708475 0.616949 0.111864 0.711864 0.030508 0.362712 0.335593 0.101695 0.267797 0.030508 0.955932 0.077966 0.220339 0.976271 0.861017 ...
4 0.725490 0.127451 0.455882 0.426471 0.044118 0.700980 0.602941 0.161765 0.637255 0.053922 0.421569 0.264706 0.122549 0.225490 0.078431 0.941176 0.053922 0.269608 0.970588 0.882353 ...

5 rows × 43 columns

Below is the summary of the 5 Clusters I have, with the mean of each feature caclulated for each cluster. The reason I calculated this is to "explain each cluster". We exported this data into excel and generated a "heatmap" to better visualize what was going on in each cluster.


In [44]:
#Preview of the Cluster Mean Values
clusterSummary.head()


Out[44]:
ClusterID 0.0 1.0 2.0 3.0 4.0
A_AC 0.745387 0.759777 0.792079 0.728814 0.725490
A_Breakfast 0.154982 0.173184 0.168317 0.172881 0.127451
A_CableTV 0.424354 0.441341 0.514851 0.413559 0.455882
A_CarbonMonoxDetector 0.457565 0.452514 0.500000 0.522034 0.426471
A_Doorman 0.029520 0.033520 0.014851 0.013559 0.044118

5 rows × 5 columns


In [45]:
#Check to see how many listings are in each cluster
clusters.ClusterID.value_counts()


Out[45]:
3    382
0    354
4    268
2    258
1    231
dtype: int64

We imported the cluster data into excel and created a heatmap to see if we could "explain" the clusters a little better. Below is a screenshot of this heatmap from our excel file.


In [46]:
from IPython.display import Image
Image(filename='clusterheatmap.png')


Out[46]:

Conclusion/Observations: After doing the clustering and exporting the data to excel where I could look at it, I could not really make sense of the clusters and give them a meaningful "name" or figure out why they might be similar. Therefore, I decided to ditch the idea of clustering, and instead am going to try running a random forest model for the outcome variable of price so that I can use the variable importance functionality to view the most important variables when considering price.

Random Forest - Variable Importance (Relationship To Price)


In [47]:
#set the seed so when instructors run this code they get the same results
np.random.seed(12345)
#Used This To Help Me: http://scikit-learn.org/stable/auto_examples/grid_search_digits.html
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import Imputer

#Create Parameter Grid - In This Case trying 1-40 trees
tunedParameters = [{'n_estimators':range(1,21)}]

#Create Grid Search Object - n_jobs = -1 allows to take advantage of all cores on my comptuer
clf = GridSearchCV(RandomForestRegressor(n_jobs = 1), param_grid = tunedParameters, cv=10)

#Fit Random Forest
Y = TransformedDat[u'Price'].astype(float)
X = TransformedDat[[u'Lat', u'Long', u'PageNumber', u'A_AC', u'A_Breakfast', u'A_CableTV', u'A_CarbonMonoxDetector', 
                   u'A_Doorman', u'A_Dryer', u'A_TV', u'A_Elevator', u'A_Essentials', u'A_Events', 
                   u'A_FamilyFriendly', u'A_FireExt', u'A_Fireplace', u'A_FirstAidKit', u'A_Gym', 
                   u'A_Heat', u'A_HotTub', u'A_Intercom', u'A_Internet', u'A_Kitchen', u'A_Parking', 
                   u'A_Pets', u'A_Pool', u'A_SafetyCard', u'A_Shampoo', u'A_SmokeDetector', u'A_Smoking', 
                   u'A_Washer', u'A_Wheelchair', u'R_CI', u'R_acc', u'R_clean', u'R_comm', u'R_loc', 
                   u'R_val', u'RespRate', u'S_Accomodates', u'S_Bathrooms', u'S_Bedrooms', 
                   u'S_NumBeds', u'MemberLength', 
                   u'SD_NumReviews', u'BookInstantly_No', u'BookInstantly_Yes', 
                   u'Cancellation_Flexible', u'Cancellation_Moderate', u'Cancellation_Strict', 
                   u'Cancellation_Super Strict', u'RespTime_a few days or more', u'RespTime_within a day', 
                   u'RespTime_within a few hours', u'RespTime_within an hour', u'S_BedType_Airbed', 
                   u'S_BedType_Couch', u'S_BedType_Futon', u'S_BedType_Pull-out Sofa', u'S_BedType_Real Bed', 
                   u'S_PropType_Apartment', u'S_PropType_Bed & Breakfast', u'S_PropType_Cabin', u'S_PropType_House', 
                   u'S_PropType_Loft', u'S_PropType_Other', u'SD_PropType_Entire home/apt', 
                   u'SD_PropType_Private room', u'SD_PropType_Shared room', 
                   u'HostGender_couple', u'HostGender_female', u'HostGender_male', u'HostGender_unknownGender']]

ImputeMissing = Imputer(strategy = 'most_frequent')
Xt = pd.DataFrame(ImputeMissing.fit_transform(X))
Xt.columns = X.columns
clf.fit(Xt, Y)

###First, Extract Values Out of the CV Grid So I Can Graph It All
num_trees = []
meanCVScore = []
stdCVScore = []

for n, mean, cv in clf.grid_scores_:
    num_trees.append(n['n_estimators'])
    meanCVScore.append(mean)
    stdCVScore.append(np.std(cv) * 2)

In [48]:
clf.grid_scores_


Out[48]:
[mean: 0.33235, std: 0.13047, params: {'n_estimators': 1},
 mean: 0.48338, std: 0.14611, params: {'n_estimators': 2},
 mean: 0.53989, std: 0.10591, params: {'n_estimators': 3},
 mean: 0.57637, std: 0.08795, params: {'n_estimators': 4},
 mean: 0.59110, std: 0.08526, params: {'n_estimators': 5},
 mean: 0.58265, std: 0.11589, params: {'n_estimators': 6},
 mean: 0.63467, std: 0.07794, params: {'n_estimators': 7},
 mean: 0.62604, std: 0.07989, params: {'n_estimators': 8},
 mean: 0.63693, std: 0.08165, params: {'n_estimators': 9},
 mean: 0.63476, std: 0.07029, params: {'n_estimators': 10},
 mean: 0.61129, std: 0.11440, params: {'n_estimators': 11},
 mean: 0.63101, std: 0.07733, params: {'n_estimators': 12},
 mean: 0.64099, std: 0.07614, params: {'n_estimators': 13},
 mean: 0.65232, std: 0.06321, params: {'n_estimators': 14},
 mean: 0.63067, std: 0.06359, params: {'n_estimators': 15},
 mean: 0.65282, std: 0.07010, params: {'n_estimators': 16},
 mean: 0.64147, std: 0.06524, params: {'n_estimators': 17},
 mean: 0.64298, std: 0.05492, params: {'n_estimators': 18},
 mean: 0.64503, std: 0.07029, params: {'n_estimators': 19},
 mean: 0.65482, std: 0.05994, params: {'n_estimators': 20}]

In [49]:
bpData = [list(score.cv_validation_scores) for score in clf.grid_scores_]
plt.figure(figsize=(15,10))
sns.boxplot(bpData)
plt.xlabel('# of Trees')
plt.ylabel('Cross Validation Score (Rsquared)')
plt.title('# of Trees vs. Cross Validation Score')
plt.show()


Based on this output, I am going to choose 7 trees as the best model. Alternatively, I could have choose 5 trees


In [50]:
tunedParameters = [{'n_estimators':[7]}]
clf2 = GridSearchCV(RandomForestRegressor(n_jobs = 1, criterion='mse'), 
                    param_grid = tunedParameters, cv=10)
#Fit Model
clf2.fit(Xt, Y)


Out[50]:
GridSearchCV(cv=10,
       estimator=RandomForestRegressor(bootstrap=True, compute_importances=None,
           criterion='mse', max_depth=None, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid=[{'n_estimators': [7]}], pre_dispatch='2*n_jobs',
       refit=True, score_func=None, scoring=None, verbose=0)

In [51]:
FeatImp = pd.DataFrame({'feature': list(Xt.columns), 'importance': list(clf2.best_estimator_.feature_importances_)})
FeatImp = FeatImp.sort('importance', ascending = False)
#Set Index To Field You want to Sort Bar Chart By
FeatImp = FeatImp.set_index('feature')
FeatImp.head(20)


Out[51]:
importance
feature
SD_PropType_Entire home/apt 0.749379
S_Bathrooms 0.142344
S_Bedrooms 0.031190
R_loc 0.023942
Long 0.012472
PageNumber 0.009235
Lat 0.007985
S_Accomodates 0.006270
SD_NumReviews 0.002811
MemberLength 0.002244
RespRate 0.002159
A_Gym 0.001537
A_CableTV 0.000750
S_PropType_Bed & Breakfast 0.000728
Cancellation_Strict 0.000722
A_Elevator 0.000518
R_clean 0.000440
A_Shampoo 0.000401
A_Pets 0.000368
A_Essentials 0.000351

20 rows × 1 columns


In [52]:
FeatImp.head(20).plot(kind = 'barh', sort_columns = True)
plt.title('Feature Importance')
plt.show()



In [53]:
FeatImp.iloc[3:20, :].plot(kind = 'barh', sort_columns = True)
plt.title('Feature Importance - Excluding Top 2 Important Variables')
plt.show()


Observations:

The variable importance in relation to price did not return anything that was much of a surprise to me. Here are the definitions of the fields that were most important: R_loc: the average star rating (1-5) of how good the location of the listing is. PageNumber: the PageNumber on where the listing showed up on search results Long: latitude Latitude S_Accomodates: the number of guests the property can accomodate Member Length: the number of days the host has been a member on Airbnb A_Intercom: binary variable indicating whether or not property has an intercom A_TV: binary variable indicating whether or not property has a TV Resp Rate: the response rate the host has to inquiries regarding the rental of the property (0-100%) SD_NumReviews: the number of reviews A_Gym: binary response variable of whether or not a gym exists. It looks like the size and location are the most important factors that effect price. Listings that are for the entire apartment charge much more compared to those who are merely renting a room. Likewise, Latitude and Longitude probably showed up here as important variables because of there are nieghborhoods with very high prices. We can see fromt he visualizations that properties on the MIT and Harvard campuses areas are very expensive with red clustered dots. Upon further investigation of the "PageNumber" variable, we found that this variable relates to how close to the center of the city that we are searching is, which again is related to location. In the Tableau visualizations, which are discussed in the next section, we found that their are indeed locations that appear to have higher prices, or are very expensive areas. These two areas are Kendall/MIT and the Back Bay Area

We made one very telling visualization that plotted all of the listings on a map, and colored the listings by how expensive they are. To normalize for the size of the unit, we created the metric Price / Number of Bedrooms, which was the closest approximation we could get to Price/Square Feet, which is not available. After normalizing for the size of the unit, we found that (1) Kendall MIT and (2) Back - Bay areas were expensive nieghborhoods. This bolsters our previous finding that location is a very important variable when considering price. Notice that the locations on the map are actually shifted a little bit by the airbnb webiste


In [54]:
Image(filename='locationexample.png')


Out[54]:

In the above visualization, we can see that Back Bay and MIT Kendall are very expensive nieghborhoods to rent a room on Airbnb, even after normalizing for space by dividing price by the number of bedrooms. This makes intuitive sense as these are also the most expensive nieghborhoods in which to rent or buy real estate in the boston area.

We have made many Tableau visualizations to really explore this data further, which we reference in subsequent parts of this paper.

Conclusion

Observations / Conclusion a.Most important variables are (in order of importance): i.Space ii.Location – Look at MIT Kendall and Back-Bay iii.Luxury Amenities (maybe)

The most important variables to determine the price are:

  • Space: the type of apartment/housing, the number of bedrooms available and the number of beds available
  • Location: the locations such as MIT Kendall Square and Back-Bay are the expensive area for business and shopping.
  • Luxury Amenities: amenties such as doorman and elevator are normally provided by luxury aprtments. Thus, the price is higher than the average

One of the goals of this project was to find the best price for Hamel Husain’s (a member of the team) Airbnb listing. The visualizations we produced allowed us to see subtleties that are not easily seen in the data or analyzed by machine learning techniques. It was extremely useful to look at listings in Hamel’s neighborhood that were priced well above average while also collecting lots of reviews and visit the pages for those specific listings. We discovered that these listings had very high-quality, professional photos and were decorated in interesting and unique ways. It was surprising to us that these listings performed so well by simply employing superior marketing. While these attributes are not represented in the data directly, we were able to find them through exploring interactive visualizations. Hamel feels confident in pricing his Cambridge apartment at 140 dollars per night – which is 50 dollars above the median price of 90 dollars, as long as he decorates and markets his unit in a similar way to the outliers we observed. We highly recommend that you also explore all four tabs of the Tableau dashboard, as it is very interesting and fun to view this data!

We built a website for our project. It can be accessed at http://hamelsmu.github.io/AirbnbScrape/

Opportunity for the futher analysis

We didn't spend time on Natural Language Processing to extract information from customers' reviews. The review ratings given by the customers are almost around 4.5/5, which have a small variance. Because of the time limits, We can't scrape the data on daily basis, but we would love to explore more on the demand of apartment/housing by checking the availability eveyday over a time period. We can also calculate the Walk Score based on latitudes and longitutes and returns information about the locations of the estates.