A Study of Airbnb Data - Boston, MA Metro Area

Team Hamel Husain - Xiaoyu Li - Yi Mao - Lana Awad

Project Website http://hamelsmu.github.io/AirbnbScrape/

Motivation

Airbnb is a website for people to rent out lodging. As a host of airbnb, we wanted to optimize the price of our listing, and wanted to understand things like: How other people priced around me, relative to dimensions such as locations, amenities, reviews, number of beds, etc? How I can set up my price in a more competitive way?

Agenda

Our project mainly focus on three parts:

Scraping data from airbnb website of cities: Boston, Cambridge, Brighton, Brookline, Watertwon, Charlestown, Medford, and Somerville.
Dataset Cleanup
Data Visualization and Trend Analysis

Data Scraping

Our ScrapeAirbnb file contains two main functions: IterateMainPage() and iterateDetail().

IterateMainPage: this function takes in a location string, and page limit as a parameter and downloads a list of dictionaries which correspond to all of the distinct listings for that location. For example, calling IterateMainPage('Cambridge--MA', 10) will scrape all of the distinct listings that appear on pages 1-10 of the page listings for that location. The output from this function will then be a list of dictionaries with each dictionary item corresponding to one unique listing on each page. The location string is in the format of 'City--State', as that is how the URL is structured.
iterateDetail: this reads in the output of the function IterateMainPage() and visits each specific listing to get mroe detailed information. If more detailed information is found, then the dictionary is updated to contain more values.



In [1]:

    
from ScrapeAirbnb import*
#ScrapeAirbnb is a seperate python file, in order to run it, please install libraries mechanize, cookielib and lxml









    



ScrapeAirbnb.py:35: UserWarning: gzip transfer encoding is experimental!
  br.set_handle_gzip(True)



In [2]:

    
test = IterateMainPage('Cambridge-MA', 1)
test2 = iterateDetail(test)









    



Processing Main Page 1 out of 1
Done Processing Main Page
Processing Listing 1 out of 18
Processing Listing 2 out of 18
Processing Listing 3 out of 18
Processing Listing 4 out of 18
Unable to parse stars listing id: 4615770
Processing Listing 5 out of 18
Processing Listing 6 out of 18
Processing Listing 7 out of 18
Processing Listing 8 out of 18
Processing Listing 9 out of 18
Processing Listing 10 out of 18
Processing Listing 11 out of 18
Processing Listing 12 out of 18
Processing Listing 13 out of 18
Processing Listing 14 out of 18
Processing Listing 15 out of 18
Processing Listing 16 out of 18
Processing Listing 17 out of 18
Processing Listing 18 out of 18

This is an example how the scraped dataset looks like.



In [3]:

    
import pandas as pd
test2 = pd.DataFrame(test2)
test2.head()









    Out[3]:






  
    
      
      A_AC
      A_Breakfast
      A_CableTV
      A_CarbonMonoxDetector
      A_Doorman
      A_Dryer
      A_Elevator
      A_Essentials
      A_Events
      A_FamilyFriendly
      A_FireExt
      A_Fireplace
      A_FirstAidKit
      A_Gym
      A_Heat
      A_HotTub
      A_Intercom
      A_Internet
      A_Kitchen
      A_Parking
      
    
  
  
    
      0
       1
       0
       0
       1
       0
       0
       0
       1
       0
       0
       1
       0
       0
       0
       1
       0
       0
       1
       1
       0
      ...
    
    
      1
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       1
       0
       0
       1
       1
       0
      ...
    
    
      2
       0
       0
       0
       0
       0
       1
       0
       1
       0
       0
       0
       0
       0
       0
       1
       0
       0
       1
       1
       0
      ...
    
    
      3
       0
       0
       0
       1
       0
       0
       0
       1
       0
       0
       1
       0
       1
       0
       1
       0
       0
       1
       1
       1
      ...
    
    
      4
       0
       0
       1
       1
       0
       1
       0
       1
       0
       0
       1
       0
       0
       0
       1
       0
       0
       1
       1
       1
      ...
    
  

5 rows × 66 columns

Data Cleaning

We had data_clean_airbnb.py file to clean the dataset scraped from the airbnb websites. We had main function named DataClean that we input the raw data and the function calculates the length of membership date; parse the ShortDesc variable into three variables: property type, number of reviews and the neighborhood.(ShortDesc is a string variable: "Private room · 14 reviews · Cambridge"); infer the gender of the host by their first name. (female, male, couple and Andy[could be both gender or unknown foreign names). The function output the data as a csv file named final_v2.csv.



In [2]:

    
from DataCleanAirbnb import*



In [3]:

    
# for the simplicity purpose, I only read in first 10 rows of my raw data and show the clea
data = pd.read_csv("airbnbData.csv")[0:10]
DataClean(data)









    Out[3]:






  
    
      
      MemberLength
      HostGender
      SD_PropType
      SD_NumReviews
      SD_Neighborhood
    
  
  
    
      0
       1090
       mostly_female
       Entire home/apt
       24
        Harvard Square, Cambridge
    
    
      1
       1243
                male
       Entire home/apt
       17
              Charlestown, Boston
    
    
      2
        999
              female
       Entire home/apt
        5
                      Charlestown
    
    
      3
       1028
                andy
       Entire home/apt
       60
                        Brookline
    
    
      4
        907
              female
          Private room
       11
                        Brookline
    
    
      5
       1120
       mostly_female
          Private room
       26
                        Brookline
    
    
      6
        420
              female
           Shared room
       22
                        Brookline
    
    
      7
        662
                andy
          Private room
        4
       Coolidge Corner, Brookline
    
    
      8
       1243
                male
       Entire home/apt
       18
                        Brookline
    
    
      9
        116
                male
          Private room
       24
                        Brookline

Imputation of data and filtering the outliers

We used most frequency method to impute the missing data from Random Forest.

Filtering Data



In [20]:

    
dat = pd.read_csv('Final_v2.csv', na_values=['Not Found'])



In [21]:

    
def filterAirbnbListings(x):
    filtered_dat = dat[(dat.SD_NumReviews > 1) & (dat.MemberLength < 70000)]
    return filtered_dat

filtered_dat = filterAirbnbListings(dat)

Encoding - Dummy Variables We want to convert some of the categorical variables in our data set to numeric values so we can more easily apply dimensionality reduction and clustering techniques. Below are functions that we use to do this.



In [22]:

    
from DummyOneHot import *

TransformedDat = dummyCode(filtered_dat)

TransformedDat.head()









    Out[22]:






  
    
      
      Unnamed: 0
      ListingID
      Title
      UserID
      baseurl
      Price
      AboutListing
      HostName
      MemberDate
      Lat
      Long
      PageCounter
      PageNumber
      A_AC
      A_Breakfast
      A_CableTV
      A_CarbonMonoxDetector
      A_Doorman
      A_Dryer
      A_TV
      
    
  
  
    
      0
       0
        281552
       Harvard Sq Large 1BR overlooks park
       1467518
        https://www.airbnb.com/s/Cambridge--MA?page=1
       175
                                                \n      
       Mary Catherine
        December 2011
       42.377119
      -71.120112
        1
       1
       0
       0
       0
       0
       0
       1
       0
      ...
    
    
      1
       1
        182613
       Luxury 2BR condo Charlestown Boston
        875739
       https://www.airbnb.com/s/Charlestown-MA?page=1
       249
       Entire large modern quiet city condo near ever...
                  Max
            July 2011
       42.377387
      -71.060435
        2
       1
       1
       0
       0
       0
       0
       1
       1
      ...
    
    
      2
       2
       1587540
       Cozy House on Bunker Hill in Boston
       2004732
       https://www.airbnb.com/s/Charlestown-MA?page=1
       225
                                                \n      
               Finola
           March 2012
       42.378898
      -71.061182
        1
       1
       1
       0
       0
       0
       0
       1
       1
      ...
    
    
      3
       3
        469506
        Luxury 1bd, Safe/Central Brookline
       1766477
        https://www.airbnb.com/s/Brookline--MA?page=1
       140
                                                \n      
                Rupal
        February 2012
       42.338848
      -71.135475
       18
       1
       1
       0
       1
       0
       0
       1
       1
      ...
    
    
      4
       4
       3937268
         Boston bedroom & private bathroom
       2530197
        https://www.airbnb.com/s/Brookline--MA?page=1
        99
       I offer a private basement bedroom with a priv...
              Natasha
            June 2012
       42.343206
      -71.119955
       17
       1
       1
       0
       0
       0
       0
       0
       0
      ...
    
  

5 rows × 93 columns

Data Exploration

To get a general idea about the distribution of different potential predictors of price, we generated box-plots of these variables using Tableau software. We looked for trends to guide our further analysis.

a- Booking and Host Variables



In [1]:

    
from IPython.display import Image
Image(filename='Booking and Host.jpg')









    Out[1]:

b- Amenities



In [4]:

    
Image(filename='Amenities.jpg')









    Out[4]:



In [5]:

    
Image(filename='Amenties 2.jpg')









    Out[5]:



In [6]:

    
Image(filename='Amenities 3.jpg')









    Out[6]:

We found no striking conclusions just from visualizing the data. A strict cancellation policy was associated with the highest priced listings.

As we would expect, luxury amenities such as doorman, gym, pool, fireplace and elevator were associated with higher prices. Amenities that are considered essential and common to probably all listings on airbnb such as internet and safety devices showed no association with price.

c- Member Length

We calculated member length by using the date the member joined Airbnb and graphed this to see the distribution.



In [23]:

    
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 
import seaborn as sns



In [24]:

    
dat = pd.read_csv('Final_v2.csv', na_values=['Not Found'])



In [25]:

    
plt.hist(dat.MemberLength)
plt.xlabel('Member Length')
plt.ylabel('Number of Properties')
plt.title('Histogram of Member Length')
plt.show









    Out[25]:





<function matplotlib.pyplot.show>

We will want to filter out the members who are older than 70,000 days as that is an outlier and may be some anomaly in the data.



In [26]:

    
from pylab import rcParams
rcParams['figure.figsize'] = 10, 5
plt.hist(dat.SD_NumReviews, bins = 50)
plt.xlabel('Number of Reviews')
plt.ylabel('Number of Properties')
plt.title('Histogram of Number of Reviews')
plt.show()

We walso filtered out members who don't have at least 3 reviews, as we want to capture properties that are actually being rented, and not inactive listings.

We plotted the number of reviews against the member length, to see how strong the relationship is.



In [27]:

    
filtered_dat = dat[(dat.SD_NumReviews > 2) & (dat.MemberLength < 70000)]
sns.lmplot("MemberLength", "SD_NumReviews", hue="SD_PropType", col = 'SD_PropType', data=filtered_dat, fit_reg=True)
plt.show()

Observations

Its pretty obvious that there is a wide distribution of how agressive people are in renting out their properties. Some people have been members for really long periods of time have not rented out their property as much as people who have been members for a relatively shorter amount of time. This means that when comparing the price of different properties, we will have to take into account that even though a property is listed, the owner may nevertheless may not be willing to rent it. One approach may be to try and look at the number of reviews / membership length to get an indication of how that relates to price. Below I plot some histograms concerning number of reviews / membership length.



In [28]:

    
rcParams['figure.figsize'] = 20, 5
plt.hist(filtered_dat.SD_NumReviews / filtered_dat.MemberLength, bins = 30)
plt.xlabel('Ratio of Reviews To Membership Length')
plt.ylabel('Number of Properties')
plt.title('Histogram of Reviews/Membership Length')
plt.show()

We came up with a metric that is calculated as follows: (Number of Reviews / Membership Length) This is mean to "measure" in a crude way how active a property has been in renting their place. We wanted to normalize the number of reviews somehow by the amount of time the property was available for rent. While we don't have prefect information, we used this metric as a proxy to see if there were outliers or not. The above histogram confirms our earlier observation that there is a wide range of activity amongst properties in terms of how they renting on Airbnb.

Correlation Plot

This plot shows the pairwise correlation between all the variables in the dataset



In [29]:

    
sns.set(style="darkgrid")

f, ax = plt.subplots(figsize=(9, 9))
sns.corrplot(filtered_dat, annot=False, sig_stars=False,
             diag_names=False, ax=ax)
plt.title('Correlation Matrix - All Variables')
plt.show()

Observations From Correlation Plot:

The reviews are are highly correlated (good reviews from cleanliness are correlated with good reviews for communication).
Price seems to be mostly strongly correlated with three variables: the number of people a property accomdoates, the number of bedrooms and the number of beds (essentially the space). Furthermore, these 3 variables are all correlated with eachother!
Suprisingly price is somewhat negatively correlated with number of reviews. This might be because of the noise created by mixing different types of properties together. This will be worth revisiting later.
Some amenities are highly correlated which make intutitive sense. Examples are washer + dryer, gym + pool, doorman + elevator. This might suggest that we can reduce the dimensionality of the amenties



In [30]:

    
filtered_dat.HostGender.value_counts()









    Out[30]:





female           501
male             402
andy             173
couple           166
mostly_female     57
mostly_male       56
dtype: int64

On thing we also tried to do is to use host names to infer the host's gender. We tried using the python package sexmachine to transform the names into a gender, however it turns out that many of the names did not yield a gender. In the below table 'andy' means that the gender is not clear from the host name.

Clustering Of Properties

Since there are so many types of properties, and I want to be able to compare similar properties to eachother, I want to create some clusters of properties so that I can group them together a little easier. I am going to use the following attributes to cluster properties:

All of the amenitiy attributes (columns that start with "A_")
S_PropType: the type of property (House, Apartment, etc)
SD_PropType: the type of accomodation (Private Room, Entire House, Shared Room, etc).

I did not want to cluster using Price or Reviews as I wanted to cluster based on inherent qualities of the listings that are not easily changed and that are largely out of the host's control. Things like the response rate, reviews, and price are within the host's control so I want to explore the relationship of those things to price in more detail. The purpose of the clustering is to simplify the wide variety of listings out there into some groups so I can compare properties more easily.

One goal of clustering is to "group" the properties by consolidating the amenity variables and finding similarities between properties. Another goal is to reduce the dimensionality by substituting all of the amenity variables with some kind of cluster assignment or a reduced dimension



In [32]:

    
#subset the variables you want to cluster by
cluster_dat = TransformedDat[[u'A_AC', u'A_Breakfast', u'A_CableTV', u'A_CarbonMonoxDetector', u'A_Doorman', 
                      u'A_Dryer', u'A_TV', u'A_Elevator', u'A_Essentials', u'A_Events', u'A_FamilyFriendly', 
                      u'A_FireExt', u'A_Fireplace', u'A_FirstAidKit', u'A_Gym', u'A_Heat', u'A_HotTub', u'A_Intercom', 
                      u'A_Internet', u'A_Kitchen', u'A_Parking', u'A_Pets', u'A_Pool', 
                      u'A_SafetyCard', u'A_Shampoo', u'A_SmokeDetector', u'A_Smoking', u'A_Washer', u'A_Wheelchair',
                      u'S_BedType_Airbed', u'S_BedType_Couch', u'S_BedType_Futon', 
                      u'S_BedType_Pull-out Sofa', u'S_BedType_Real Bed', u'S_PropType_Apartment', 
                      u'S_PropType_Bed & Breakfast', u'S_PropType_Cabin', u'S_PropType_House', u'S_PropType_Loft', 
                      u'S_PropType_Other', u'SD_PropType_Entire home/apt', u'SD_PropType_Private room', 
                      u'SD_PropType_Shared room']]

Apply PCA on dimensions



In [33]:

    
from sklearn.decomposition import PCA
from sklearn import preprocessing
pca = PCA()
pcaResults = pca.fit(cluster_dat)



In [34]:

    
plt.plot(np.cumsum(pcaResults.explained_variance_ratio_))
plt.title('Cumalitive Proportion of Variance Explained - Principal Components')
plt.xlabel('Number of Principal Components')
plt.show()

Using PCA, We can reduce the number of amenity features to 10 from 43 and still explain 75% of the variance. If we are going to cluster by using something like k-means reducing dimensionality will be important. We choose to go with 10 principal components here.

Attempt K-Means On Principal Components



In [38]:

    
#get first 10 principal components
pcaDat = pca.fit_transform(cluster_dat)[:, :10]
#confirm shape of new data
np.shape(pcaDat)









    Out[38]:





(1493, 10)



In [39]:

    
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb
import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

K = range(1,30)
X = pcaDat

#Run K Means With Up To 8 Clusters
KM = [kmeans(X,k) for k in K] # apply kmeans 1 to 10
centroids = [cent for (cent,var) in KM]   # cluster centroids

D_k = [cdist(X, cent, 'euclidean') for cent in centroids]

cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]

Choosing K in K Means:

One of the key parameters in K means is the number of clusters or value of K. I used the elbow method, and choose 5 clusters. This is somewhat subjective, however with unsupervised learning there are some subjective elements. I adapted code from the below link in order to make the elbow chart: http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb



In [40]:

    
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb
# plot elbow curve
rcParams['figure.figsize'] = 8, 5
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
tt = plt.title('Elbow for K-Means clustering')

Using the elbow method, we went with K=5 as the number of clusters. We choose 5 because that appears where the elbow is or where the gradient of the curve starts to drastically change.

Now that we have choosen K = 5 I re-run k-means clustering for K = 5 and then inspect the output to see what the data might looklike



In [41]:

    
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clusterin
from sklearn.cluster import KMeans

km = KMeans(5, init='k-means++') # initialize KMeans
c = km.fit_predict(X)

print np.shape(cluster_dat)
print np.shape(c)









    



(1493, 43)
(1493,)



In [42]:

    
#write out cluter data to csv file so I can inspect it
#dat2 = pd.DataFrame.reset_index(cluster_dat)[[a for a in cluster_dat.columns if a != 'index']]
#concatenate cluster assignments to the original data
clusters = pd.concat([cluster_dat, pd.DataFrame(c)], axis = 1)
#rename clusterID column
clusters = clusters.rename(columns = {0:'ClusterID'})
#summarize clusters by clusterID, output to excel to inspect it
clusterSummary = clusters.groupby(['ClusterID']).mean().T
clusterSummary.to_csv('Cluster_Data.csv')



In [43]:

    
clusters.groupby(['ClusterID']).mean()









    Out[43]:






  
    
      
      A_AC
      A_Breakfast
      A_CableTV
      A_CarbonMonoxDetector
      A_Doorman
      A_Dryer
      A_TV
      A_Elevator
      A_Essentials
      A_Events
      A_FamilyFriendly
      A_FireExt
      A_Fireplace
      A_FirstAidKit
      A_Gym
      A_Heat
      A_HotTub
      A_Intercom
      A_Internet
      A_Kitchen
      
    
    
      ClusterID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
       0.745387
       0.154982
       0.424354
       0.457565
       0.029520
       0.749077
       0.586716
       0.103321
       0.638376
       0.055351
       0.450185
       0.321033
       0.132841
       0.250923
       0.040590
       0.940959
       0.066421
       0.228782
       0.977860
       0.896679
      ...
    
    
      1
       0.759777
       0.173184
       0.441341
       0.452514
       0.033520
       0.737430
       0.642458
       0.134078
       0.659218
       0.050279
       0.379888
       0.312849
       0.117318
       0.268156
       0.044693
       0.966480
       0.072626
       0.273743
       0.972067
       0.888268
      ...
    
    
      2
       0.792079
       0.168317
       0.514851
       0.500000
       0.014851
       0.722772
       0.643564
       0.094059
       0.702970
       0.019802
       0.356436
       0.361386
       0.118812
       0.292079
       0.039604
       0.955446
       0.024752
       0.217822
       0.980198
       0.861386
      ...
    
    
      3
       0.728814
       0.172881
       0.413559
       0.522034
       0.013559
       0.708475
       0.616949
       0.111864
       0.711864
       0.030508
       0.362712
       0.335593
       0.101695
       0.267797
       0.030508
       0.955932
       0.077966
       0.220339
       0.976271
       0.861017
      ...
    
    
      4
       0.725490
       0.127451
       0.455882
       0.426471
       0.044118
       0.700980
       0.602941
       0.161765
       0.637255
       0.053922
       0.421569
       0.264706
       0.122549
       0.225490
       0.078431
       0.941176
       0.053922
       0.269608
       0.970588
       0.882353
      ...
    
  

5 rows × 43 columns

Below is the summary of the 5 Clusters I have, with the mean of each feature caclulated for each cluster. The reason I calculated this is to "explain each cluster". We exported this data into excel and generated a "heatmap" to better visualize what was going on in each cluster.



In [44]:

    
#Preview of the Cluster Mean Values
clusterSummary.head()









    Out[44]:






  
    
      ClusterID
      0.0
      1.0
      2.0
      3.0
      4.0
    
  
  
    
      A_AC
       0.745387
       0.759777
       0.792079
       0.728814
       0.725490
    
    
      A_Breakfast
       0.154982
       0.173184
       0.168317
       0.172881
       0.127451
    
    
      A_CableTV
       0.424354
       0.441341
       0.514851
       0.413559
       0.455882
    
    
      A_CarbonMonoxDetector
       0.457565
       0.452514
       0.500000
       0.522034
       0.426471
    
    
      A_Doorman
       0.029520
       0.033520
       0.014851
       0.013559
       0.044118
    
  

5 rows × 5 columns



In [45]:

    
#Check to see how many listings are in each cluster
clusters.ClusterID.value_counts()









    Out[45]:





3    382
0    354
4    268
2    258
1    231
dtype: int64

We imported the cluster data into excel and created a heatmap to see if we could "explain" the clusters a little better. Below is a screenshot of this heatmap from our excel file.



In [46]:

    
from IPython.display import Image
Image(filename='clusterheatmap.png')









    Out[46]:

Conclusion/Observations: After doing the clustering and exporting the data to excel where I could look at it, I could not really make sense of the clusters and give them a meaningful "name" or figure out why they might be similar. Therefore, I decided to ditch the idea of clustering, and instead am going to try running a random forest model for the outcome variable of price so that I can use the variable importance functionality to view the most important variables when considering price.

Random Forest - Variable Importance (Relationship To Price)



In [47]:

    
#set the seed so when instructors run this code they get the same results
np.random.seed(12345)
#Used This To Help Me: http://scikit-learn.org/stable/auto_examples/grid_search_digits.html
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import Imputer

#Create Parameter Grid - In This Case trying 1-40 trees
tunedParameters = [{'n_estimators':range(1,21)}]

#Create Grid Search Object - n_jobs = -1 allows to take advantage of all cores on my comptuer
clf = GridSearchCV(RandomForestRegressor(n_jobs = 1), param_grid = tunedParameters, cv=10)

#Fit Random Forest
Y = TransformedDat[u'Price'].astype(float)
X = TransformedDat[[u'Lat', u'Long', u'PageNumber', u'A_AC', u'A_Breakfast', u'A_CableTV', u'A_CarbonMonoxDetector', 
                   u'A_Doorman', u'A_Dryer', u'A_TV', u'A_Elevator', u'A_Essentials', u'A_Events', 
                   u'A_FamilyFriendly', u'A_FireExt', u'A_Fireplace', u'A_FirstAidKit', u'A_Gym', 
                   u'A_Heat', u'A_HotTub', u'A_Intercom', u'A_Internet', u'A_Kitchen', u'A_Parking', 
                   u'A_Pets', u'A_Pool', u'A_SafetyCard', u'A_Shampoo', u'A_SmokeDetector', u'A_Smoking', 
                   u'A_Washer', u'A_Wheelchair', u'R_CI', u'R_acc', u'R_clean', u'R_comm', u'R_loc', 
                   u'R_val', u'RespRate', u'S_Accomodates', u'S_Bathrooms', u'S_Bedrooms', 
                   u'S_NumBeds', u'MemberLength', 
                   u'SD_NumReviews', u'BookInstantly_No', u'BookInstantly_Yes', 
                   u'Cancellation_Flexible', u'Cancellation_Moderate', u'Cancellation_Strict', 
                   u'Cancellation_Super Strict', u'RespTime_a few days or more', u'RespTime_within a day', 
                   u'RespTime_within a few hours', u'RespTime_within an hour', u'S_BedType_Airbed', 
                   u'S_BedType_Couch', u'S_BedType_Futon', u'S_BedType_Pull-out Sofa', u'S_BedType_Real Bed', 
                   u'S_PropType_Apartment', u'S_PropType_Bed & Breakfast', u'S_PropType_Cabin', u'S_PropType_House', 
                   u'S_PropType_Loft', u'S_PropType_Other', u'SD_PropType_Entire home/apt', 
                   u'SD_PropType_Private room', u'SD_PropType_Shared room', 
                   u'HostGender_couple', u'HostGender_female', u'HostGender_male', u'HostGender_unknownGender']]

ImputeMissing = Imputer(strategy = 'most_frequent')
Xt = pd.DataFrame(ImputeMissing.fit_transform(X))
Xt.columns = X.columns
clf.fit(Xt, Y)

###First, Extract Values Out of the CV Grid So I Can Graph It All
num_trees = []
meanCVScore = []
stdCVScore = []

for n, mean, cv in clf.grid_scores_:
    num_trees.append(n['n_estimators'])
    meanCVScore.append(mean)
    stdCVScore.append(np.std(cv) * 2)



In [48]:

    
clf.grid_scores_









    Out[48]:





[mean: 0.33235, std: 0.13047, params: {'n_estimators': 1},
 mean: 0.48338, std: 0.14611, params: {'n_estimators': 2},
 mean: 0.53989, std: 0.10591, params: {'n_estimators': 3},
 mean: 0.57637, std: 0.08795, params: {'n_estimators': 4},
 mean: 0.59110, std: 0.08526, params: {'n_estimators': 5},
 mean: 0.58265, std: 0.11589, params: {'n_estimators': 6},
 mean: 0.63467, std: 0.07794, params: {'n_estimators': 7},
 mean: 0.62604, std: 0.07989, params: {'n_estimators': 8},
 mean: 0.63693, std: 0.08165, params: {'n_estimators': 9},
 mean: 0.63476, std: 0.07029, params: {'n_estimators': 10},
 mean: 0.61129, std: 0.11440, params: {'n_estimators': 11},
 mean: 0.63101, std: 0.07733, params: {'n_estimators': 12},
 mean: 0.64099, std: 0.07614, params: {'n_estimators': 13},
 mean: 0.65232, std: 0.06321, params: {'n_estimators': 14},
 mean: 0.63067, std: 0.06359, params: {'n_estimators': 15},
 mean: 0.65282, std: 0.07010, params: {'n_estimators': 16},
 mean: 0.64147, std: 0.06524, params: {'n_estimators': 17},
 mean: 0.64298, std: 0.05492, params: {'n_estimators': 18},
 mean: 0.64503, std: 0.07029, params: {'n_estimators': 19},
 mean: 0.65482, std: 0.05994, params: {'n_estimators': 20}]



In [49]:

    
bpData = [list(score.cv_validation_scores) for score in clf.grid_scores_]
plt.figure(figsize=(15,10))
sns.boxplot(bpData)
plt.xlabel('# of Trees')
plt.ylabel('Cross Validation Score (Rsquared)')
plt.title('# of Trees vs. Cross Validation Score')
plt.show()

Based on this output, I am going to choose 7 trees as the best model. Alternatively, I could have choose 5 trees



In [50]:

    
tunedParameters = [{'n_estimators':[7]}]
clf2 = GridSearchCV(RandomForestRegressor(n_jobs = 1, criterion='mse'), 
                    param_grid = tunedParameters, cv=10)
#Fit Model
clf2.fit(Xt, Y)









    Out[50]:





GridSearchCV(cv=10,
       estimator=RandomForestRegressor(bootstrap=True, compute_importances=None,
           criterion='mse', max_depth=None, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid=[{'n_estimators': [7]}], pre_dispatch='2*n_jobs',
       refit=True, score_func=None, scoring=None, verbose=0)



In [51]:

    
FeatImp = pd.DataFrame({'feature': list(Xt.columns), 'importance': list(clf2.best_estimator_.feature_importances_)})
FeatImp = FeatImp.sort('importance', ascending = False)
#Set Index To Field You want to Sort Bar Chart By
FeatImp = FeatImp.set_index('feature')
FeatImp.head(20)









    Out[51]:






  
    
      
      importance
    
    
      feature
      
    
  
  
    
      SD_PropType_Entire home/apt
       0.749379
    
    
      S_Bathrooms
       0.142344
    
    
      S_Bedrooms
       0.031190
    
    
      R_loc
       0.023942
    
    
      Long
       0.012472
    
    
      PageNumber
       0.009235
    
    
      Lat
       0.007985
    
    
      S_Accomodates
       0.006270
    
    
      SD_NumReviews
       0.002811
    
    
      MemberLength
       0.002244
    
    
      RespRate
       0.002159
    
    
      A_Gym
       0.001537
    
    
      A_CableTV
       0.000750
    
    
      S_PropType_Bed & Breakfast
       0.000728
    
    
      Cancellation_Strict
       0.000722
    
    
      A_Elevator
       0.000518
    
    
      R_clean
       0.000440
    
    
      A_Shampoo
       0.000401
    
    
      A_Pets
       0.000368
    
    
      A_Essentials
       0.000351
    
  

20 rows × 1 columns



In [52]:

    
FeatImp.head(20).plot(kind = 'barh', sort_columns = True)
plt.title('Feature Importance')
plt.show()



In [53]:

    
FeatImp.iloc[3:20, :].plot(kind = 'barh', sort_columns = True)
plt.title('Feature Importance - Excluding Top 2 Important Variables')
plt.show()

Observations:

The variable importance in relation to price did not return anything that was much of a surprise to me. Here are the definitions of the fields that were most important: R_loc: the average star rating (1-5) of how good the location of the listing is. PageNumber: the PageNumber on where the listing showed up on search results Long: latitude Latitude S_Accomodates: the number of guests the property can accomodate Member Length: the number of days the host has been a member on Airbnb A_Intercom: binary variable indicating whether or not property has an intercom A_TV: binary variable indicating whether or not property has a TV Resp Rate: the response rate the host has to inquiries regarding the rental of the property (0-100%) SD_NumReviews: the number of reviews A_Gym: binary response variable of whether or not a gym exists. It looks like the size and location are the most important factors that effect price. Listings that are for the entire apartment charge much more compared to those who are merely renting a room. Likewise, Latitude and Longitude probably showed up here as important variables because of there are nieghborhoods with very high prices. We can see fromt he visualizations that properties on the MIT and Harvard campuses areas are very expensive with red clustered dots. Upon further investigation of the "PageNumber" variable, we found that this variable relates to how close to the center of the city that we are searching is, which again is related to location. In the Tableau visualizations, which are discussed in the next section, we found that their are indeed locations that appear to have higher prices, or are very expensive areas. These two areas are Kendall/MIT and the Back Bay Area

We made one very telling visualization that plotted all of the listings on a map, and colored the listings by how expensive they are. To normalize for the size of the unit, we created the metric Price / Number of Bedrooms, which was the closest approximation we could get to Price/Square Feet, which is not available. After normalizing for the size of the unit, we found that (1) Kendall MIT and (2) Back - Bay areas were expensive nieghborhoods. This bolsters our previous finding that location is a very important variable when considering price. Notice that the locations on the map are actually shifted a little bit by the airbnb webiste



In [54]:

    
Image(filename='locationexample.png')









    Out[54]:

In the above visualization, we can see that Back Bay and MIT Kendall are very expensive nieghborhoods to rent a room on Airbnb, even after normalizing for space by dividing price by the number of bedrooms. This makes intuitive sense as these are also the most expensive nieghborhoods in which to rent or buy real estate in the boston area.

We have made many Tableau visualizations to really explore this data further, which we reference in subsequent parts of this paper.

Conclusion

Observations / Conclusion a.Most important variables are (in order of importance): i.Space ii.Location – Look at MIT Kendall and Back-Bay iii.Luxury Amenities (maybe)

The most important variables to determine the price are:

Space: the type of apartment/housing, the number of bedrooms available and the number of beds available
Location: the locations such as MIT Kendall Square and Back-Bay are the expensive area for business and shopping.
Luxury Amenities: amenties such as doorman and elevator are normally provided by luxury aprtments. Thus, the price is higher than the average

One of the goals of this project was to find the best price for Hamel Husain’s (a member of the team) Airbnb listing. The visualizations we produced allowed us to see subtleties that are not easily seen in the data or analyzed by machine learning techniques. It was extremely useful to look at listings in Hamel’s neighborhood that were priced well above average while also collecting lots of reviews and visit the pages for those specific listings. We discovered that these listings had very high-quality, professional photos and were decorated in interesting and unique ways. It was surprising to us that these listings performed so well by simply employing superior marketing. While these attributes are not represented in the data directly, we were able to find them through exploring interactive visualizations. Hamel feels confident in pricing his Cambridge apartment at 140 dollars per night – which is 50 dollars above the median price of 90 dollars, as long as he decorates and markets his unit in a similar way to the outliers we observed. We highly recommend that you also explore all four tabs of the Tableau dashboard, as it is very interesting and fun to view this data!

We built a website for our project. It can be accessed at http://hamelsmu.github.io/AirbnbScrape/

Opportunity for the futher analysis

We didn't spend time on Natural Language Processing to extract information from customers' reviews. The review ratings given by the customers are almost around 4.5/5, which have a small variance. Because of the time limits, We can't scrape the data on daily basis, but we would love to explore more on the demand of apartment/housing by checking the availability eveyday over a time period. We can also calculate the Walk Score based on latitudes and longitutes and returns information about the locations of the estates.

	A_AC	A_CableTV	A_CarbonMonoxDetector	A_Dryer	A_Essentials	A_FireExt	A_FirstAidKit	A_Heat	A_Internet	A_Kitchen	A_Parking
0	1	0	1	0	1	1	0	1	1	1	0	...
1	1	0	0	0	0	0	0	1	1	1	0	...
2	0	0	0	1	1	0	0	1	1	1	0	...
3	0	0	1	0	1	1	1	1	1	1	1	...
4	0	1	1	1	1	1	0	1	1	1	1	...

	MemberLength	HostGender	SD_PropType	SD_NumReviews	SD_Neighborhood
0	1090	mostly_female	Entire home/apt	24	Harvard Square, Cambridge
1	1243	male	Entire home/apt	17	Charlestown, Boston
2	999	female	Entire home/apt	5	Charlestown
3	1028	andy	Entire home/apt	60	Brookline
4	907	female	Private room	11	Brookline
5	1120	mostly_female	Private room	26	Brookline
6	420	female	Shared room	22	Brookline
7	662	andy	Private room	4	Coolidge Corner, Brookline
8	1243	male	Entire home/apt	18	Brookline
9	116	male	Private room	24	Brookline

	Unnamed: 0	ListingID	Title	UserID	baseurl	Price	AboutListing	HostName	MemberDate	Lat	Long	PageCounter	PageNumber	A_AC	A_CableTV	A_Dryer	A_TV
0	0	281552	Harvard Sq Large 1BR overlooks park	1467518	https://www.airbnb.com/s/Cambridge--MA?page=1	175	\n	Mary Catherine	December 2011	42.377119	-71.120112	1	1	0	0	1	0	...
1	1	182613	Luxury 2BR condo Charlestown Boston	875739	https://www.airbnb.com/s/Charlestown-MA?page=1	249	Entire large modern quiet city condo near ever...	Max	July 2011	42.377387	-71.060435	2	1	1	0	1	1	...
2	2	1587540	Cozy House on Bunker Hill in Boston	2004732	https://www.airbnb.com/s/Charlestown-MA?page=1	225	\n	Finola	March 2012	42.378898	-71.061182	1	1	1	0	1	1	...
3	3	469506	Luxury 1bd, Safe/Central Brookline	1766477	https://www.airbnb.com/s/Brookline--MA?page=1	140	\n	Rupal	February 2012	42.338848	-71.135475	18	1	1	1	1	1	...
4	4	3937268	Boston bedroom & private bathroom	2530197	https://www.airbnb.com/s/Brookline--MA?page=1	99	I offer a private basement bedroom with a priv...	Natasha	June 2012	42.343206	-71.119955	17	1	1	0	0	0	...

	A_AC	A_Breakfast	A_CableTV	A_CarbonMonoxDetector	A_Doorman	A_Dryer	A_TV	A_Elevator	A_Essentials	A_Events	A_FamilyFriendly	A_FireExt	A_Fireplace	A_FirstAidKit	A_Gym	A_Heat	A_HotTub	A_Intercom	A_Internet	A_Kitchen
ClusterID
0	0.745387	0.154982	0.424354	0.457565	0.029520	0.749077	0.586716	0.103321	0.638376	0.055351	0.450185	0.321033	0.132841	0.250923	0.040590	0.940959	0.066421	0.228782	0.977860	0.896679	...
1	0.759777	0.173184	0.441341	0.452514	0.033520	0.737430	0.642458	0.134078	0.659218	0.050279	0.379888	0.312849	0.117318	0.268156	0.044693	0.966480	0.072626	0.273743	0.972067	0.888268	...
2	0.792079	0.168317	0.514851	0.500000	0.014851	0.722772	0.643564	0.094059	0.702970	0.019802	0.356436	0.361386	0.118812	0.292079	0.039604	0.955446	0.024752	0.217822	0.980198	0.861386	...
3	0.728814	0.172881	0.413559	0.522034	0.013559	0.708475	0.616949	0.111864	0.711864	0.030508	0.362712	0.335593	0.101695	0.267797	0.030508	0.955932	0.077966	0.220339	0.976271	0.861017	...
4	0.725490	0.127451	0.455882	0.426471	0.044118	0.700980	0.602941	0.161765	0.637255	0.053922	0.421569	0.264706	0.122549	0.225490	0.078431	0.941176	0.053922	0.269608	0.970588	0.882353	...

ClusterID	0.0	1.0	2.0	3.0	4.0
A_AC	0.745387	0.759777	0.792079	0.728814	0.725490
A_Breakfast	0.154982	0.173184	0.168317	0.172881	0.127451
A_CableTV	0.424354	0.441341	0.514851	0.413559	0.455882
A_CarbonMonoxDetector	0.457565	0.452514	0.500000	0.522034	0.426471
A_Doorman	0.029520	0.033520	0.014851	0.013559	0.044118

	importance
feature
SD_PropType_Entire home/apt	0.749379
S_Bathrooms	0.142344
S_Bedrooms	0.031190
R_loc	0.023942
Long	0.012472
PageNumber	0.009235
Lat	0.007985
S_Accomodates	0.006270
SD_NumReviews	0.002811
MemberLength	0.002244
RespRate	0.002159
A_Gym	0.001537
A_CableTV	0.000750
S_PropType_Bed & Breakfast	0.000728
Cancellation_Strict	0.000722
A_Elevator	0.000518
R_clean	0.000440
A_Shampoo	0.000401
A_Pets	0.000368
A_Essentials	0.000351

	A_AC	A_CableTV	A_CarbonMonoxDetector	A_Dryer	A_Essentials	A_FireExt	A_FirstAidKit	A_Heat	A_Internet	A_Kitchen	A_Parking
0	1	0	1	0	1	1	0	1	1	1	0	...
1	1	0	0	0	0	0	0	1	1	1	0	...
2	0	0	0	1	1	0	0	1	1	1	0	...
3	0	0	1	0	1	1	1	1	1	1	1	...
4	0	1	1	1	1	1	0	1	1	1	1	...

	A_AC	A_CableTV	A_CarbonMonoxDetector	A_Dryer	A_Essentials	A_FireExt	A_FirstAidKit	A_Heat	A_Internet	A_Kitchen	A_Parking
0	1	0	1	0	1	1	0	1	1	1	0	...
1	1	0	0	0	0	0	0	1	1	1	0	...
2	0	0	0	1	1	0	0	1	1	1	0	...
3	0	0	1	0	1	1	1	1	1	1	1	...
4	0	1	1	1	1	1	0	1	1	1	1	...