Team Hamel Husain - Xiaoyu Li - Yi Mao - Lana Awad
Project Website http://hamelsmu.github.io/AirbnbScrape/
Airbnb is a website for people to rent out lodging. As a host of airbnb, we wanted to optimize the price of our listing, and wanted to understand things like: How other people priced around me, relative to dimensions such as locations, amenities, reviews, number of beds, etc? How I can set up my price in a more competitive way?
Our ScrapeAirbnb file contains two main functions: IterateMainPage() and iterateDetail().
In [1]:
from ScrapeAirbnb import*
#ScrapeAirbnb is a seperate python file, in order to run it, please install libraries mechanize, cookielib and lxml
In [2]:
test = IterateMainPage('Cambridge-MA', 1)
test2 = iterateDetail(test)
This is an example how the scraped dataset looks like.
In [3]:
import pandas as pd
test2 = pd.DataFrame(test2)
test2.head()
Out[3]:
We had data_clean_airbnb.py file to clean the dataset scraped from the airbnb websites. We had main function named DataClean that we input the raw data and the function calculates the length of membership date; parse the ShortDesc variable into three variables: property type, number of reviews and the neighborhood.(ShortDesc is a string variable: "Private room · 14 reviews · Cambridge"); infer the gender of the host by their first name. (female, male, couple and Andy[could be both gender or unknown foreign names). The function output the data as a csv file named final_v2.csv.
In [2]:
from DataCleanAirbnb import*
In [3]:
# for the simplicity purpose, I only read in first 10 rows of my raw data and show the clea
data = pd.read_csv("airbnbData.csv")[0:10]
DataClean(data)
Out[3]:
Imputation of data and filtering the outliers
We used most frequency method to impute the missing data from Random Forest.
Filtering Data
In [20]:
dat = pd.read_csv('Final_v2.csv', na_values=['Not Found'])
In [21]:
def filterAirbnbListings(x):
filtered_dat = dat[(dat.SD_NumReviews > 1) & (dat.MemberLength < 70000)]
return filtered_dat
filtered_dat = filterAirbnbListings(dat)
Encoding - Dummy Variables We want to convert some of the categorical variables in our data set to numeric values so we can more easily apply dimensionality reduction and clustering techniques. Below are functions that we use to do this.
In [22]:
from DummyOneHot import *
TransformedDat = dummyCode(filtered_dat)
TransformedDat.head()
Out[22]:
To get a general idea about the distribution of different potential predictors of price, we generated box-plots of these variables using Tableau software. We looked for trends to guide our further analysis.
a- Booking and Host Variables
In [1]:
from IPython.display import Image
Image(filename='Booking and Host.jpg')
Out[1]:
b- Amenities
In [4]:
Image(filename='Amenities.jpg')
Out[4]:
In [5]:
Image(filename='Amenties 2.jpg')
Out[5]:
In [6]:
Image(filename='Amenities 3.jpg')
Out[6]:
We found no striking conclusions just from visualizing the data. A strict cancellation policy was associated with the highest priced listings.
As we would expect, luxury amenities such as doorman, gym, pool, fireplace and elevator were associated with higher prices. Amenities that are considered essential and common to probably all listings on airbnb such as internet and safety devices showed no association with price.
c- Member Length
We calculated member length by using the date the member joined Airbnb and graphed this to see the distribution.
In [23]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting
import seaborn as sns
In [24]:
dat = pd.read_csv('Final_v2.csv', na_values=['Not Found'])
In [25]:
plt.hist(dat.MemberLength)
plt.xlabel('Member Length')
plt.ylabel('Number of Properties')
plt.title('Histogram of Member Length')
plt.show
Out[25]:
We will want to filter out the members who are older than 70,000 days as that is an outlier and may be some anomaly in the data.
In [26]:
from pylab import rcParams
rcParams['figure.figsize'] = 10, 5
plt.hist(dat.SD_NumReviews, bins = 50)
plt.xlabel('Number of Reviews')
plt.ylabel('Number of Properties')
plt.title('Histogram of Number of Reviews')
plt.show()
We walso filtered out members who don't have at least 3 reviews, as we want to capture properties that are actually being rented, and not inactive listings.
We plotted the number of reviews against the member length, to see how strong the relationship is.
In [27]:
filtered_dat = dat[(dat.SD_NumReviews > 2) & (dat.MemberLength < 70000)]
sns.lmplot("MemberLength", "SD_NumReviews", hue="SD_PropType", col = 'SD_PropType', data=filtered_dat, fit_reg=True)
plt.show()
Its pretty obvious that there is a wide distribution of how agressive people are in renting out their properties. Some people have been members for really long periods of time have not rented out their property as much as people who have been members for a relatively shorter amount of time. This means that when comparing the price of different properties, we will have to take into account that even though a property is listed, the owner may nevertheless may not be willing to rent it. One approach may be to try and look at the number of reviews / membership length to get an indication of how that relates to price. Below I plot some histograms concerning number of reviews / membership length.
In [28]:
rcParams['figure.figsize'] = 20, 5
plt.hist(filtered_dat.SD_NumReviews / filtered_dat.MemberLength, bins = 30)
plt.xlabel('Ratio of Reviews To Membership Length')
plt.ylabel('Number of Properties')
plt.title('Histogram of Reviews/Membership Length')
plt.show()
We came up with a metric that is calculated as follows: (Number of Reviews / Membership Length) This is mean to "measure" in a crude way how active a property has been in renting their place. We wanted to normalize the number of reviews somehow by the amount of time the property was available for rent. While we don't have prefect information, we used this metric as a proxy to see if there were outliers or not. The above histogram confirms our earlier observation that there is a wide range of activity amongst properties in terms of how they renting on Airbnb.
In [29]:
sns.set(style="darkgrid")
f, ax = plt.subplots(figsize=(9, 9))
sns.corrplot(filtered_dat, annot=False, sig_stars=False,
diag_names=False, ax=ax)
plt.title('Correlation Matrix - All Variables')
plt.show()
Observations From Correlation Plot:
In [30]:
filtered_dat.HostGender.value_counts()
Out[30]:
On thing we also tried to do is to use host names to infer the host's gender. We tried using the python package sexmachine to transform the names into a gender, however it turns out that many of the names did not yield a gender. In the below table 'andy' means that the gender is not clear from the host name.
Since there are so many types of properties, and I want to be able to compare similar properties to eachother, I want to create some clusters of properties so that I can group them together a little easier. I am going to use the following attributes to cluster properties:
SD_PropType: the type of accomodation (Private Room, Entire House, Shared Room, etc).
I did not want to cluster using Price or Reviews as I wanted to cluster based on inherent qualities of the listings that are not easily changed and that are largely out of the host's control. Things like the response rate, reviews, and price are within the host's control so I want to explore the relationship of those things to price in more detail. The purpose of the clustering is to simplify the wide variety of listings out there into some groups so I can compare properties more easily.
One goal of clustering is to "group" the properties by consolidating the amenity variables and finding similarities between properties. Another goal is to reduce the dimensionality by substituting all of the amenity variables with some kind of cluster assignment or a reduced dimension
In [32]:
#subset the variables you want to cluster by
cluster_dat = TransformedDat[[u'A_AC', u'A_Breakfast', u'A_CableTV', u'A_CarbonMonoxDetector', u'A_Doorman',
u'A_Dryer', u'A_TV', u'A_Elevator', u'A_Essentials', u'A_Events', u'A_FamilyFriendly',
u'A_FireExt', u'A_Fireplace', u'A_FirstAidKit', u'A_Gym', u'A_Heat', u'A_HotTub', u'A_Intercom',
u'A_Internet', u'A_Kitchen', u'A_Parking', u'A_Pets', u'A_Pool',
u'A_SafetyCard', u'A_Shampoo', u'A_SmokeDetector', u'A_Smoking', u'A_Washer', u'A_Wheelchair',
u'S_BedType_Airbed', u'S_BedType_Couch', u'S_BedType_Futon',
u'S_BedType_Pull-out Sofa', u'S_BedType_Real Bed', u'S_PropType_Apartment',
u'S_PropType_Bed & Breakfast', u'S_PropType_Cabin', u'S_PropType_House', u'S_PropType_Loft',
u'S_PropType_Other', u'SD_PropType_Entire home/apt', u'SD_PropType_Private room',
u'SD_PropType_Shared room']]
Apply PCA on dimensions
In [33]:
from sklearn.decomposition import PCA
from sklearn import preprocessing
pca = PCA()
pcaResults = pca.fit(cluster_dat)
In [34]:
plt.plot(np.cumsum(pcaResults.explained_variance_ratio_))
plt.title('Cumalitive Proportion of Variance Explained - Principal Components')
plt.xlabel('Number of Principal Components')
plt.show()
Using PCA, We can reduce the number of amenity features to 10 from 43 and still explain 75% of the variance. If we are going to cluster by using something like k-means reducing dimensionality will be important. We choose to go with 10 principal components here.
In [38]:
#get first 10 principal components
pcaDat = pca.fit_transform(cluster_dat)[:, :10]
#confirm shape of new data
np.shape(pcaDat)
Out[38]:
In [39]:
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb
import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
K = range(1,30)
X = pcaDat
#Run K Means With Up To 8 Clusters
KM = [kmeans(X,k) for k in K] # apply kmeans 1 to 10
centroids = [cent for (cent,var) in KM] # cluster centroids
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]
One of the key parameters in K means is the number of clusters or value of K. I used the elbow method, and choose 5 clusters. This is somewhat subjective, however with unsupervised learning there are some subjective elements. I adapted code from the below link in order to make the elbow chart: http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb
In [40]:
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb
# plot elbow curve
rcParams['figure.figsize'] = 8, 5
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
tt = plt.title('Elbow for K-Means clustering')
Using the elbow method, we went with K=5 as the number of clusters. We choose 5 because that appears where the elbow is or where the gradient of the curve starts to drastically change.
Now that we have choosen K = 5 I re-run k-means clustering for K = 5 and then inspect the output to see what the data might looklike
In [41]:
#adapted code from http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clusterin
from sklearn.cluster import KMeans
km = KMeans(5, init='k-means++') # initialize KMeans
c = km.fit_predict(X)
print np.shape(cluster_dat)
print np.shape(c)
In [42]:
#write out cluter data to csv file so I can inspect it
#dat2 = pd.DataFrame.reset_index(cluster_dat)[[a for a in cluster_dat.columns if a != 'index']]
#concatenate cluster assignments to the original data
clusters = pd.concat([cluster_dat, pd.DataFrame(c)], axis = 1)
#rename clusterID column
clusters = clusters.rename(columns = {0:'ClusterID'})
#summarize clusters by clusterID, output to excel to inspect it
clusterSummary = clusters.groupby(['ClusterID']).mean().T
clusterSummary.to_csv('Cluster_Data.csv')
In [43]:
clusters.groupby(['ClusterID']).mean()
Out[43]:
Below is the summary of the 5 Clusters I have, with the mean of each feature caclulated for each cluster. The reason I calculated this is to "explain each cluster". We exported this data into excel and generated a "heatmap" to better visualize what was going on in each cluster.
In [44]:
#Preview of the Cluster Mean Values
clusterSummary.head()
Out[44]:
In [45]:
#Check to see how many listings are in each cluster
clusters.ClusterID.value_counts()
Out[45]:
We imported the cluster data into excel and created a heatmap to see if we could "explain" the clusters a little better. Below is a screenshot of this heatmap from our excel file.
In [46]:
from IPython.display import Image
Image(filename='clusterheatmap.png')
Out[46]:
Conclusion/Observations: After doing the clustering and exporting the data to excel where I could look at it, I could not really make sense of the clusters and give them a meaningful "name" or figure out why they might be similar. Therefore, I decided to ditch the idea of clustering, and instead am going to try running a random forest model for the outcome variable of price so that I can use the variable importance functionality to view the most important variables when considering price.
In [47]:
#set the seed so when instructors run this code they get the same results
np.random.seed(12345)
#Used This To Help Me: http://scikit-learn.org/stable/auto_examples/grid_search_digits.html
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import Imputer
#Create Parameter Grid - In This Case trying 1-40 trees
tunedParameters = [{'n_estimators':range(1,21)}]
#Create Grid Search Object - n_jobs = -1 allows to take advantage of all cores on my comptuer
clf = GridSearchCV(RandomForestRegressor(n_jobs = 1), param_grid = tunedParameters, cv=10)
#Fit Random Forest
Y = TransformedDat[u'Price'].astype(float)
X = TransformedDat[[u'Lat', u'Long', u'PageNumber', u'A_AC', u'A_Breakfast', u'A_CableTV', u'A_CarbonMonoxDetector',
u'A_Doorman', u'A_Dryer', u'A_TV', u'A_Elevator', u'A_Essentials', u'A_Events',
u'A_FamilyFriendly', u'A_FireExt', u'A_Fireplace', u'A_FirstAidKit', u'A_Gym',
u'A_Heat', u'A_HotTub', u'A_Intercom', u'A_Internet', u'A_Kitchen', u'A_Parking',
u'A_Pets', u'A_Pool', u'A_SafetyCard', u'A_Shampoo', u'A_SmokeDetector', u'A_Smoking',
u'A_Washer', u'A_Wheelchair', u'R_CI', u'R_acc', u'R_clean', u'R_comm', u'R_loc',
u'R_val', u'RespRate', u'S_Accomodates', u'S_Bathrooms', u'S_Bedrooms',
u'S_NumBeds', u'MemberLength',
u'SD_NumReviews', u'BookInstantly_No', u'BookInstantly_Yes',
u'Cancellation_Flexible', u'Cancellation_Moderate', u'Cancellation_Strict',
u'Cancellation_Super Strict', u'RespTime_a few days or more', u'RespTime_within a day',
u'RespTime_within a few hours', u'RespTime_within an hour', u'S_BedType_Airbed',
u'S_BedType_Couch', u'S_BedType_Futon', u'S_BedType_Pull-out Sofa', u'S_BedType_Real Bed',
u'S_PropType_Apartment', u'S_PropType_Bed & Breakfast', u'S_PropType_Cabin', u'S_PropType_House',
u'S_PropType_Loft', u'S_PropType_Other', u'SD_PropType_Entire home/apt',
u'SD_PropType_Private room', u'SD_PropType_Shared room',
u'HostGender_couple', u'HostGender_female', u'HostGender_male', u'HostGender_unknownGender']]
ImputeMissing = Imputer(strategy = 'most_frequent')
Xt = pd.DataFrame(ImputeMissing.fit_transform(X))
Xt.columns = X.columns
clf.fit(Xt, Y)
###First, Extract Values Out of the CV Grid So I Can Graph It All
num_trees = []
meanCVScore = []
stdCVScore = []
for n, mean, cv in clf.grid_scores_:
num_trees.append(n['n_estimators'])
meanCVScore.append(mean)
stdCVScore.append(np.std(cv) * 2)
In [48]:
clf.grid_scores_
Out[48]:
In [49]:
bpData = [list(score.cv_validation_scores) for score in clf.grid_scores_]
plt.figure(figsize=(15,10))
sns.boxplot(bpData)
plt.xlabel('# of Trees')
plt.ylabel('Cross Validation Score (Rsquared)')
plt.title('# of Trees vs. Cross Validation Score')
plt.show()
Based on this output, I am going to choose 7 trees as the best model. Alternatively, I could have choose 5 trees
In [50]:
tunedParameters = [{'n_estimators':[7]}]
clf2 = GridSearchCV(RandomForestRegressor(n_jobs = 1, criterion='mse'),
param_grid = tunedParameters, cv=10)
#Fit Model
clf2.fit(Xt, Y)
Out[50]:
In [51]:
FeatImp = pd.DataFrame({'feature': list(Xt.columns), 'importance': list(clf2.best_estimator_.feature_importances_)})
FeatImp = FeatImp.sort('importance', ascending = False)
#Set Index To Field You want to Sort Bar Chart By
FeatImp = FeatImp.set_index('feature')
FeatImp.head(20)
Out[51]:
In [52]:
FeatImp.head(20).plot(kind = 'barh', sort_columns = True)
plt.title('Feature Importance')
plt.show()
In [53]:
FeatImp.iloc[3:20, :].plot(kind = 'barh', sort_columns = True)
plt.title('Feature Importance - Excluding Top 2 Important Variables')
plt.show()
The variable importance in relation to price did not return anything that was much of a surprise to me. Here are the definitions of the fields that were most important: R_loc: the average star rating (1-5) of how good the location of the listing is. PageNumber: the PageNumber on where the listing showed up on search results Long: latitude Latitude S_Accomodates: the number of guests the property can accomodate Member Length: the number of days the host has been a member on Airbnb A_Intercom: binary variable indicating whether or not property has an intercom A_TV: binary variable indicating whether or not property has a TV Resp Rate: the response rate the host has to inquiries regarding the rental of the property (0-100%) SD_NumReviews: the number of reviews A_Gym: binary response variable of whether or not a gym exists. It looks like the size and location are the most important factors that effect price. Listings that are for the entire apartment charge much more compared to those who are merely renting a room. Likewise, Latitude and Longitude probably showed up here as important variables because of there are nieghborhoods with very high prices. We can see fromt he visualizations that properties on the MIT and Harvard campuses areas are very expensive with red clustered dots. Upon further investigation of the "PageNumber" variable, we found that this variable relates to how close to the center of the city that we are searching is, which again is related to location. In the Tableau visualizations, which are discussed in the next section, we found that their are indeed locations that appear to have higher prices, or are very expensive areas. These two areas are Kendall/MIT and the Back Bay Area
We made one very telling visualization that plotted all of the listings on a map, and colored the listings by how expensive they are. To normalize for the size of the unit, we created the metric Price / Number of Bedrooms, which was the closest approximation we could get to Price/Square Feet, which is not available. After normalizing for the size of the unit, we found that (1) Kendall MIT and (2) Back - Bay areas were expensive nieghborhoods. This bolsters our previous finding that location is a very important variable when considering price. Notice that the locations on the map are actually shifted a little bit by the airbnb webiste
In [54]:
Image(filename='locationexample.png')
Out[54]:
In the above visualization, we can see that Back Bay and MIT Kendall are very expensive nieghborhoods to rent a room on Airbnb, even after normalizing for space by dividing price by the number of bedrooms. This makes intuitive sense as these are also the most expensive nieghborhoods in which to rent or buy real estate in the boston area.
We have made many Tableau visualizations to really explore this data further, which we reference in subsequent parts of this paper.
Observations / Conclusion a.Most important variables are (in order of importance): i.Space ii.Location – Look at MIT Kendall and Back-Bay iii.Luxury Amenities (maybe)
The most important variables to determine the price are:
One of the goals of this project was to find the best price for Hamel Husain’s (a member of the team) Airbnb listing. The visualizations we produced allowed us to see subtleties that are not easily seen in the data or analyzed by machine learning techniques. It was extremely useful to look at listings in Hamel’s neighborhood that were priced well above average while also collecting lots of reviews and visit the pages for those specific listings. We discovered that these listings had very high-quality, professional photos and were decorated in interesting and unique ways. It was surprising to us that these listings performed so well by simply employing superior marketing. While these attributes are not represented in the data directly, we were able to find them through exploring interactive visualizations. Hamel feels confident in pricing his Cambridge apartment at 140 dollars per night – which is 50 dollars above the median price of 90 dollars, as long as he decorates and markets his unit in a similar way to the outliers we observed. We highly recommend that you also explore all four tabs of the Tableau dashboard, as it is very interesting and fun to view this data!
We built a website for our project. It can be accessed at http://hamelsmu.github.io/AirbnbScrape/
We didn't spend time on Natural Language Processing to extract information from customers' reviews. The review ratings given by the customers are almost around 4.5/5, which have a small variance. Because of the time limits, We can't scrape the data on daily basis, but we would love to explore more on the demand of apartment/housing by checking the availability eveyday over a time period. We can also calculate the Walk Score based on latitudes and longitutes and returns information about the locations of the estates.