For this lab’s exercise we are going to answer a few questions about AirBnB listings in San Francisco to make a better informed civic descisions. Spurred by Prop F in San Francisco, imagine you are the mayor of SF (or your respective city) and you need to decide what impact AirBnB has had on your own housing situation. We will collect the relevant data, parse and store this data in a structured form, and use statistics and visualization to both better understand our own city and potentially communicate these findings to the public at large.
I will explore SF's data, but the techniques should be generally applicable to any city. Inside AirBnB has many interesting cities to further explore: http://insideairbnb.com/
lxml
to parse web pages
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%pylab inline
In [2]:
# We will use the Inside AirBnB dataset from here on
df = pd.read_csv('data/sf_listings.csv')
df.head()
Out[2]:
In [3]:
df.room_type.value_counts().plot.bar()
Out[3]:
In [4]:
# Since SF doesn't have many neighborhoods (comparatively) we can also see the raw # per neighborhood
df.groupby('neighbourhood').count()['id'].plot.bar(figsize=(14,6))
Out[4]:
In [5]:
df.groupby('host_id').count()['id'].plot.hist(bins=50)
Out[5]:
In [6]:
# let's zoom in to the tail
subselect = df.groupby('host_id').count()['id']
subselect[subselect > 1].plot.hist(bins=50)
Out[6]:
In [7]:
def scale_free_plot(df, num):
subselect = df.groupby('host_id').count()['id']
return subselect[subselect > num].plot.hist(bins=75)
In [8]:
scale_free_plot(df, 2)
Out[8]:
In [9]:
# the shape of the distribution stays relatively the same as we subselect
for i in range(5):
scale_free_plot(df, i)
plt.show()
In an effort to find potential correlations (or outliers) you want a little bit more fine grained loot at the data. Create a scatterplot matrix of the data for your city. http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-scatter-matrix
In [10]:
from pandas.tools.plotting import scatter_matrix
# it only makes sense to plot the continuous columns
continuous_columns = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', \
'calculated_host_listings_count','availability_365']
# semicolon prevents the axis objests from printing
scatter_matrix(df[continuous_columns], alpha=0.6, figsize=(16, 16), diagonal='kde');
price
heavily skewed towards cheap prices (with a few extreme outliers). host_listings_count
and number_of_reviews
have similar distributions.minimum_nights
has a sharp bimodal distribution.
In [11]:
sns.distplot(df[(df.calculated_host_listings_count > 2) & (df.room_type == 'Entire home/apt')].availability_365, bins=50)
sns.distplot(df[(df.calculated_host_listings_count <= 2) & (df.room_type == 'Entire home/apt')].availability_365, bins=50)
Out[11]:
In [12]:
# Host with multiple listing for the entire home distribution is skewed to availability the entire year
# implying that these hosts are renting the AirBnB as short term sublets (or hotels)
entire_home = df[df.room_type == 'Entire home/apt']
plt.figure(figsize=(14,6))
sns.kdeplot(entire_home[entire_home.calculated_host_listings_count > 1].availability_365, label='Multiple Listings')
sns.kdeplot(entire_home[entire_home.calculated_host_listings_count == 1].availability_365, label = 'Single Listing')
plt.legend();
In [13]:
# Host with multiple listing for the entire home distribution is skewed to availability the entire year
# implying that these hosts are renting the AirBnB as short term sublets (or hotels)
plt.figure(figsize=(14,6))
sns.kdeplot(df[df.minimum_nights > 29].availability_365, label='Short term Sublet')
sns.kdeplot(df[df.minimum_nights <= 20].availability_365, label = 'Listing')
plt.legend();
In [14]:
# Host with multiple listing for the entire home distribution is skewed to availability the entire year
# implying that these hosts are renting the AirBnB as short term sublets (or hotels)
entire_home = df[df.minimum_nights > 29]
plt.figure(figsize=(14,6))
sns.kdeplot(entire_home[entire_home.calculated_host_listings_count > 1].availability_365, label='Multiple Listings')
sns.kdeplot(entire_home[entire_home.calculated_host_listings_count == 1].availability_365, label = 'Single Listing')
plt.legend();
In [15]:
# just a tocuh hard to interpret...
plt.figure(figsize=(16, 6))
sns.violinplot(data=df, x='neighbourhood', y='price')
Out[15]:
In [16]:
# boxplots can sometimes handle outliers better, we can see here there are some listings that are high priced extrema
plt.figure(figsize=(16, 6))
sns.boxplot(data=df, x='neighbourhood', y='price')
Out[16]:
Lets try to only show the 10 neighborhoods with the most listings and to zoom in on the distribution of the lower prices (now that we can identify the outliers) we can remove listings priced at > $2000
In [17]:
top_neighborhoods = df.groupby('neighbourhood').count().sort_values('id', ascending = False).index[:10]
top_neighborhoods
Out[17]:
In [18]:
neighborhood_subset = df[df.neighbourhood.isin(top_neighborhoods)]
In [19]:
plt.figure(figsize=(16, 6))
sns.boxplot(data=neighborhood_subset[neighborhood_subset.price < 2000], x='neighbourhood', y='price')
Out[19]:
In [20]:
plt.figure(figsize=(16, 6))
sns.violinplot(data=neighborhood_subset[neighborhood_subset.price < 2000], x='neighbourhood', y='price')
Out[20]: