Exercise: Wrangling Data: Acquisition, Integration, and Exploration

For this lab’s exercise we are going to answer a few questions about AirBnB listings in San Francisco to make a better informed civic descisions. Spurred by Prop F in San Francisco, imagine you are the mayor of SF (or your respective city) and you need to decide what impact AirBnB has had on your own housing situation. We will collect the relevant data, parse and store this data in a structured form, and use statistics and visualization to both better understand our own city and potentially communicate these findings to the public at large.

I will explore SF's data, but the techniques should be generally applicable to any city. Inside AirBnB has many interesting cities to further explore: http://insideairbnb.com/

Outline

  • Start with Effective Questions
    • Intro + Data Science Overview
    • Proposition F
    • How can we answer this?
  • Acquiring Data
    • What's an API? (Zillow API, SF Open Data, datausa.io)
    • How the Web Works (Socrata API)
  • What if there is no API?
    • Scrape an AirBnB listing
  • What to do now that we have data?
    • Basics of HTML (CSS selectors and grabbing what you want)
    • Use lxml to parse web pages
  • Storing Data
    • Schemas and Structure
    • Relations (users, listings, and reviews)
    • Store listing in SQLite
  • Manipulating Data
    • basics of Pandas
    • summary stats
    • split-apply-combine
    • Aggregations
    • Prop F. revenue lost
  • Exploratory Data Analysis
    • Inside AirBnB
    • Why visual?
    • Chart Types (visualizing continuous, categorical, and distributions and facets)
    • Distributions of Prop F. Revenue vs. point statistics

Visualize

Time to visualize! Using pandas (and matplotlib) create a visualization of each of the following:

  • Distribution of room_type (for entire city)
  • Histogram of # of listings per neighborhood
  • Histogram of # of listings for each user
  • City wide distribution of listing price
  • Distribution of median listing price per neighborhood
  • Histogram of number of reviews per listing

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
# We will use the Inside AirBnB dataset from here on
df = pd.read_csv('data/sf_listings.csv')
df.head()


Out[2]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 1162609 Lovely One Bedroom Apartment 6368122 Taylor NaN Seacliff 37.785217 -122.488655 Entire home/apt 350 4 8 2015-09-17 0.28 1 90
1 6032828 Historic Seacliff Home 30384615 Patricia NaN Seacliff 37.783658 -122.489398 Entire home/apt 300 1 0 NaN NaN 1 173
2 6938818 Best Secret in Town 36381578 Harris NaN Seacliff 37.781505 -122.504754 Private room 119 1 10 2015-10-08 2.33 1 341
3 8087607 Single Room Beautiful Beach Condo 3264449 Keith NaN Seacliff 37.775318 -122.511621 Private room 79 1 0 NaN NaN 1 40
4 4781448 3 Bd 2.5 Ba Full Flat Condo w Views 13112558 Pam NaN Seacliff 37.781797 -122.492492 Entire home/apt 695 2 1 2015-08-11 0.36 1 23

In [3]:
df.room_type.value_counts().plot.bar()


Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x10aa23710>

In [4]:
# Since SF doesn't have many neighborhoods (comparatively) we can also see the raw # per neighborhood
df.groupby('neighbourhood').count()['id'].plot.bar(figsize=(14,6))


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b2fa1d0>

In [5]:
df.groupby('host_id').count()['id'].plot.hist(bins=50)


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b671910>

In [6]:
# let's zoom in to the tail
subselect = df.groupby('host_id').count()['id']
subselect[subselect > 1].plot.hist(bins=50)


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a9f0210>

In [7]:
def scale_free_plot(df, num):
    subselect = df.groupby('host_id').count()['id']
    return subselect[subselect > num].plot.hist(bins=75)

In [8]:
scale_free_plot(df, 2)


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bcf9450>

In [9]:
# the shape of the distribution stays relatively the same as we subselect
for i in range(5):
    scale_free_plot(df, i)
    plt.show()


Scatterplot Matrix

In an effort to find potential correlations (or outliers) you want a little bit more fine grained loot at the data. Create a scatterplot matrix of the data for your city. http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-scatter-matrix


In [10]:
from pandas.tools.plotting import scatter_matrix

# it only makes sense to plot the continuous columns
continuous_columns = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', \
           'calculated_host_listings_count','availability_365']

# semicolon prevents the axis objests from printing
scatter_matrix(df[continuous_columns], alpha=0.6, figsize=(16, 16), diagonal='kde');


Interesting insights from the scatter matrix:

  • price heavily skewed towards cheap prices (with a few extreme outliers). host_listings_count and number_of_reviews have similar distributions.
  • minimum_nights has a sharp bimodal distribution.
  • Listing are bimodal too and are either:
    • available for a relatively short period of the year
    • available for most of it (these are probably the "hotels")
  • Host with a large number of listings have them each for a relative low price.
  • Listings that are expensive have very few reviews (i.e. not many people stay at them)

In [11]:
sns.distplot(df[(df.calculated_host_listings_count > 2) & (df.room_type == 'Entire home/apt')].availability_365, bins=50)
sns.distplot(df[(df.calculated_host_listings_count <= 2) & (df.room_type == 'Entire home/apt')].availability_365, bins=50)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x10dd09510>

In [12]:
# Host with multiple listing for the entire home distribution is skewed to availability the entire year
# implying that these hosts are renting the AirBnB as short term sublets (or hotels)

entire_home = df[df.room_type == 'Entire home/apt']

plt.figure(figsize=(14,6))
sns.kdeplot(entire_home[entire_home.calculated_host_listings_count > 1].availability_365, label='Multiple Listings')
sns.kdeplot(entire_home[entire_home.calculated_host_listings_count == 1].availability_365, label = 'Single Listing')
plt.legend();



In [13]:
# Host with multiple listing for the entire home distribution is skewed to availability the entire year
# implying that these hosts are renting the AirBnB as short term sublets (or hotels)

plt.figure(figsize=(14,6))
sns.kdeplot(df[df.minimum_nights > 29].availability_365, label='Short term Sublet')
sns.kdeplot(df[df.minimum_nights <= 20].availability_365, label = 'Listing')
plt.legend();



In [14]:
# Host with multiple listing for the entire home distribution is skewed to availability the entire year
# implying that these hosts are renting the AirBnB as short term sublets (or hotels)

entire_home = df[df.minimum_nights > 29]

plt.figure(figsize=(14,6))
sns.kdeplot(entire_home[entire_home.calculated_host_listings_count > 1].availability_365, label='Multiple Listings')
sns.kdeplot(entire_home[entire_home.calculated_host_listings_count == 1].availability_365, label = 'Single Listing')
plt.legend();


Extra!

Advanced Plots with Seaborn

Make a violin plot of the price distribution of each neighborhood.

If your city has a large number of neighborhoods plot the 10 with the most listing.


In [15]:
# just a tocuh hard to interpret...
plt.figure(figsize=(16, 6))
sns.violinplot(data=df, x='neighbourhood', y='price')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x10dcfb2d0>

In [16]:
# boxplots can sometimes handle outliers better, we can see here there are some listings that are high priced extrema
plt.figure(figsize=(16, 6))
sns.boxplot(data=df, x='neighbourhood', y='price')


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f234910>

Lets try to only show the 10 neighborhoods with the most listings and to zoom in on the distribution of the lower prices (now that we can identify the outliers) we can remove listings priced at > $2000


In [17]:
top_neighborhoods = df.groupby('neighbourhood').count().sort_values('id', ascending = False).index[:10]
top_neighborhoods


Out[17]:
Index([u'Mission', u'Western Addition', u'South of Market',
       u'Castro/Upper Market', u'Haight Ashbury', u'Bernal Heights', u'Marina',
       u'Noe Valley', u'Downtown/Civic Center', u'Nob Hill'],
      dtype='object', name=u'neighbourhood')

In [18]:
neighborhood_subset = df[df.neighbourhood.isin(top_neighborhoods)]

In [19]:
plt.figure(figsize=(16, 6))
sns.boxplot(data=neighborhood_subset[neighborhood_subset.price < 2000], x='neighbourhood', y='price')


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d045110>

In [20]:
plt.figure(figsize=(16, 6))
sns.violinplot(data=neighborhood_subset[neighborhood_subset.price < 2000], x='neighbourhood', y='price')


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c3a5710>

Boxplots vs. Violinplots

  • Boxplot
    • Can be easier to interpret (a visual representation of point statistics, the quartiles)
    • Shows individual outlier data points (violinplot only shows range of outliers, max values)
    • More common and familiar
    • Can be easier to compare a large number of boxplots
    • Can be easier to compare the spread (upper - lower quartile) of the values
  • Violinplot
    • Shows more information (distribution rather than point statistic)
    • Will show bimodality (or more complex distribution of values, boxplot collapses this to one number)
    • Shows density of values