Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [48]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt


# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [49]:
bookings = pd.read_csv('../data/bookings.csv', delimiter=",")
listings = pd.read_csv('../data/listings.csv', delimiter=",")

bookings.set_index(['prop_id'], inplace=True)
listings.set_index(['prop_id'], inplace=True)

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [50]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 1 to 408
Data columns (total 7 columns):
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(5), object(2)
memory usage: 25.5+ KB

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [51]:
listings_by_prop_type = listings.groupby('prop_type')
listings_by_prop_type.describe()


Out[51]:
description_length person_capacity picture_count price tenure_months
prop_type
Property type 1 count 269.000000 269.000000 269.000000 269.000000 269.000000
mean 313.171004 3.516729 14.695167 237.085502 8.464684
std 214.769141 1.644955 10.623651 425.710534 5.773367
min 17.000000 1.000000 1.000000 40.000000 1.000000
25% 193.000000 2.000000 6.000000 120.000000 4.000000
50% 266.000000 3.000000 12.000000 150.000000 7.000000
75% 388.000000 4.000000 20.000000 229.000000 14.000000
max 1719.000000 10.000000 71.000000 5000.000000 30.000000
Property type 2 count 135.000000 135.000000 135.000000 135.000000 135.000000
mean 304.851852 2.000000 13.948148 93.288889 8.377778
std 255.135332 0.846415 10.255191 42.261246 5.963654
min 0.000000 1.000000 1.000000 39.000000 1.000000
25% 150.500000 2.000000 6.500000 69.000000 4.000000
50% 239.000000 2.000000 11.000000 89.000000 7.000000
75% 402.500000 2.000000 19.500000 99.000000 10.500000
max 1969.000000 6.000000 56.000000 350.000000 29.000000
Property type 3 count 4.000000 4.000000 4.000000 4.000000 4.000000
mean 184.750000 1.750000 8.750000 63.750000 13.750000
std 53.093471 0.500000 7.320064 16.520190 8.616844
min 113.000000 1.000000 1.000000 40.000000 5.000000
25% 170.000000 1.750000 3.250000 58.750000 7.250000
50% 192.500000 2.000000 9.500000 70.000000 13.500000
75% 207.250000 2.000000 15.000000 75.000000 20.000000
max 241.000000 2.000000 15.000000 75.000000 23.000000

Same, but by property type per neighborhood?


In [52]:
listings_by_neighborhood_then_prop_type = listings.groupby(['prop_type', 'neighborhood'])
listings_by_neighborhood_then_prop_type.describe()


Out[52]:
description_length person_capacity picture_count price tenure_months
prop_type neighborhood
Property type 1 Neighborhood 1 count 1.000000 1.000000 1.000000 1.000000 1.000000
mean 209.000000 2.000000 26.000000 85.000000 6.000000
std NaN NaN NaN NaN NaN
min 209.000000 2.000000 26.000000 85.000000 6.000000
25% 209.000000 2.000000 26.000000 85.000000 6.000000
50% 209.000000 2.000000 26.000000 85.000000 6.000000
75% 209.000000 2.000000 26.000000 85.000000 6.000000
max 209.000000 2.000000 26.000000 85.000000 6.000000
Neighborhood 10 count 6.000000 6.000000 6.000000 6.000000 6.000000
mean 391.000000 3.500000 13.333333 142.500000 3.833333
std 146.929915 1.224745 8.571270 36.979724 1.602082
min 160.000000 2.000000 4.000000 90.000000 1.000000
25% 312.250000 2.500000 6.750000 135.000000 3.250000
50% 425.500000 4.000000 12.000000 137.500000 4.500000
75% 499.000000 4.000000 20.250000 147.500000 5.000000
max 537.000000 5.000000 24.000000 205.000000 5.000000
Neighborhood 11 count 14.000000 14.000000 14.000000 14.000000 14.000000
mean 379.000000 3.214286 9.928571 159.428571 9.642857
std 396.956111 1.311404 5.928605 70.962302 5.212517
min 82.000000 2.000000 5.000000 95.000000 1.000000
25% 242.500000 2.000000 6.000000 103.750000 6.000000
50% 295.500000 3.000000 8.000000 130.000000 9.500000
75% 373.000000 3.750000 9.750000 196.500000 15.500000
max 1719.000000 6.000000 23.000000 319.000000 16.000000
Neighborhood 12 count 39.000000 39.000000 39.000000 39.000000 39.000000
mean 267.205128 3.435897 10.820513 365.615385 7.897436
std 137.867820 1.874972 7.118810 686.086484 5.290482
min 45.000000 1.000000 1.000000 60.000000 1.000000
25% 182.000000 2.000000 6.000000 125.000000 3.000000
50% 251.000000 3.000000 8.000000 150.000000 6.000000
... ... ... ... ... ... ... ...
Property type 3 Neighborhood 11 std NaN NaN NaN NaN NaN
min 196.000000 2.000000 15.000000 75.000000 8.000000
25% 196.000000 2.000000 15.000000 75.000000 8.000000
50% 196.000000 2.000000 15.000000 75.000000 8.000000
75% 196.000000 2.000000 15.000000 75.000000 8.000000
max 196.000000 2.000000 15.000000 75.000000 8.000000
Neighborhood 14 count 1.000000 1.000000 1.000000 1.000000 1.000000
mean 113.000000 1.000000 1.000000 75.000000 5.000000
std NaN NaN NaN NaN NaN
min 113.000000 1.000000 1.000000 75.000000 5.000000
25% 113.000000 1.000000 1.000000 75.000000 5.000000
50% 113.000000 1.000000 1.000000 75.000000 5.000000
75% 113.000000 1.000000 1.000000 75.000000 5.000000
max 113.000000 1.000000 1.000000 75.000000 5.000000
Neighborhood 17 count 1.000000 1.000000 1.000000 1.000000 1.000000
mean 189.000000 2.000000 15.000000 65.000000 23.000000
std NaN NaN NaN NaN NaN
min 189.000000 2.000000 15.000000 65.000000 23.000000
25% 189.000000 2.000000 15.000000 65.000000 23.000000
50% 189.000000 2.000000 15.000000 65.000000 23.000000
75% 189.000000 2.000000 15.000000 65.000000 23.000000
max 189.000000 2.000000 15.000000 65.000000 23.000000
Neighborhood 4 count 1.000000 1.000000 1.000000 1.000000 1.000000
mean 241.000000 2.000000 4.000000 40.000000 19.000000
std NaN NaN NaN NaN NaN
min 241.000000 2.000000 4.000000 40.000000 19.000000
25% 241.000000 2.000000 4.000000 40.000000 19.000000
50% 241.000000 2.000000 4.000000 40.000000 19.000000
75% 241.000000 2.000000 4.000000 40.000000 19.000000
max 241.000000 2.000000 4.000000 40.000000 19.000000

320 rows × 5 columns

Plot daily bookings:


In [59]:
bookings.booking_date = pd.to_datetime(bookings.booking_date)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 9 to 408
Data columns (total 1 columns):
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 94.9 KB

Plot the daily bookings per neighborhood (provide a legend)


In [ ]:

Part 2 - Develop a data set


In [ ]:

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [ ]:

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [ ]:

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [ ]:

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [ ]:
from sklearn.cross_validation import train_test_split

In [ ]:

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [ ]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets


In [ ]:


In [ ]:

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]: