Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [58]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [59]:
# Create a data frame from the listings dataset

listings = pd.read_csv('../data/listings.csv')

# Create a data frame from the bookings dataset

bookings = pd.read_csv('../data/bookings.csv')

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [82]:
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()


Out[82]:
price person_capacity picture_count description_length tenure_months
count 408.000000 408.000000 408.000000 408.000000 408.000000
mean 187.806373 2.997549 14.389706 309.159314 8.487745
std 353.050858 1.594676 10.477428 228.021684 5.872088
min 39.000000 1.000000 1.000000 0.000000 1.000000
25% 90.000000 2.000000 6.000000 179.000000 4.000000
50% 125.000000 2.000000 12.000000 250.000000 7.000000
75% 199.000000 4.000000 20.000000 389.500000 13.000000
max 5000.000000 10.000000 71.000000 1969.000000 30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [61]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB

In [62]:
listings.groupby(['prop_type'])['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'].agg(['mean'])


Out[62]:
price person_capacity picture_count description_length tenure_months
mean mean mean mean mean
prop_type
Property type 1 237.085502 3.516729 14.695167 313.171004 8.464684
Property type 2 93.288889 2.000000 13.948148 304.851852 8.377778
Property type 3 63.750000 1.750000 8.750000 184.750000 13.750000

Same, but by property type per neighborhood?


In [63]:
listings.groupby(['prop_type', 'neighborhood'])['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'].agg(['mean'])


Out[63]:
price person_capacity picture_count description_length tenure_months
mean mean mean mean mean
prop_type neighborhood
Property type 1 Neighborhood 1 85.000000 2.000000 26.000000 209.000000 6.000000
Neighborhood 10 142.500000 3.500000 13.333333 391.000000 3.833333
Neighborhood 11 159.428571 3.214286 9.928571 379.000000 9.642857
Neighborhood 12 365.615385 3.435897 10.820513 267.205128 7.897436
Neighborhood 13 241.897959 4.061224 15.653061 290.408163 9.122449
Neighborhood 14 164.676471 3.205882 14.764706 317.205882 8.441176
Neighborhood 15 178.880000 3.720000 14.320000 321.760000 9.320000
Neighborhood 16 158.928571 2.928571 21.642857 310.714286 7.071429
Neighborhood 17 189.869565 3.521739 16.086957 317.347826 9.869565
Neighborhood 18 173.590909 2.954545 16.090909 369.227273 8.227273
Neighborhood 19 222.375000 3.625000 11.000000 254.500000 6.500000
Neighborhood 2 250.000000 6.000000 8.000000 423.000000 6.000000
Neighborhood 20 804.333333 2.777778 9.444444 223.555556 9.666667
Neighborhood 21 362.500000 4.250000 49.000000 306.250000 14.750000
Neighborhood 22 225.000000 3.000000 19.000000 500.000000 9.000000
Neighborhood 5 194.500000 2.500000 8.500000 266.500000 11.500000
Neighborhood 6 146.000000 3.333333 12.666667 290.666667 4.000000
Neighborhood 7 161.000000 3.666667 14.333333 343.000000 5.333333
Neighborhood 8 174.750000 5.000000 11.000000 300.000000 6.750000
Neighborhood 9 151.142857 4.285714 13.428571 471.428571 5.714286
Property type 2 Neighborhood 10 137.500000 2.000000 20.000000 126.000000 3.500000
Neighborhood 11 78.750000 2.000000 16.750000 161.250000 11.250000
Neighborhood 12 96.894737 1.947368 10.473684 244.526316 9.842105
Neighborhood 13 81.130435 1.826087 16.695652 418.565217 9.739130
Neighborhood 14 83.809524 1.857143 15.904762 348.619048 8.714286
Neighborhood 15 95.000000 2.266667 11.733333 301.733333 8.200000
Neighborhood 16 83.625000 2.062500 15.375000 246.250000 6.687500
Neighborhood 17 102.454545 2.000000 15.454545 308.272727 7.181818
Neighborhood 18 120.666667 2.222222 12.333333 297.777778 9.222222
Neighborhood 19 88.875000 2.000000 15.125000 383.375000 5.500000
Neighborhood 20 60.000000 1.000000 3.000000 101.000000 6.000000
Neighborhood 3 60.000000 2.000000 7.000000 264.000000 9.000000
Neighborhood 4 60.000000 2.000000 10.000000 95.000000 11.000000
Neighborhood 7 100.000000 2.000000 3.000000 148.000000 2.000000
Neighborhood 8 350.000000 4.000000 5.000000 223.000000 3.000000
Neighborhood 9 110.000000 2.000000 3.500000 114.500000 9.000000
Property type 3 Neighborhood 11 75.000000 2.000000 15.000000 196.000000 8.000000
Neighborhood 14 75.000000 1.000000 1.000000 113.000000 5.000000
Neighborhood 17 65.000000 2.000000 15.000000 189.000000 23.000000
Neighborhood 4 40.000000 2.000000 4.000000 241.000000 19.000000

Plot daily bookings:


In [64]:
# Generate a count of bookings by date
cross_tab = bookings.groupby(['booking_date'])['prop_id'].agg(['count']).unstack(0)

# Plot the table
cross_tab.plot()


Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bc97b90>

Plot the daily bookings per neighborhood (provide a legend)


In [65]:
# Create a smaller dataset with just the prop_id and neighborhood
neighborhoods = listings[['prop_id', 'neighborhood']]

# Merge bookings_by_prop into the listings data
bookings_neighborhoods = bookings.merge(neighborhoods, on='prop_id', how='left')

# Generate a count of bookings by date and neighborhood
cross_tab = bookings_neighborhoods.groupby(['neighborhood','booking_date'])['prop_id'].agg(['count']).unstack(0)

# Plot the table
cross_tab.plot()


Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bd99750>

Part 2 - Develop a data set


In [66]:
# Count the number of bookings by property
bookings_by_prop = bookings.groupby('prop_id')[['prop_id']].count()
bookings_by_prop

# Rename columns and reset the index
bookings_by_prop.rename(columns={'prop_id': 'number_of_bookings'}, inplace=True)
bookings_by_prop = bookings_by_prop.reset_index()

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [67]:
# Merge bookings_by_prop into the listings data
listings = listings.merge(bookings_by_prop, on='prop_id', how='left')

# Replace null values with 0's
listings.number_of_bookings.replace(np.nan, 0, inplace = True)

# Divide number_of_bookings by tenure_months
listings['booking_rate'] = listings.number_of_bookings / listings.tenure_months

# Describe the resulting dataset
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 10 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
number_of_bookings    408 non-null float64
booking_rate          408 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 35.1+ KB

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [68]:
listings_filtered = listings[listings.tenure_months > 10]
listings_filtered.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 120 entries, 0 to 119
Data columns (total 10 columns):
prop_id               120 non-null int64
prop_type             120 non-null object
neighborhood          120 non-null object
price                 120 non-null int64
person_capacity       120 non-null int64
picture_count         120 non-null int64
description_length    120 non-null int64
tenure_months         120 non-null int64
number_of_bookings    120 non-null float64
booking_rate          120 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 10.3+ KB

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [69]:
listings_filtered = pd.get_dummies(listings_filtered)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [70]:
from sklearn.cross_validation import train_test_split

In [71]:
# Create a reduced data from of features from the original data frame
cols = list(listings_filtered.columns)
cols.remove('prop_id')
cols.remove('booking_rate')
cols.remove('number_of_bookings')
iv = listings_filtered[cols]

# Create a data frame with the target variable
dv = listings_filtered['booking_rate'].values

# Split those two data frames into training and test datasets
iv_train, iv_test, dv_train, dv_test = train_test_split(iv, dv, test_size=0.2)

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [72]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets


In [73]:
lr.fit(iv_train, dv_train)


Out[73]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [74]:
lr.score(iv_train, dv_train)


Out[74]:
0.31484836732897703

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

In [75]:
# The score tells us the amount of variance in the outcome variable that is explained by the predictor variables.
# A score of 0.41 tells us that the model explains about 40% of the variance in the outcome variable.

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [76]:
# Create a reduced data from of features from the original data frame
listings_filtered['monthly_revenue'] = listings_filtered.price * listings_filtered.booking_rate

# Create a reduced data from of features from the original data frame
cols = list(listings_filtered.columns)
cols.remove('prop_id')
cols.remove('booking_rate')
cols.remove('number_of_bookings')
iv = listings_filtered[cols]

# Create a data frame with the target variable
dv = listings_filtered['booking_rate'].values

# Split those two data frames into training and test datasets
iv_train, iv_test, dv_train, dv_test = train_test_split(iv, dv, test_size=0.2)

# Fit model with the test sets
lr.fit(iv_train, dv_train)

# Report the score
lr.score(iv_train, dv_train)


Out[76]:
0.92979467316975706

In [57]:


In [ ]: