Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

pd.set_option('display.max_rows', 10)

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [3]:
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')
#bookings[bookings.prop_id==1]

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [4]:
listings[['price','person_capacity','picture_count','description_length','tenure_months']].mean()


Out[4]:
price                 187.806373
person_capacity         2.997549
picture_count          14.389706
description_length    309.159314
tenure_months           8.487745
dtype: float64

In [5]:
listings[['price','person_capacity','picture_count','description_length','tenure_months']].median()


Out[5]:
price                 125
person_capacity         2
picture_count          12
description_length    250
tenure_months           7
dtype: float64

In [6]:
listings[['price','person_capacity','picture_count','description_length','tenure_months']].std()


Out[6]:
price                 353.050858
person_capacity         1.594676
picture_count          10.477428
description_length    228.021684
tenure_months           5.872088
dtype: float64

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [7]:
listings.groupby('prop_type')['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])


Out[7]:
price person_capacity picture_count description_length tenure_months
mean mean mean mean mean
prop_type
Property type 1 237.085502 3.516729 14.695167 313.171004 8.464684
Property type 2 93.288889 2.000000 13.948148 304.851852 8.377778
Property type 3 63.750000 1.750000 8.750000 184.750000 13.750000

Same, but by property type per neighborhood?


In [8]:
listings.groupby(['prop_type','neighborhood'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])


Out[8]:
price person_capacity picture_count description_length tenure_months
mean mean mean mean mean
prop_type neighborhood
Property type 1 Neighborhood 1 85.000000 2.000000 26.000000 209.000000 6.000000
Neighborhood 10 142.500000 3.500000 13.333333 391.000000 3.833333
Neighborhood 11 159.428571 3.214286 9.928571 379.000000 9.642857
Neighborhood 12 365.615385 3.435897 10.820513 267.205128 7.897436
Neighborhood 13 241.897959 4.061224 15.653061 290.408163 9.122449
... ... ... ... ... ... ...
Property type 2 Neighborhood 9 110.000000 2.000000 3.500000 114.500000 9.000000
Property type 3 Neighborhood 11 75.000000 2.000000 15.000000 196.000000 8.000000
Neighborhood 14 75.000000 1.000000 1.000000 113.000000 5.000000
Neighborhood 17 65.000000 2.000000 15.000000 189.000000 23.000000
Neighborhood 4 40.000000 2.000000 4.000000 241.000000 19.000000

40 rows × 5 columns

Plot daily bookings:


In [9]:
bookings.booking_date.value_counts().sort_index().plot()


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b4d8590>

Plot the daily bookings per neighborhood (provide a legend)


In [10]:
booking_all = pd.merge(listings, bookings, how='outer', left_on='prop_id', right_on='prop_id')
booking_all.groupby(['booking_date','neighborhood']).count().sort_index().plot()


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c030710>

Part 2 - Develop a data set

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [11]:
number_of_bookings = bookings.groupby('prop_id').count().reset_index()
number_of_bookings.rename(columns={'booking_date':'number_of_bookings'}, inplace = True)
listings2 = pd.merge(listings, number_of_bookings, how='left', on='prop_id')
listings2.number_of_bookings.fillna(0, inplace=True)
# Alternative way: listings2['number_of_bookings'] = listings2.number_of_bookings.fillna(0)
listings2['booking_rate'] = listings2.number_of_bookings/listings2.tenure_months
listings2


Out[11]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 0.133333
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 0 0.000000
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1 0.034483
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27 0.964286
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 0 0.000000
... ... ... ... ... ... ... ... ... ... ...
403 404 Property type 2 Neighborhood 14 100 1 8 235 1 3 3.000000
404 405 Property type 2 Neighborhood 13 85 2 27 1048 1 19 19.000000
405 406 Property type 1 Neighborhood 9 70 3 18 153 1 19 19.000000
406 407 Property type 1 Neighborhood 13 129 2 13 370 1 15 15.000000
407 408 Property type 1 Neighborhood 14 100 3 21 707 1 54 54.000000

408 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [12]:
listings2[listings2.tenure_months > 10]


Out[12]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 0.133333
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 0 0.000000
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1 0.034483
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27 0.964286
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 0 0.000000
... ... ... ... ... ... ... ... ... ... ...
115 116 Property type 1 Neighborhood 17 125 2 9 89 11 0 0.000000
116 117 Property type 2 Neighborhood 14 49 2 14 417 11 8 0.727273
117 118 Property type 2 Neighborhood 4 60 2 10 95 11 11 1.000000
118 119 Property type 2 Neighborhood 12 55 2 8 333 11 1 0.090909
119 120 Property type 2 Neighborhood 9 100 1 4 212 11 0 0.000000

120 rows × 10 columns

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [13]:
listings2.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 10 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
number_of_bookings    408 non-null float64
booking_rate          408 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 35.1+ KB

In [14]:
pd.get_dummies(listings2[['prop_type', 'neighborhood', 'tenure_months']])


Out[14]:
tenure_months prop_type_Property type 1 prop_type_Property type 2 prop_type_Property type 3 neighborhood_Neighborhood 1 neighborhood_Neighborhood 10 neighborhood_Neighborhood 11 neighborhood_Neighborhood 12 neighborhood_Neighborhood 13 neighborhood_Neighborhood 14 ... neighborhood_Neighborhood 20 neighborhood_Neighborhood 21 neighborhood_Neighborhood 22 neighborhood_Neighborhood 3 neighborhood_Neighborhood 4 neighborhood_Neighborhood 5 neighborhood_Neighborhood 6 neighborhood_Neighborhood 7 neighborhood_Neighborhood 8 neighborhood_Neighborhood 9
0 30 1 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
1 29 1 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
2 29 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 28 0 1 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
4 28 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
403 1 0 1 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
404 1 0 1 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
405 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
406 1 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
407 1 1 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0

408 rows × 26 columns


In [15]:
#Best way
features_cols = [col for col in listings2.columns if col not in ['prop_id', 'booking_rate', 'number_of_bookings']]
features = pd.get_dummies(listings2[features_cols])

In [16]:
#Other way
#prop_type_dummies = pd.get_dummies(listings2.prop_type)
#neighborhood_dummies = pd.get_dummies(listings2.neighborhood)
#features1 = listings2.join(prop_type_dummies, on='prop_id')
#features1 = listings2.join(neighborhood_dummies, on='prop_id')
#features1

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [17]:
from sklearn.cross_validation import train_test_split
y = listings2.booking_rate
X = features
X_train, X_test, y_train, y_test = train_test_split(X , y, test_size=0.2, random_state = 42)

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [18]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
# Below we Train our linear Regression
lr.fit(X_train, y_train)


Out[18]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

fit your model with your test sets


In [19]:
lr.score(X_test, y_test)


Out[19]:
0.13602860098044356

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

It returns the Linear Regression score and in this case it tells us that there is a low positive correlation between the booking_rate and the columns in the features table.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [20]:
month_rev = listings2['booking_rate']*listings2['price']
month_rev
alt_pred = month_rev
regres = features
A_train, A_test, b_train, b_test = train_test_split(regres , alt_pred, test_size=0.2, random_state = 42)
lr2 = LinearRegression()
lr2.fit(A_train, b_train)
lr2.score(A_test, b_test)


Out[20]:
0.058306396109671699