#Homework 1 - Data Analysis and Regression In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [331]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [332]:
pd.set_option('display.max_row', None)
pd.set_option('display.precision', 3)
pd.set_option('display.max_column', None)

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [333]:
bookings = pd.read_csv('../data/bookings.csv', parse_dates = ['booking_date'])
bookings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)

In [334]:
listings = pd.read_csv('../data/listings.csv')

In [335]:
merged = listings.merge(bookings)
merged.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
booking_date          6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(6), object(2)

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [336]:
listings[['price', 'person_capacity',
          'picture_count', 'description_length',
          'tenure_months']].describe().loc[['mean', '50%', 'std']]


Out[336]:
price person_capacity picture_count description_length tenure_months
mean 187.81 3.00 14.39 309.16 8.49
50% 125.00 2.00 12.00 250.00 7.00
std 353.05 1.59 10.48 228.02 5.87

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [337]:
listings.groupby('prop_type')['price',
                              'person_capacity', 'picture_count',
                              'description_length', 'tenure_months'].mean()


Out[337]:
price person_capacity picture_count description_length tenure_months
prop_type
Property type 1 237.09 3.52 14.70 313.17 8.46
Property type 2 93.29 2.00 13.95 304.85 8.38
Property type 3 63.75 1.75 8.75 184.75 13.75

Same, but by property type per neighborhood?


In [338]:
listings.groupby(['neighborhood', 'prop_type'])['price',
                                                'person_capacity', 'picture_count',
                                                'description_length', 'tenure_months'].mean()


Out[338]:
price person_capacity picture_count description_length tenure_months
neighborhood prop_type
Neighborhood 1 Property type 1 85.00 2.00 26.00 209.00 6.00
Neighborhood 10 Property type 1 142.50 3.50 13.33 391.00 3.83
Property type 2 137.50 2.00 20.00 126.00 3.50
Neighborhood 11 Property type 1 159.43 3.21 9.93 379.00 9.64
Property type 2 78.75 2.00 16.75 161.25 11.25
Property type 3 75.00 2.00 15.00 196.00 8.00
Neighborhood 12 Property type 1 365.62 3.44 10.82 267.21 7.90
Property type 2 96.89 1.95 10.47 244.53 9.84
Neighborhood 13 Property type 1 241.90 4.06 15.65 290.41 9.12
Property type 2 81.13 1.83 16.70 418.57 9.74
Neighborhood 14 Property type 1 164.68 3.21 14.76 317.21 8.44
Property type 2 83.81 1.86 15.90 348.62 8.71
Property type 3 75.00 1.00 1.00 113.00 5.00
Neighborhood 15 Property type 1 178.88 3.72 14.32 321.76 9.32
Property type 2 95.00 2.27 11.73 301.73 8.20
Neighborhood 16 Property type 1 158.93 2.93 21.64 310.71 7.07
Property type 2 83.62 2.06 15.38 246.25 6.69
Neighborhood 17 Property type 1 189.87 3.52 16.09 317.35 9.87
Property type 2 102.45 2.00 15.45 308.27 7.18
Property type 3 65.00 2.00 15.00 189.00 23.00
Neighborhood 18 Property type 1 173.59 2.95 16.09 369.23 8.23
Property type 2 120.67 2.22 12.33 297.78 9.22
Neighborhood 19 Property type 1 222.38 3.62 11.00 254.50 6.50
Property type 2 88.88 2.00 15.12 383.38 5.50
Neighborhood 2 Property type 1 250.00 6.00 8.00 423.00 6.00
Neighborhood 20 Property type 1 804.33 2.78 9.44 223.56 9.67
Property type 2 60.00 1.00 3.00 101.00 6.00
Neighborhood 21 Property type 1 362.50 4.25 49.00 306.25 14.75
Neighborhood 22 Property type 1 225.00 3.00 19.00 500.00 9.00
Neighborhood 3 Property type 2 60.00 2.00 7.00 264.00 9.00
Neighborhood 4 Property type 2 60.00 2.00 10.00 95.00 11.00
Property type 3 40.00 2.00 4.00 241.00 19.00
Neighborhood 5 Property type 1 194.50 2.50 8.50 266.50 11.50
Neighborhood 6 Property type 1 146.00 3.33 12.67 290.67 4.00
Neighborhood 7 Property type 1 161.00 3.67 14.33 343.00 5.33
Property type 2 100.00 2.00 3.00 148.00 2.00
Neighborhood 8 Property type 1 174.75 5.00 11.00 300.00 6.75
Property type 2 350.00 4.00 5.00 223.00 3.00
Neighborhood 9 Property type 1 151.14 4.29 13.43 471.43 5.71
Property type 2 110.00 2.00 3.50 114.50 9.00

Plot daily bookings:


In [339]:
merged.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
booking_date          6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(6), object(2)

In [340]:
merged.booking_date = pd.to_datetime(merged.booking_date)

In [341]:
merged.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
booking_date          6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(6), object(2)

In [342]:
merged['day_of_week'] = merged.booking_date.map(lambda x: x.dayofweek)

In [343]:
def day(x):
    day_of_week = {}
    day_of_week[0] = 'Mon'
    day_of_week[1] = 'Tues'
    day_of_week[2] = 'Wed'
    day_of_week[3] = 'Thurs'
    day_of_week[4] = 'Fri'
    day_of_week[5] = 'Sat'
    day_of_week[6] = 'Sun'
    return day_of_week[x]

merged['day_name'] = merged['day_of_week'].map(day)

In [344]:
merged.groupby(['day_of_week'])['day_name'].value_counts().plot(kind = 'bar')


Out[344]:
<matplotlib.axes._subplots.AxesSubplot at 0x110b7e5d0>

Plot the daily bookings per neighborhood (provide a legend)


In [345]:
sns.factorplot("neighborhood", hue="day_name", data=merged, palette="Purples_d", size = 25, legend = True);



In [345]:

Part 2 - Develop a data set

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [395]:
bookings_ct = bookings.groupby('prop_id').count().reset_index()
merged2 = listings.merge(bookings_ct, on = 'prop_id')
merged2.head()


Out[395]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months booking_date
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4
1 3 Property type 2 Neighborhood 16 95 2 16 172 29 1
2 4 Property type 2 Neighborhood 13 90 2 19 472 28 27
3 6 Property type 2 Neighborhood 13 89 2 10 886 28 88
4 7 Property type 2 Neighborhood 13 85 1 11 58 24 2

In [396]:
merged2.rename(columns = {'booking_date' : 'number_of_bookings'}, inplace = True)
merged2['booking_rate'] = merged2.number_of_bookings/merged2.tenure_months
merged2.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 0 to 327
Data columns (total 10 columns):
prop_id               328 non-null int64
prop_type             328 non-null object
neighborhood          328 non-null object
price                 328 non-null int64
person_capacity       328 non-null int64
picture_count         328 non-null int64
description_length    328 non-null int64
tenure_months         328 non-null int64
number_of_bookings    328 non-null int64
booking_rate          328 non-null float64
dtypes: float64(1), int64(7), object(2)

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [397]:
merged2 = merged2[merged2.tenure_months  > 9]
merged2.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 113 entries, 0 to 112
Data columns (total 10 columns):
prop_id               113 non-null int64
prop_type             113 non-null object
neighborhood          113 non-null object
price                 113 non-null int64
person_capacity       113 non-null int64
picture_count         113 non-null int64
description_length    113 non-null int64
tenure_months         113 non-null int64
number_of_bookings    113 non-null int64
booking_rate          113 non-null float64
dtypes: float64(1), int64(7), object(2)

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [398]:
dummy_prop = pd.get_dummies(merged2['prop_type'])
dummy_neigh = pd.get_dummies(merged2['neighborhood'])
merged2_prop = merged2.join(dummy_prop)

In [399]:
merged3 = merged2_prop.join(dummy_neigh)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [400]:
from sklearn.cross_validation import train_test_split

In [401]:
merged_final = merged3.drop(['prop_id', 'prop_type', 'neighborhood', 'number_of_bookings'], axis = 1)
features = merged_final.drop(['booking_rate'], axis = 1).values
target = merged_final['booking_rate'].values

In [401]:


In [402]:
features.shape


Out[402]:
(113, 22)

In [403]:
target = np.log(target)

In [404]:
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size = .2, random_state = (9))

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [419]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [420]:
lr.fit(feature_train, target_train )


Out[420]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

fit your model with your test sets


In [421]:
target_pred = lr.predict(feature_test)

In [422]:
# target_pred[target_pred == target_pred.min()] = target_pred.mean()

target_pred[target_pred < -4] = target_pred.mean()

target_pred


Out[422]:
array([ 0.29523577, -2.04599243, -1.15709947, -1.87976958, -0.65496419,
       -0.1700946 , -0.73483983, -1.12476064,  0.48094131, -0.24461875,
       -0.21776611,  0.2702393 , -1.06444888, -1.18860123, -0.3414003 ,
       -0.76560807,  0.68593991, -0.20624223, -0.07908219, -1.11982075,
       -0.38995087,  0.64780865,  0.29358936])

In [423]:
sum_sq_model = np.sum((target_test - target_pred) ** 2)
sum_sq_model


Out[423]:
43.863998813931239

In [424]:
sum_sq_naive = np.sum((target_test - target_test.mean()) ** 2)
sum_sq_naive


Out[424]:
44.793393920738836

In [425]:
1 - sum_sq_model / sum_sq_naive


Out[425]:
0.020748486003363542

In [426]:
lr.score(feature_test, target_test, sample_weight = None)


Out[426]:
0.020748486003363542

In [388]:
fig, ax = plt.subplots(1,1)

ax.scatter(target_pred, target_test)
ax.plot(target, target, 'r')


Out[388]:
[<matplotlib.lines.Line2D at 0x1108f5e50>]

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

The score method is an indicator of how well the model is performing. It takes the squared difference between the target test y value and the predicted 'y' value. It then takes 1 and subtracts the squared difference. If value is 1 it means that the squared difference is 0, which would reflex prefectly predicted 'y' values. If the value is below 0 it means that the predicted values are worse than the just placing the mean of 'y' for every predicted value.

The model comes up with a generally low or negative number, varying each time because train_test_split takes a different sample each time it runs. That means that the prediction is generally slightly better or slightly worse than the mean depending on the target_test and feature_test sample.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [191]:
merged_final['monthly_rev'] = merged_final.booking_rate * merged_final.price

In [192]:
merged_final.monthly_rev.hist(bins = 50);



In [193]:
features1 = merged_final.drop(['booking_rate', 'monthly_rev'], axis = 1).values

In [194]:
target1 = merged_final['monthly_rev']

In [195]:
target1 = np.log(target1)

In [196]:
features_train1, features_test1, target_train1, target_test1 = train_test_split(features1, target1, test_size = .2)

In [197]:
lr.fit(features_train1, target_train1)


Out[197]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [198]:
features_train1


Out[198]:
array([[  49.,    2.,   14., ...,    0.,    0.,    0.],
       [ 350.,    4.,   71., ...,    0.,    0.,    0.],
       [ 175.,    3.,   24., ...,    0.,    0.,    0.],
       ..., 
       [ 180.,    6.,   26., ...,    0.,    0.,    0.],
       [ 110.,    3.,   29., ...,    0.,    0.,    0.],
       [  65.,    2.,   15., ...,    0.,    0.,    0.]])

In [199]:
target_pred1 = lr.predict(features_test1)

In [199]:


In [200]:
sum_sq_model = np.sum((target_test1 - target_pred1) ** 2)
sum_sq_model


Out[200]:
27.479035852227003

In [201]:
sum_sq_naive = np.sum((target_test1 - target_test1.mean()) ** 2)
sum_sq_naive


Out[201]:
21.537077221694485

In [202]:
1 - sum_sq_model / sum_sq_naive


Out[202]:
-0.27589438294566415

In [203]:
lr.score(features_test1, target_test1, sample_weight = None)


Out[203]:
-0.27589438294566415

In [204]:
fig, ax = plt.subplots(1,1)

ax.scatter(target_pred1, target_test1)
ax.plot(target1, target1, 'r')


Out[204]:
[<matplotlib.lines.Line2D at 0x10f78b890>]

In [43]:


In [43]:


In [ ]: