Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [2]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [3]:
# Load airbnb data from data folder under hw/data/
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')

bookings.head(5)
listings.head(5)


Out[3]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 Property type 1 Neighborhood 14 140 3 11 232 30
1 2 Property type 1 Neighborhood 14 95 2 3 37 29
2 3 Property type 2 Neighborhood 16 95 2 16 172 29
3 4 Property type 2 Neighborhood 13 90 2 19 472 28
4 5 Property type 1 Neighborhood 15 125 5 21 442 28

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [4]:
# Describe the dataset - This gives you a summary of numerical columns

listings.describe()


Out[4]:
prop_id price person_capacity picture_count description_length tenure_months
count 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000
mean 204.500000 187.806373 2.997549 14.389706 309.159314 8.487745
std 117.923704 353.050858 1.594676 10.477428 228.021684 5.872088
min 1.000000 39.000000 1.000000 1.000000 0.000000 1.000000
25% 102.750000 90.000000 2.000000 6.000000 179.000000 4.000000
50% 204.500000 125.000000 2.000000 12.000000 250.000000 7.000000
75% 306.250000 199.000000 4.000000 20.000000 389.500000 13.000000
max 408.000000 5000.000000 10.000000 71.000000 1969.000000 30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [5]:
# There are 3 different property types. Group by prop_type first and then describe.

listings_prop_type = listings.groupby(listings.prop_type)
listings_prop_type.describe()


Out[5]:
description_length person_capacity picture_count price prop_id tenure_months
prop_type
Property type 1 count 269.000000 269.000000 269.000000 269.000000 269.000000 269.000000
mean 313.171004 3.516729 14.695167 237.085502 204.754647 8.464684
std 214.769141 1.644955 10.623651 425.710534 119.643061 5.773367
min 17.000000 1.000000 1.000000 40.000000 1.000000 1.000000
25% 193.000000 2.000000 6.000000 120.000000 95.000000 4.000000
50% 266.000000 3.000000 12.000000 150.000000 208.000000 7.000000
75% 388.000000 4.000000 20.000000 229.000000 310.000000 14.000000
max 1719.000000 10.000000 71.000000 5000.000000 408.000000 30.000000
Property type 2 count 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000
mean 304.851852 2.000000 13.948148 93.288889 206.392593 8.377778
std 255.135332 0.846415 10.255191 42.261246 114.074805 5.963654
min 0.000000 1.000000 1.000000 39.000000 3.000000 1.000000
25% 150.500000 2.000000 6.500000 69.000000 120.500000 4.000000
50% 239.000000 2.000000 11.000000 89.000000 200.000000 7.000000
75% 402.500000 2.000000 19.500000 99.000000 299.500000 10.500000
max 1969.000000 6.000000 56.000000 350.000000 405.000000 29.000000
Property type 3 count 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000
mean 184.750000 1.750000 8.750000 63.750000 123.500000 13.750000
std 53.093471 0.500000 7.320064 16.520190 132.879645 8.616844
min 113.000000 1.000000 1.000000 40.000000 10.000000 5.000000
25% 170.000000 1.750000 3.250000 58.750000 17.500000 7.250000
50% 192.500000 2.000000 9.500000 70.000000 99.000000 13.500000
75% 207.250000 2.000000 15.000000 75.000000 205.000000 20.000000
max 241.000000 2.000000 15.000000 75.000000 286.000000 23.000000

Same, but by property type per neighborhood?


In [6]:
# There are 3 different property types and 22 neighborhoods. Group by prop_type and neighborhood first and then describe.

listings_prop_type = listings.groupby([listings.prop_type, listings.neighborhood])
listings_prop_type.describe()


Out[6]:
description_length person_capacity picture_count price prop_id tenure_months
prop_type neighborhood
Property type 1 Neighborhood 1 count 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mean 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
std NaN NaN NaN NaN NaN NaN
min 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
25% 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
50% 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
75% 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
max 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
Neighborhood 10 count 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000
mean 391.000000 3.500000 13.333333 142.500000 307.500000 3.833333
std 146.929915 1.224745 8.571270 36.979724 55.301899 1.602082
min 160.000000 2.000000 4.000000 90.000000 255.000000 1.000000
25% 312.250000 2.500000 6.750000 135.000000 265.000000 3.250000
50% 425.500000 4.000000 12.000000 137.500000 297.000000 4.500000
75% 499.000000 4.000000 20.250000 147.500000 330.500000 5.000000
max 537.000000 5.000000 24.000000 205.000000 400.000000 5.000000
Neighborhood 11 count 14.000000 14.000000 14.000000 14.000000 14.000000 14.000000
mean 379.000000 3.214286 9.928571 159.428571 174.000000 9.642857
std 396.956111 1.311404 5.928605 70.962302 110.198422 5.212517
min 82.000000 2.000000 5.000000 95.000000 52.000000 1.000000
25% 242.500000 2.000000 6.000000 103.750000 70.500000 6.000000
50% 295.500000 3.000000 8.000000 130.000000 142.500000 9.500000
75% 373.000000 3.750000 9.750000 196.500000 243.500000 15.500000
max 1719.000000 6.000000 23.000000 319.000000 401.000000 16.000000
Neighborhood 12 count 39.000000 39.000000 39.000000 39.000000 39.000000 39.000000
mean 267.205128 3.435897 10.820513 365.615385 211.307692 7.897436
std 137.867820 1.874972 7.118810 686.086484 118.670520 5.290482
min 45.000000 1.000000 1.000000 60.000000 22.000000 1.000000
25% 182.000000 2.000000 6.000000 125.000000 105.500000 3.000000
50% 251.000000 3.000000 8.000000 150.000000 223.000000 6.000000
... ... ... ... ... ... ... ... ...
Property type 3 Neighborhood 11 std NaN NaN NaN NaN NaN NaN
min 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
25% 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
50% 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
75% 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
max 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
Neighborhood 14 count 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mean 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
std NaN NaN NaN NaN NaN NaN
min 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
25% 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
50% 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
75% 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
max 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
Neighborhood 17 count 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mean 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
std NaN NaN NaN NaN NaN NaN
min 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
25% 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
50% 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
75% 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
max 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
Neighborhood 4 count 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mean 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
std NaN NaN NaN NaN NaN NaN
min 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
25% 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
50% 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
75% 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
max 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000

320 rows × 6 columns

Plot daily bookings:


In [88]:
# datetime field is imported as "object" which is the same as strings
bookings.booking_date = pd.to_datetime(bookings.booking_date)
# create new data frame with date and number of bookings
plot_bookings = bookings['booking_date'].value_counts()
# sort by date
plot_bookings.sort_index()
plot_bookings.plot()


Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a341310>

In [70]:
###Plot the daily bookings per neighborhood (provide a legend)

In [91]:
listings.neighborhood.value_counts()


Out[91]:
Neighborhood 13    72
Neighborhood 12    58
Neighborhood 14    56
Neighborhood 15    40
Neighborhood 17    35
Neighborhood 18    31
Neighborhood 16    30
Neighborhood 11    19
Neighborhood 19    16
Neighborhood 20    10
Neighborhood 9      9
Neighborhood 10     8
Neighborhood 8      5
Neighborhood 7      4
Neighborhood 21     4
Neighborhood 6      3
Neighborhood 4      2
Neighborhood 5      2
Neighborhood 22     1
Neighborhood 2      1
Neighborhood 3      1
Neighborhood 1      1
dtype: int64

Part 2 - Develop a data set


In [ ]:

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [36]:
bookings_by_prop = bookings.groupby('prop_id')[['prop_id']].count()
bookings_by_prop.rename(columns={'prop_id': 'number_of_bookings'}, inplace=True)
bookings_by_prop.reset_index(inplace=True)
bookings_by_prop.head()

listings.merge(bookings_by_prop, on='prop_id', how='left')

#def get_booking_rate(numb):   
    #return numb
    
#listings['booking_rate'] = listings['number_of_bookings'].map(get_booking_rate)


Out[36]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 NaN
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 NaN
5 6 Property type 2 Neighborhood 13 89 2 10 886 28 88
6 7 Property type 2 Neighborhood 13 85 1 11 58 24 2
7 8 Property type 1 Neighborhood 18 120 2 13 685 23 28
8 9 Property type 1 Neighborhood 13 210 6 27 180 23 3
9 10 Property type 3 Neighborhood 17 65 2 15 189 23 26
10 11 Property type 1 Neighborhood 13 145 3 9 140 22 7
11 12 Property type 2 Neighborhood 13 89 2 11 153 22 45
12 13 Property type 2 Neighborhood 14 96 2 10 245 20 57
13 14 Property type 2 Neighborhood 17 95 2 8 139 20 31
14 15 Property type 2 Neighborhood 11 95 2 15 255 19 12
15 16 Property type 2 Neighborhood 14 95 2 39 334 19 44
16 17 Property type 2 Neighborhood 12 65 2 7 333 19 7
17 18 Property type 2 Neighborhood 14 100 2 5 39 19 NaN
18 19 Property type 2 Neighborhood 12 65 2 9 448 19 1
19 20 Property type 3 Neighborhood 4 40 2 4 241 19 9
20 21 Property type 1 Neighborhood 13 295 5 22 228 18 9
21 22 Property type 1 Neighborhood 12 275 4 20 523 18 NaN
22 23 Property type 2 Neighborhood 14 90 3 15 411 18 52
23 24 Property type 1 Neighborhood 16 110 2 10 495 17 26
24 25 Property type 1 Neighborhood 18 215 2 16 190 17 3
25 26 Property type 1 Neighborhood 14 139 2 20 395 17 2
26 27 Property type 1 Neighborhood 15 180 6 26 325 17 68
27 28 Property type 2 Neighborhood 12 95 2 8 137 16 8
28 29 Property type 1 Neighborhood 17 125 3 31 327 16 2
29 30 Property type 2 Neighborhood 17 90 3 5 582 16 9
... ... ... ... ... ... ... ... ... ...
378 379 Property type 1 Neighborhood 18 79 2 5 348 2 3
379 380 Property type 1 Neighborhood 14 145 4 9 253 2 33
380 381 Property type 1 Neighborhood 16 159 3 17 206 2 23
381 382 Property type 1 Neighborhood 19 120 4 19 203 2 4
382 383 Property type 1 Neighborhood 14 200 4 13 116 2 NaN
383 384 Property type 1 Neighborhood 14 425 6 13 449 2 4
384 385 Property type 1 Neighborhood 13 110 4 22 464 1 9
385 386 Property type 1 Neighborhood 14 97 2 5 195 1 52
386 387 Property type 1 Neighborhood 14 125 2 6 425 1 43
387 388 Property type 1 Neighborhood 15 229 4 5 301 1 NaN
388 389 Property type 2 Neighborhood 17 39 1 21 247 1 8
389 390 Property type 1 Neighborhood 12 150 2 9 141 1 10
390 391 Property type 1 Neighborhood 14 95 2 8 217 1 33
391 392 Property type 2 Neighborhood 17 105 2 26 224 1 18
392 393 Property type 1 Neighborhood 19 319 6 8 213 1 16
393 394 Property type 2 Neighborhood 15 190 4 12 179 1 6
394 395 Property type 1 Neighborhood 15 120 2 5 192 1 12
395 396 Property type 1 Neighborhood 13 120 4 4 374 1 NaN
396 397 Property type 1 Neighborhood 13 139 4 13 303 1 24
397 398 Property type 1 Neighborhood 12 60 2 10 102 1 1
398 399 Property type 2 Neighborhood 16 100 1 6 81 1 NaN
399 400 Property type 1 Neighborhood 10 90 2 23 523 1 9
400 401 Property type 1 Neighborhood 11 150 4 8 82 1 2
401 402 Property type 1 Neighborhood 7 300 4 19 550 1 5
402 403 Property type 1 Neighborhood 19 100 4 2 69 1 1
403 404 Property type 2 Neighborhood 14 100 1 8 235 1 3
404 405 Property type 2 Neighborhood 13 85 2 27 1048 1 19
405 406 Property type 1 Neighborhood 9 70 3 18 153 1 19
406 407 Property type 1 Neighborhood 13 129 2 13 370 1 15
407 408 Property type 1 Neighborhood 14 100 3 21 707 1 54

408 rows × 9 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [35]:
listings_filtered = listings[listings.tenure_months >= 10]
listings_filtered.describe()


Out[35]:
prop_id price person_capacity picture_count description_length tenure_months
count 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000
mean 72.500000 188.895833 3.354167 14.104167 308.645833 15.173611
std 41.713307 215.771464 1.837712 10.890246 177.770044 4.215074
min 1.000000 40.000000 1.000000 1.000000 0.000000 10.000000
25% 36.750000 93.750000 2.000000 6.000000 198.500000 12.000000
50% 72.500000 135.000000 3.000000 11.000000 282.500000 15.000000
75% 108.250000 245.250000 4.000000 18.250000 388.000000 16.000000
max 144.000000 2394.000000 8.000000 71.000000 1111.000000 30.000000

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [81]:
import pandas as pd
dummies1 = pd.get_dummies(listings['neighborhood'])
dummies1

dummies2 = pd.get_dummies(listings['prop_type'])
dummies2


Out[81]:
Property type 1 Property type 2 Property type 3
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 1 0 0
5 0 1 0
6 0 1 0
7 1 0 0
8 1 0 0
9 0 0 1
10 1 0 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 1 0
15 0 1 0
16 0 1 0
17 0 1 0
18 0 1 0
19 0 0 1
20 1 0 0
21 1 0 0
22 0 1 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 0 1 0
28 1 0 0
29 0 1 0
... ... ... ...
378 1 0 0
379 1 0 0
380 1 0 0
381 1 0 0
382 1 0 0
383 1 0 0
384 1 0 0
385 1 0 0
386 1 0 0
387 1 0 0
388 0 1 0
389 1 0 0
390 1 0 0
391 0 1 0
392 1 0 0
393 0 1 0
394 1 0 0
395 1 0 0
396 1 0 0
397 1 0 0
398 0 1 0
399 1 0 0
400 1 0 0
401 1 0 0
402 1 0 0
403 0 1 0
404 0 1 0
405 1 0 0
406 1 0 0
407 1 0 0

408 rows × 3 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [ ]:
from sklearn.cross_validation import train_test_split

In [ ]:

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [ ]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets


In [ ]:


In [ ]:

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]: