Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [48]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt


# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [49]:

    
bookings = pd.read_csv('../data/bookings.csv', delimiter=",")
listings = pd.read_csv('../data/listings.csv', delimiter=",")

bookings.set_index(['prop_id'], inplace=True)
listings.set_index(['prop_id'], inplace=True)

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [50]:

    
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 1 to 408
Data columns (total 7 columns):
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(5), object(2)
memory usage: 25.5+ KB

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [51]:

    
listings_by_prop_type = listings.groupby('prop_type')
listings_by_prop_type.describe()









    Out[51]:






  
    
      
      
      description_length
      person_capacity
      picture_count
      price
      tenure_months
    
    
      prop_type
      
      
      
      
      
      
    
  
  
    
      Property type 1
      count
        269.000000
       269.000000
       269.000000
        269.000000
       269.000000
    
    
      mean
        313.171004
         3.516729
        14.695167
        237.085502
         8.464684
    
    
      std
        214.769141
         1.644955
        10.623651
        425.710534
         5.773367
    
    
      min
         17.000000
         1.000000
         1.000000
         40.000000
         1.000000
    
    
      25%
        193.000000
         2.000000
         6.000000
        120.000000
         4.000000
    
    
      50%
        266.000000
         3.000000
        12.000000
        150.000000
         7.000000
    
    
      75%
        388.000000
         4.000000
        20.000000
        229.000000
        14.000000
    
    
      max
       1719.000000
        10.000000
        71.000000
       5000.000000
        30.000000
    
    
      Property type 2
      count
        135.000000
       135.000000
       135.000000
        135.000000
       135.000000
    
    
      mean
        304.851852
         2.000000
        13.948148
         93.288889
         8.377778
    
    
      std
        255.135332
         0.846415
        10.255191
         42.261246
         5.963654
    
    
      min
          0.000000
         1.000000
         1.000000
         39.000000
         1.000000
    
    
      25%
        150.500000
         2.000000
         6.500000
         69.000000
         4.000000
    
    
      50%
        239.000000
         2.000000
        11.000000
         89.000000
         7.000000
    
    
      75%
        402.500000
         2.000000
        19.500000
         99.000000
        10.500000
    
    
      max
       1969.000000
         6.000000
        56.000000
        350.000000
        29.000000
    
    
      Property type 3
      count
          4.000000
         4.000000
         4.000000
          4.000000
         4.000000
    
    
      mean
        184.750000
         1.750000
         8.750000
         63.750000
        13.750000
    
    
      std
         53.093471
         0.500000
         7.320064
         16.520190
         8.616844
    
    
      min
        113.000000
         1.000000
         1.000000
         40.000000
         5.000000
    
    
      25%
        170.000000
         1.750000
         3.250000
         58.750000
         7.250000
    
    
      50%
        192.500000
         2.000000
         9.500000
         70.000000
        13.500000
    
    
      75%
        207.250000
         2.000000
        15.000000
         75.000000
        20.000000
    
    
      max
        241.000000
         2.000000
        15.000000
         75.000000
        23.000000

Same, but by property type per neighborhood?



In [52]:

    
listings_by_neighborhood_then_prop_type = listings.groupby(['prop_type', 'neighborhood'])
listings_by_neighborhood_then_prop_type.describe()









    Out[52]:






  
    
      
      
      
      description_length
      person_capacity
      picture_count
      price
      tenure_months
    
    
      prop_type
      neighborhood
      
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 1
      count
          1.000000
        1.000000
        1.000000
         1.000000
        1.000000
    
    
      mean
        209.000000
        2.000000
       26.000000
        85.000000
        6.000000
    
    
      std
               NaN
             NaN
             NaN
              NaN
             NaN
    
    
      min
        209.000000
        2.000000
       26.000000
        85.000000
        6.000000
    
    
      25%
        209.000000
        2.000000
       26.000000
        85.000000
        6.000000
    
    
      50%
        209.000000
        2.000000
       26.000000
        85.000000
        6.000000
    
    
      75%
        209.000000
        2.000000
       26.000000
        85.000000
        6.000000
    
    
      max
        209.000000
        2.000000
       26.000000
        85.000000
        6.000000
    
    
      Neighborhood 10
      count
          6.000000
        6.000000
        6.000000
         6.000000
        6.000000
    
    
      mean
        391.000000
        3.500000
       13.333333
       142.500000
        3.833333
    
    
      std
        146.929915
        1.224745
        8.571270
        36.979724
        1.602082
    
    
      min
        160.000000
        2.000000
        4.000000
        90.000000
        1.000000
    
    
      25%
        312.250000
        2.500000
        6.750000
       135.000000
        3.250000
    
    
      50%
        425.500000
        4.000000
       12.000000
       137.500000
        4.500000
    
    
      75%
        499.000000
        4.000000
       20.250000
       147.500000
        5.000000
    
    
      max
        537.000000
        5.000000
       24.000000
       205.000000
        5.000000
    
    
      Neighborhood 11
      count
         14.000000
       14.000000
       14.000000
        14.000000
       14.000000
    
    
      mean
        379.000000
        3.214286
        9.928571
       159.428571
        9.642857
    
    
      std
        396.956111
        1.311404
        5.928605
        70.962302
        5.212517
    
    
      min
         82.000000
        2.000000
        5.000000
        95.000000
        1.000000
    
    
      25%
        242.500000
        2.000000
        6.000000
       103.750000
        6.000000
    
    
      50%
        295.500000
        3.000000
        8.000000
       130.000000
        9.500000
    
    
      75%
        373.000000
        3.750000
        9.750000
       196.500000
       15.500000
    
    
      max
       1719.000000
        6.000000
       23.000000
       319.000000
       16.000000
    
    
      Neighborhood 12
      count
         39.000000
       39.000000
       39.000000
        39.000000
       39.000000
    
    
      mean
        267.205128
        3.435897
       10.820513
       365.615385
        7.897436
    
    
      std
        137.867820
        1.874972
        7.118810
       686.086484
        5.290482
    
    
      min
         45.000000
        1.000000
        1.000000
        60.000000
        1.000000
    
    
      25%
        182.000000
        2.000000
        6.000000
       125.000000
        3.000000
    
    
      50%
        251.000000
        3.000000
        8.000000
       150.000000
        6.000000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Property type 3
      Neighborhood 11
      std
               NaN
             NaN
             NaN
              NaN
             NaN
    
    
      min
        196.000000
        2.000000
       15.000000
        75.000000
        8.000000
    
    
      25%
        196.000000
        2.000000
       15.000000
        75.000000
        8.000000
    
    
      50%
        196.000000
        2.000000
       15.000000
        75.000000
        8.000000
    
    
      75%
        196.000000
        2.000000
       15.000000
        75.000000
        8.000000
    
    
      max
        196.000000
        2.000000
       15.000000
        75.000000
        8.000000
    
    
      Neighborhood 14
      count
          1.000000
        1.000000
        1.000000
         1.000000
        1.000000
    
    
      mean
        113.000000
        1.000000
        1.000000
        75.000000
        5.000000
    
    
      std
               NaN
             NaN
             NaN
              NaN
             NaN
    
    
      min
        113.000000
        1.000000
        1.000000
        75.000000
        5.000000
    
    
      25%
        113.000000
        1.000000
        1.000000
        75.000000
        5.000000
    
    
      50%
        113.000000
        1.000000
        1.000000
        75.000000
        5.000000
    
    
      75%
        113.000000
        1.000000
        1.000000
        75.000000
        5.000000
    
    
      max
        113.000000
        1.000000
        1.000000
        75.000000
        5.000000
    
    
      Neighborhood 17
      count
          1.000000
        1.000000
        1.000000
         1.000000
        1.000000
    
    
      mean
        189.000000
        2.000000
       15.000000
        65.000000
       23.000000
    
    
      std
               NaN
             NaN
             NaN
              NaN
             NaN
    
    
      min
        189.000000
        2.000000
       15.000000
        65.000000
       23.000000
    
    
      25%
        189.000000
        2.000000
       15.000000
        65.000000
       23.000000
    
    
      50%
        189.000000
        2.000000
       15.000000
        65.000000
       23.000000
    
    
      75%
        189.000000
        2.000000
       15.000000
        65.000000
       23.000000
    
    
      max
        189.000000
        2.000000
       15.000000
        65.000000
       23.000000
    
    
      Neighborhood 4
      count
          1.000000
        1.000000
        1.000000
         1.000000
        1.000000
    
    
      mean
        241.000000
        2.000000
        4.000000
        40.000000
       19.000000
    
    
      std
               NaN
             NaN
             NaN
              NaN
             NaN
    
    
      min
        241.000000
        2.000000
        4.000000
        40.000000
       19.000000
    
    
      25%
        241.000000
        2.000000
        4.000000
        40.000000
       19.000000
    
    
      50%
        241.000000
        2.000000
        4.000000
        40.000000
       19.000000
    
    
      75%
        241.000000
        2.000000
        4.000000
        40.000000
       19.000000
    
    
      max
        241.000000
        2.000000
        4.000000
        40.000000
       19.000000
    
  

320 rows × 5 columns

Plot daily bookings:



In [59]:

    
bookings.booking_date = pd.to_datetime(bookings.booking_date)









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 9 to 408
Data columns (total 1 columns):
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 94.9 KB

Plot the daily bookings per neighborhood (provide a legend)



In [ ]:

Part 2 - Develop a data set



In [ ]:

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [ ]:

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [ ]:

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [ ]:

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [ ]:

    
from sklearn.cross_validation import train_test_split



In [ ]:

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [ ]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets



In [ ]:

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [ ]:

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [ ]:

		description_length	person_capacity	picture_count	price	tenure_months
prop_type
Property type 1	count	269.000000	269.000000	269.000000	269.000000	269.000000
	mean	313.171004	3.516729	14.695167	237.085502	8.464684
	std	214.769141	1.644955	10.623651	425.710534	5.773367
	min	17.000000	1.000000	1.000000	40.000000	1.000000
	25%	193.000000	2.000000	6.000000	120.000000	4.000000
	50%	266.000000	3.000000	12.000000	150.000000	7.000000
	75%	388.000000	4.000000	20.000000	229.000000	14.000000
	max	1719.000000	10.000000	71.000000	5000.000000	30.000000
Property type 2	count	135.000000	135.000000	135.000000	135.000000	135.000000
	mean	304.851852	2.000000	13.948148	93.288889	8.377778
	std	255.135332	0.846415	10.255191	42.261246	5.963654
	min	0.000000	1.000000	1.000000	39.000000	1.000000
	25%	150.500000	2.000000	6.500000	69.000000	4.000000
	50%	239.000000	2.000000	11.000000	89.000000	7.000000
	75%	402.500000	2.000000	19.500000	99.000000	10.500000
	max	1969.000000	6.000000	56.000000	350.000000	29.000000
Property type 3	count	4.000000	4.000000	4.000000	4.000000	4.000000
	mean	184.750000	1.750000	8.750000	63.750000	13.750000
	std	53.093471	0.500000	7.320064	16.520190	8.616844
	min	113.000000	1.000000	1.000000	40.000000	5.000000
	25%	170.000000	1.750000	3.250000	58.750000	7.250000
	50%	192.500000	2.000000	9.500000	70.000000	13.500000
	75%	207.250000	2.000000	15.000000	75.000000	20.000000
	max	241.000000	2.000000	15.000000	75.000000	23.000000

			description_length	person_capacity	picture_count	price	tenure_months
prop_type	neighborhood
Property type 1	Neighborhood 1	count	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	209.000000	2.000000	26.000000	85.000000	6.000000
		std	NaN	NaN	NaN	NaN	NaN
		min	209.000000	2.000000	26.000000	85.000000	6.000000
		25%	209.000000	2.000000	26.000000	85.000000	6.000000
		50%	209.000000	2.000000	26.000000	85.000000	6.000000
		75%	209.000000	2.000000	26.000000	85.000000	6.000000
		max	209.000000	2.000000	26.000000	85.000000	6.000000
	Neighborhood 10	count	6.000000	6.000000	6.000000	6.000000	6.000000
		mean	391.000000	3.500000	13.333333	142.500000	3.833333
		std	146.929915	1.224745	8.571270	36.979724	1.602082
		min	160.000000	2.000000	4.000000	90.000000	1.000000
		25%	312.250000	2.500000	6.750000	135.000000	3.250000
		50%	425.500000	4.000000	12.000000	137.500000	4.500000
		75%	499.000000	4.000000	20.250000	147.500000	5.000000
		max	537.000000	5.000000	24.000000	205.000000	5.000000
	Neighborhood 11	count	14.000000	14.000000	14.000000	14.000000	14.000000
		mean	379.000000	3.214286	9.928571	159.428571	9.642857
		std	396.956111	1.311404	5.928605	70.962302	5.212517
		min	82.000000	2.000000	5.000000	95.000000	1.000000
		25%	242.500000	2.000000	6.000000	103.750000	6.000000
		50%	295.500000	3.000000	8.000000	130.000000	9.500000
		75%	373.000000	3.750000	9.750000	196.500000	15.500000
		max	1719.000000	6.000000	23.000000	319.000000	16.000000
	Neighborhood 12	count	39.000000	39.000000	39.000000	39.000000	39.000000
		mean	267.205128	3.435897	10.820513	365.615385	7.897436
		std	137.867820	1.874972	7.118810	686.086484	5.290482
		min	45.000000	1.000000	1.000000	60.000000	1.000000
		25%	182.000000	2.000000	6.000000	125.000000	3.000000
		50%	251.000000	3.000000	8.000000	150.000000	6.000000
...	...	...	...	...	...	...	...
Property type 3	Neighborhood 11	std	NaN	NaN	NaN	NaN	NaN
		min	196.000000	2.000000	15.000000	75.000000	8.000000
		25%	196.000000	2.000000	15.000000	75.000000	8.000000
		50%	196.000000	2.000000	15.000000	75.000000	8.000000
		75%	196.000000	2.000000	15.000000	75.000000	8.000000
		max	196.000000	2.000000	15.000000	75.000000	8.000000
	Neighborhood 14	count	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	113.000000	1.000000	1.000000	75.000000	5.000000
		std	NaN	NaN	NaN	NaN	NaN
		min	113.000000	1.000000	1.000000	75.000000	5.000000
		25%	113.000000	1.000000	1.000000	75.000000	5.000000
		50%	113.000000	1.000000	1.000000	75.000000	5.000000
		75%	113.000000	1.000000	1.000000	75.000000	5.000000
		max	113.000000	1.000000	1.000000	75.000000	5.000000
	Neighborhood 17	count	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	189.000000	2.000000	15.000000	65.000000	23.000000
		std	NaN	NaN	NaN	NaN	NaN
		min	189.000000	2.000000	15.000000	65.000000	23.000000
		25%	189.000000	2.000000	15.000000	65.000000	23.000000
		50%	189.000000	2.000000	15.000000	65.000000	23.000000
		75%	189.000000	2.000000	15.000000	65.000000	23.000000
		max	189.000000	2.000000	15.000000	65.000000	23.000000
	Neighborhood 4	count	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	241.000000	2.000000	4.000000	40.000000	19.000000
		std	NaN	NaN	NaN	NaN	NaN
		min	241.000000	2.000000	4.000000	40.000000	19.000000
		25%	241.000000	2.000000	4.000000	40.000000	19.000000
		50%	241.000000	2.000000	4.000000	40.000000	19.000000
		75%	241.000000	2.000000	4.000000	40.000000	19.000000
		max	241.000000	2.000000	4.000000	40.000000	19.000000