Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [58]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [59]:

    
# Create a data frame from the listings dataset

listings = pd.read_csv('../data/listings.csv')

# Create a data frame from the bookings dataset

bookings = pd.read_csv('../data/bookings.csv')

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [82]:

    
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()









    Out[82]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      count
        408.000000
       408.000000
       408.000000
        408.000000
       408.000000
    
    
      mean
        187.806373
         2.997549
        14.389706
        309.159314
         8.487745
    
    
      std
        353.050858
         1.594676
        10.477428
        228.021684
         5.872088
    
    
      min
         39.000000
         1.000000
         1.000000
          0.000000
         1.000000
    
    
      25%
         90.000000
         2.000000
         6.000000
        179.000000
         4.000000
    
    
      50%
        125.000000
         2.000000
        12.000000
        250.000000
         7.000000
    
    
      75%
        199.000000
         4.000000
        20.000000
        389.500000
        13.000000
    
    
      max
       5000.000000
        10.000000
        71.000000
       1969.000000
        30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [61]:

    
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB



In [62]:

    
listings.groupby(['prop_type'])['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'].agg(['mean'])









    Out[62]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      mean
      mean
      mean
      mean
      mean
    
    
      prop_type
      
      
      
      
      
    
  
  
    
      Property type 1
       237.085502
       3.516729
       14.695167
       313.171004
        8.464684
    
    
      Property type 2
        93.288889
       2.000000
       13.948148
       304.851852
        8.377778
    
    
      Property type 3
        63.750000
       1.750000
        8.750000
       184.750000
       13.750000

Same, but by property type per neighborhood?



In [63]:

    
listings.groupby(['prop_type', 'neighborhood'])['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'].agg(['mean'])









    Out[63]:






  
    
      
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      
      mean
      mean
      mean
      mean
      mean
    
    
      prop_type
      neighborhood
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 1
        85.000000
       2.000000
       26.000000
       209.000000
        6.000000
    
    
      Neighborhood 10
       142.500000
       3.500000
       13.333333
       391.000000
        3.833333
    
    
      Neighborhood 11
       159.428571
       3.214286
        9.928571
       379.000000
        9.642857
    
    
      Neighborhood 12
       365.615385
       3.435897
       10.820513
       267.205128
        7.897436
    
    
      Neighborhood 13
       241.897959
       4.061224
       15.653061
       290.408163
        9.122449
    
    
      Neighborhood 14
       164.676471
       3.205882
       14.764706
       317.205882
        8.441176
    
    
      Neighborhood 15
       178.880000
       3.720000
       14.320000
       321.760000
        9.320000
    
    
      Neighborhood 16
       158.928571
       2.928571
       21.642857
       310.714286
        7.071429
    
    
      Neighborhood 17
       189.869565
       3.521739
       16.086957
       317.347826
        9.869565
    
    
      Neighborhood 18
       173.590909
       2.954545
       16.090909
       369.227273
        8.227273
    
    
      Neighborhood 19
       222.375000
       3.625000
       11.000000
       254.500000
        6.500000
    
    
      Neighborhood 2
       250.000000
       6.000000
        8.000000
       423.000000
        6.000000
    
    
      Neighborhood 20
       804.333333
       2.777778
        9.444444
       223.555556
        9.666667
    
    
      Neighborhood 21
       362.500000
       4.250000
       49.000000
       306.250000
       14.750000
    
    
      Neighborhood 22
       225.000000
       3.000000
       19.000000
       500.000000
        9.000000
    
    
      Neighborhood 5
       194.500000
       2.500000
        8.500000
       266.500000
       11.500000
    
    
      Neighborhood 6
       146.000000
       3.333333
       12.666667
       290.666667
        4.000000
    
    
      Neighborhood 7
       161.000000
       3.666667
       14.333333
       343.000000
        5.333333
    
    
      Neighborhood 8
       174.750000
       5.000000
       11.000000
       300.000000
        6.750000
    
    
      Neighborhood 9
       151.142857
       4.285714
       13.428571
       471.428571
        5.714286
    
    
      Property type 2
      Neighborhood 10
       137.500000
       2.000000
       20.000000
       126.000000
        3.500000
    
    
      Neighborhood 11
        78.750000
       2.000000
       16.750000
       161.250000
       11.250000
    
    
      Neighborhood 12
        96.894737
       1.947368
       10.473684
       244.526316
        9.842105
    
    
      Neighborhood 13
        81.130435
       1.826087
       16.695652
       418.565217
        9.739130
    
    
      Neighborhood 14
        83.809524
       1.857143
       15.904762
       348.619048
        8.714286
    
    
      Neighborhood 15
        95.000000
       2.266667
       11.733333
       301.733333
        8.200000
    
    
      Neighborhood 16
        83.625000
       2.062500
       15.375000
       246.250000
        6.687500
    
    
      Neighborhood 17
       102.454545
       2.000000
       15.454545
       308.272727
        7.181818
    
    
      Neighborhood 18
       120.666667
       2.222222
       12.333333
       297.777778
        9.222222
    
    
      Neighborhood 19
        88.875000
       2.000000
       15.125000
       383.375000
        5.500000
    
    
      Neighborhood 20
        60.000000
       1.000000
        3.000000
       101.000000
        6.000000
    
    
      Neighborhood 3
        60.000000
       2.000000
        7.000000
       264.000000
        9.000000
    
    
      Neighborhood 4
        60.000000
       2.000000
       10.000000
        95.000000
       11.000000
    
    
      Neighborhood 7
       100.000000
       2.000000
        3.000000
       148.000000
        2.000000
    
    
      Neighborhood 8
       350.000000
       4.000000
        5.000000
       223.000000
        3.000000
    
    
      Neighborhood 9
       110.000000
       2.000000
        3.500000
       114.500000
        9.000000
    
    
      Property type 3
      Neighborhood 11
        75.000000
       2.000000
       15.000000
       196.000000
        8.000000
    
    
      Neighborhood 14
        75.000000
       1.000000
        1.000000
       113.000000
        5.000000
    
    
      Neighborhood 17
        65.000000
       2.000000
       15.000000
       189.000000
       23.000000
    
    
      Neighborhood 4
        40.000000
       2.000000
        4.000000
       241.000000
       19.000000

Plot daily bookings:



In [64]:

    
# Generate a count of bookings by date
cross_tab = bookings.groupby(['booking_date'])['prop_id'].agg(['count']).unstack(0)

# Plot the table
cross_tab.plot()









    Out[64]:





<matplotlib.axes._subplots.AxesSubplot at 0x10bc97b90>

Plot the daily bookings per neighborhood (provide a legend)



In [65]:

    
# Create a smaller dataset with just the prop_id and neighborhood
neighborhoods = listings[['prop_id', 'neighborhood']]

# Merge bookings_by_prop into the listings data
bookings_neighborhoods = bookings.merge(neighborhoods, on='prop_id', how='left')

# Generate a count of bookings by date and neighborhood
cross_tab = bookings_neighborhoods.groupby(['neighborhood','booking_date'])['prop_id'].agg(['count']).unstack(0)

# Plot the table
cross_tab.plot()









    Out[65]:





<matplotlib.axes._subplots.AxesSubplot at 0x10bd99750>

Part 2 - Develop a data set



In [66]:

    
# Count the number of bookings by property
bookings_by_prop = bookings.groupby('prop_id')[['prop_id']].count()
bookings_by_prop

# Rename columns and reset the index
bookings_by_prop.rename(columns={'prop_id': 'number_of_bookings'}, inplace=True)
bookings_by_prop = bookings_by_prop.reset_index()

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [67]:

    
# Merge bookings_by_prop into the listings data
listings = listings.merge(bookings_by_prop, on='prop_id', how='left')

# Replace null values with 0's
listings.number_of_bookings.replace(np.nan, 0, inplace = True)

# Divide number_of_bookings by tenure_months
listings['booking_rate'] = listings.number_of_bookings / listings.tenure_months

# Describe the resulting dataset
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 10 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
number_of_bookings    408 non-null float64
booking_rate          408 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 35.1+ KB

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [68]:

    
listings_filtered = listings[listings.tenure_months > 10]
listings_filtered.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 120 entries, 0 to 119
Data columns (total 10 columns):
prop_id               120 non-null int64
prop_type             120 non-null object
neighborhood          120 non-null object
price                 120 non-null int64
person_capacity       120 non-null int64
picture_count         120 non-null int64
description_length    120 non-null int64
tenure_months         120 non-null int64
number_of_bookings    120 non-null float64
booking_rate          120 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 10.3+ KB

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [69]:

    
listings_filtered = pd.get_dummies(listings_filtered)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [70]:

    
from sklearn.cross_validation import train_test_split



In [71]:

    
# Create a reduced data from of features from the original data frame
cols = list(listings_filtered.columns)
cols.remove('prop_id')
cols.remove('booking_rate')
cols.remove('number_of_bookings')
iv = listings_filtered[cols]

# Create a data frame with the target variable
dv = listings_filtered['booking_rate'].values

# Split those two data frames into training and test datasets
iv_train, iv_test, dv_train, dv_test = train_test_split(iv, dv, test_size=0.2)

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [72]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets



In [73]:

    
lr.fit(iv_train, dv_train)









    Out[73]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [74]:

    
lr.score(iv_train, dv_train)









    Out[74]:





0.31484836732897703

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?



In [75]:

    
# The score tells us the amount of variance in the outcome variable that is explained by the predictor variables.
# A score of 0.41 tells us that the model explains about 40% of the variance in the outcome variable.

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [76]:

    
# Create a reduced data from of features from the original data frame
listings_filtered['monthly_revenue'] = listings_filtered.price * listings_filtered.booking_rate

# Create a reduced data from of features from the original data frame
cols = list(listings_filtered.columns)
cols.remove('prop_id')
cols.remove('booking_rate')
cols.remove('number_of_bookings')
iv = listings_filtered[cols]

# Create a data frame with the target variable
dv = listings_filtered['booking_rate'].values

# Split those two data frames into training and test datasets
iv_train, iv_test, dv_train, dv_test = train_test_split(iv, dv, test_size=0.2)

# Fit model with the test sets
lr.fit(iv_train, dv_train)

# Report the score
lr.score(iv_train, dv_train)









    Out[76]:





0.92979467316975706



In [57]:



In [ ]:

	price	person_capacity	picture_count	description_length	tenure_months
count	408.000000	408.000000	408.000000	408.000000	408.000000
mean	187.806373	2.997549	14.389706	309.159314	8.487745
std	353.050858	1.594676	10.477428	228.021684	5.872088
min	39.000000	1.000000	1.000000	0.000000	1.000000
25%	90.000000	2.000000	6.000000	179.000000	4.000000
50%	125.000000	2.000000	12.000000	250.000000	7.000000
75%	199.000000	4.000000	20.000000	389.500000	13.000000
max	5000.000000	10.000000	71.000000	1969.000000	30.000000