Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

pd.set_option('display.max_rows', 10)

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [3]:

    
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')
#bookings[bookings.prop_id==1]

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [4]:

    
listings[['price','person_capacity','picture_count','description_length','tenure_months']].mean()









    Out[4]:





price                 187.806373
person_capacity         2.997549
picture_count          14.389706
description_length    309.159314
tenure_months           8.487745
dtype: float64



In [5]:

    
listings[['price','person_capacity','picture_count','description_length','tenure_months']].median()









    Out[5]:





price                 125
person_capacity         2
picture_count          12
description_length    250
tenure_months           7
dtype: float64



In [6]:

    
listings[['price','person_capacity','picture_count','description_length','tenure_months']].std()









    Out[6]:





price                 353.050858
person_capacity         1.594676
picture_count          10.477428
description_length    228.021684
tenure_months           5.872088
dtype: float64

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [7]:

    
listings.groupby('prop_type')['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])









    Out[7]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      mean
      mean
      mean
      mean
      mean
    
    
      prop_type
      
      
      
      
      
    
  
  
    
      Property type 1
       237.085502
       3.516729
       14.695167
       313.171004
        8.464684
    
    
      Property type 2
        93.288889
       2.000000
       13.948148
       304.851852
        8.377778
    
    
      Property type 3
        63.750000
       1.750000
        8.750000
       184.750000
       13.750000

Same, but by property type per neighborhood?



In [8]:

    
listings.groupby(['prop_type','neighborhood'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])









    Out[8]:






  
    
      
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      
      mean
      mean
      mean
      mean
      mean
    
    
      prop_type
      neighborhood
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 1
        85.000000
       2.000000
       26.000000
       209.000000
        6.000000
    
    
      Neighborhood 10
       142.500000
       3.500000
       13.333333
       391.000000
        3.833333
    
    
      Neighborhood 11
       159.428571
       3.214286
        9.928571
       379.000000
        9.642857
    
    
      Neighborhood 12
       365.615385
       3.435897
       10.820513
       267.205128
        7.897436
    
    
      Neighborhood 13
       241.897959
       4.061224
       15.653061
       290.408163
        9.122449
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Property type 2
      Neighborhood 9
       110.000000
       2.000000
        3.500000
       114.500000
        9.000000
    
    
      Property type 3
      Neighborhood 11
        75.000000
       2.000000
       15.000000
       196.000000
        8.000000
    
    
      Neighborhood 14
        75.000000
       1.000000
        1.000000
       113.000000
        5.000000
    
    
      Neighborhood 17
        65.000000
       2.000000
       15.000000
       189.000000
       23.000000
    
    
      Neighborhood 4
        40.000000
       2.000000
        4.000000
       241.000000
       19.000000
    
  

40 rows × 5 columns

Plot daily bookings:



In [9]:

    
bookings.booking_date.value_counts().sort_index().plot()









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b4d8590>

Plot the daily bookings per neighborhood (provide a legend)



In [10]:

    
booking_all = pd.merge(listings, bookings, how='outer', left_on='prop_id', right_on='prop_id')
booking_all.groupby(['booking_date','neighborhood']).count().sort_index().plot()









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c030710>

Part 2 - Develop a data set

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [11]:

    
number_of_bookings = bookings.groupby('prop_id').count().reset_index()
number_of_bookings.rename(columns={'booking_date':'number_of_bookings'}, inplace = True)
listings2 = pd.merge(listings, number_of_bookings, how='left', on='prop_id')
listings2.number_of_bookings.fillna(0, inplace=True)
# Alternative way: listings2['number_of_bookings'] = listings2.number_of_bookings.fillna(0)
listings2['booking_rate'] = listings2.number_of_bookings/listings2.tenure_months
listings2









    Out[11]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
        232
       30
        4
        0.133333
    
    
      1  
         2
       Property type 1
       Neighborhood 14
        95
       2
        3
         37
       29
        0
        0.000000
    
    
      2  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
        172
       29
        1
        0.034483
    
    
      3  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
        472
       28
       27
        0.964286
    
    
      4  
         5
       Property type 1
       Neighborhood 15
       125
       5
       21
        442
       28
        0
        0.000000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      403
       404
       Property type 2
       Neighborhood 14
       100
       1
        8
        235
        1
        3
        3.000000
    
    
      404
       405
       Property type 2
       Neighborhood 13
        85
       2
       27
       1048
        1
       19
       19.000000
    
    
      405
       406
       Property type 1
        Neighborhood 9
        70
       3
       18
        153
        1
       19
       19.000000
    
    
      406
       407
       Property type 1
       Neighborhood 13
       129
       2
       13
        370
        1
       15
       15.000000
    
    
      407
       408
       Property type 1
       Neighborhood 14
       100
       3
       21
        707
        1
       54
       54.000000
    
  

408 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [12]:

    
listings2[listings2.tenure_months > 10]









    Out[12]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
        4
       0.133333
    
    
      1  
         2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
        0
       0.000000
    
    
      2  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        1
       0.034483
    
    
      3  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
       27
       0.964286
    
    
      4  
         5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28
        0
       0.000000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      115
       116
       Property type 1
       Neighborhood 17
       125
       2
        9
        89
       11
        0
       0.000000
    
    
      116
       117
       Property type 2
       Neighborhood 14
        49
       2
       14
       417
       11
        8
       0.727273
    
    
      117
       118
       Property type 2
        Neighborhood 4
        60
       2
       10
        95
       11
       11
       1.000000
    
    
      118
       119
       Property type 2
       Neighborhood 12
        55
       2
        8
       333
       11
        1
       0.090909
    
    
      119
       120
       Property type 2
        Neighborhood 9
       100
       1
        4
       212
       11
        0
       0.000000
    
  

120 rows × 10 columns

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [13]:

    
listings2.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 10 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
number_of_bookings    408 non-null float64
booking_rate          408 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 35.1+ KB



In [14]:

    
pd.get_dummies(listings2[['prop_type', 'neighborhood', 'tenure_months']])









    Out[14]:






  
    
      
      tenure_months
      prop_type_Property type 1
      prop_type_Property type 2
      prop_type_Property type 3
      neighborhood_Neighborhood 1
      neighborhood_Neighborhood 10
      neighborhood_Neighborhood 11
      neighborhood_Neighborhood 12
      neighborhood_Neighborhood 13
      neighborhood_Neighborhood 14
      ...
      neighborhood_Neighborhood 20
      neighborhood_Neighborhood 21
      neighborhood_Neighborhood 22
      neighborhood_Neighborhood 3
      neighborhood_Neighborhood 4
      neighborhood_Neighborhood 5
      neighborhood_Neighborhood 6
      neighborhood_Neighborhood 7
      neighborhood_Neighborhood 8
      neighborhood_Neighborhood 9
    
  
  
    
      0  
       30
       1
       0
       0
       0
       0
       0
       0
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      1  
       29
       1
       0
       0
       0
       0
       0
       0
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2  
       29
       0
       1
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3  
       28
       0
       1
       0
       0
       0
       0
       0
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4  
       28
       1
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      403
        1
       0
       1
       0
       0
       0
       0
       0
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      404
        1
       0
       1
       0
       0
       0
       0
       0
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      405
        1
       1
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       1
    
    
      406
        1
       1
       0
       0
       0
       0
       0
       0
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      407
        1
       1
       0
       0
       0
       0
       0
       0
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

408 rows × 26 columns



In [15]:

    
#Best way
features_cols = [col for col in listings2.columns if col not in ['prop_id', 'booking_rate', 'number_of_bookings']]
features = pd.get_dummies(listings2[features_cols])



In [16]:

    
#Other way
#prop_type_dummies = pd.get_dummies(listings2.prop_type)
#neighborhood_dummies = pd.get_dummies(listings2.neighborhood)
#features1 = listings2.join(prop_type_dummies, on='prop_id')
#features1 = listings2.join(neighborhood_dummies, on='prop_id')
#features1

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [17]:

    
from sklearn.cross_validation import train_test_split
y = listings2.booking_rate
X = features
X_train, X_test, y_train, y_test = train_test_split(X , y, test_size=0.2, random_state = 42)

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [18]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
# Below we Train our linear Regression
lr.fit(X_train, y_train)









    Out[18]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

fit your model with your test sets



In [19]:

    
lr.score(X_test, y_test)









    Out[19]:





0.13602860098044356

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

It returns the Linear Regression score and in this case it tells us that there is a low positive correlation between the booking_rate and the columns in the features table.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [20]:

    
month_rev = listings2['booking_rate']*listings2['price']
month_rev
alt_pred = month_rev
regres = features
A_train, A_test, b_train, b_test = train_test_split(regres , alt_pred, test_size=0.2, random_state = 42)
lr2 = LinearRegression()
lr2.fit(A_train, b_train)
lr2.score(A_test, b_test)









    Out[20]:





0.058306396109671699

	price	person_capacity	picture_count	description_length	tenure_months
	mean	mean	mean	mean	mean
prop_type
Property type 1	237.085502	3.516729	14.695167	313.171004	8.464684
Property type 2	93.288889	2.000000	13.948148	304.851852	8.377778
Property type 3	63.750000	1.750000	8.750000	184.750000	13.750000

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	number_of_bookings	booking_rate
0	1	Property type 1	Neighborhood 14	140	3	11	232	30	4	0.133333
1	2	Property type 1	Neighborhood 14	95	2	3	37	29	0	0.000000
2	3	Property type 2	Neighborhood 16	95	2	16	172	29	1	0.034483
3	4	Property type 2	Neighborhood 13	90	2	19	472	28	27	0.964286
4	5	Property type 1	Neighborhood 15	125	5	21	442	28	0	0.000000
...	...	...	...	...	...	...	...	...	...	...
403	404	Property type 2	Neighborhood 14	100	1	8	235	1	3	3.000000
404	405	Property type 2	Neighborhood 13	85	2	27	1048	1	19	19.000000
405	406	Property type 1	Neighborhood 9	70	3	18	153	1	19	19.000000
406	407	Property type 1	Neighborhood 13	129	2	13	370	1	15	15.000000
407	408	Property type 1	Neighborhood 14	100	3	21	707	1	54	54.000000

	tenure_months	prop_type_Property type 1	prop_type_Property type 2	prop_type_Property type 3	neighborhood_Neighborhood 1	neighborhood_Neighborhood 10	neighborhood_Neighborhood 11	neighborhood_Neighborhood 12	neighborhood_Neighborhood 13	neighborhood_Neighborhood 14	...	neighborhood_Neighborhood 20	neighborhood_Neighborhood 21	neighborhood_Neighborhood 22	neighborhood_Neighborhood 3	neighborhood_Neighborhood 4	neighborhood_Neighborhood 5	neighborhood_Neighborhood 6	neighborhood_Neighborhood 7	neighborhood_Neighborhood 8	neighborhood_Neighborhood 9
0	30	1	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
1	29	1	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
2	29	0	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	28	0	1	0	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
4	28	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
403	1	0	1	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
404	1	0	1	0	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
405	1	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
406	1	1	0	0	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
407	1	1	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0

	tenure_months	prop_type_Property type 1	prop_type_Property type 2	prop_type_Property type 3	neighborhood_Neighborhood 1	neighborhood_Neighborhood 10	neighborhood_Neighborhood 11	neighborhood_Neighborhood 12	neighborhood_Neighborhood 13	neighborhood_Neighborhood 14	...	neighborhood_Neighborhood 20	neighborhood_Neighborhood 21	neighborhood_Neighborhood 22	neighborhood_Neighborhood 3	neighborhood_Neighborhood 4	neighborhood_Neighborhood 5	neighborhood_Neighborhood 6	neighborhood_Neighborhood 7	neighborhood_Neighborhood 8	neighborhood_Neighborhood 9
0	30	1	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
1	29	1	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
2	29	0	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	28	0	1	0	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
4	28	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
403	1	0	1	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
404	1	0	1	0	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
405	1	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
406	1	1	0	0	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
407	1	1	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0