Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

pd.set_option('display.max_rows', 20)
pd.set_option('display.precision', 4)

%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [2]:

    
bookings = pd.read_csv('../data/bookings.csv', parse_dates = ['booking_date'])
listings = pd.read_csv('../data/listings.csv')

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [3]:

    
print "Mean of the variables of interests"
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].mean()









    



Mean of the variables of interests






    Out[3]:





price                 187.806
person_capacity         2.998
picture_count          14.390
description_length    309.159
tenure_months           8.488
dtype: float64



In [4]:

    
print "Standard deviation of the variables of interest"
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].std()









    



Standard deviation of the variables of interest






    Out[4]:





price                 353.051
person_capacity         1.595
picture_count          10.477
description_length    228.022
tenure_months           5.872
dtype: float64



In [5]:

    
print "Median of the variables of interest"
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].median()









    



Median of the variables of interest






    Out[5]:





price                 125
person_capacity         2
picture_count          12
description_length    250
tenure_months           7
dtype: float64

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [6]:

    
print 'Mean price, person capacity, picture count, description length, and tenure by property type'
listings.groupby('prop_type')['price', 'person_capacity', 'picture_count','description_length', 'tenure_months'].mean()









    



Mean price, person capacity, picture count, description length, and tenure by property type






    Out[6]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      prop_type
      
      
      
      
      
    
  
  
    
      Property type 1
       237.086
       3.517
       14.695
       313.171
        8.465
    
    
      Property type 2
        93.289
       2.000
       13.948
       304.852
        8.378
    
    
      Property type 3
        63.750
       1.750
        8.750
       184.750
       13.750

Same, but by property type per neighborhood?



In [7]:

    
print 'Property type 1: Mean price, person capacity, picture count, description length, and tenure by neighborhood'
listings[listings.prop_type == 'Property type 1'].groupby('neighborhood')['price', 'person_capacity', 'picture_count','description_length', 'tenure_months'].mean().sort()









    



Property type 1: Mean price, person capacity, picture count, description length, and tenure by neighborhood






    Out[7]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      neighborhood
      
      
      
      
      
    
  
  
    
      Neighborhood 1
        85.000
       2.000
       26.000
       209.000
        6.000
    
    
      Neighborhood 10
       142.500
       3.500
       13.333
       391.000
        3.833
    
    
      Neighborhood 11
       159.429
       3.214
        9.929
       379.000
        9.643
    
    
      Neighborhood 12
       365.615
       3.436
       10.821
       267.205
        7.897
    
    
      Neighborhood 13
       241.898
       4.061
       15.653
       290.408
        9.122
    
    
      Neighborhood 14
       164.676
       3.206
       14.765
       317.206
        8.441
    
    
      Neighborhood 15
       178.880
       3.720
       14.320
       321.760
        9.320
    
    
      Neighborhood 16
       158.929
       2.929
       21.643
       310.714
        7.071
    
    
      Neighborhood 17
       189.870
       3.522
       16.087
       317.348
        9.870
    
    
      Neighborhood 18
       173.591
       2.955
       16.091
       369.227
        8.227
    
    
      Neighborhood 19
       222.375
       3.625
       11.000
       254.500
        6.500
    
    
      Neighborhood 2
       250.000
       6.000
        8.000
       423.000
        6.000
    
    
      Neighborhood 20
       804.333
       2.778
        9.444
       223.556
        9.667
    
    
      Neighborhood 21
       362.500
       4.250
       49.000
       306.250
       14.750
    
    
      Neighborhood 22
       225.000
       3.000
       19.000
       500.000
        9.000
    
    
      Neighborhood 5
       194.500
       2.500
        8.500
       266.500
       11.500
    
    
      Neighborhood 6
       146.000
       3.333
       12.667
       290.667
        4.000
    
    
      Neighborhood 7
       161.000
       3.667
       14.333
       343.000
        5.333
    
    
      Neighborhood 8
       174.750
       5.000
       11.000
       300.000
        6.750
    
    
      Neighborhood 9
       151.143
       4.286
       13.429
       471.429
        5.714



In [8]:

    
print 'Property type 2: Mean price, person capacity, picture count, description length, and tenure by neighborhood'
listings[listings.prop_type == 'Property type 2'].groupby('neighborhood')['price', 'person_capacity', 'picture_count','description_length', 'tenure_months'].mean().sort()









    



Property type 2: Mean price, person capacity, picture count, description length, and tenure by neighborhood






    Out[8]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      neighborhood
      
      
      
      
      
    
  
  
    
      Neighborhood 10
       137.500
       2.000
       20.000
       126.000
        3.500
    
    
      Neighborhood 11
        78.750
       2.000
       16.750
       161.250
       11.250
    
    
      Neighborhood 12
        96.895
       1.947
       10.474
       244.526
        9.842
    
    
      Neighborhood 13
        81.130
       1.826
       16.696
       418.565
        9.739
    
    
      Neighborhood 14
        83.810
       1.857
       15.905
       348.619
        8.714
    
    
      Neighborhood 15
        95.000
       2.267
       11.733
       301.733
        8.200
    
    
      Neighborhood 16
        83.625
       2.062
       15.375
       246.250
        6.688
    
    
      Neighborhood 17
       102.455
       2.000
       15.455
       308.273
        7.182
    
    
      Neighborhood 18
       120.667
       2.222
       12.333
       297.778
        9.222
    
    
      Neighborhood 19
        88.875
       2.000
       15.125
       383.375
        5.500
    
    
      Neighborhood 20
        60.000
       1.000
        3.000
       101.000
        6.000
    
    
      Neighborhood 3
        60.000
       2.000
        7.000
       264.000
        9.000
    
    
      Neighborhood 4
        60.000
       2.000
       10.000
        95.000
       11.000
    
    
      Neighborhood 7
       100.000
       2.000
        3.000
       148.000
        2.000
    
    
      Neighborhood 8
       350.000
       4.000
        5.000
       223.000
        3.000
    
    
      Neighborhood 9
       110.000
       2.000
        3.500
       114.500
        9.000



In [9]:

    
print 'Property type 3: Mean price, person capacity, picture count, description length, and tenure by neighborhood'
listings[listings.prop_type == 'Property type 3'].groupby('neighborhood')['price', 'person_capacity', 'picture_count','description_length', 'tenure_months'].mean()









    



Property type 3: Mean price, person capacity, picture count, description length, and tenure by neighborhood






    Out[9]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      neighborhood
      
      
      
      
      
    
  
  
    
      Neighborhood 11
       75
       2
       15
       196
        8
    
    
      Neighborhood 14
       75
       1
        1
       113
        5
    
    
      Neighborhood 17
       65
       2
       15
       189
       23
    
    
      Neighborhood 4
       40
       2
        4
       241
       19

Plot daily bookings:



In [10]:

    
plot1 = bookings.booking_date.value_counts().plot(figsize = (25,10))
plot1.set_title('Daily bookings')
plot1.set_xlabel('Date')
plot1.set_ylabel('Number of bookings')









    Out[10]:





<matplotlib.text.Text at 0x108fd7810>

Plot the daily bookings per neighborhood (provide a legend)



In [11]:

    
lb = listings.merge(bookings)



In [12]:

    
lb['day_of_week'] = lb.booking_date.map(lambda x: x.dayofweek)



In [13]:

    
def day_word(number):
    day_of_week = {}
    day_of_week[0] = 'Monday'
    day_of_week[1] = 'Tuesday'
    day_of_week[2] = 'Wednesday'
    day_of_week[3] = 'Thursday'
    day_of_week[4] = 'Friday'
    day_of_week[5] = 'Saturday'
    day_of_week[6] = 'Sunday'
    return day_of_week[number]



In [14]:

    
lb['day'] = lb['day_of_week'].map(day_word)



In [15]:

    
# Warning: Super unpretty graph to follow. :( But I was told not to worry. :)
lb.groupby(['neighborhood', 'booking_date'])['prop_id'].count().unstack(0).plot(figsize = (20, 10))
plt.title('Daily bookings by neighborhood')
plt.xlabel('Date')
plt.ylabel('Number of bookings')
plt.ylim([0, 14]);



In [16]:

    
# Here's a prettier plot. But it uses day of the week on x axis.
# I could not figure out how to re-order x axis labels ... 
lb.groupby(['neighborhood', 'day'])['prop_id'].count().unstack(0).plot(figsize = (20, 10))
plt.title('Daily bookings by neighborhood')
plt.xlabel('Day of the week')
plt.ylabel('Number of bookings')
plt.ylim([0, 35]);

Part 2 - Develop a data set

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [17]:

    
# Not all listings had bookings.
# Function to get a dict with prop_id as key and number of bookings as value.
bookedProperties = {}
for prop in bookings['prop_id']:
    if prop in bookedProperties:
        bookedProperties[prop] += 1
    else:
        bookedProperties[prop] = 1



In [18]:

    
# Function to get number of bookings if bookings were made, set to zero otherwise. 
def bookings_map(propertyid):
    if propertyid not in bookedProperties.keys():
        return 0
    else: 
        return bookedProperties[propertyid]



In [19]:

    
# Apply function. 
listings['number_of_bookings'] = listings.prop_id.map(bookings_map)
listings['booking_rate'] = listings.number_of_bookings / listings.tenure_months

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [20]:

    
listings_new = listings[listings.tenure_months >= 10]



In [21]:

    
# Recode categories in prop_type so that dummy variables can be called later (i.e., whitespace to underscore)
# Chad, Ramesh - any ideas why I keep getting the error message (when running it the first time)?
listings_new['prop_typeR'] = listings_new.loc[:, 'prop_type'].map(lambda x: x.replace(' ', '_'))
listings_new['neighborhoodR'] = listings_new.loc[:, 'neighborhood'].map(lambda x: x.replace(' ', '_'))









    



-c:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
-c:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [22]:

    
dummies = pd.get_dummies(listings_new[['neighborhoodR', 'prop_typeR']], prefix = ['dummy', 'dummy'])



In [23]:

    
# Join with listings_new.
listings_with_dummies = listings_new.join(dummies)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [24]:

    
from sklearn.cross_validation import train_test_split



In [25]:

    
predictor = np.array(listings_with_dummies['booking_rate'])



In [26]:

    
listings_reduced = listings_with_dummies.drop(['prop_id', 'booking_rate', 'prop_type', 'neighborhood', 'number_of_bookings', \
                                               'prop_typeR', 'neighborhoodR'], \
                                              axis = 1)



In [27]:

    
regressors_train, regressors_test, predictor_train, predictor_test = train_test_split(listings_reduced,\
                                                                                     predictor, \
                                                                                     test_size = 0.2,\
                                                                                     random_state = 12)

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [28]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets



In [29]:

    
lr.fit(regressors_train, predictor_train)









    Out[29]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [30]:

    
# Print R Squared. It is the variance explained. 
print 'Variance score of test data: ', lr.score(regressors_test, predictor_test)









    



Variance score of test data:  0.234231968507



In [31]:

    
predicted = lr.predict(regressors_test)
expected = predictor_test



In [32]:

    
fig, ax = plt.subplots(1, 1)
ax.scatter(expected, predicted)
ax.plot(predictor, predictor, 'r')
ax.axis('tight')
ax.set_xlabel('Observed booking rate')
ax.set_ylabel('Predicted booking rate')
print "Residual mean squared error of the regression:", np.sqrt(np.mean((predicted - expected) ** 2))









    



Residual mean squared error of the regression: 1.48085566538

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

This mean that 23% of the variance in the test data can be explained by the model trained on the train data. R squared is the explained variation divided by the total variation. Higher scores mean that more variation is explained by the model. This suggests that the model that was fitted is not that bad (in studies involving human behavior most studies report explained variance to be below 50%), but it probably would be better to select features to feed into the model and not enter it all. With only 144 individual data points, the number of regressors (here: 24) is also pretty large.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [33]:

    
# Copy orig data.
listings_copy = listings_new.copy()



In [34]:

    
# Compute new variable of interest.
listings_copy['monthly_revenue'] = (listings_copy.loc[:, 'number_of_bookings'] * listings_copy.loc[:, 'price']) / (listings_copy.loc[:, 'tenure_months'])



In [35]:

    
# Get dummy variables. Join with data copy.
dummies2 = pd.get_dummies(listings_copy[['neighborhoodR', 'prop_typeR']], prefix = ['dummy', 'dummy'])
listings_copy_dummies = listings_copy.join(dummies2)



In [36]:

    
# Predictor np array. Drop unwanted variables. 
predictor2 = np.array(listings_copy_dummies['monthly_revenue'])
listings_copy_reduced = listings_copy_dummies.drop(['prop_id', 'booking_rate', 'prop_type', 'neighborhood', 'number_of_bookings', 'prop_typeR', 'neighborhoodR', 'monthly_revenue'], axis = 1)



In [37]:

    
# Split. 
regr_train, regr_test, pred_train, pred_test = train_test_split(listings_copy_reduced, predictor2, test_size = 0.2, random_state = 9)



In [38]:

    
lr2 = LinearRegression()



In [39]:

    
lr2.fit(regr_train, pred_train)









    Out[39]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)



In [40]:

    
print 'R squared: ', lr2.score(regr_test, pred_test)
print 'Booooh, a horrible model (negative means that the model fitted is worse than a simple horizontal line).'









    



R squared:  -0.0609556205214
Booooh, a horrible model (negative means that the model fitted is worse than a simple horizontal line).



In [41]:

    
predicted2 = lr2.predict(regr_test)
expected2 = pred_test



In [42]:

    
fig, ax = plt.subplots(1, 1)
ax.scatter(expected2, predicted2)
ax.plot(predictor2, predictor2, 'r')
ax.axis('tight')
ax.set_xlabel('Observed monthly revenue (USD)')
ax.set_ylabel('Predicted monthly revenue (USD)')
print 'Residual mean squared error of the regression:', np.sqrt(np.mean((predicted2 - expected2) ** 2))









    



Residual mean squared error of the regression: 166.779819473



In [43]:

    
# I tried some transformations. Log transform did not normalize predictor2. 
# I went for sqrt transform - even though it still didn't look Gaussian.
predictor2_sqrt = np.sqrt(predictor2)
# Split again.
regr2_train, regr2_test, pred2_train, pred2_test = train_test_split(listings_copy_reduced, predictor2_sqrt, test_size = 0.2, random_state = 9)
# Fit.
lr3 = LinearRegression()
lr3.fit(regr2_train, pred2_train)
# Look at score.
print 'R squared: ', lr3.score(regr2_test, pred2_test)
print 'Ha, made it worse. Thoughts?'









    



R squared:  -0.19290939359
Ha, made it worse. Thoughts?



In [43]:

	price	person_capacity	picture_count	description_length	tenure_months
prop_type
Property type 1	237.086	3.517	14.695	313.171	8.465
Property type 2	93.289	2.000	13.948	304.852	8.378
Property type 3	63.750	1.750	8.750	184.750	13.750

	price	person_capacity	picture_count	description_length	tenure_months
neighborhood
Neighborhood 1	85.000	2.000	26.000	209.000	6.000
Neighborhood 10	142.500	3.500	13.333	391.000	3.833
Neighborhood 11	159.429	3.214	9.929	379.000	9.643
Neighborhood 12	365.615	3.436	10.821	267.205	7.897
Neighborhood 13	241.898	4.061	15.653	290.408	9.122
Neighborhood 14	164.676	3.206	14.765	317.206	8.441
Neighborhood 15	178.880	3.720	14.320	321.760	9.320
Neighborhood 16	158.929	2.929	21.643	310.714	7.071
Neighborhood 17	189.870	3.522	16.087	317.348	9.870
Neighborhood 18	173.591	2.955	16.091	369.227	8.227
Neighborhood 19	222.375	3.625	11.000	254.500	6.500
Neighborhood 2	250.000	6.000	8.000	423.000	6.000
Neighborhood 20	804.333	2.778	9.444	223.556	9.667
Neighborhood 21	362.500	4.250	49.000	306.250	14.750
Neighborhood 22	225.000	3.000	19.000	500.000	9.000
Neighborhood 5	194.500	2.500	8.500	266.500	11.500
Neighborhood 6	146.000	3.333	12.667	290.667	4.000
Neighborhood 7	161.000	3.667	14.333	343.000	5.333
Neighborhood 8	174.750	5.000	11.000	300.000	6.750
Neighborhood 9	151.143	4.286	13.429	471.429	5.714

	price	person_capacity	picture_count	description_length	tenure_months
neighborhood
Neighborhood 10	137.500	2.000	20.000	126.000	3.500
Neighborhood 11	78.750	2.000	16.750	161.250	11.250
Neighborhood 12	96.895	1.947	10.474	244.526	9.842
Neighborhood 13	81.130	1.826	16.696	418.565	9.739
Neighborhood 14	83.810	1.857	15.905	348.619	8.714
Neighborhood 15	95.000	2.267	11.733	301.733	8.200
Neighborhood 16	83.625	2.062	15.375	246.250	6.688
Neighborhood 17	102.455	2.000	15.455	308.273	7.182
Neighborhood 18	120.667	2.222	12.333	297.778	9.222
Neighborhood 19	88.875	2.000	15.125	383.375	5.500
Neighborhood 20	60.000	1.000	3.000	101.000	6.000
Neighborhood 3	60.000	2.000	7.000	264.000	9.000
Neighborhood 4	60.000	2.000	10.000	95.000	11.000
Neighborhood 7	100.000	2.000	3.000	148.000	2.000
Neighborhood 8	350.000	4.000	5.000	223.000	3.000
Neighborhood 9	110.000	2.000	3.500	114.500	9.000