Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

pd.set_option('display.max_rows', 20)
pd.set_option('display.precision', 4)

%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [2]:
bookings = pd.read_csv('../data/bookings.csv', parse_dates = ['booking_date'])
listings = pd.read_csv('../data/listings.csv')

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [3]:
print "Mean of the variables of interests"
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].mean()


Mean of the variables of interests
Out[3]:
price                 187.806
person_capacity         2.998
picture_count          14.390
description_length    309.159
tenure_months           8.488
dtype: float64

In [4]:
print "Standard deviation of the variables of interest"
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].std()


Standard deviation of the variables of interest
Out[4]:
price                 353.051
person_capacity         1.595
picture_count          10.477
description_length    228.022
tenure_months           5.872
dtype: float64

In [5]:
print "Median of the variables of interest"
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].median()


Median of the variables of interest
Out[5]:
price                 125
person_capacity         2
picture_count          12
description_length    250
tenure_months           7
dtype: float64

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [6]:
print 'Mean price, person capacity, picture count, description length, and tenure by property type'
listings.groupby('prop_type')['price', 'person_capacity', 'picture_count','description_length', 'tenure_months'].mean()


Mean price, person capacity, picture count, description length, and tenure by property type
Out[6]:
price person_capacity picture_count description_length tenure_months
prop_type
Property type 1 237.086 3.517 14.695 313.171 8.465
Property type 2 93.289 2.000 13.948 304.852 8.378
Property type 3 63.750 1.750 8.750 184.750 13.750

Same, but by property type per neighborhood?


In [7]:
print 'Property type 1: Mean price, person capacity, picture count, description length, and tenure by neighborhood'
listings[listings.prop_type == 'Property type 1'].groupby('neighborhood')['price', 'person_capacity', 'picture_count','description_length', 'tenure_months'].mean().sort()


Property type 1: Mean price, person capacity, picture count, description length, and tenure by neighborhood
Out[7]:
price person_capacity picture_count description_length tenure_months
neighborhood
Neighborhood 1 85.000 2.000 26.000 209.000 6.000
Neighborhood 10 142.500 3.500 13.333 391.000 3.833
Neighborhood 11 159.429 3.214 9.929 379.000 9.643
Neighborhood 12 365.615 3.436 10.821 267.205 7.897
Neighborhood 13 241.898 4.061 15.653 290.408 9.122
Neighborhood 14 164.676 3.206 14.765 317.206 8.441
Neighborhood 15 178.880 3.720 14.320 321.760 9.320
Neighborhood 16 158.929 2.929 21.643 310.714 7.071
Neighborhood 17 189.870 3.522 16.087 317.348 9.870
Neighborhood 18 173.591 2.955 16.091 369.227 8.227
Neighborhood 19 222.375 3.625 11.000 254.500 6.500
Neighborhood 2 250.000 6.000 8.000 423.000 6.000
Neighborhood 20 804.333 2.778 9.444 223.556 9.667
Neighborhood 21 362.500 4.250 49.000 306.250 14.750
Neighborhood 22 225.000 3.000 19.000 500.000 9.000
Neighborhood 5 194.500 2.500 8.500 266.500 11.500
Neighborhood 6 146.000 3.333 12.667 290.667 4.000
Neighborhood 7 161.000 3.667 14.333 343.000 5.333
Neighborhood 8 174.750 5.000 11.000 300.000 6.750
Neighborhood 9 151.143 4.286 13.429 471.429 5.714

In [8]:
print 'Property type 2: Mean price, person capacity, picture count, description length, and tenure by neighborhood'
listings[listings.prop_type == 'Property type 2'].groupby('neighborhood')['price', 'person_capacity', 'picture_count','description_length', 'tenure_months'].mean().sort()


Property type 2: Mean price, person capacity, picture count, description length, and tenure by neighborhood
Out[8]:
price person_capacity picture_count description_length tenure_months
neighborhood
Neighborhood 10 137.500 2.000 20.000 126.000 3.500
Neighborhood 11 78.750 2.000 16.750 161.250 11.250
Neighborhood 12 96.895 1.947 10.474 244.526 9.842
Neighborhood 13 81.130 1.826 16.696 418.565 9.739
Neighborhood 14 83.810 1.857 15.905 348.619 8.714
Neighborhood 15 95.000 2.267 11.733 301.733 8.200
Neighborhood 16 83.625 2.062 15.375 246.250 6.688
Neighborhood 17 102.455 2.000 15.455 308.273 7.182
Neighborhood 18 120.667 2.222 12.333 297.778 9.222
Neighborhood 19 88.875 2.000 15.125 383.375 5.500
Neighborhood 20 60.000 1.000 3.000 101.000 6.000
Neighborhood 3 60.000 2.000 7.000 264.000 9.000
Neighborhood 4 60.000 2.000 10.000 95.000 11.000
Neighborhood 7 100.000 2.000 3.000 148.000 2.000
Neighborhood 8 350.000 4.000 5.000 223.000 3.000
Neighborhood 9 110.000 2.000 3.500 114.500 9.000

In [9]:
print 'Property type 3: Mean price, person capacity, picture count, description length, and tenure by neighborhood'
listings[listings.prop_type == 'Property type 3'].groupby('neighborhood')['price', 'person_capacity', 'picture_count','description_length', 'tenure_months'].mean()


Property type 3: Mean price, person capacity, picture count, description length, and tenure by neighborhood
Out[9]:
price person_capacity picture_count description_length tenure_months
neighborhood
Neighborhood 11 75 2 15 196 8
Neighborhood 14 75 1 1 113 5
Neighborhood 17 65 2 15 189 23
Neighborhood 4 40 2 4 241 19

Plot daily bookings:


In [10]:
plot1 = bookings.booking_date.value_counts().plot(figsize = (25,10))
plot1.set_title('Daily bookings')
plot1.set_xlabel('Date')
plot1.set_ylabel('Number of bookings')


Out[10]:
<matplotlib.text.Text at 0x108fd7810>

Plot the daily bookings per neighborhood (provide a legend)


In [11]:
lb = listings.merge(bookings)

In [12]:
lb['day_of_week'] = lb.booking_date.map(lambda x: x.dayofweek)

In [13]:
def day_word(number):
    day_of_week = {}
    day_of_week[0] = 'Monday'
    day_of_week[1] = 'Tuesday'
    day_of_week[2] = 'Wednesday'
    day_of_week[3] = 'Thursday'
    day_of_week[4] = 'Friday'
    day_of_week[5] = 'Saturday'
    day_of_week[6] = 'Sunday'
    return day_of_week[number]

In [14]:
lb['day'] = lb['day_of_week'].map(day_word)

In [15]:
# Warning: Super unpretty graph to follow. :( But I was told not to worry. :)
lb.groupby(['neighborhood', 'booking_date'])['prop_id'].count().unstack(0).plot(figsize = (20, 10))
plt.title('Daily bookings by neighborhood')
plt.xlabel('Date')
plt.ylabel('Number of bookings')
plt.ylim([0, 14]);



In [16]:
# Here's a prettier plot. But it uses day of the week on x axis.
# I could not figure out how to re-order x axis labels ... 
lb.groupby(['neighborhood', 'day'])['prop_id'].count().unstack(0).plot(figsize = (20, 10))
plt.title('Daily bookings by neighborhood')
plt.xlabel('Day of the week')
plt.ylabel('Number of bookings')
plt.ylim([0, 35]);


Part 2 - Develop a data set

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [17]:
# Not all listings had bookings.
# Function to get a dict with prop_id as key and number of bookings as value.
bookedProperties = {}
for prop in bookings['prop_id']:
    if prop in bookedProperties:
        bookedProperties[prop] += 1
    else:
        bookedProperties[prop] = 1

In [18]:
# Function to get number of bookings if bookings were made, set to zero otherwise. 
def bookings_map(propertyid):
    if propertyid not in bookedProperties.keys():
        return 0
    else: 
        return bookedProperties[propertyid]

In [19]:
# Apply function. 
listings['number_of_bookings'] = listings.prop_id.map(bookings_map)
listings['booking_rate'] = listings.number_of_bookings / listings.tenure_months

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [20]:
listings_new = listings[listings.tenure_months >= 10]

In [21]:
# Recode categories in prop_type so that dummy variables can be called later (i.e., whitespace to underscore)
# Chad, Ramesh - any ideas why I keep getting the error message (when running it the first time)?
listings_new['prop_typeR'] = listings_new.loc[:, 'prop_type'].map(lambda x: x.replace(' ', '_'))
listings_new['neighborhoodR'] = listings_new.loc[:, 'neighborhood'].map(lambda x: x.replace(' ', '_'))


-c:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
-c:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [22]:
dummies = pd.get_dummies(listings_new[['neighborhoodR', 'prop_typeR']], prefix = ['dummy', 'dummy'])

In [23]:
# Join with listings_new.
listings_with_dummies = listings_new.join(dummies)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [24]:
from sklearn.cross_validation import train_test_split

In [25]:
predictor = np.array(listings_with_dummies['booking_rate'])

In [26]:
listings_reduced = listings_with_dummies.drop(['prop_id', 'booking_rate', 'prop_type', 'neighborhood', 'number_of_bookings', \
                                               'prop_typeR', 'neighborhoodR'], \
                                              axis = 1)

In [27]:
regressors_train, regressors_test, predictor_train, predictor_test = train_test_split(listings_reduced,\
                                                                                     predictor, \
                                                                                     test_size = 0.2,\
                                                                                     random_state = 12)

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [28]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets


In [29]:
lr.fit(regressors_train, predictor_train)


Out[29]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [30]:
# Print R Squared. It is the variance explained. 
print 'Variance score of test data: ', lr.score(regressors_test, predictor_test)


Variance score of test data:  0.234231968507

In [31]:
predicted = lr.predict(regressors_test)
expected = predictor_test

In [32]:
fig, ax = plt.subplots(1, 1)
ax.scatter(expected, predicted)
ax.plot(predictor, predictor, 'r')
ax.axis('tight')
ax.set_xlabel('Observed booking rate')
ax.set_ylabel('Predicted booking rate')
print "Residual mean squared error of the regression:", np.sqrt(np.mean((predicted - expected) ** 2))


Residual mean squared error of the regression: 1.48085566538

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

This mean that 23% of the variance in the test data can be explained by the model trained on the train data. R squared is the explained variation divided by the total variation. Higher scores mean that more variation is explained by the model. This suggests that the model that was fitted is not that bad (in studies involving human behavior most studies report explained variance to be below 50%), but it probably would be better to select features to feed into the model and not enter it all. With only 144 individual data points, the number of regressors (here: 24) is also pretty large.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [33]:
# Copy orig data.
listings_copy = listings_new.copy()

In [34]:
# Compute new variable of interest.
listings_copy['monthly_revenue'] = (listings_copy.loc[:, 'number_of_bookings'] * listings_copy.loc[:, 'price']) / (listings_copy.loc[:, 'tenure_months'])

In [35]:
# Get dummy variables. Join with data copy.
dummies2 = pd.get_dummies(listings_copy[['neighborhoodR', 'prop_typeR']], prefix = ['dummy', 'dummy'])
listings_copy_dummies = listings_copy.join(dummies2)

In [36]:
# Predictor np array. Drop unwanted variables. 
predictor2 = np.array(listings_copy_dummies['monthly_revenue'])
listings_copy_reduced = listings_copy_dummies.drop(['prop_id', 'booking_rate', 'prop_type', 'neighborhood', 'number_of_bookings', 'prop_typeR', 'neighborhoodR', 'monthly_revenue'], axis = 1)

In [37]:
# Split. 
regr_train, regr_test, pred_train, pred_test = train_test_split(listings_copy_reduced, predictor2, test_size = 0.2, random_state = 9)

In [38]:
lr2 = LinearRegression()

In [39]:
lr2.fit(regr_train, pred_train)


Out[39]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [40]:
print 'R squared: ', lr2.score(regr_test, pred_test)
print 'Booooh, a horrible model (negative means that the model fitted is worse than a simple horizontal line).'


R squared:  -0.0609556205214
Booooh, a horrible model (negative means that the model fitted is worse than a simple horizontal line).

In [41]:
predicted2 = lr2.predict(regr_test)
expected2 = pred_test

In [42]:
fig, ax = plt.subplots(1, 1)
ax.scatter(expected2, predicted2)
ax.plot(predictor2, predictor2, 'r')
ax.axis('tight')
ax.set_xlabel('Observed monthly revenue (USD)')
ax.set_ylabel('Predicted monthly revenue (USD)')
print 'Residual mean squared error of the regression:', np.sqrt(np.mean((predicted2 - expected2) ** 2))


Residual mean squared error of the regression: 166.779819473

In [43]:
# I tried some transformations. Log transform did not normalize predictor2. 
# I went for sqrt transform - even though it still didn't look Gaussian.
predictor2_sqrt = np.sqrt(predictor2)
# Split again.
regr2_train, regr2_test, pred2_train, pred2_test = train_test_split(listings_copy_reduced, predictor2_sqrt, test_size = 0.2, random_state = 9)
# Fit.
lr3 = LinearRegression()
lr3.fit(regr2_train, pred2_train)
# Look at score.
print 'R squared: ', lr3.score(regr2_test, pred2_test)
print 'Ha, made it worse. Thoughts?'


R squared:  -0.19290939359
Ha, made it worse. Thoughts?

In [43]: