Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [1]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import itertools
import sklearn as sk
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 20)
pd.set_option('display.precision', 4)

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [2]:
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv',parse_dates = True)

bookings['booking_date'] = pd.to_datetime(bookings.booking_date)

In [3]:
print listings.info()
print bookings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)
memory usage: 142.4 KB
None

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [4]:
listings.describe()


Out[4]:
prop_id price person_capacity picture_count description_length tenure_months
count 408.000 408.000 408.000 408.000 408.000 408.000
mean 204.500 187.806 2.998 14.390 309.159 8.488
std 117.924 353.051 1.595 10.477 228.022 5.872
min 1.000 39.000 1.000 1.000 0.000 1.000
25% 102.750 90.000 2.000 6.000 179.000 4.000
50% 204.500 125.000 2.000 12.000 250.000 7.000
75% 306.250 199.000 4.000 20.000 389.500 13.000
max 408.000 5000.000 10.000 71.000 1969.000 30.000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [5]:
listings.groupby(['prop_type'])['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'].agg(['mean'])


Out[5]:
price person_capacity picture_count description_length tenure_months
mean mean mean mean mean
prop_type
Property type 1 237.086 3.517 14.695 313.171 8.465
Property type 2 93.289 2.000 13.948 304.852 8.378
Property type 3 63.750 1.750 8.750 184.750 13.750

Same, but by property type per neighborhood?


In [6]:
listings.groupby(['prop_type','neighborhood',])['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'].mean().sort(['price'], ascending = False)


Out[6]:
price person_capacity picture_count description_length tenure_months
prop_type neighborhood
Property type 1 Neighborhood 20 804.333 2.778 9.444 223.556 9.667
Neighborhood 12 365.615 3.436 10.821 267.205 7.897
Neighborhood 21 362.500 4.250 49.000 306.250 14.750
Property type 2 Neighborhood 8 350.000 4.000 5.000 223.000 3.000
Property type 1 Neighborhood 2 250.000 6.000 8.000 423.000 6.000
Neighborhood 13 241.898 4.061 15.653 290.408 9.122
Neighborhood 22 225.000 3.000 19.000 500.000 9.000
Neighborhood 19 222.375 3.625 11.000 254.500 6.500
Neighborhood 5 194.500 2.500 8.500 266.500 11.500
Neighborhood 17 189.870 3.522 16.087 317.348 9.870
... ... ... ... ... ... ...
Property type 2 Neighborhood 16 83.625 2.062 15.375 246.250 6.688
Neighborhood 13 81.130 1.826 16.696 418.565 9.739
Neighborhood 11 78.750 2.000 16.750 161.250 11.250
Property type 3 Neighborhood 11 75.000 2.000 15.000 196.000 8.000
Neighborhood 14 75.000 1.000 1.000 113.000 5.000
Neighborhood 17 65.000 2.000 15.000 189.000 23.000
Property type 2 Neighborhood 20 60.000 1.000 3.000 101.000 6.000
Neighborhood 3 60.000 2.000 7.000 264.000 9.000
Neighborhood 4 60.000 2.000 10.000 95.000 11.000
Property type 3 Neighborhood 4 40.000 2.000 4.000 241.000 19.000

40 rows × 5 columns

Plot daily bookings:


In [7]:
bookings.booking_date.value_counts().plot(figsize = (20,10))


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x109771bd0>

Plot the daily bookings per neighborhood (provide a legend)


In [8]:
bookings_neighborhood = pd.merge(listings, bookings, on='prop_id')
bookings_neighborhood = bookings_neighborhood.groupby(['neighborhood','booking_date'])['neighborhood'].agg(['count']).unstack(0)

colors = itertools.cycle(['b', 'g', 'r', 'c', 'm', 'y'])
markers = itertools.cycle(mlines.Line2D.filled_markers)

fig, ax = plt.subplots(1,1)
fig.text(.5,.95, "Daily Bookings by Neighborhood", fontsize=20, ha='center')
fig.set_figwidth(30)
fig.set_figheight(15)
for neighborhoods in bookings_neighborhood:
    y = bookings_neighborhood[bookings_neighborhood[neighborhoods].notnull()][[neighborhoods]]
    x = y.index
    marker = markers.next()
    color = colors.next()
    ax.scatter(x,y, marker= marker, s=50,c=color, label= neighborhoods)
ax.legend(loc='upper right');
ax.set_ylabel('Bookings');


Part 2 - Develop a data set


In [9]:
bookings_neighborhood


Out[9]:
count
neighborhood Neighborhood 1 Neighborhood 10 Neighborhood 11 Neighborhood 12 Neighborhood 13 Neighborhood 14 Neighborhood 15 Neighborhood 16 Neighborhood 17 Neighborhood 18 ... Neighborhood 20 Neighborhood 21 Neighborhood 22 Neighborhood 3 Neighborhood 4 Neighborhood 5 Neighborhood 6 Neighborhood 7 Neighborhood 8 Neighborhood 9
booking_date
2011-01-01 NaN NaN NaN NaN 4 3 1 NaN 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
2011-01-02 NaN NaN 1 NaN 1 3 1 1 NaN 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-01-03 NaN NaN NaN NaN 5 2 NaN NaN 2 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
2011-01-04 NaN NaN NaN 1 1 1 2 NaN NaN 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
2011-01-05 NaN NaN 1 NaN 1 6 3 1 1 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
2011-01-06 NaN NaN NaN NaN 4 NaN NaN NaN 2 2 ... NaN NaN NaN NaN NaN NaN NaN NaN 1 1
2011-01-07 NaN NaN 1 1 3 3 2 NaN 2 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-01-08 NaN NaN 1 NaN 1 1 1 1 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-01-09 NaN NaN 1 1 1 3 NaN NaN 3 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-01-10 NaN NaN NaN 1 2 3 2 1 2 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2011-12-22 NaN NaN NaN NaN 2 NaN NaN NaN 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-23 NaN NaN 1 1 3 2 NaN NaN 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-24 NaN NaN 2 1 2 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-25 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 ... NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN
2011-12-26 NaN NaN 1 NaN 2 2 2 NaN 1 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
2011-12-27 NaN NaN NaN 1 2 2 1 2 NaN NaN ... 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-28 NaN NaN 1 2 NaN 4 3 NaN 1 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-29 NaN NaN NaN NaN 3 2 3 1 2 NaN ... 1 NaN NaN NaN NaN NaN NaN NaN NaN 1
2011-12-30 NaN NaN NaN NaN 2 4 1 2 1 1 ... NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN
2011-12-31 NaN NaN NaN 2 3 NaN 2 3 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

365 rows × 21 columns

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [10]:
df = bookings.groupby('prop_id').count().reset_index()
df = pd.merge(listings, df, on='prop_id')
df.rename(columns={'booking_date': 'number_of_bookings'}, inplace=True)
df['booking_rate']= df.number_of_bookings/df.tenure_months

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [11]:
df = df[df['tenure_months']>9]

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [12]:
df = pd.core.reshape.get_dummies(df, ['prop_type','neighborhood'])

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [13]:
y = df.booking_rate.values
X = df.drop(['prop_id','booking_rate','number_of_bookings', 'tenure_months'],axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [14]:
lr = LinearRegression()
degree = 1
est = make_pipeline(PolynomialFeatures(degree),LinearRegression(normalize=True))

fit your model with your test sets


In [15]:
regr = est.fit(X_train,y_train)

In [16]:
np.shape(y_test)


Out[16]:
(23,)

In [17]:
regr.score(X_test, y_test)


Out[17]:
-0.27961212193809937

In [18]:
test_predict = regr.predict(X_test)
train_predict = regr.predict(X_train)
fig, ax = plt.subplots(1,1)
ax.scatter(train_predict, y_train, c='r')
ax.scatter(test_predict, y_test)
ax.set_xlabel('Predicted Booking Rate');
ax.set_ylabel('Actual Booking Rate');


Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) 2).sum(). Best possible score is 1.0, lower values are worse.

R^2 shows how well the data fits the model

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [19]:
df['monthly_revenue'] = df.number_of_bookings*df.price/df.tenure_months

In [20]:
y = df.monthly_revenue.values
X = df.drop(['prop_id','booking_rate','number_of_bookings', 'tenure_months','monthly_revenue','price'],axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [21]:
regr = est.fit(X_train, y_train)
regr.score(X_test, y_test)


Out[21]:
-0.28927345519073255

In [22]:
test_predict = regr.predict(X_test)

In [23]:
sum_sq_naive = np.sum((test_predict - y_test.mean())**2)
sum_sq_naive


Out[23]:
181888.60390989989

In [24]:
test_predict = regr.predict(X_test)
train_predict = regr.predict(X_train)
fig, ax = plt.subplots(1,1)
ax.scatter(train_predict, y_train, c='r')
ax.scatter(test_predict, y_test)
ax.set_xlabel('Predicted Monthly Revenue');
ax.set_ylabel('Actual Monthly Revenue');



In [ ]: