Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [1]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import itertools
import sklearn as sk
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 20)
pd.set_option('display.precision', 4)

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [2]:

    
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv',parse_dates = True)

bookings['booking_date'] = pd.to_datetime(bookings.booking_date)



In [3]:

    
print listings.info()
print bookings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)
memory usage: 142.4 KB
None

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [4]:

    
listings.describe()









    Out[4]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      count
       408.000
        408.000
       408.000
       408.000
        408.000
       408.000
    
    
      mean
       204.500
        187.806
         2.998
        14.390
        309.159
         8.488
    
    
      std
       117.924
        353.051
         1.595
        10.477
        228.022
         5.872
    
    
      min
         1.000
         39.000
         1.000
         1.000
          0.000
         1.000
    
    
      25%
       102.750
         90.000
         2.000
         6.000
        179.000
         4.000
    
    
      50%
       204.500
        125.000
         2.000
        12.000
        250.000
         7.000
    
    
      75%
       306.250
        199.000
         4.000
        20.000
        389.500
        13.000
    
    
      max
       408.000
       5000.000
        10.000
        71.000
       1969.000
        30.000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [5]:

    
listings.groupby(['prop_type'])['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'].agg(['mean'])









    Out[5]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      mean
      mean
      mean
      mean
      mean
    
    
      prop_type
      
      
      
      
      
    
  
  
    
      Property type 1
       237.086
       3.517
       14.695
       313.171
        8.465
    
    
      Property type 2
        93.289
       2.000
       13.948
       304.852
        8.378
    
    
      Property type 3
        63.750
       1.750
        8.750
       184.750
       13.750

Same, but by property type per neighborhood?



In [6]:

    
listings.groupby(['prop_type','neighborhood',])['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'].mean().sort(['price'], ascending = False)









    Out[6]:






  
    
      
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      prop_type
      neighborhood
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 20
       804.333
       2.778
        9.444
       223.556
        9.667
    
    
      Neighborhood 12
       365.615
       3.436
       10.821
       267.205
        7.897
    
    
      Neighborhood 21
       362.500
       4.250
       49.000
       306.250
       14.750
    
    
      Property type 2
      Neighborhood 8
       350.000
       4.000
        5.000
       223.000
        3.000
    
    
      Property type 1
      Neighborhood 2
       250.000
       6.000
        8.000
       423.000
        6.000
    
    
      Neighborhood 13
       241.898
       4.061
       15.653
       290.408
        9.122
    
    
      Neighborhood 22
       225.000
       3.000
       19.000
       500.000
        9.000
    
    
      Neighborhood 19
       222.375
       3.625
       11.000
       254.500
        6.500
    
    
      Neighborhood 5
       194.500
       2.500
        8.500
       266.500
       11.500
    
    
      Neighborhood 17
       189.870
       3.522
       16.087
       317.348
        9.870
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Property type 2
      Neighborhood 16
        83.625
       2.062
       15.375
       246.250
        6.688
    
    
      Neighborhood 13
        81.130
       1.826
       16.696
       418.565
        9.739
    
    
      Neighborhood 11
        78.750
       2.000
       16.750
       161.250
       11.250
    
    
      Property type 3
      Neighborhood 11
        75.000
       2.000
       15.000
       196.000
        8.000
    
    
      Neighborhood 14
        75.000
       1.000
        1.000
       113.000
        5.000
    
    
      Neighborhood 17
        65.000
       2.000
       15.000
       189.000
       23.000
    
    
      Property type 2
      Neighborhood 20
        60.000
       1.000
        3.000
       101.000
        6.000
    
    
      Neighborhood 3
        60.000
       2.000
        7.000
       264.000
        9.000
    
    
      Neighborhood 4
        60.000
       2.000
       10.000
        95.000
       11.000
    
    
      Property type 3
      Neighborhood 4
        40.000
       2.000
        4.000
       241.000
       19.000
    
  

40 rows × 5 columns

Plot daily bookings:



In [7]:

    
bookings.booking_date.value_counts().plot(figsize = (20,10))









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x109771bd0>

Plot the daily bookings per neighborhood (provide a legend)



In [8]:

    
bookings_neighborhood = pd.merge(listings, bookings, on='prop_id')
bookings_neighborhood = bookings_neighborhood.groupby(['neighborhood','booking_date'])['neighborhood'].agg(['count']).unstack(0)

colors = itertools.cycle(['b', 'g', 'r', 'c', 'm', 'y'])
markers = itertools.cycle(mlines.Line2D.filled_markers)

fig, ax = plt.subplots(1,1)
fig.text(.5,.95, "Daily Bookings by Neighborhood", fontsize=20, ha='center')
fig.set_figwidth(30)
fig.set_figheight(15)
for neighborhoods in bookings_neighborhood:
    y = bookings_neighborhood[bookings_neighborhood[neighborhoods].notnull()][[neighborhoods]]
    x = y.index
    marker = markers.next()
    color = colors.next()
    ax.scatter(x,y, marker= marker, s=50,c=color, label= neighborhoods)
ax.legend(loc='upper right');
ax.set_ylabel('Bookings');

Part 2 - Develop a data set



In [9]:

    
bookings_neighborhood









    Out[9]:






  
    
      
      count
    
    
      neighborhood
      Neighborhood 1
      Neighborhood 10
      Neighborhood 11
      Neighborhood 12
      Neighborhood 13
      Neighborhood 14
      Neighborhood 15
      Neighborhood 16
      Neighborhood 17
      Neighborhood 18
      ...
      Neighborhood 20
      Neighborhood 21
      Neighborhood 22
      Neighborhood 3
      Neighborhood 4
      Neighborhood 5
      Neighborhood 6
      Neighborhood 7
      Neighborhood 8
      Neighborhood 9
    
    
      booking_date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-01-01
      NaN
      NaN
      NaN
      NaN
        4
        3
        1
      NaN
        1
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        2
    
    
      2011-01-02
      NaN
      NaN
        1
      NaN
        1
        3
        1
        1
      NaN
        2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-01-03
      NaN
      NaN
      NaN
      NaN
        5
        2
      NaN
      NaN
        2
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
    
    
      2011-01-04
      NaN
      NaN
      NaN
        1
        1
        1
        2
      NaN
      NaN
        1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        2
    
    
      2011-01-05
      NaN
      NaN
        1
      NaN
        1
        6
        3
        1
        1
        1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
    
    
      2011-01-06
      NaN
      NaN
      NaN
      NaN
        4
      NaN
      NaN
      NaN
        2
        2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
        1
    
    
      2011-01-07
      NaN
      NaN
        1
        1
        3
        3
        2
      NaN
        2
        1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-01-08
      NaN
      NaN
        1
      NaN
        1
        1
        1
        1
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-01-09
      NaN
      NaN
        1
        1
        1
        3
      NaN
      NaN
        3
        1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-01-10
      NaN
      NaN
      NaN
        1
        2
        3
        2
        1
        2
        2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2011-12-22
      NaN
      NaN
      NaN
      NaN
        2
      NaN
      NaN
      NaN
        1
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-12-23
      NaN
      NaN
        1
        1
        3
        2
      NaN
      NaN
        1
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-12-24
      NaN
      NaN
        2
        1
        2
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-12-25
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
      ...
      NaN
      NaN
      NaN
        1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-12-26
      NaN
      NaN
        1
      NaN
        2
        2
        2
      NaN
        1
        2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
    
    
      2011-12-27
      NaN
      NaN
      NaN
        1
        2
        2
        1
        2
      NaN
      NaN
      ...
        1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-12-28
      NaN
      NaN
        1
        2
      NaN
        4
        3
      NaN
        1
        2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-12-29
      NaN
      NaN
      NaN
      NaN
        3
        2
        3
        1
        2
      NaN
      ...
        1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
    
    
      2011-12-30
      NaN
      NaN
      NaN
      NaN
        2
        4
        1
        2
        1
        1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
      NaN
    
    
      2011-12-31
      NaN
      NaN
      NaN
        2
        3
      NaN
        2
        3
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

365 rows × 21 columns

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [10]:

    
df = bookings.groupby('prop_id').count().reset_index()
df = pd.merge(listings, df, on='prop_id')
df.rename(columns={'booking_date': 'number_of_bookings'}, inplace=True)
df['booking_rate']= df.number_of_bookings/df.tenure_months

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [11]:

    
df = df[df['tenure_months']>9]

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [12]:

    
df = pd.core.reshape.get_dummies(df, ['prop_type','neighborhood'])

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [13]:

    
y = df.booking_rate.values
X = df.drop(['prop_id','booking_rate','number_of_bookings', 'tenure_months'],axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [14]:

    
lr = LinearRegression()
degree = 1
est = make_pipeline(PolynomialFeatures(degree),LinearRegression(normalize=True))

fit your model with your test sets



In [15]:

    
regr = est.fit(X_train,y_train)



In [16]:

    
np.shape(y_test)









    Out[16]:





(23,)

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [17]:

    
regr.score(X_test, y_test)









    Out[17]:





-0.27961212193809937



In [18]:

    
test_predict = regr.predict(X_test)
train_predict = regr.predict(X_train)
fig, ax = plt.subplots(1,1)
ax.scatter(train_predict, y_train, c='r')
ax.scatter(test_predict, y_test)
ax.set_xlabel('Predicted Booking Rate');
ax.set_ylabel('Actual Booking Rate');

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) 2).sum(). Best possible score is 1.0, lower values are worse.

R^2 shows how well the data fits the model

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [19]:

    
df['monthly_revenue'] = df.number_of_bookings*df.price/df.tenure_months



In [20]:

    
y = df.monthly_revenue.values
X = df.drop(['prop_id','booking_rate','number_of_bookings', 'tenure_months','monthly_revenue','price'],axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



In [21]:

    
regr = est.fit(X_train, y_train)
regr.score(X_test, y_test)









    Out[21]:





-0.28927345519073255



In [22]:

    
test_predict = regr.predict(X_test)



In [23]:

    
sum_sq_naive = np.sum((test_predict - y_test.mean())**2)
sum_sq_naive









    Out[23]:





181888.60390989989



In [24]:

    
test_predict = regr.predict(X_test)
train_predict = regr.predict(X_train)
fig, ax = plt.subplots(1,1)
ax.scatter(train_predict, y_train, c='r')
ax.scatter(test_predict, y_test)
ax.set_xlabel('Predicted Monthly Revenue');
ax.set_ylabel('Actual Monthly Revenue');



In [ ]:

	prop_id	price	person_capacity	picture_count	description_length	tenure_months
count	408.000	408.000	408.000	408.000	408.000	408.000
mean	204.500	187.806	2.998	14.390	309.159	8.488
std	117.924	353.051	1.595	10.477	228.022	5.872
min	1.000	39.000	1.000	1.000	0.000	1.000
25%	102.750	90.000	2.000	6.000	179.000	4.000
50%	204.500	125.000	2.000	12.000	250.000	7.000
75%	306.250	199.000	4.000	20.000	389.500	13.000
max	408.000	5000.000	10.000	71.000	1969.000	30.000

		price	person_capacity	picture_count	description_length	tenure_months
prop_type	neighborhood
Property type 1	Neighborhood 20	804.333	2.778	9.444	223.556	9.667
	Neighborhood 12	365.615	3.436	10.821	267.205	7.897
	Neighborhood 21	362.500	4.250	49.000	306.250	14.750
Property type 2	Neighborhood 8	350.000	4.000	5.000	223.000	3.000
Property type 1	Neighborhood 2	250.000	6.000	8.000	423.000	6.000
	Neighborhood 13	241.898	4.061	15.653	290.408	9.122
	Neighborhood 22	225.000	3.000	19.000	500.000	9.000
	Neighborhood 19	222.375	3.625	11.000	254.500	6.500
	Neighborhood 5	194.500	2.500	8.500	266.500	11.500
	Neighborhood 17	189.870	3.522	16.087	317.348	9.870
...	...	...	...	...	...	...
Property type 2	Neighborhood 16	83.625	2.062	15.375	246.250	6.688
	Neighborhood 13	81.130	1.826	16.696	418.565	9.739
	Neighborhood 11	78.750	2.000	16.750	161.250	11.250
Property type 3	Neighborhood 11	75.000	2.000	15.000	196.000	8.000
	Neighborhood 14	75.000	1.000	1.000	113.000	5.000
	Neighborhood 17	65.000	2.000	15.000	189.000	23.000
Property type 2	Neighborhood 20	60.000	1.000	3.000	101.000	6.000
	Neighborhood 3	60.000	2.000	7.000	264.000	9.000
	Neighborhood 4	60.000	2.000	10.000	95.000	11.000
Property type 3	Neighborhood 4	40.000	2.000	4.000	241.000	19.000

	count
neighborhood	Neighborhood 1	Neighborhood 10	Neighborhood 11	Neighborhood 12	Neighborhood 13	Neighborhood 14	Neighborhood 15	Neighborhood 16	Neighborhood 17	Neighborhood 18	...	Neighborhood 20	Neighborhood 21	Neighborhood 22	Neighborhood 3	Neighborhood 4	Neighborhood 5	Neighborhood 6	Neighborhood 7	Neighborhood 8	Neighborhood 9
booking_date
2011-01-01	NaN	NaN	NaN	NaN	4	3	1	NaN	1	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2
2011-01-02	NaN	NaN	1	NaN	1	3	1	1	NaN	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-01-03	NaN	NaN	NaN	NaN	5	2	NaN	NaN	2	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1
2011-01-04	NaN	NaN	NaN	1	1	1	2	NaN	NaN	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2
2011-01-05	NaN	NaN	1	NaN	1	6	3	1	1	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1
2011-01-06	NaN	NaN	NaN	NaN	4	NaN	NaN	NaN	2	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	1
2011-01-07	NaN	NaN	1	1	3	3	2	NaN	2	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-01-08	NaN	NaN	1	NaN	1	1	1	1	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-01-09	NaN	NaN	1	1	1	3	NaN	NaN	3	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-01-10	NaN	NaN	NaN	1	2	3	2	1	2	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2011-12-22	NaN	NaN	NaN	NaN	2	NaN	NaN	NaN	1	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-23	NaN	NaN	1	1	3	2	NaN	NaN	1	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-24	NaN	NaN	2	1	2	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	...	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-26	NaN	NaN	1	NaN	2	2	2	NaN	1	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1
2011-12-27	NaN	NaN	NaN	1	2	2	1	2	NaN	NaN	...	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-28	NaN	NaN	1	2	NaN	4	3	NaN	1	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-29	NaN	NaN	NaN	NaN	3	2	3	1	2	NaN	...	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1
2011-12-30	NaN	NaN	NaN	NaN	2	4	1	2	1	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	NaN
2011-12-31	NaN	NaN	NaN	2	3	NaN	2	3	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN