Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from __future__ import print_function

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [3]:
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')
bookings.info()
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null object
dtypes: int64(1), object(1)<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [4]:
listings.describe()


Out[4]:
prop_id price person_capacity picture_count description_length tenure_months
count 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000
mean 204.500000 187.806373 2.997549 14.389706 309.159314 8.487745
std 117.923704 353.050858 1.594676 10.477428 228.021684 5.872088
min 1.000000 39.000000 1.000000 1.000000 0.000000 1.000000
25% 102.750000 90.000000 2.000000 6.000000 179.000000 4.000000
50% 204.500000 125.000000 2.000000 12.000000 250.000000 7.000000
75% 306.250000 199.000000 4.000000 20.000000 389.500000 13.000000
max 408.000000 5000.000000 10.000000 71.000000 1969.000000 30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [5]:
listings.groupby('prop_type').describe()


Out[5]:
description_length person_capacity picture_count price prop_id tenure_months
prop_type
Property type 1 count 269.000000 269.000000 269.000000 269.000000 269.000000 269.000000
mean 313.171004 3.516729 14.695167 237.085502 204.754647 8.464684
std 214.769141 1.644955 10.623651 425.710534 119.643061 5.773367
min 17.000000 1.000000 1.000000 40.000000 1.000000 1.000000
25% 193.000000 2.000000 6.000000 120.000000 95.000000 4.000000
50% 266.000000 3.000000 12.000000 150.000000 208.000000 7.000000
75% 388.000000 4.000000 20.000000 229.000000 310.000000 14.000000
max 1719.000000 10.000000 71.000000 5000.000000 408.000000 30.000000
Property type 2 count 135.000000 135.000000 135.000000 135.000000 135.000000 135.000000
mean 304.851852 2.000000 13.948148 93.288889 206.392593 8.377778
std 255.135332 0.846415 10.255191 42.261246 114.074805 5.963654
min 0.000000 1.000000 1.000000 39.000000 3.000000 1.000000
25% 150.500000 2.000000 6.500000 69.000000 120.500000 4.000000
50% 239.000000 2.000000 11.000000 89.000000 200.000000 7.000000
75% 402.500000 2.000000 19.500000 99.000000 299.500000 10.500000
max 1969.000000 6.000000 56.000000 350.000000 405.000000 29.000000
Property type 3 count 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000
mean 184.750000 1.750000 8.750000 63.750000 123.500000 13.750000
std 53.093471 0.500000 7.320064 16.520190 132.879645 8.616844
min 113.000000 1.000000 1.000000 40.000000 10.000000 5.000000
25% 170.000000 1.750000 3.250000 58.750000 17.500000 7.250000
50% 192.500000 2.000000 9.500000 70.000000 99.000000 13.500000
75% 207.250000 2.000000 15.000000 75.000000 205.000000 20.000000
max 241.000000 2.000000 15.000000 75.000000 286.000000 23.000000

Same, but by property type per neighborhood?


In [6]:
listings.groupby(['prop_type','neighborhood']).describe()


Out[6]:
description_length person_capacity picture_count price prop_id tenure_months
prop_type neighborhood
Property type 1 Neighborhood 1 count 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mean 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
std NaN NaN NaN NaN NaN NaN
min 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
25% 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
50% 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
75% 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
max 209.000000 2.000000 26.000000 85.000000 235.000000 6.000000
Neighborhood 10 count 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000
mean 391.000000 3.500000 13.333333 142.500000 307.500000 3.833333
std 146.929915 1.224745 8.571270 36.979724 55.301899 1.602082
min 160.000000 2.000000 4.000000 90.000000 255.000000 1.000000
25% 312.250000 2.500000 6.750000 135.000000 265.000000 3.250000
50% 425.500000 4.000000 12.000000 137.500000 297.000000 4.500000
75% 499.000000 4.000000 20.250000 147.500000 330.500000 5.000000
max 537.000000 5.000000 24.000000 205.000000 400.000000 5.000000
Neighborhood 11 count 14.000000 14.000000 14.000000 14.000000 14.000000 14.000000
mean 379.000000 3.214286 9.928571 159.428571 174.000000 9.642857
std 396.956111 1.311404 5.928605 70.962302 110.198422 5.212517
min 82.000000 2.000000 5.000000 95.000000 52.000000 1.000000
25% 242.500000 2.000000 6.000000 103.750000 70.500000 6.000000
50% 295.500000 3.000000 8.000000 130.000000 142.500000 9.500000
75% 373.000000 3.750000 9.750000 196.500000 243.500000 15.500000
max 1719.000000 6.000000 23.000000 319.000000 401.000000 16.000000
Neighborhood 12 count 39.000000 39.000000 39.000000 39.000000 39.000000 39.000000
mean 267.205128 3.435897 10.820513 365.615385 211.307692 7.897436
std 137.867820 1.874972 7.118810 686.086484 118.670520 5.290482
min 45.000000 1.000000 1.000000 60.000000 22.000000 1.000000
25% 182.000000 2.000000 6.000000 125.000000 105.500000 3.000000
50% 251.000000 3.000000 8.000000 150.000000 223.000000 6.000000
... ... ... ... ... ... ... ... ...
Property type 3 Neighborhood 11 std NaN NaN NaN NaN NaN NaN
min 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
25% 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
50% 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
75% 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
max 196.000000 2.000000 15.000000 75.000000 178.000000 8.000000
Neighborhood 14 count 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mean 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
std NaN NaN NaN NaN NaN NaN
min 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
25% 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
50% 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
75% 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
max 113.000000 1.000000 1.000000 75.000000 286.000000 5.000000
Neighborhood 17 count 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mean 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
std NaN NaN NaN NaN NaN NaN
min 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
25% 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
50% 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
75% 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
max 189.000000 2.000000 15.000000 65.000000 10.000000 23.000000
Neighborhood 4 count 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
mean 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
std NaN NaN NaN NaN NaN NaN
min 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
25% 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
50% 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
75% 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000
max 241.000000 2.000000 4.000000 40.000000 20.000000 19.000000

320 rows × 6 columns

Plot daily bookings:


In [7]:
import timeit
import pandas as pd
#Convert to bookings.booking_date from object to date
from datetime import datetime
import matplotlib.dates as mdates


bookings['booking_date']= pd.to_datetime(bookings['booking_date'])
#bookings.info()

daily_totals = bookings.groupby('booking_date').count()
daily_totals.rename(columns={'prop_id':'Daily Totals'},inplace=True)


#plot cumulative bookings by date
plt.style.use('fivethirtyeight')
ax = daily_totals.plot(kind='line')
ax.set_title('Daily Bookings')
ax.set_xlabel('days')
ax.set_ylabel('# of bookings')
ax.legend()


Out[7]:
<matplotlib.legend.Legend at 0xc5d3a58>

Plot the daily bookings per neighborhood (provide a legend)


In [8]:
combined_df = pd.merge(bookings, listings, on='prop_id')
neighborhood_daily_totals = combined_df.groupby(['booking_date','neighborhood']).count()
#neighborhood_daily_totals.info()
neighborhood_daily_totals.head()

plt.style.use('fivethirtyeight')

#ax = plt.plot(neighborhood_daily_totals)

Part 2 - Develop a data set


In [9]:
import math
#create new table that counts each prop_id
bookings_by_prop_id = bookings.groupby('prop_id').count()
#bookings_by_prop_id.head()

#Set an index on listings DF
new_listings = listings.set_index(['prop_id'])

#Join tables
new_listings = new_listings.join(bookings_by_prop_id)
new_listings.rename(columns={'booking_date':'number_of_bookings'},inplace=True)

#Replace all NaNs with 0s
new_listings.number_of_bookings = [0 if math.isnan(x) else x for x in new_listings.number_of_bookings]

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [10]:
new_listings['booking_rate'] = (new_listings['number_of_bookings'])/(new_listings['tenure_months'])

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [11]:
model_data = new_listings[new_listings.tenure_months>10]
model_data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 120 entries, 1 to 120
Data columns (total 9 columns):
prop_type             120 non-null object
neighborhood          120 non-null object
price                 120 non-null int64
person_capacity       120 non-null int64
picture_count         120 non-null int64
description_length    120 non-null int64
tenure_months         120 non-null int64
number_of_bookings    120 non-null float64
booking_rate          120 non-null float64
dtypes: float64(2), int64(5), object(2)

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [12]:
model_data = new_listings[new_listings.tenure_months>10]
prop_type_dummies = pd.get_dummies(model_data.prop_type)
neigborhood_dummies = pd.get_dummies(model_data.neighborhood)
model_data = model_data.join(prop_type_dummies)
model_data = model_data.join(neigborhood_dummies)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [13]:
from sklearn.cross_validation import train_test_split

In [14]:
#Predictors
model_data_y = model_data.booking_rate

#Regressors
excluded_data = ['booking_rate','prop_type','neighborhood','number_of_bookings','prop_id']
cols = [col for col in model_data.columns if col not in excluded_data]
model_data_x = model_data[cols]
#model_data_x.columns

X_train, X_test, y_train, y_test = train_test_split(model_data_x, model_data_y, test_size=0.8)

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm

#####################
#Ordinary Least Squares
degree = 1
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)

regr = LinearRegression()
regr.fit(X_train, y_train)

# The coefficients
print('Results from Ordinary Least Squares \n')
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))

print("\n \n statsmodels package OLS Results \n")
res = sm.OLS(y_train,X_train).fit() #create a model using statsmodels
print(res.params)
print (res.bse)
print (res.summary())

######################
#Ridge Regression
from sklearn import linear_model
clf = linear_model.Ridge (alpha = .5)
clf.fit(X_train, y_train)


# The coefficients
print('\n \n Results from Ridge Regression \n')
print('Coefficients: \n', clf.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((clf.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % clf.score(X_test, y_test))


######################
#Lasso
clf = linear_model.Lasso(alpha = 0.1)
clf = LinearRegression()
clf.fit(X_train, y_train)

# The coefficients
print('\n \n Results from Lasso \n')
print('Coefficients: \n', clf.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((clf.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % clf.score(X_test, y_test))


Results from Ordinary Least Squares 

Coefficients: 
 [  4.33231630e-03   1.31849210e-01   6.49068082e-02  -1.83175684e-03
  -8.29673567e-02  -1.19394747e+00   1.19394747e+00  -2.93210621e-17
  -8.63409526e-02  -2.84053936e-01  -1.13367671e-01  -1.57813656e+00
   8.26863417e-01   2.64065935e+00   5.48643488e-01  -5.47674390e-01
   0.00000000e+00   0.00000000e+00  -1.40659274e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
Residual sum of squares: 3.49
Variance score: -2.72

 
 statsmodels package OLS Results 

[  4.33231630e-03   1.31849210e-01   6.49068082e-02  -1.83175684e-03
  -8.29673567e-02  -3.67139320e-01   2.02075562e+00   8.50691366e-16
   9.73941915e-02  -1.00318792e-01   7.03674734e-02  -1.39440141e+00
   1.01059856e+00   2.82439449e+00   7.32378632e-01  -3.63939246e-01
   0.00000000e+00   0.00000000e+00  -1.22285760e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
[  1.03705827e-02   4.12275985e-01   6.68380431e-02   3.77950774e-03
   1.25164660e-01   2.08580548e+00   1.77102711e+00   9.05292184e-16
   1.41541439e+00   7.83376905e-01   8.61642177e-01   1.69544078e+00
   1.27010726e+00   2.11484856e+00   8.99000599e-01   1.13000524e+00
   0.00000000e+00   0.00000000e+00   2.31667759e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.672
Model:                            OLS   Adj. R-squared:                  0.126
Method:                 Least Squares   F-statistic:                     1.231
Date:                Tue, 09 Dec 2014   Prob (F-statistic):              0.387
Time:                        00:24:11   Log-Likelihood:                -30.706
No. Observations:                  24   AIC:                             91.41
Df Residuals:                       9   BIC:                             109.1
Df Model:                          15                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.0043      0.010      0.418      0.686        -0.019     0.028
x2             0.1318      0.412      0.320      0.756        -0.801     1.064
x3             0.0649      0.067      0.971      0.357        -0.086     0.216
x4            -0.0018      0.004     -0.485      0.639        -0.010     0.007
x5            -0.0830      0.125     -0.663      0.524        -0.366     0.200
x6            -0.3671      2.086     -0.176      0.864        -5.086     4.351
x7             2.0208      1.771      1.141      0.283        -1.986     6.027
const       8.507e-16   9.05e-16      0.940      0.372      -1.2e-15   2.9e-15
x8             0.0974      1.415      0.069      0.947        -3.104     3.299
x9            -0.1003      0.783     -0.128      0.901        -1.872     1.672
x10            0.0704      0.862      0.082      0.937        -1.879     2.020
x11           -1.3944      1.695     -0.822      0.432        -5.230     2.441
x12            1.0106      1.270      0.796      0.447        -1.863     3.884
x13            2.8244      2.115      1.336      0.215        -1.960     7.609
x14            0.7324      0.899      0.815      0.436        -1.301     2.766
x15           -0.3639      1.130     -0.322      0.755        -2.920     2.192
x16                 0          0        nan        nan             0         0
x17                 0          0        nan        nan             0         0
x18           -1.2229      2.317     -0.528      0.610        -6.464     4.018
x19                 0          0        nan        nan             0         0
x20                 0          0        nan        nan             0         0
x21                 0          0        nan        nan             0         0
==============================================================================
Omnibus:                        1.150   Durbin-Watson:                   1.681
Prob(Omnibus):                  0.563   Jarque-Bera (JB):                0.494
Skew:                           0.348   Prob(JB):                        0.781
Kurtosis:                       3.100   Cond. No.                          nan
==============================================================================

Warnings:
[1] The smallest eigenvalue is -8.46e-16. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

 
 Results from Ridge Regression 

Coefficients: 
 [ -2.22243437e-03   2.53053079e-01   2.87706628e-02   4.61191227e-04
  -5.79452947e-02  -7.53816269e-01   7.53816269e-01   0.00000000e+00
  -1.84238929e-01  -3.05435854e-01  -1.33245424e-01  -7.29369567e-01
   3.33498545e-01   9.96778128e-01   4.99834654e-01  -3.95814567e-01
   0.00000000e+00   0.00000000e+00  -8.20069869e-02   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
Residual sum of squares: 1.28
Variance score: -0.37

 
 Results from Lasso 

Coefficients: 
 [  4.33231630e-03   1.31849210e-01   6.49068082e-02  -1.83175684e-03
  -8.29673567e-02  -1.19394747e+00   1.19394747e+00  -2.93210621e-17
  -8.63409526e-02  -2.84053936e-01  -1.13367671e-01  -1.57813656e+00
   8.26863417e-01   2.64065935e+00   5.48643488e-01  -5.47674390e-01
   0.00000000e+00   0.00000000e+00  -1.40659274e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
Residual sum of squares: 3.49
Variance score: -2.72

fit your model with your test sets


In [19]:
degree = 1
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)
pred = est.predict(X_test)

In [21]:
print("Score from OLS: ", est.score(X_test,y_test))


Score from OLS:  -2.71572105337

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

The score compares the fit of our model vs the fit of a horizontal line. Generally, this score should be between 0 and 1. In our case, it is below 0, suggesting a horizontal line fits better than our data. This may probably be due to the # of predictors we have. A suggestion is to use lasso to determine which predictors are unnecessary to simplify our model.

Interestingly, statsmodels and scikitlearn gave slightly different coeficients. statsmodels gives more information on default than scikitlearn. It shows that many of the variables are not significant. Next steps would probably be try to transform the data. Interestingly, the R^2 for statsmodels was 0.67 which is R^2 based on the training set which means the model is not generalizable.

I am curious how to display predictor names on coeficients. It seems that the train/split function created 4 arrays without labels.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [225]:
#Part of the work. need to keep brainstorming
print("Currently Incomplete")
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')
bookings['booking_date']= pd.to_datetime(bookings['booking_date'])
bookings.set_index(pd.DatetimeIndex(bookings['booking_date']),inplace=True)
bookings.rename(columns={'booking_date':'Monthly_Totals'},inplace=True)
bookings.info()
bookings = bookings.groupby('prop_id').resample("M", how="count").head()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6076 entries, 2011-06-17 00:00:00 to 2011-09-08 00:00:00
Data columns (total 2 columns):
prop_id           6076 non-null int64
Monthly_Totals    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)

In [237]:
listings.set_index(listings.prop_id, inplace=True)
new_data = bookings.join(listings)
new_data['monthly_revenue'] = new_data.Monthly_Totals*new_data.price


Out[237]:
Monthly_Totals prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months monthly_revenue
prop_id
1 2011-03-31 2 1 Property type 1 Neighborhood 14 140 3 11 232 30 280
2011-04-30 0 1 Property type 1 Neighborhood 14 140 3 11 232 30 0
2011-05-31 1 1 Property type 1 Neighborhood 14 140 3 11 232 30 140
2011-06-30 1 1 Property type 1 Neighborhood 14 140 3 11 232 30 140
3 2011-08-31 1 3 Property type 2 Neighborhood 16 95 2 16 172 29 95

In [ ]:
#Predictors
new_data_y = new_data.monthly_revenue

#Regressors
excluded_data = ['monthly_rate','prop_type','neighborhood','number_of_bookings','prop_id']
cols = [col for col in model_data.columns if col not in excluded_data]
model_data_x = model_data[cols]
#model_data_x.columns


X_train, X_test, y_train, y_test = train_test_split(model_data_x, model_data_y, test_size=0.8)