Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from __future__ import print_function

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [3]:

    
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')
bookings.info()
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null object
dtypes: int64(1), object(1)<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [4]:

    
listings.describe()









    Out[4]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      count
       408.000000
        408.000000
       408.000000
       408.000000
        408.000000
       408.000000
    
    
      mean
       204.500000
        187.806373
         2.997549
        14.389706
        309.159314
         8.487745
    
    
      std
       117.923704
        353.050858
         1.594676
        10.477428
        228.021684
         5.872088
    
    
      min
         1.000000
         39.000000
         1.000000
         1.000000
          0.000000
         1.000000
    
    
      25%
       102.750000
         90.000000
         2.000000
         6.000000
        179.000000
         4.000000
    
    
      50%
       204.500000
        125.000000
         2.000000
        12.000000
        250.000000
         7.000000
    
    
      75%
       306.250000
        199.000000
         4.000000
        20.000000
        389.500000
        13.000000
    
    
      max
       408.000000
       5000.000000
        10.000000
        71.000000
       1969.000000
        30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [5]:

    
listings.groupby('prop_type').describe()









    Out[5]:






  
    
      
      
      description_length
      person_capacity
      picture_count
      price
      prop_id
      tenure_months
    
    
      prop_type
      
      
      
      
      
      
      
    
  
  
    
      Property type 1
      count
        269.000000
       269.000000
       269.000000
        269.000000
       269.000000
       269.000000
    
    
      mean
        313.171004
         3.516729
        14.695167
        237.085502
       204.754647
         8.464684
    
    
      std
        214.769141
         1.644955
        10.623651
        425.710534
       119.643061
         5.773367
    
    
      min
         17.000000
         1.000000
         1.000000
         40.000000
         1.000000
         1.000000
    
    
      25%
        193.000000
         2.000000
         6.000000
        120.000000
        95.000000
         4.000000
    
    
      50%
        266.000000
         3.000000
        12.000000
        150.000000
       208.000000
         7.000000
    
    
      75%
        388.000000
         4.000000
        20.000000
        229.000000
       310.000000
        14.000000
    
    
      max
       1719.000000
        10.000000
        71.000000
       5000.000000
       408.000000
        30.000000
    
    
      Property type 2
      count
        135.000000
       135.000000
       135.000000
        135.000000
       135.000000
       135.000000
    
    
      mean
        304.851852
         2.000000
        13.948148
         93.288889
       206.392593
         8.377778
    
    
      std
        255.135332
         0.846415
        10.255191
         42.261246
       114.074805
         5.963654
    
    
      min
          0.000000
         1.000000
         1.000000
         39.000000
         3.000000
         1.000000
    
    
      25%
        150.500000
         2.000000
         6.500000
         69.000000
       120.500000
         4.000000
    
    
      50%
        239.000000
         2.000000
        11.000000
         89.000000
       200.000000
         7.000000
    
    
      75%
        402.500000
         2.000000
        19.500000
         99.000000
       299.500000
        10.500000
    
    
      max
       1969.000000
         6.000000
        56.000000
        350.000000
       405.000000
        29.000000
    
    
      Property type 3
      count
          4.000000
         4.000000
         4.000000
          4.000000
         4.000000
         4.000000
    
    
      mean
        184.750000
         1.750000
         8.750000
         63.750000
       123.500000
        13.750000
    
    
      std
         53.093471
         0.500000
         7.320064
         16.520190
       132.879645
         8.616844
    
    
      min
        113.000000
         1.000000
         1.000000
         40.000000
        10.000000
         5.000000
    
    
      25%
        170.000000
         1.750000
         3.250000
         58.750000
        17.500000
         7.250000
    
    
      50%
        192.500000
         2.000000
         9.500000
         70.000000
        99.000000
        13.500000
    
    
      75%
        207.250000
         2.000000
        15.000000
         75.000000
       205.000000
        20.000000
    
    
      max
        241.000000
         2.000000
        15.000000
         75.000000
       286.000000
        23.000000

Same, but by property type per neighborhood?



In [6]:

    
listings.groupby(['prop_type','neighborhood']).describe()









    Out[6]:






  
    
      
      
      
      description_length
      person_capacity
      picture_count
      price
      prop_id
      tenure_months
    
    
      prop_type
      neighborhood
      
      
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 1
      count
          1.000000
        1.000000
        1.000000
         1.000000
         1.000000
        1.000000
    
    
      mean
        209.000000
        2.000000
       26.000000
        85.000000
       235.000000
        6.000000
    
    
      std
               NaN
             NaN
             NaN
              NaN
              NaN
             NaN
    
    
      min
        209.000000
        2.000000
       26.000000
        85.000000
       235.000000
        6.000000
    
    
      25%
        209.000000
        2.000000
       26.000000
        85.000000
       235.000000
        6.000000
    
    
      50%
        209.000000
        2.000000
       26.000000
        85.000000
       235.000000
        6.000000
    
    
      75%
        209.000000
        2.000000
       26.000000
        85.000000
       235.000000
        6.000000
    
    
      max
        209.000000
        2.000000
       26.000000
        85.000000
       235.000000
        6.000000
    
    
      Neighborhood 10
      count
          6.000000
        6.000000
        6.000000
         6.000000
         6.000000
        6.000000
    
    
      mean
        391.000000
        3.500000
       13.333333
       142.500000
       307.500000
        3.833333
    
    
      std
        146.929915
        1.224745
        8.571270
        36.979724
        55.301899
        1.602082
    
    
      min
        160.000000
        2.000000
        4.000000
        90.000000
       255.000000
        1.000000
    
    
      25%
        312.250000
        2.500000
        6.750000
       135.000000
       265.000000
        3.250000
    
    
      50%
        425.500000
        4.000000
       12.000000
       137.500000
       297.000000
        4.500000
    
    
      75%
        499.000000
        4.000000
       20.250000
       147.500000
       330.500000
        5.000000
    
    
      max
        537.000000
        5.000000
       24.000000
       205.000000
       400.000000
        5.000000
    
    
      Neighborhood 11
      count
         14.000000
       14.000000
       14.000000
        14.000000
        14.000000
       14.000000
    
    
      mean
        379.000000
        3.214286
        9.928571
       159.428571
       174.000000
        9.642857
    
    
      std
        396.956111
        1.311404
        5.928605
        70.962302
       110.198422
        5.212517
    
    
      min
         82.000000
        2.000000
        5.000000
        95.000000
        52.000000
        1.000000
    
    
      25%
        242.500000
        2.000000
        6.000000
       103.750000
        70.500000
        6.000000
    
    
      50%
        295.500000
        3.000000
        8.000000
       130.000000
       142.500000
        9.500000
    
    
      75%
        373.000000
        3.750000
        9.750000
       196.500000
       243.500000
       15.500000
    
    
      max
       1719.000000
        6.000000
       23.000000
       319.000000
       401.000000
       16.000000
    
    
      Neighborhood 12
      count
         39.000000
       39.000000
       39.000000
        39.000000
        39.000000
       39.000000
    
    
      mean
        267.205128
        3.435897
       10.820513
       365.615385
       211.307692
        7.897436
    
    
      std
        137.867820
        1.874972
        7.118810
       686.086484
       118.670520
        5.290482
    
    
      min
         45.000000
        1.000000
        1.000000
        60.000000
        22.000000
        1.000000
    
    
      25%
        182.000000
        2.000000
        6.000000
       125.000000
       105.500000
        3.000000
    
    
      50%
        251.000000
        3.000000
        8.000000
       150.000000
       223.000000
        6.000000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Property type 3
      Neighborhood 11
      std
               NaN
             NaN
             NaN
              NaN
              NaN
             NaN
    
    
      min
        196.000000
        2.000000
       15.000000
        75.000000
       178.000000
        8.000000
    
    
      25%
        196.000000
        2.000000
       15.000000
        75.000000
       178.000000
        8.000000
    
    
      50%
        196.000000
        2.000000
       15.000000
        75.000000
       178.000000
        8.000000
    
    
      75%
        196.000000
        2.000000
       15.000000
        75.000000
       178.000000
        8.000000
    
    
      max
        196.000000
        2.000000
       15.000000
        75.000000
       178.000000
        8.000000
    
    
      Neighborhood 14
      count
          1.000000
        1.000000
        1.000000
         1.000000
         1.000000
        1.000000
    
    
      mean
        113.000000
        1.000000
        1.000000
        75.000000
       286.000000
        5.000000
    
    
      std
               NaN
             NaN
             NaN
              NaN
              NaN
             NaN
    
    
      min
        113.000000
        1.000000
        1.000000
        75.000000
       286.000000
        5.000000
    
    
      25%
        113.000000
        1.000000
        1.000000
        75.000000
       286.000000
        5.000000
    
    
      50%
        113.000000
        1.000000
        1.000000
        75.000000
       286.000000
        5.000000
    
    
      75%
        113.000000
        1.000000
        1.000000
        75.000000
       286.000000
        5.000000
    
    
      max
        113.000000
        1.000000
        1.000000
        75.000000
       286.000000
        5.000000
    
    
      Neighborhood 17
      count
          1.000000
        1.000000
        1.000000
         1.000000
         1.000000
        1.000000
    
    
      mean
        189.000000
        2.000000
       15.000000
        65.000000
        10.000000
       23.000000
    
    
      std
               NaN
             NaN
             NaN
              NaN
              NaN
             NaN
    
    
      min
        189.000000
        2.000000
       15.000000
        65.000000
        10.000000
       23.000000
    
    
      25%
        189.000000
        2.000000
       15.000000
        65.000000
        10.000000
       23.000000
    
    
      50%
        189.000000
        2.000000
       15.000000
        65.000000
        10.000000
       23.000000
    
    
      75%
        189.000000
        2.000000
       15.000000
        65.000000
        10.000000
       23.000000
    
    
      max
        189.000000
        2.000000
       15.000000
        65.000000
        10.000000
       23.000000
    
    
      Neighborhood 4
      count
          1.000000
        1.000000
        1.000000
         1.000000
         1.000000
        1.000000
    
    
      mean
        241.000000
        2.000000
        4.000000
        40.000000
        20.000000
       19.000000
    
    
      std
               NaN
             NaN
             NaN
              NaN
              NaN
             NaN
    
    
      min
        241.000000
        2.000000
        4.000000
        40.000000
        20.000000
       19.000000
    
    
      25%
        241.000000
        2.000000
        4.000000
        40.000000
        20.000000
       19.000000
    
    
      50%
        241.000000
        2.000000
        4.000000
        40.000000
        20.000000
       19.000000
    
    
      75%
        241.000000
        2.000000
        4.000000
        40.000000
        20.000000
       19.000000
    
    
      max
        241.000000
        2.000000
        4.000000
        40.000000
        20.000000
       19.000000
    
  

320 rows × 6 columns

Plot daily bookings:



In [7]:

    
import timeit
import pandas as pd
#Convert to bookings.booking_date from object to date
from datetime import datetime
import matplotlib.dates as mdates


bookings['booking_date']= pd.to_datetime(bookings['booking_date'])
#bookings.info()

daily_totals = bookings.groupby('booking_date').count()
daily_totals.rename(columns={'prop_id':'Daily Totals'},inplace=True)


#plot cumulative bookings by date
plt.style.use('fivethirtyeight')
ax = daily_totals.plot(kind='line')
ax.set_title('Daily Bookings')
ax.set_xlabel('days')
ax.set_ylabel('# of bookings')
ax.legend()









    Out[7]:





<matplotlib.legend.Legend at 0xc5d3a58>

Plot the daily bookings per neighborhood (provide a legend)



In [8]:

    
combined_df = pd.merge(bookings, listings, on='prop_id')
neighborhood_daily_totals = combined_df.groupby(['booking_date','neighborhood']).count()
#neighborhood_daily_totals.info()
neighborhood_daily_totals.head()

plt.style.use('fivethirtyeight')

#ax = plt.plot(neighborhood_daily_totals)

Part 2 - Develop a data set



In [9]:

    
import math
#create new table that counts each prop_id
bookings_by_prop_id = bookings.groupby('prop_id').count()
#bookings_by_prop_id.head()

#Set an index on listings DF
new_listings = listings.set_index(['prop_id'])

#Join tables
new_listings = new_listings.join(bookings_by_prop_id)
new_listings.rename(columns={'booking_date':'number_of_bookings'},inplace=True)

#Replace all NaNs with 0s
new_listings.number_of_bookings = [0 if math.isnan(x) else x for x in new_listings.number_of_bookings]

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [10]:

    
new_listings['booking_rate'] = (new_listings['number_of_bookings'])/(new_listings['tenure_months'])

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [11]:

    
model_data = new_listings[new_listings.tenure_months>10]
model_data.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 120 entries, 1 to 120
Data columns (total 9 columns):
prop_type             120 non-null object
neighborhood          120 non-null object
price                 120 non-null int64
person_capacity       120 non-null int64
picture_count         120 non-null int64
description_length    120 non-null int64
tenure_months         120 non-null int64
number_of_bookings    120 non-null float64
booking_rate          120 non-null float64
dtypes: float64(2), int64(5), object(2)

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [12]:

    
model_data = new_listings[new_listings.tenure_months>10]
prop_type_dummies = pd.get_dummies(model_data.prop_type)
neigborhood_dummies = pd.get_dummies(model_data.neighborhood)
model_data = model_data.join(prop_type_dummies)
model_data = model_data.join(neigborhood_dummies)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [13]:

    
from sklearn.cross_validation import train_test_split



In [14]:

    
#Predictors
model_data_y = model_data.booking_rate

#Regressors
excluded_data = ['booking_rate','prop_type','neighborhood','number_of_bookings','prop_id']
cols = [col for col in model_data.columns if col not in excluded_data]
model_data_x = model_data[cols]
#model_data_x.columns

X_train, X_test, y_train, y_test = train_test_split(model_data_x, model_data_y, test_size=0.8)

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [15]:

    
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm

#####################
#Ordinary Least Squares
degree = 1
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)

regr = LinearRegression()
regr.fit(X_train, y_train)

# The coefficients
print('Results from Ordinary Least Squares \n')
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))

print("\n \n statsmodels package OLS Results \n")
res = sm.OLS(y_train,X_train).fit() #create a model using statsmodels
print(res.params)
print (res.bse)
print (res.summary())

######################
#Ridge Regression
from sklearn import linear_model
clf = linear_model.Ridge (alpha = .5)
clf.fit(X_train, y_train)


# The coefficients
print('\n \n Results from Ridge Regression \n')
print('Coefficients: \n', clf.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((clf.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % clf.score(X_test, y_test))


######################
#Lasso
clf = linear_model.Lasso(alpha = 0.1)
clf = LinearRegression()
clf.fit(X_train, y_train)

# The coefficients
print('\n \n Results from Lasso \n')
print('Coefficients: \n', clf.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((clf.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % clf.score(X_test, y_test))









    



Results from Ordinary Least Squares 

Coefficients: 
 [  4.33231630e-03   1.31849210e-01   6.49068082e-02  -1.83175684e-03
  -8.29673567e-02  -1.19394747e+00   1.19394747e+00  -2.93210621e-17
  -8.63409526e-02  -2.84053936e-01  -1.13367671e-01  -1.57813656e+00
   8.26863417e-01   2.64065935e+00   5.48643488e-01  -5.47674390e-01
   0.00000000e+00   0.00000000e+00  -1.40659274e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
Residual sum of squares: 3.49
Variance score: -2.72

 
 statsmodels package OLS Results 

[  4.33231630e-03   1.31849210e-01   6.49068082e-02  -1.83175684e-03
  -8.29673567e-02  -3.67139320e-01   2.02075562e+00   8.50691366e-16
   9.73941915e-02  -1.00318792e-01   7.03674734e-02  -1.39440141e+00
   1.01059856e+00   2.82439449e+00   7.32378632e-01  -3.63939246e-01
   0.00000000e+00   0.00000000e+00  -1.22285760e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
[  1.03705827e-02   4.12275985e-01   6.68380431e-02   3.77950774e-03
   1.25164660e-01   2.08580548e+00   1.77102711e+00   9.05292184e-16
   1.41541439e+00   7.83376905e-01   8.61642177e-01   1.69544078e+00
   1.27010726e+00   2.11484856e+00   8.99000599e-01   1.13000524e+00
   0.00000000e+00   0.00000000e+00   2.31667759e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.672
Model:                            OLS   Adj. R-squared:                  0.126
Method:                 Least Squares   F-statistic:                     1.231
Date:                Tue, 09 Dec 2014   Prob (F-statistic):              0.387
Time:                        00:24:11   Log-Likelihood:                -30.706
No. Observations:                  24   AIC:                             91.41
Df Residuals:                       9   BIC:                             109.1
Df Model:                          15                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.0043      0.010      0.418      0.686        -0.019     0.028
x2             0.1318      0.412      0.320      0.756        -0.801     1.064
x3             0.0649      0.067      0.971      0.357        -0.086     0.216
x4            -0.0018      0.004     -0.485      0.639        -0.010     0.007
x5            -0.0830      0.125     -0.663      0.524        -0.366     0.200
x6            -0.3671      2.086     -0.176      0.864        -5.086     4.351
x7             2.0208      1.771      1.141      0.283        -1.986     6.027
const       8.507e-16   9.05e-16      0.940      0.372      -1.2e-15   2.9e-15
x8             0.0974      1.415      0.069      0.947        -3.104     3.299
x9            -0.1003      0.783     -0.128      0.901        -1.872     1.672
x10            0.0704      0.862      0.082      0.937        -1.879     2.020
x11           -1.3944      1.695     -0.822      0.432        -5.230     2.441
x12            1.0106      1.270      0.796      0.447        -1.863     3.884
x13            2.8244      2.115      1.336      0.215        -1.960     7.609
x14            0.7324      0.899      0.815      0.436        -1.301     2.766
x15           -0.3639      1.130     -0.322      0.755        -2.920     2.192
x16                 0          0        nan        nan             0         0
x17                 0          0        nan        nan             0         0
x18           -1.2229      2.317     -0.528      0.610        -6.464     4.018
x19                 0          0        nan        nan             0         0
x20                 0          0        nan        nan             0         0
x21                 0          0        nan        nan             0         0
==============================================================================
Omnibus:                        1.150   Durbin-Watson:                   1.681
Prob(Omnibus):                  0.563   Jarque-Bera (JB):                0.494
Skew:                           0.348   Prob(JB):                        0.781
Kurtosis:                       3.100   Cond. No.                          nan
==============================================================================

Warnings:
[1] The smallest eigenvalue is -8.46e-16. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

 
 Results from Ridge Regression 

Coefficients: 
 [ -2.22243437e-03   2.53053079e-01   2.87706628e-02   4.61191227e-04
  -5.79452947e-02  -7.53816269e-01   7.53816269e-01   0.00000000e+00
  -1.84238929e-01  -3.05435854e-01  -1.33245424e-01  -7.29369567e-01
   3.33498545e-01   9.96778128e-01   4.99834654e-01  -3.95814567e-01
   0.00000000e+00   0.00000000e+00  -8.20069869e-02   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
Residual sum of squares: 1.28
Variance score: -0.37

 
 Results from Lasso 

Coefficients: 
 [  4.33231630e-03   1.31849210e-01   6.49068082e-02  -1.83175684e-03
  -8.29673567e-02  -1.19394747e+00   1.19394747e+00  -2.93210621e-17
  -8.63409526e-02  -2.84053936e-01  -1.13367671e-01  -1.57813656e+00
   8.26863417e-01   2.64065935e+00   5.48643488e-01  -5.47674390e-01
   0.00000000e+00   0.00000000e+00  -1.40659274e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
Residual sum of squares: 3.49
Variance score: -2.72

fit your model with your test sets



In [19]:

    
degree = 1
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)
pred = est.predict(X_test)

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [21]:

    
print("Score from OLS: ", est.score(X_test,y_test))









    



Score from OLS:  -2.71572105337

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

The score compares the fit of our model vs the fit of a horizontal line. Generally, this score should be between 0 and 1. In our case, it is below 0, suggesting a horizontal line fits better than our data. This may probably be due to the # of predictors we have. A suggestion is to use lasso to determine which predictors are unnecessary to simplify our model.

Interestingly, statsmodels and scikitlearn gave slightly different coeficients. statsmodels gives more information on default than scikitlearn. It shows that many of the variables are not significant. Next steps would probably be try to transform the data. Interestingly, the R^2 for statsmodels was 0.67 which is R^2 based on the training set which means the model is not generalizable.

I am curious how to display predictor names on coeficients. It seems that the train/split function created 4 arrays without labels.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [225]:

    
#Part of the work. need to keep brainstorming
print("Currently Incomplete")
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')
bookings['booking_date']= pd.to_datetime(bookings['booking_date'])
bookings.set_index(pd.DatetimeIndex(bookings['booking_date']),inplace=True)
bookings.rename(columns={'booking_date':'Monthly_Totals'},inplace=True)
bookings.info()
bookings = bookings.groupby('prop_id').resample("M", how="count").head()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6076 entries, 2011-06-17 00:00:00 to 2011-09-08 00:00:00
Data columns (total 2 columns):
prop_id           6076 non-null int64
Monthly_Totals    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)



In [237]:

    
listings.set_index(listings.prop_id, inplace=True)
new_data = bookings.join(listings)
new_data['monthly_revenue'] = new_data.Monthly_Totals*new_data.price









    Out[237]:






  
    
      
      
      Monthly_Totals
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      monthly_revenue
    
    
      prop_id
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      2011-03-31
       2
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
       280
    
    
      2011-04-30
       0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
         0
    
    
      2011-05-31
       1
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
       140
    
    
      2011-06-30
       1
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
       140
    
    
      3
      2011-08-31
       1
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        95



In [ ]:

    
#Predictors
new_data_y = new_data.monthly_revenue

#Regressors
excluded_data = ['monthly_rate','prop_type','neighborhood','number_of_bookings','prop_id']
cols = [col for col in model_data.columns if col not in excluded_data]
model_data_x = model_data[cols]
#model_data_x.columns


X_train, X_test, y_train, y_test = train_test_split(model_data_x, model_data_y, test_size=0.8)

	prop_id	price	person_capacity	picture_count	description_length	tenure_months
count	408.000000	408.000000	408.000000	408.000000	408.000000	408.000000
mean	204.500000	187.806373	2.997549	14.389706	309.159314	8.487745
std	117.923704	353.050858	1.594676	10.477428	228.021684	5.872088
min	1.000000	39.000000	1.000000	1.000000	0.000000	1.000000
25%	102.750000	90.000000	2.000000	6.000000	179.000000	4.000000
50%	204.500000	125.000000	2.000000	12.000000	250.000000	7.000000
75%	306.250000	199.000000	4.000000	20.000000	389.500000	13.000000
max	408.000000	5000.000000	10.000000	71.000000	1969.000000	30.000000

		description_length	person_capacity	picture_count	price	prop_id	tenure_months
prop_type
Property type 1	count	269.000000	269.000000	269.000000	269.000000	269.000000	269.000000
	mean	313.171004	3.516729	14.695167	237.085502	204.754647	8.464684
	std	214.769141	1.644955	10.623651	425.710534	119.643061	5.773367
	min	17.000000	1.000000	1.000000	40.000000	1.000000	1.000000
	25%	193.000000	2.000000	6.000000	120.000000	95.000000	4.000000
	50%	266.000000	3.000000	12.000000	150.000000	208.000000	7.000000
	75%	388.000000	4.000000	20.000000	229.000000	310.000000	14.000000
	max	1719.000000	10.000000	71.000000	5000.000000	408.000000	30.000000
Property type 2	count	135.000000	135.000000	135.000000	135.000000	135.000000	135.000000
	mean	304.851852	2.000000	13.948148	93.288889	206.392593	8.377778
	std	255.135332	0.846415	10.255191	42.261246	114.074805	5.963654
	min	0.000000	1.000000	1.000000	39.000000	3.000000	1.000000
	25%	150.500000	2.000000	6.500000	69.000000	120.500000	4.000000
	50%	239.000000	2.000000	11.000000	89.000000	200.000000	7.000000
	75%	402.500000	2.000000	19.500000	99.000000	299.500000	10.500000
	max	1969.000000	6.000000	56.000000	350.000000	405.000000	29.000000
Property type 3	count	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000
	mean	184.750000	1.750000	8.750000	63.750000	123.500000	13.750000
	std	53.093471	0.500000	7.320064	16.520190	132.879645	8.616844
	min	113.000000	1.000000	1.000000	40.000000	10.000000	5.000000
	25%	170.000000	1.750000	3.250000	58.750000	17.500000	7.250000
	50%	192.500000	2.000000	9.500000	70.000000	99.000000	13.500000
	75%	207.250000	2.000000	15.000000	75.000000	205.000000	20.000000
	max	241.000000	2.000000	15.000000	75.000000	286.000000	23.000000

			description_length	person_capacity	picture_count	price	prop_id	tenure_months
prop_type	neighborhood
Property type 1	Neighborhood 1	count	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	209.000000	2.000000	26.000000	85.000000	235.000000	6.000000
		std	NaN	NaN	NaN	NaN	NaN	NaN
		min	209.000000	2.000000	26.000000	85.000000	235.000000	6.000000
		25%	209.000000	2.000000	26.000000	85.000000	235.000000	6.000000
		50%	209.000000	2.000000	26.000000	85.000000	235.000000	6.000000
		75%	209.000000	2.000000	26.000000	85.000000	235.000000	6.000000
		max	209.000000	2.000000	26.000000	85.000000	235.000000	6.000000
	Neighborhood 10	count	6.000000	6.000000	6.000000	6.000000	6.000000	6.000000
		mean	391.000000	3.500000	13.333333	142.500000	307.500000	3.833333
		std	146.929915	1.224745	8.571270	36.979724	55.301899	1.602082
		min	160.000000	2.000000	4.000000	90.000000	255.000000	1.000000
		25%	312.250000	2.500000	6.750000	135.000000	265.000000	3.250000
		50%	425.500000	4.000000	12.000000	137.500000	297.000000	4.500000
		75%	499.000000	4.000000	20.250000	147.500000	330.500000	5.000000
		max	537.000000	5.000000	24.000000	205.000000	400.000000	5.000000
	Neighborhood 11	count	14.000000	14.000000	14.000000	14.000000	14.000000	14.000000
		mean	379.000000	3.214286	9.928571	159.428571	174.000000	9.642857
		std	396.956111	1.311404	5.928605	70.962302	110.198422	5.212517
		min	82.000000	2.000000	5.000000	95.000000	52.000000	1.000000
		25%	242.500000	2.000000	6.000000	103.750000	70.500000	6.000000
		50%	295.500000	3.000000	8.000000	130.000000	142.500000	9.500000
		75%	373.000000	3.750000	9.750000	196.500000	243.500000	15.500000
		max	1719.000000	6.000000	23.000000	319.000000	401.000000	16.000000
	Neighborhood 12	count	39.000000	39.000000	39.000000	39.000000	39.000000	39.000000
		mean	267.205128	3.435897	10.820513	365.615385	211.307692	7.897436
		std	137.867820	1.874972	7.118810	686.086484	118.670520	5.290482
		min	45.000000	1.000000	1.000000	60.000000	22.000000	1.000000
		25%	182.000000	2.000000	6.000000	125.000000	105.500000	3.000000
		50%	251.000000	3.000000	8.000000	150.000000	223.000000	6.000000
...	...	...	...	...	...	...	...	...
Property type 3	Neighborhood 11	std	NaN	NaN	NaN	NaN	NaN	NaN
		min	196.000000	2.000000	15.000000	75.000000	178.000000	8.000000
		25%	196.000000	2.000000	15.000000	75.000000	178.000000	8.000000
		50%	196.000000	2.000000	15.000000	75.000000	178.000000	8.000000
		75%	196.000000	2.000000	15.000000	75.000000	178.000000	8.000000
		max	196.000000	2.000000	15.000000	75.000000	178.000000	8.000000
	Neighborhood 14	count	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	113.000000	1.000000	1.000000	75.000000	286.000000	5.000000
		std	NaN	NaN	NaN	NaN	NaN	NaN
		min	113.000000	1.000000	1.000000	75.000000	286.000000	5.000000
		25%	113.000000	1.000000	1.000000	75.000000	286.000000	5.000000
		50%	113.000000	1.000000	1.000000	75.000000	286.000000	5.000000
		75%	113.000000	1.000000	1.000000	75.000000	286.000000	5.000000
		max	113.000000	1.000000	1.000000	75.000000	286.000000	5.000000
	Neighborhood 17	count	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	189.000000	2.000000	15.000000	65.000000	10.000000	23.000000
		std	NaN	NaN	NaN	NaN	NaN	NaN
		min	189.000000	2.000000	15.000000	65.000000	10.000000	23.000000
		25%	189.000000	2.000000	15.000000	65.000000	10.000000	23.000000
		50%	189.000000	2.000000	15.000000	65.000000	10.000000	23.000000
		75%	189.000000	2.000000	15.000000	65.000000	10.000000	23.000000
		max	189.000000	2.000000	15.000000	65.000000	10.000000	23.000000
	Neighborhood 4	count	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	241.000000	2.000000	4.000000	40.000000	20.000000	19.000000
		std	NaN	NaN	NaN	NaN	NaN	NaN
		min	241.000000	2.000000	4.000000	40.000000	20.000000	19.000000
		25%	241.000000	2.000000	4.000000	40.000000	20.000000	19.000000
		50%	241.000000	2.000000	4.000000	40.000000	20.000000	19.000000
		75%	241.000000	2.000000	4.000000	40.000000	20.000000	19.000000
		max	241.000000	2.000000	4.000000	40.000000	20.000000	19.000000

		Monthly_Totals	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	monthly_revenue
prop_id
1	2011-03-31	2	1	Property type 1	Neighborhood 14	140	3	11	232	30	280
	2011-04-30	0	1	Property type 1	Neighborhood 14	140	3	11	232	30	0
	2011-05-31	1	1	Property type 1	Neighborhood 14	140	3	11	232	30	140
	2011-06-30	1	1	Property type 1	Neighborhood 14	140	3	11	232	30	140
3	2011-08-31	1	3	Property type 2	Neighborhood 16	95	2	16	172	29	95