Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [2]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [3]:

    
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')



In [4]:

    
bookings.head()









    Out[4]:






  
    
      
      prop_id
      booking_date
    
  
  
    
      0
        9
       2011-06-17
    
    
      1
       13
       2011-08-12
    
    
      2
       21
       2011-06-20
    
    
      3
       28
       2011-05-05
    
    
      4
       29
       2011-11-17



In [5]:

    
listings.head()









    Out[5]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
    
    
      4
       5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28



In [6]:

    
bookings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null object
dtypes: int64(1), object(1)
memory usage: 142.4+ KB



In [7]:

    
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [8]:

    
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()









    Out[8]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      count
        408.000000
       408.000000
       408.000000
        408.000000
       408.000000
    
    
      mean
        187.806373
         2.997549
        14.389706
        309.159314
         8.487745
    
    
      std
        353.050858
         1.594676
        10.477428
        228.021684
         5.872088
    
    
      min
         39.000000
         1.000000
         1.000000
          0.000000
         1.000000
    
    
      25%
         90.000000
         2.000000
         6.000000
        179.000000
         4.000000
    
    
      50%
        125.000000
         2.000000
        12.000000
        250.000000
         7.000000
    
    
      75%
        199.000000
         4.000000
        20.000000
        389.500000
        13.000000
    
    
      max
       5000.000000
        10.000000
        71.000000
       1969.000000
        30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [9]:

    
listings.groupby('prop_type')[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()









    Out[9]:






  
    
      
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      prop_type
      
      
      
      
      
      
    
  
  
    
      Property type 1
      count
        269.000000
       269.000000
       269.000000
        269.000000
       269.000000
    
    
      mean
        237.085502
         3.516729
        14.695167
        313.171004
         8.464684
    
    
      std
        425.710534
         1.644955
        10.623651
        214.769141
         5.773367
    
    
      min
         40.000000
         1.000000
         1.000000
         17.000000
         1.000000
    
    
      25%
        120.000000
         2.000000
         6.000000
        193.000000
         4.000000
    
    
      50%
        150.000000
         3.000000
        12.000000
        266.000000
         7.000000
    
    
      75%
        229.000000
         4.000000
        20.000000
        388.000000
        14.000000
    
    
      max
       5000.000000
        10.000000
        71.000000
       1719.000000
        30.000000
    
    
      Property type 2
      count
        135.000000
       135.000000
       135.000000
        135.000000
       135.000000
    
    
      mean
         93.288889
         2.000000
        13.948148
        304.851852
         8.377778
    
    
      std
         42.261246
         0.846415
        10.255191
        255.135332
         5.963654
    
    
      min
         39.000000
         1.000000
         1.000000
          0.000000
         1.000000
    
    
      25%
         69.000000
         2.000000
         6.500000
        150.500000
         4.000000
    
    
      50%
         89.000000
         2.000000
        11.000000
        239.000000
         7.000000
    
    
      75%
         99.000000
         2.000000
        19.500000
        402.500000
        10.500000
    
    
      max
        350.000000
         6.000000
        56.000000
       1969.000000
        29.000000
    
    
      Property type 3
      count
          4.000000
         4.000000
         4.000000
          4.000000
         4.000000
    
    
      mean
         63.750000
         1.750000
         8.750000
        184.750000
        13.750000
    
    
      std
         16.520190
         0.500000
         7.320064
         53.093471
         8.616844
    
    
      min
         40.000000
         1.000000
         1.000000
        113.000000
         5.000000
    
    
      25%
         58.750000
         1.750000
         3.250000
        170.000000
         7.250000
    
    
      50%
         70.000000
         2.000000
         9.500000
        192.500000
        13.500000
    
    
      75%
         75.000000
         2.000000
        15.000000
        207.250000
        20.000000
    
    
      max
         75.000000
         2.000000
        15.000000
        241.000000
        23.000000

Same, but by property type per neighborhood?



In [10]:

    
listings.groupby(['prop_type', 'neighborhood'])[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()









    Out[10]:






  
    
      
      
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      prop_type
      neighborhood
      
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 1
      count
         1.000000
        1.000000
        1.000000
          1.000000
        1.000000
    
    
      mean
        85.000000
        2.000000
       26.000000
        209.000000
        6.000000
    
    
      std
              NaN
             NaN
             NaN
               NaN
             NaN
    
    
      min
        85.000000
        2.000000
       26.000000
        209.000000
        6.000000
    
    
      25%
        85.000000
        2.000000
       26.000000
        209.000000
        6.000000
    
    
      50%
        85.000000
        2.000000
       26.000000
        209.000000
        6.000000
    
    
      75%
        85.000000
        2.000000
       26.000000
        209.000000
        6.000000
    
    
      max
        85.000000
        2.000000
       26.000000
        209.000000
        6.000000
    
    
      Neighborhood 10
      count
         6.000000
        6.000000
        6.000000
          6.000000
        6.000000
    
    
      mean
       142.500000
        3.500000
       13.333333
        391.000000
        3.833333
    
    
      std
        36.979724
        1.224745
        8.571270
        146.929915
        1.602082
    
    
      min
        90.000000
        2.000000
        4.000000
        160.000000
        1.000000
    
    
      25%
       135.000000
        2.500000
        6.750000
        312.250000
        3.250000
    
    
      50%
       137.500000
        4.000000
       12.000000
        425.500000
        4.500000
    
    
      75%
       147.500000
        4.000000
       20.250000
        499.000000
        5.000000
    
    
      max
       205.000000
        5.000000
       24.000000
        537.000000
        5.000000
    
    
      Neighborhood 11
      count
        14.000000
       14.000000
       14.000000
         14.000000
       14.000000
    
    
      mean
       159.428571
        3.214286
        9.928571
        379.000000
        9.642857
    
    
      std
        70.962302
        1.311404
        5.928605
        396.956111
        5.212517
    
    
      min
        95.000000
        2.000000
        5.000000
         82.000000
        1.000000
    
    
      25%
       103.750000
        2.000000
        6.000000
        242.500000
        6.000000
    
    
      50%
       130.000000
        3.000000
        8.000000
        295.500000
        9.500000
    
    
      75%
       196.500000
        3.750000
        9.750000
        373.000000
       15.500000
    
    
      max
       319.000000
        6.000000
       23.000000
       1719.000000
       16.000000
    
    
      Neighborhood 12
      count
        39.000000
       39.000000
       39.000000
         39.000000
       39.000000
    
    
      mean
       365.615385
        3.435897
       10.820513
        267.205128
        7.897436
    
    
      std
       686.086484
        1.874972
        7.118810
        137.867820
        5.290482
    
    
      min
        60.000000
        1.000000
        1.000000
         45.000000
        1.000000
    
    
      25%
       125.000000
        2.000000
        6.000000
        182.000000
        3.000000
    
    
      50%
       150.000000
        3.000000
        8.000000
        251.000000
        6.000000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Property type 3
      Neighborhood 11
      std
              NaN
             NaN
             NaN
               NaN
             NaN
    
    
      min
        75.000000
        2.000000
       15.000000
        196.000000
        8.000000
    
    
      25%
        75.000000
        2.000000
       15.000000
        196.000000
        8.000000
    
    
      50%
        75.000000
        2.000000
       15.000000
        196.000000
        8.000000
    
    
      75%
        75.000000
        2.000000
       15.000000
        196.000000
        8.000000
    
    
      max
        75.000000
        2.000000
       15.000000
        196.000000
        8.000000
    
    
      Neighborhood 14
      count
         1.000000
        1.000000
        1.000000
          1.000000
        1.000000
    
    
      mean
        75.000000
        1.000000
        1.000000
        113.000000
        5.000000
    
    
      std
              NaN
             NaN
             NaN
               NaN
             NaN
    
    
      min
        75.000000
        1.000000
        1.000000
        113.000000
        5.000000
    
    
      25%
        75.000000
        1.000000
        1.000000
        113.000000
        5.000000
    
    
      50%
        75.000000
        1.000000
        1.000000
        113.000000
        5.000000
    
    
      75%
        75.000000
        1.000000
        1.000000
        113.000000
        5.000000
    
    
      max
        75.000000
        1.000000
        1.000000
        113.000000
        5.000000
    
    
      Neighborhood 17
      count
         1.000000
        1.000000
        1.000000
          1.000000
        1.000000
    
    
      mean
        65.000000
        2.000000
       15.000000
        189.000000
       23.000000
    
    
      std
              NaN
             NaN
             NaN
               NaN
             NaN
    
    
      min
        65.000000
        2.000000
       15.000000
        189.000000
       23.000000
    
    
      25%
        65.000000
        2.000000
       15.000000
        189.000000
       23.000000
    
    
      50%
        65.000000
        2.000000
       15.000000
        189.000000
       23.000000
    
    
      75%
        65.000000
        2.000000
       15.000000
        189.000000
       23.000000
    
    
      max
        65.000000
        2.000000
       15.000000
        189.000000
       23.000000
    
    
      Neighborhood 4
      count
         1.000000
        1.000000
        1.000000
          1.000000
        1.000000
    
    
      mean
        40.000000
        2.000000
        4.000000
        241.000000
       19.000000
    
    
      std
              NaN
             NaN
             NaN
               NaN
             NaN
    
    
      min
        40.000000
        2.000000
        4.000000
        241.000000
       19.000000
    
    
      25%
        40.000000
        2.000000
        4.000000
        241.000000
       19.000000
    
    
      50%
        40.000000
        2.000000
        4.000000
        241.000000
       19.000000
    
    
      75%
        40.000000
        2.000000
        4.000000
        241.000000
       19.000000
    
    
      max
        40.000000
        2.000000
        4.000000
        241.000000
       19.000000
    
  

320 rows × 5 columns

Plot daily bookings:



In [11]:

    
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB



In [12]:

    
# merge
mrgd = bookings.merge(listings, on='prop_id')

# convert to datetime
mrgd['booking_date'] = pd.to_datetime(mrgd.booking_date)



In [13]:

    
mrgd.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
booking_date          6076 non-null datetime64[ns]
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
dtypes: datetime64[ns](1), int64(6), object(2)
memory usage: 474.7+ KB



In [14]:

    
count = mrgd.booking_date.value_counts()



In [15]:

    
fig, ax = plt.subplots(1,1)
ax.plot_date(count.index, count)









    Out[15]:





[<matplotlib.lines.Line2D at 0x103334f50>]



In [16]:

    
count.plot()









    Out[16]:





<matplotlib.axes.AxesSubplot at 0x107e3a2d0>



In [17]:

    
reset = count.sort_index().reset_index()



In [18]:

    
print '*'*80
print '''Lets regress ! I'm going to use np.polyfit. It minimizes the minimized squared error'''
print '*'*80
from IPython.display import Image
Image(url='http://docs.scipy.org/doc/numpy/_images/math/3e125ffecb04217922f4cb0a04367165ebf64864.png')









    



********************************************************************************
Lets regress ! I'm going to use np.polyfit. It minimizes the minimized squared error
********************************************************************************






    Out[18]:



In [19]:

    
for i in range(1,4):
    z = np.polyfit(reset.index, reset[0], i)
    p = np.poly1d(z)
    
    print "Polynomial (" + str(i) + ")", '\n', p, '\n'
    xp = np.linspace(1, max(reset.index), max(reset[0]))
    plt.plot(reset.index, reset[0], '.', xp, p(xp), '-')









    



Polynomial (1) 
 
-0.008861 x + 18.26 

Polynomial (2) 
           2
-0.000285 x + 0.09488 x + 11.98 

Polynomial (3) 
            3            2
-9.708e-08 x - 0.000232 x + 0.08717 x + 12.22



In [20]:

    
help('pandas.stats.moments.rolling_mean')









    



Help on function roll_mean in pandas.stats.moments:

pandas.stats.moments.rolling_mean = roll_mean(arg, window, min_periods=None, freq=None, center=False, how=None, **kwargs)
    Moving mean.
    
    Parameters
    ----------
    arg : Series, DataFrame
    window : int
        Size of the moving window. This is the number of observations used for
        calculating the statistic.
    min_periods : int, default None
        Minimum number of observations in window required to have a value
        (otherwise result is NA).
    freq : string or DateOffset object, optional (default None)
        Frequency to conform the data to before computing the statistic. Specified
        as a frequency string or DateOffset object.
    center : boolean, default False
        Set the labels at the center of the window.
    how : string, default 'None'
        Method for down- or re-sampling
    
    Returns
    -------
    y : type of input argument
    
    Notes
    -----
    By default, the result is set to the right edge of the window. This can be
    changed to the center of the window by setting ``center=True``.
    
    The `freq` keyword is used to conform time series data to a specified
    frequency by resampling the data. This is done with the default parameters
    of :meth:`~pandas.Series.resample` (i.e. using the `mean`).



In [21]:

    
print "I think this one seems to fit pretty well. Let's add an exponential weighted moving average, just for fun."
print """
1. Raw Data
2. Polyfit
3. 30 day exponential weighted moving average
4. 14 day exponential weighted moving average
"""
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row')
from pandas.stats.moments import ewma

z = np.polyfit(reset.index, reset[0], 2)
p = np.poly1d(z)

print "Polynomial (" + str(i) + ")", '\n', p, '\n'
xp = np.linspace(1, max(reset.index), max(reset[0]))

# Green is our polyfit
ax1.plot(reset.index, reset[0], '.')
ax2.plot(xp, p(xp), 'g-')

# Here is the exponential moving average, in red, for 1 months span
moving_ema_30 = ewma(reset[0], span=30)
ax3.plot(moving_ema_30)

moving_ema_2w = ewma(reset[0], span=14)
ax4.plot(moving_ema_2w, 'r-')









    



I think this one seems to fit pretty well. Let's add an exponential weighted moving average, just for fun.

1. Raw Data
2. Polyfit
3. 30 day exponential weighted moving average
4. 14 day exponential weighted moving average

Polynomial (3) 
           2
-0.000285 x + 0.09488 x + 11.98 







    Out[21]:





[<matplotlib.lines.Line2D at 0x107f41f90>]



In [22]:

    
reset.head()

Plot the daily bookings per neighborhood (provide a legend)



In [23]:

    
# smaller frame
nei = mrgd[['neighborhood', 'booking_date']]
# sort
nei = nei.sort(columns='booking_date')
# group by and count
neidate = nei.groupby('booking_date').neighborhood.value_counts()
# table that sucker
neidate = neidate.unstack()
neidate.plot()









    Out[23]:





<matplotlib.axes.AxesSubplot at 0x107f90ad0>



In [24]:

    
# Bleh, thats hard to read.

# How random is it ?
from pandas.tools.plotting import lag_plot
lag_plot(neidate)









    Out[24]:





<matplotlib.axes.AxesSubplot at 0x108183510>



In [25]:

    
from pandas.tools.plotting import autocorrelation_plot
autocorrelation_plot(neidate)

# I guess pretty random.









    Out[25]:





<matplotlib.axes.AxesSubplot at 0x1081f85d0>



In [26]:

    
chunk = pd.DataFrame(neidate)
maxnei = range(len(chunk.columns))
for i in range(20):
    plt.subplots(1,1)
    plt.title(str(chunk.columns[i]))
    chunk.ix[:,i].plot()



In [27]:

    
# I still feel like there is a better way of plotting those...

Part 2 - Develop a data set

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [28]:

    
l2 = listings.copy()
l2['number_of_bookings'] = bookings.groupby('prop_id').count()
l2['booking_rate'] = l2.number_of_bookings/l2.tenure_months
l2.head()









    Out[28]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
      NaN
            NaN
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
        4
       0.137931
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
      NaN
            NaN
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
        1
       0.035714
    
    
      4
       5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28
       27
       0.964286

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [29]:

    
l3 = l2[l2.tenure_months>10]
l3.head()









    Out[29]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
      NaN
            NaN
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
        4
       0.137931
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
      NaN
            NaN
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
        1
       0.035714
    
    
      4
       5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28
       27
       0.964286

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [30]:

    
l4 = pd.get_dummies(l3)
l4.columns
l4.head()
l4 = l4.fillna(0) # THIS IS A HUGE ASSUPTION : Is it ok to fill NaN's with a zero ?

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [31]:

    
from sklearn.cross_validation import train_test_split



In [32]:

    
training_data = l4.drop(['booking_rate'], axis=1)
training_data = training_data.drop(['number_of_bookings'], axis=1)
test_data = l4.booking_rate
train_a, test_a, train_b, test_b = train_test_split(training_data, test_data, test_size=.2)



In [33]:

    
test_data.head()
training_data.columns









    Out[33]:





Index([u'prop_id', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months', u'prop_type_Property type 1', u'prop_type_Property type 2', u'prop_type_Property type 3', u'neighborhood_Neighborhood 11', u'neighborhood_Neighborhood 12', u'neighborhood_Neighborhood 13', u'neighborhood_Neighborhood 14', u'neighborhood_Neighborhood 15', u'neighborhood_Neighborhood 16', u'neighborhood_Neighborhood 17', u'neighborhood_Neighborhood 18', u'neighborhood_Neighborhood 19', u'neighborhood_Neighborhood 20', u'neighborhood_Neighborhood 21', u'neighborhood_Neighborhood 4', u'neighborhood_Neighborhood 5', u'neighborhood_Neighborhood 9'], dtype='object')

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [34]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_a, train_b)









    Out[34]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

fit your model with your test sets



In [35]:

    
# Mean square error
print np.mean((lr.predict(test_a)-test_b)**2)
# Score
print lr.score(test_a, test_b)









    



1.1945578579
-0.0413719506008



In [36]:

    
lr.fit(test_a, test_b)
# Mean square error
print np.mean((lr.predict(test_a)-test_b)**2)
# Score
print lr.score(test_a, test_b)









    



0.324432180458
0.717171863744

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [37]:

    
train = lr.score(train_a, train_b)
test = lr.score(test_a, test_b)
print train, test









    



-1.15374927847 0.717171863744

What does the score method do?

Returns the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0, lower values are worse.

So what actually is that?

R^2, or the coefficient of determination, is a way of "how well data fits a statistical model" (Wikipedia). R^2 can be defined in multiple ways - but here it is defined in a few parts.

First we have: (1 - u/ v)

u = regression sum of squares

((y_true - ypred) ** 2 )).sum()

summation of squared errors: the error between the predicted value and the true value is squared (to alleviate the possibility of positive and negative answers cancelling each other out), then they are added together.

and

v = residual sum of squares

((y_true - y_true.mean()) ** 2).sum()

residual sum of squares: the error between the predicted value and the mean of the true value is squared (to alleviate the possibility of positive and negative answers cancelling each other out), then they are added together.

What does this tell us about our model ?

We'll, the close the score is to 1.0 the better. That means that we did pretty good if our training set is at .95 ! When we add more data, from the test set, we see this explanation of variance go up. Which is good.

Just kidding !! I had left number_of_bookings in there so that is why it was getting such a great score. So now, I'm getting values much lower, but they are real (I think?).

Basically, the score is telling us the goodness of fit. Now that we have a much smaller score, we know that our model only explains a smaller part of the variance.

test size: .30 gives a r^2 of 0.717171863744



In [37]:

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [34]:



In [ ]:

	index	0
0	2011-01-01	11
1	2011-01-02	9
2	2011-01-03	10
3	2011-01-04	8
4	2011-01-05	15

	prop_id	booking_date
0	9	2011-06-17
1	13	2011-08-12
2	21	2011-06-20
3	28	2011-05-05
4	29	2011-11-17

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months
0	1	Property type 1	Neighborhood 14	140	3	11	232	30
1	2	Property type 1	Neighborhood 14	95	2	3	37	29
2	3	Property type 2	Neighborhood 16	95	2	16	172	29
3	4	Property type 2	Neighborhood 13	90	2	19	472	28
4	5	Property type 1	Neighborhood 15	125	5	21	442	28

	price	person_capacity	picture_count	description_length	tenure_months
count	408.000000	408.000000	408.000000	408.000000	408.000000
mean	187.806373	2.997549	14.389706	309.159314	8.487745
std	353.050858	1.594676	10.477428	228.021684	5.872088
min	39.000000	1.000000	1.000000	0.000000	1.000000
25%	90.000000	2.000000	6.000000	179.000000	4.000000
50%	125.000000	2.000000	12.000000	250.000000	7.000000
75%	199.000000	4.000000	20.000000	389.500000	13.000000
max	5000.000000	10.000000	71.000000	1969.000000	30.000000

		price	person_capacity	picture_count	description_length	tenure_months
prop_type
Property type 1	count	269.000000	269.000000	269.000000	269.000000	269.000000
	mean	237.085502	3.516729	14.695167	313.171004	8.464684
	std	425.710534	1.644955	10.623651	214.769141	5.773367
	min	40.000000	1.000000	1.000000	17.000000	1.000000
	25%	120.000000	2.000000	6.000000	193.000000	4.000000
	50%	150.000000	3.000000	12.000000	266.000000	7.000000
	75%	229.000000	4.000000	20.000000	388.000000	14.000000
	max	5000.000000	10.000000	71.000000	1719.000000	30.000000
Property type 2	count	135.000000	135.000000	135.000000	135.000000	135.000000
	mean	93.288889	2.000000	13.948148	304.851852	8.377778
	std	42.261246	0.846415	10.255191	255.135332	5.963654
	min	39.000000	1.000000	1.000000	0.000000	1.000000
	25%	69.000000	2.000000	6.500000	150.500000	4.000000
	50%	89.000000	2.000000	11.000000	239.000000	7.000000
	75%	99.000000	2.000000	19.500000	402.500000	10.500000
	max	350.000000	6.000000	56.000000	1969.000000	29.000000
Property type 3	count	4.000000	4.000000	4.000000	4.000000	4.000000
	mean	63.750000	1.750000	8.750000	184.750000	13.750000
	std	16.520190	0.500000	7.320064	53.093471	8.616844
	min	40.000000	1.000000	1.000000	113.000000	5.000000
	25%	58.750000	1.750000	3.250000	170.000000	7.250000
	50%	70.000000	2.000000	9.500000	192.500000	13.500000
	75%	75.000000	2.000000	15.000000	207.250000	20.000000
	max	75.000000	2.000000	15.000000	241.000000	23.000000

			price	person_capacity	picture_count	description_length	tenure_months
prop_type	neighborhood
Property type 1	Neighborhood 1	count	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	85.000000	2.000000	26.000000	209.000000	6.000000
		std	NaN	NaN	NaN	NaN	NaN
		min	85.000000	2.000000	26.000000	209.000000	6.000000
		25%	85.000000	2.000000	26.000000	209.000000	6.000000
		50%	85.000000	2.000000	26.000000	209.000000	6.000000
		75%	85.000000	2.000000	26.000000	209.000000	6.000000
		max	85.000000	2.000000	26.000000	209.000000	6.000000
	Neighborhood 10	count	6.000000	6.000000	6.000000	6.000000	6.000000
		mean	142.500000	3.500000	13.333333	391.000000	3.833333
		std	36.979724	1.224745	8.571270	146.929915	1.602082
		min	90.000000	2.000000	4.000000	160.000000	1.000000
		25%	135.000000	2.500000	6.750000	312.250000	3.250000
		50%	137.500000	4.000000	12.000000	425.500000	4.500000
		75%	147.500000	4.000000	20.250000	499.000000	5.000000
		max	205.000000	5.000000	24.000000	537.000000	5.000000
	Neighborhood 11	count	14.000000	14.000000	14.000000	14.000000	14.000000
		mean	159.428571	3.214286	9.928571	379.000000	9.642857
		std	70.962302	1.311404	5.928605	396.956111	5.212517
		min	95.000000	2.000000	5.000000	82.000000	1.000000
		25%	103.750000	2.000000	6.000000	242.500000	6.000000
		50%	130.000000	3.000000	8.000000	295.500000	9.500000
		75%	196.500000	3.750000	9.750000	373.000000	15.500000
		max	319.000000	6.000000	23.000000	1719.000000	16.000000
	Neighborhood 12	count	39.000000	39.000000	39.000000	39.000000	39.000000
		mean	365.615385	3.435897	10.820513	267.205128	7.897436
		std	686.086484	1.874972	7.118810	137.867820	5.290482
		min	60.000000	1.000000	1.000000	45.000000	1.000000
		25%	125.000000	2.000000	6.000000	182.000000	3.000000
		50%	150.000000	3.000000	8.000000	251.000000	6.000000
...	...	...	...	...	...	...	...
Property type 3	Neighborhood 11	std	NaN	NaN	NaN	NaN	NaN
		min	75.000000	2.000000	15.000000	196.000000	8.000000
		25%	75.000000	2.000000	15.000000	196.000000	8.000000
		50%	75.000000	2.000000	15.000000	196.000000	8.000000
		75%	75.000000	2.000000	15.000000	196.000000	8.000000
		max	75.000000	2.000000	15.000000	196.000000	8.000000
	Neighborhood 14	count	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	75.000000	1.000000	1.000000	113.000000	5.000000
		std	NaN	NaN	NaN	NaN	NaN
		min	75.000000	1.000000	1.000000	113.000000	5.000000
		25%	75.000000	1.000000	1.000000	113.000000	5.000000
		50%	75.000000	1.000000	1.000000	113.000000	5.000000
		75%	75.000000	1.000000	1.000000	113.000000	5.000000
		max	75.000000	1.000000	1.000000	113.000000	5.000000
	Neighborhood 17	count	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	65.000000	2.000000	15.000000	189.000000	23.000000
		std	NaN	NaN	NaN	NaN	NaN
		min	65.000000	2.000000	15.000000	189.000000	23.000000
		25%	65.000000	2.000000	15.000000	189.000000	23.000000
		50%	65.000000	2.000000	15.000000	189.000000	23.000000
		75%	65.000000	2.000000	15.000000	189.000000	23.000000
		max	65.000000	2.000000	15.000000	189.000000	23.000000
	Neighborhood 4	count	1.000000	1.000000	1.000000	1.000000	1.000000
		mean	40.000000	2.000000	4.000000	241.000000	19.000000
		std	NaN	NaN	NaN	NaN	NaN
		min	40.000000	2.000000	4.000000	241.000000	19.000000
		25%	40.000000	2.000000	4.000000	241.000000	19.000000
		50%	40.000000	2.000000	4.000000	241.000000	19.000000
		75%	40.000000	2.000000	4.000000	241.000000	19.000000
		max	40.000000	2.000000	4.000000	241.000000	19.000000