Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [2]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline



In [3]:

    
pd.__version__









    Out[3]:





'0.15.1'

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [4]:

    
# Let's explore the Datasets
bookings = pd.read_csv('../data/bookings.csv', parse_dates=['booking_date'])
listings = pd.read_csv('../data/listings.csv')



In [5]:

    
bookings.tail()









    Out[5]:






  
    
      
      prop_id
      booking_date
    
  
  
    
      6071
       408
      2011-06-02
    
    
      6072
       408
      2011-08-22
    
    
      6073
       408
      2011-07-24
    
    
      6074
       408
      2011-01-12
    
    
      6075
       408
      2011-09-08



In [6]:

    
bookings.set_index('booking_date' ,inplace=True)



In [7]:

    
bookings['number_of_bookings'] = 1



In [8]:

    
help(bookings.resample)









    



Help on method resample in module pandas.core.generic:

resample(self, rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0) method of pandas.core.frame.DataFrame instance
    Convenience method for frequency conversion and resampling of regular
    time-series data.
    
    Parameters
    ----------
    rule : string
        the offset string or object representing target conversion
    how : string
        method for down- or re-sampling, default to 'mean' for
        downsampling
    axis : int, optional, default 0
    fill_method : string, default None
        fill_method for upsampling
    closed : {'right', 'left'}
        Which side of bin interval is closed
    label : {'right', 'left'}
        Which bin edge label to label bucket with
    convention : {'start', 'end', 's', 'e'}
    kind : "period"/"timestamp"
    loffset : timedelta
        Adjust the resampled time labels
    limit : int, default None
        Maximum size gap to when reindexing with fill_method
    base : int, default 0
        For frequencies that evenly subdivide 1 day, the "origin" of the
        aggregated intervals. For example, for '5min' frequency, base could
        range from 0 through 4. Defaults to 0



In [9]:

    
bookings.resample('M', how='count').number_of_bookings.plot()









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x1094dcf10>



In [10]:

    
bookings.resample('D')









    Out[10]:






  
    
      
      prop_id
      number_of_bookings
    
    
      booking_date
      
      
    
  
  
    
      2011-01-01
       183.181818
       1
    
    
      2011-01-02
       169.222222
       1
    
    
      2011-01-03
       170.300000
       1
    
    
      2011-01-04
       223.875000
       1
    
    
      2011-01-05
       162.200000
       1
    
    
      2011-01-06
       204.285714
       1
    
    
      2011-01-07
       144.571429
       1
    
    
      2011-01-08
       230.833333
       1
    
    
      2011-01-09
       151.400000
       1
    
    
      2011-01-10
       227.928571
       1
    
    
      2011-01-11
       191.600000
       1
    
    
      2011-01-12
       192.450000
       1
    
    
      2011-01-13
       185.350000
       1
    
    
      2011-01-14
       209.857143
       1
    
    
      2011-01-15
       196.727273
       1
    
    
      2011-01-16
       190.909091
       1
    
    
      2011-01-17
       219.190476
       1
    
    
      2011-01-18
       234.000000
       1
    
    
      2011-01-19
       221.285714
       1
    
    
      2011-01-20
       241.076923
       1
    
    
      2011-01-21
       206.809524
       1
    
    
      2011-01-22
       228.000000
       1
    
    
      2011-01-23
       143.142857
       1
    
    
      2011-01-24
       281.142857
       1
    
    
      2011-01-25
       223.062500
       1
    
    
      2011-01-26
       166.909091
       1
    
    
      2011-01-27
       199.750000
       1
    
    
      2011-01-28
       232.800000
       1
    
    
      2011-01-29
       234.700000
       1
    
    
      2011-01-30
       164.444444
       1
    
    
      ...
      ...
      ...
    
    
      2011-12-02
       302.444444
       1
    
    
      2011-12-03
       252.250000
       1
    
    
      2011-12-04
       191.090909
       1
    
    
      2011-12-05
       237.916667
       1
    
    
      2011-12-06
       206.687500
       1
    
    
      2011-12-07
       199.000000
       1
    
    
      2011-12-08
       253.833333
       1
    
    
      2011-12-09
       233.571429
       1
    
    
      2011-12-10
       174.400000
       1
    
    
      2011-12-11
       258.166667
       1
    
    
      2011-12-12
       252.933333
       1
    
    
      2011-12-13
       207.285714
       1
    
    
      2011-12-14
       205.083333
       1
    
    
      2011-12-15
       161.125000
       1
    
    
      2011-12-16
       173.882353
       1
    
    
      2011-12-17
       181.692308
       1
    
    
      2011-12-18
       235.727273
       1
    
    
      2011-12-19
       243.000000
       1
    
    
      2011-12-20
       149.375000
       1
    
    
      2011-12-21
       197.000000
       1
    
    
      2011-12-22
       226.333333
       1
    
    
      2011-12-23
       251.625000
       1
    
    
      2011-12-24
       263.200000
       1
    
    
      2011-12-25
       237.000000
       1
    
    
      2011-12-26
       198.454545
       1
    
    
      2011-12-27
       202.777778
       1
    
    
      2011-12-28
       200.857143
       1
    
    
      2011-12-29
       154.714286
       1
    
    
      2011-12-30
       215.071429
       1
    
    
      2011-12-31
       198.100000
       1
    
  

365 rows × 2 columns



In [11]:

    
listings.tail()









    Out[11]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      403
       404
       Property type 2
       Neighborhood 14
       100
       1
        8
        235
       1
    
    
      404
       405
       Property type 2
       Neighborhood 13
        85
       2
       27
       1048
       1
    
    
      405
       406
       Property type 1
        Neighborhood 9
        70
       3
       18
        153
       1
    
    
      406
       407
       Property type 1
       Neighborhood 13
       129
       2
       13
        370
       1
    
    
      407
       408
       Property type 1
       Neighborhood 14
       100
       3
       21
        707
       1

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [12]:

    
listings.describe()









    Out[12]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      count
       408.000000
        408.000000
       408.000000
       408.000000
        408.000000
       408.000000
    
    
      mean
       204.500000
        187.806373
         2.997549
        14.389706
        309.159314
         8.487745
    
    
      std
       117.923704
        353.050858
         1.594676
        10.477428
        228.021684
         5.872088
    
    
      min
         1.000000
         39.000000
         1.000000
         1.000000
          0.000000
         1.000000
    
    
      25%
       102.750000
         90.000000
         2.000000
         6.000000
        179.000000
         4.000000
    
    
      50%
       204.500000
        125.000000
         2.000000
        12.000000
        250.000000
         7.000000
    
    
      75%
       306.250000
        199.000000
         4.000000
        20.000000
        389.500000
        13.000000
    
    
      max
       408.000000
       5000.000000
        10.000000
        71.000000
       1969.000000
        30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [13]:

    
len([1, 2, 3])









    Out[13]:





3



In [14]:

    
[1, 2, 3].__len__()









    Out[14]:





3



In [15]:

    
class MyClass(object):
    def __init__(self, value):
        self.value = value
        
    def __len__(self):
        return self.value



In [16]:

    
myobj = MyClass(5)



In [17]:

    
listings.groupby(['prop_type']).agg(['mean', 'count'])









    Out[17]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      mean
      count
      mean
      count
      mean
      count
      mean
      count
      mean
      count
      mean
      count
    
    
      prop_type
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Property type 1
       204.754647
       269
       237.085502
       269
       3.516729
       269
       14.695167
       269
       313.171004
       269
        8.464684
       269
    
    
      Property type 2
       206.392593
       135
        93.288889
       135
       2.000000
       135
       13.948148
       135
       304.851852
       135
        8.377778
       135
    
    
      Property type 3
       123.500000
         4
        63.750000
         4
       1.750000
         4
        8.750000
         4
       184.750000
         4
       13.750000
         4

Same, but by property type per neighborhood?



In [18]:

    
listings.head(2)









    Out[18]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29



In [19]:

    
pd.pivot_table(listings, values='person_capacity', index='neighborhood', columns='prop_type').head(2)









    Out[19]:






  
    
      prop_type
      Property type 1
      Property type 2
      Property type 3
    
    
      neighborhood
      
      
      
    
  
  
    
      Neighborhood 1
       2.0
      NaN
      NaN
    
    
      Neighborhood 10
       3.5
        2
      NaN



In [20]:

    
group_cols = ['neighborhood', 'prop_type']
agg_cols = ['person_capacity', 'price']
listings.groupby(group_cols)[agg_cols].agg(['sum', 'count']).unstack(level='prop_type')
#listings.groupby(group_cols)[agg_cols].agg(['sum', 'count']).unstack(1)









    Out[20]:






  
    
      
      person_capacity
      price
    
    
      
      sum
      count
      sum
      count
    
    
      prop_type
      Property type 1
      Property type 2
      Property type 3
      Property type 1
      Property type 2
      Property type 3
      Property type 1
      Property type 2
      Property type 3
      Property type 1
      Property type 2
      Property type 3
    
    
      neighborhood
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Neighborhood 1
         2
      NaN
      NaN
        1
      NaN
      NaN
          85
        NaN
      NaN
        1
      NaN
      NaN
    
    
      Neighborhood 10
        21
        4
      NaN
        6
        2
      NaN
         855
        275
      NaN
        6
        2
      NaN
    
    
      Neighborhood 11
        45
        8
        2
       14
        4
        1
        2232
        315
       75
       14
        4
        1
    
    
      Neighborhood 12
       134
       37
      NaN
       39
       19
      NaN
       14259
       1841
      NaN
       39
       19
      NaN
    
    
      Neighborhood 13
       199
       42
      NaN
       49
       23
      NaN
       11853
       1866
      NaN
       49
       23
      NaN
    
    
      Neighborhood 14
       109
       39
        1
       34
       21
        1
        5599
       1760
       75
       34
       21
        1
    
    
      Neighborhood 15
        93
       34
      NaN
       25
       15
      NaN
        4472
       1425
      NaN
       25
       15
      NaN
    
    
      Neighborhood 16
        41
       33
      NaN
       14
       16
      NaN
        2225
       1338
      NaN
       14
       16
      NaN
    
    
      Neighborhood 17
        81
       22
        2
       23
       11
        1
        4367
       1127
       65
       23
       11
        1
    
    
      Neighborhood 18
        65
       20
      NaN
       22
        9
      NaN
        3819
       1086
      NaN
       22
        9
      NaN
    
    
      Neighborhood 19
        29
       16
      NaN
        8
        8
      NaN
        1779
        711
      NaN
        8
        8
      NaN
    
    
      Neighborhood 2
         6
      NaN
      NaN
        1
      NaN
      NaN
         250
        NaN
      NaN
        1
      NaN
      NaN
    
    
      Neighborhood 20
        25
        1
      NaN
        9
        1
      NaN
        7239
         60
      NaN
        9
        1
      NaN
    
    
      Neighborhood 21
        17
      NaN
      NaN
        4
      NaN
      NaN
        1450
        NaN
      NaN
        4
      NaN
      NaN
    
    
      Neighborhood 22
         3
      NaN
      NaN
        1
      NaN
      NaN
         225
        NaN
      NaN
        1
      NaN
      NaN
    
    
      Neighborhood 3
       NaN
        2
      NaN
      NaN
        1
      NaN
         NaN
         60
      NaN
      NaN
        1
      NaN
    
    
      Neighborhood 4
       NaN
        2
        2
      NaN
        1
        1
         NaN
         60
       40
      NaN
        1
        1
    
    
      Neighborhood 5
         5
      NaN
      NaN
        2
      NaN
      NaN
         389
        NaN
      NaN
        2
      NaN
      NaN
    
    
      Neighborhood 6
        10
      NaN
      NaN
        3
      NaN
      NaN
         438
        NaN
      NaN
        3
      NaN
      NaN
    
    
      Neighborhood 7
        11
        2
      NaN
        3
        1
      NaN
         483
        100
      NaN
        3
        1
      NaN
    
    
      Neighborhood 8
        20
        4
      NaN
        4
        1
      NaN
         699
        350
      NaN
        4
        1
      NaN
    
    
      Neighborhood 9
        30
        4
      NaN
        7
        2
      NaN
        1058
        220
      NaN
        7
        2
      NaN



In [21]:

    
listings.groupby(['prop_type', 'neighborhood']).agg(['mean', 'count'])









    Out[21]:






  
    
      
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      
      mean
      count
      mean
      count
      mean
      count
      mean
      count
      mean
      count
      mean
      count
    
    
      prop_type
      neighborhood
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 1
       235.000000
        1
        85.000000
        1
       2.000000
        1
       26.000000
        1
       209.000000
        1
        6.000000
        1
    
    
      Neighborhood 10
       307.500000
        6
       142.500000
        6
       3.500000
        6
       13.333333
        6
       391.000000
        6
        3.833333
        6
    
    
      Neighborhood 11
       174.000000
       14
       159.428571
       14
       3.214286
       14
        9.928571
       14
       379.000000
       14
        9.642857
       14
    
    
      Neighborhood 12
       211.307692
       39
       365.615385
       39
       3.435897
       39
       10.820513
       39
       267.205128
       39
        7.897436
       39
    
    
      Neighborhood 13
       190.142857
       49
       241.897959
       49
       4.061224
       49
       15.653061
       49
       290.408163
       49
        9.122449
       49
    
    
      Neighborhood 14
       220.764706
       34
       164.676471
       34
       3.205882
       34
       14.764706
       34
       317.205882
       34
        8.441176
       34
    
    
      Neighborhood 15
       191.560000
       25
       178.880000
       25
       3.720000
       25
       14.320000
       25
       321.760000
       25
        9.320000
       25
    
    
      Neighborhood 16
       233.000000
       14
       158.928571
       14
       2.928571
       14
       21.642857
       14
       310.714286
       14
        7.071429
       14
    
    
      Neighborhood 17
       166.043478
       23
       189.869565
       23
       3.521739
       23
       16.086957
       23
       317.347826
       23
        9.869565
       23
    
    
      Neighborhood 18
       210.000000
       22
       173.590909
       22
       2.954545
       22
       16.090909
       22
       369.227273
       22
        8.227273
       22
    
    
      Neighborhood 19
       253.250000
        8
       222.375000
        8
       3.625000
        8
       11.000000
        8
       254.500000
        8
        6.500000
        8
    
    
      Neighborhood 2
       244.000000
        1
       250.000000
        1
       6.000000
        1
        8.000000
        1
       423.000000
        1
        6.000000
        1
    
    
      Neighborhood 20
       174.111111
        9
       804.333333
        9
       2.777778
        9
        9.444444
        9
       223.555556
        9
        9.666667
        9
    
    
      Neighborhood 21
        79.250000
        4
       362.500000
        4
       4.250000
        4
       49.000000
        4
       306.250000
        4
       14.750000
        4
    
    
      Neighborhood 22
       162.000000
        1
       225.000000
        1
       3.000000
        1
       19.000000
        1
       500.000000
        1
        9.000000
        1
    
    
      Neighborhood 5
       132.500000
        2
       194.500000
        2
       2.500000
        2
        8.500000
        2
       266.500000
        2
       11.500000
        2
    
    
      Neighborhood 6
       291.333333
        3
       146.000000
        3
       3.333333
        3
       12.666667
        3
       290.666667
        3
        4.000000
        3
    
    
      Neighborhood 7
       273.333333
        3
       161.000000
        3
       3.666667
        3
       14.333333
        3
       343.000000
        3
        5.333333
        3
    
    
      Neighborhood 8
       218.250000
        4
       174.750000
        4
       5.000000
        4
       11.000000
        4
       300.000000
        4
        6.750000
        4
    
    
      Neighborhood 9
       265.857143
        7
       151.142857
        7
       4.285714
        7
       13.428571
        7
       471.428571
        7
        5.714286
        7
    
    
      Property type 2
      Neighborhood 10
       327.000000
        2
       137.500000
        2
       2.000000
        2
       20.000000
        2
       126.000000
        2
        3.500000
        2
    
    
      Neighborhood 11
       146.250000
        4
        78.750000
        4
       2.000000
        4
       16.750000
        4
       161.250000
        4
       11.250000
        4
    
    
      Neighborhood 12
       164.263158
       19
        96.894737
       19
       1.947368
       19
       10.473684
       19
       244.526316
       19
        9.842105
       19
    
    
      Neighborhood 13
       199.000000
       23
        81.130435
       23
       1.826087
       23
       16.695652
       23
       418.565217
       23
        9.739130
       23
    
    
      Neighborhood 14
       195.047619
       21
        83.809524
       21
       1.857143
       21
       15.904762
       21
       348.619048
       21
        8.714286
       21
    
    
      Neighborhood 15
       194.666667
       15
        95.000000
       15
       2.266667
       15
       11.733333
       15
       301.733333
       15
        8.200000
       15
    
    
      Neighborhood 16
       251.562500
       16
        83.625000
       16
       2.062500
       16
       15.375000
       16
       246.250000
       16
        6.687500
       16
    
    
      Neighborhood 17
       242.181818
       11
       102.454545
       11
       2.000000
       11
       15.454545
       11
       308.272727
       11
        7.181818
       11
    
    
      Neighborhood 18
       179.333333
        9
       120.666667
        9
       2.222222
        9
       12.333333
        9
       297.777778
        9
        9.222222
        9
    
    
      Neighborhood 19
       256.750000
        8
        88.875000
        8
       2.000000
        8
       15.125000
        8
       383.375000
        8
        5.500000
        8
    
    
      Neighborhood 20
       230.000000
        1
        60.000000
        1
       1.000000
        1
        3.000000
        1
       101.000000
        1
        6.000000
        1
    
    
      Neighborhood 3
       166.000000
        1
        60.000000
        1
       2.000000
        1
        7.000000
        1
       264.000000
        1
        9.000000
        1
    
    
      Neighborhood 4
       118.000000
        1
        60.000000
        1
       2.000000
        1
       10.000000
        1
        95.000000
        1
       11.000000
        1
    
    
      Neighborhood 7
       365.000000
        1
       100.000000
        1
       2.000000
        1
        3.000000
        1
       148.000000
        1
        2.000000
        1
    
    
      Neighborhood 8
       343.000000
        1
       350.000000
        1
       4.000000
        1
        5.000000
        1
       223.000000
        1
        3.000000
        1
    
    
      Neighborhood 9
       165.500000
        2
       110.000000
        2
       2.000000
        2
        3.500000
        2
       114.500000
        2
        9.000000
        2
    
    
      Property type 3
      Neighborhood 11
       178.000000
        1
        75.000000
        1
       2.000000
        1
       15.000000
        1
       196.000000
        1
        8.000000
        1
    
    
      Neighborhood 14
       286.000000
        1
        75.000000
        1
       1.000000
        1
        1.000000
        1
       113.000000
        1
        5.000000
        1
    
    
      Neighborhood 17
        10.000000
        1
        65.000000
        1
       2.000000
        1
       15.000000
        1
       189.000000
        1
       23.000000
        1
    
    
      Neighborhood 4
        20.000000
        1
        40.000000
        1
       2.000000
        1
        4.000000
        1
       241.000000
        1
       19.000000
        1

Plot daily bookings:



In [34]:

    
bookings = bookings.reset_index()
bookings.groupby('booking_date').count()









    Out[34]:






  
    
      
      prop_id
      number_of_bookings
    
    
      booking_date
      
      
    
  
  
    
      2011-01-01
       11
       11
    
    
      2011-01-02
        9
        9
    
    
      2011-01-03
       10
       10
    
    
      2011-01-04
        8
        8
    
    
      2011-01-05
       15
       15
    
    
      2011-01-06
       14
       14
    
    
      2011-01-07
       14
       14
    
    
      2011-01-08
        6
        6
    
    
      2011-01-09
       10
       10
    
    
      2011-01-10
       14
       14
    
    
      2011-01-11
       15
       15
    
    
      2011-01-12
       20
       20
    
    
      2011-01-13
       20
       20
    
    
      2011-01-14
       14
       14
    
    
      2011-01-15
       11
       11
    
    
      2011-01-16
       11
       11
    
    
      2011-01-17
       21
       21
    
    
      2011-01-18
        7
        7
    
    
      2011-01-19
       14
       14
    
    
      2011-01-20
       13
       13
    
    
      2011-01-21
       21
       21
    
    
      2011-01-22
       11
       11
    
    
      2011-01-23
       14
       14
    
    
      2011-01-24
       14
       14
    
    
      2011-01-25
       16
       16
    
    
      2011-01-26
       11
       11
    
    
      2011-01-27
       16
       16
    
    
      2011-01-28
       15
       15
    
    
      2011-01-29
       10
       10
    
    
      2011-01-30
       18
       18
    
    
      ...
      ...
      ...
    
    
      2011-12-02
        9
        9
    
    
      2011-12-03
        4
        4
    
    
      2011-12-04
       11
       11
    
    
      2011-12-05
       12
       12
    
    
      2011-12-06
       16
       16
    
    
      2011-12-07
       14
       14
    
    
      2011-12-08
       18
       18
    
    
      2011-12-09
        7
        7
    
    
      2011-12-10
       15
       15
    
    
      2011-12-11
        6
        6
    
    
      2011-12-12
       15
       15
    
    
      2011-12-13
       14
       14
    
    
      2011-12-14
       12
       12
    
    
      2011-12-15
        8
        8
    
    
      2011-12-16
       17
       17
    
    
      2011-12-17
       13
       13
    
    
      2011-12-18
       11
       11
    
    
      2011-12-19
        9
        9
    
    
      2011-12-20
        8
        8
    
    
      2011-12-21
        8
        8
    
    
      2011-12-22
        3
        3
    
    
      2011-12-23
        8
        8
    
    
      2011-12-24
        5
        5
    
    
      2011-12-25
        2
        2
    
    
      2011-12-26
       11
       11
    
    
      2011-12-27
        9
        9
    
    
      2011-12-28
       14
       14
    
    
      2011-12-29
       14
       14
    
    
      2011-12-30
       14
       14
    
    
      2011-12-31
       10
       10
    
  

365 rows × 2 columns



In [35]:

    
# Plot daily bookings
#grid_plot = sns.FacetGrid(bookings, row='booking_date', col='prop_id')
#grid_plot.map(sns.regplot, 'booking_date', color='.3', fit_reg=False, x_jitter=.1)


#prop_id	booking_date 

#ax = sns.boxplot(bookings.age)
#ax.set_title('Age Distribution by class')

#bookings.groupby(['booking_date']).agg(['count']).plot(kind='bar')
#bookings.groupby('booking_date').agg(['count']).plot(kind='bar')
bookings.groupby('booking_date').count().plot(kind='bar')









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a0f1b90>

Plot the daily bookings per neighborhood (provide a legend)



In [36]:

    
bookings.head()









    Out[36]:






  
    
      
      booking_date
      prop_id
      number_of_bookings
    
  
  
    
      0
      2011-06-17
        9
       1
    
    
      1
      2011-08-12
       13
       1
    
    
      2
      2011-06-20
       21
       1
    
    
      3
      2011-05-05
       28
       1
    
    
      4
      2011-11-17
       29
       1



In [40]:

    
#first merge
merged_bookings_listings = pd.merge(bookings, listings, on='prop_id')
merged_bookings_listings.head()
merged_bookings_listings.groupby('neighborhood')['neighborhood'].count().plot(kind='bar')

#groupby('cylinders')['mpg'].count().plot(kind='bar')
#bookings.resample('M', how='count').number_of_bookings.plot()









    Out[40]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d60c290>

Part 2 - Develop a data set



In [ ]:

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [41]:

    
b =bookings.groupby('prop_id').count()



In [42]:

    
b.head()









    Out[42]:






  
    
      
      booking_date
      number_of_bookings
    
    
      prop_id
      
      
    
  
  
    
      1
        4
        4
    
    
      3
        1
        1
    
    
      4
       27
       27
    
    
      6
       88
       88
    
    
      7
        2
        2



In [43]:

    
c = b.reset_index()



In [44]:

    
c.head()









    Out[44]:






  
    
      
      prop_id
      booking_date
      number_of_bookings
    
  
  
    
      0
       1
        4
        4
    
    
      1
       3
        1
        1
    
    
      2
       4
       27
       27
    
    
      3
       6
       88
       88
    
    
      4
       7
        2
        2



In [54]:

    
number_of_bookings = bookings.groupby('prop_id').count().reset_index()
number_of_bookings.rename(columns={'booking_date':'number_of_bookings2'}, inplace = True)
number_of_bookings.head()
listings2 = pd.merge(listings, number_of_bookings, how='left', on='prop_id')
listings2.number_of_bookings.fillna(0, inplace=True)
listings2.head()









    Out[54]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings2
      number_of_bookings
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
        4
        4
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
      NaN
        0
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        1
        1
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
       27
       27
    
    
      4
       5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28
      NaN
        0



In [55]:

    
#listings['number_of_bookings'] = 1
#listings['booking_rate'] = 

number_of_bookings = bookings.groupby('prop_id').count().reset_index()
number_of_bookings.rename(columns={'booking_date':'booking_date_old'}, inplace = True)
listings2 = pd.merge(listings, number_of_bookings, how='left', on='prop_id')
listings2.number_of_bookings.fillna(0, inplace=True)
# Alternative way: listings2['number_of_bookings'] = listings2.number_of_bookings.fillna(0)
listings2['booking_rate'] = listings2.number_of_bookings/listings2.tenure_months

listings2.tail()









    Out[55]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      booking_date_old
      number_of_bookings
      booking_rate
    
  
  
    
      403
       404
       Property type 2
       Neighborhood 14
       100
       1
        8
        235
       1
        3
        3
        3
    
    
      404
       405
       Property type 2
       Neighborhood 13
        85
       2
       27
       1048
       1
       19
       19
       19
    
    
      405
       406
       Property type 1
        Neighborhood 9
        70
       3
       18
        153
       1
       19
       19
       19
    
    
      406
       407
       Property type 1
       Neighborhood 13
       129
       2
       13
        370
       1
       15
       15
       15
    
    
      407
       408
       Property type 1
       Neighborhood 14
       100
       3
       21
        707
       1
       54
       54
       54

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [57]:

    
listings2[listings2.tenure_months > 9].tail()









    Out[57]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      booking_date_old
      number_of_bookings
      booking_rate
    
  
  
    
      139
       140
       Property type 1
       Neighborhood 12
       200
       4
       18
       125
       10
       10
       10
       1.0
    
    
      140
       141
       Property type 2
       Neighborhood 12
        45
       2
       36
       281
       10
        1
        1
       0.1
    
    
      141
       142
       Property type 2
       Neighborhood 15
        96
       2
        9
       138
       10
       48
       48
       4.8
    
    
      142
       143
       Property type 2
       Neighborhood 15
        58
       2
        7
       135
       10
       21
       21
       2.1
    
    
      143
       144
       Property type 2
       Neighborhood 14
       100
       2
        5
        35
       10
      NaN
        0
       0.0

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [70]:

    
#this converted the strings to multiple collums, each unique to the possible values in the original collumn.  Note this only applied to the strings.  The integers did not change.
listings3 = pd.get_dummies(listings2)
#pd.get_dummies(listings2[['prop_type', 'neighborhood', 'price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months', 'booking_rate']])



In [72]:

    
listings3.columns









    Out[72]:





Index([u'prop_id', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months', u'booking_date_old', u'number_of_bookings', u'booking_rate', u'prop_type_Property type 1', u'prop_type_Property type 2', u'prop_type_Property type 3', u'neighborhood_Neighborhood 1', u'neighborhood_Neighborhood 10', u'neighborhood_Neighborhood 11', u'neighborhood_Neighborhood 12', u'neighborhood_Neighborhood 13', u'neighborhood_Neighborhood 14', u'neighborhood_Neighborhood 15', u'neighborhood_Neighborhood 16', u'neighborhood_Neighborhood 17', u'neighborhood_Neighborhood 18', u'neighborhood_Neighborhood 19', u'neighborhood_Neighborhood 2', u'neighborhood_Neighborhood 20', u'neighborhood_Neighborhood 21', u'neighborhood_Neighborhood 22', u'neighborhood_Neighborhood 3', u'neighborhood_Neighborhood 4', u'neighborhood_Neighborhood 5', u'neighborhood_Neighborhood 6', u'neighborhood_Neighborhood 7', u'neighborhood_Neighborhood 8', u'neighborhood_Neighborhood 9'], dtype='object')



In [115]:

    
listings3.head()









    Out[115]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      booking_date_old
      number_of_bookings
      booking_rate
      prop_type_Property type 1
      ...
      neighborhood_Neighborhood 20
      neighborhood_Neighborhood 21
      neighborhood_Neighborhood 22
      neighborhood_Neighborhood 3
      neighborhood_Neighborhood 4
      neighborhood_Neighborhood 5
      neighborhood_Neighborhood 6
      neighborhood_Neighborhood 7
      neighborhood_Neighborhood 8
      neighborhood_Neighborhood 9
    
  
  
    
      0
       1
       140
       3
       11
       232
       30
        4
        4
       0.133333
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      1
       2
        95
       2
        3
        37
       29
      NaN
        0
       0.000000
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2
       3
        95
       2
       16
       172
       29
        1
        1
       0.034483
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3
       4
        90
       2
       19
       472
       28
       27
       27
       0.964286
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4
       5
       125
       5
       21
       442
       28
      NaN
        0
       0.000000
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

5 rows × 34 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [74]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(listings3[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'] 
], listings3.booking_rate, random_state=12, test_size=0.2)



In [ ]:

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [93]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets



In [94]:

    
lr.fit(X_train, y_train)









    Out[94]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [95]:

    
lr.score(X_test, y_test)









    Out[95]:





0.1256213709762638

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

The score returns the coefficient of determination R^2 of the prediction. Our "score" was 0.1256213709762638, which seems pretty low (the best is 1). Therefore, this suggests the X_test data does not strongly predict Y_test data.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [120]:

    
X_train, X_test, y_train, y_test = train_test_split(listings3[['price', 'picture_count', 'tenure_months'] 
], listings3.booking_rate, random_state=12, test_size=0.2)

lr = LinearRegression()

lr.fit(X_train, y_train)

lr.score(X_test, y_test)


#I tried adding and removing additional fields, and nothing seemed to significantly increasee the coefficient.

#Not sure how to create monthly revenue.









    Out[120]:





0.15289924643736186



In [ ]:



In [121]:

    
#Is it possible to plot the final regression in the homework?

#What does “random_state=12” mean??  

#How create “monthly revenue” predictor in HW1?

#How could I identify which of the x_test inputs are the most important to determine y?



In [ ]:



In [107]:

    
##Optional - can we plot this info?##



In [106]:

    
from sklearn.preprocessing import PolynomialFeatures


def f(x):
    return np.sin(2 * np.pi * x)

# generate points used to plot
x_plot = np.linspace(0, 1, 100)

def plot_approximation(est, ax, label=None):
    """Plot the approximation of ``est`` on axis ``ax``. """
    ax.plot(x_plot, f(x_plot), label='ground truth', color='green')
    ax.scatter(X_train, y_train)
    ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='red', label=label)
    ax.set_ylim((-2, 2))
    ax.set_xlim((0, 1))
    ax.set_ylabel('y')
    ax.set_xlabel('x')
    ax.legend(loc='upper right',frameon=True)



In [101]:

    
fig,ax = plt.subplots(1,1)
degree = 1
lr = make_pipeline(PolynomialFeatures(degree), LinearRegression())
plot_approximation(lr, ax, label='1')









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-101-cd4fe06f7b2d> in <module>()
      2 degree = 1
      3 lr = make_pipeline(PolynomialFeatures(degree), LinearRegression())
----> 4 plot_approximation(lr, ax, label='1')

<ipython-input-99-4bc76fe74d65> in plot_approximation(est, ax, label)
     11     """Plot the approximation of ``est`` on axis ``ax``. """
     12     ax.plot(x_plot, f(x_plot), label='ground truth', color='green')
---> 13     ax.scatter(X_train, y_train)
     14     ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='red', label=label)
     15     ax.set_ylim((-2, 2))

/Users/johnfohr/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, **kwargs)
   3575         y = np.ma.ravel(y)
   3576         if x.size != y.size:
-> 3577             raise ValueError("x and y must be the same size")
   3578 
   3579         s = np.ma.ravel(s)  # This doesn't have to match x, y in size.

ValueError: x and y must be the same size

	prop_id	booking_date
6071	408	2011-06-02
6072	408	2011-08-22
6073	408	2011-07-24
6074	408	2011-01-12
6075	408	2011-09-08

	prop_id	number_of_bookings
booking_date
2011-01-01	183.181818	1
2011-01-02	169.222222	1
2011-01-03	170.300000	1
2011-01-04	223.875000	1
2011-01-05	162.200000	1
2011-01-06	204.285714	1
2011-01-07	144.571429	1
2011-01-08	230.833333	1
2011-01-09	151.400000	1
2011-01-10	227.928571	1
2011-01-11	191.600000	1
2011-01-12	192.450000	1
2011-01-13	185.350000	1
2011-01-14	209.857143	1
2011-01-15	196.727273	1
2011-01-16	190.909091	1
2011-01-17	219.190476	1
2011-01-18	234.000000	1
2011-01-19	221.285714	1
2011-01-20	241.076923	1
2011-01-21	206.809524	1
2011-01-22	228.000000	1
2011-01-23	143.142857	1
2011-01-24	281.142857	1
2011-01-25	223.062500	1
2011-01-26	166.909091	1
2011-01-27	199.750000	1
2011-01-28	232.800000	1
2011-01-29	234.700000	1
2011-01-30	164.444444	1
...	...	...
2011-12-02	302.444444	1
2011-12-03	252.250000	1
2011-12-04	191.090909	1
2011-12-05	237.916667	1
2011-12-06	206.687500	1
2011-12-07	199.000000	1
2011-12-08	253.833333	1
2011-12-09	233.571429	1
2011-12-10	174.400000	1
2011-12-11	258.166667	1
2011-12-12	252.933333	1
2011-12-13	207.285714	1
2011-12-14	205.083333	1
2011-12-15	161.125000	1
2011-12-16	173.882353	1
2011-12-17	181.692308	1
2011-12-18	235.727273	1
2011-12-19	243.000000	1
2011-12-20	149.375000	1
2011-12-21	197.000000	1
2011-12-22	226.333333	1
2011-12-23	251.625000	1
2011-12-24	263.200000	1
2011-12-25	237.000000	1
2011-12-26	198.454545	1
2011-12-27	202.777778	1
2011-12-28	200.857143	1
2011-12-29	154.714286	1
2011-12-30	215.071429	1
2011-12-31	198.100000	1

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months
403	404	Property type 2	Neighborhood 14	100	1	8	235	1
404	405	Property type 2	Neighborhood 13	85	2	27	1048	1
405	406	Property type 1	Neighborhood 9	70	3	18	153	1
406	407	Property type 1	Neighborhood 13	129	2	13	370	1
407	408	Property type 1	Neighborhood 14	100	3	21	707	1

	prop_id	price	person_capacity	picture_count	description_length	tenure_months
count	408.000000	408.000000	408.000000	408.000000	408.000000	408.000000
mean	204.500000	187.806373	2.997549	14.389706	309.159314	8.487745
std	117.923704	353.050858	1.594676	10.477428	228.021684	5.872088
min	1.000000	39.000000	1.000000	1.000000	0.000000	1.000000
25%	102.750000	90.000000	2.000000	6.000000	179.000000	4.000000
50%	204.500000	125.000000	2.000000	12.000000	250.000000	7.000000
75%	306.250000	199.000000	4.000000	20.000000	389.500000	13.000000
max	408.000000	5000.000000	10.000000	71.000000	1969.000000	30.000000

	person_capacity						price
	sum			count			sum			count
prop_type	Property type 1	Property type 2	Property type 3	Property type 1	Property type 2	Property type 3	Property type 1	Property type 2	Property type 3	Property type 1	Property type 2	Property type 3
neighborhood
Neighborhood 1	2	NaN	NaN	1	NaN	NaN	85	NaN	NaN	1	NaN	NaN
Neighborhood 10	21	4	NaN	6	2	NaN	855	275	NaN	6	2	NaN
Neighborhood 11	45	8	2	14	4	1	2232	315	75	14	4	1
Neighborhood 12	134	37	NaN	39	19	NaN	14259	1841	NaN	39	19	NaN
Neighborhood 13	199	42	NaN	49	23	NaN	11853	1866	NaN	49	23	NaN
Neighborhood 14	109	39	1	34	21	1	5599	1760	75	34	21	1
Neighborhood 15	93	34	NaN	25	15	NaN	4472	1425	NaN	25	15	NaN
Neighborhood 16	41	33	NaN	14	16	NaN	2225	1338	NaN	14	16	NaN
Neighborhood 17	81	22	2	23	11	1	4367	1127	65	23	11	1
Neighborhood 18	65	20	NaN	22	9	NaN	3819	1086	NaN	22	9	NaN
Neighborhood 19	29	16	NaN	8	8	NaN	1779	711	NaN	8	8	NaN
Neighborhood 2	6	NaN	NaN	1	NaN	NaN	250	NaN	NaN	1	NaN	NaN
Neighborhood 20	25	1	NaN	9	1	NaN	7239	60	NaN	9	1	NaN
Neighborhood 21	17	NaN	NaN	4	NaN	NaN	1450	NaN	NaN	4	NaN	NaN
Neighborhood 22	3	NaN	NaN	1	NaN	NaN	225	NaN	NaN	1	NaN	NaN
Neighborhood 3	NaN	2	NaN	NaN	1	NaN	NaN	60	NaN	NaN	1	NaN
Neighborhood 4	NaN	2	2	NaN	1	1	NaN	60	40	NaN	1	1
Neighborhood 5	5	NaN	NaN	2	NaN	NaN	389	NaN	NaN	2	NaN	NaN
Neighborhood 6	10	NaN	NaN	3	NaN	NaN	438	NaN	NaN	3	NaN	NaN
Neighborhood 7	11	2	NaN	3	1	NaN	483	100	NaN	3	1	NaN
Neighborhood 8	20	4	NaN	4	1	NaN	699	350	NaN	4	1	NaN
Neighborhood 9	30	4	NaN	7	2	NaN	1058	220	NaN	7	2	NaN

	booking_date	prop_id	number_of_bookings
0	2011-06-17	9	1
1	2011-08-12	13	1
2	2011-06-20	21	1
3	2011-05-05	28	1
4	2011-11-17	29	1

	prop_id	price	person_capacity	picture_count	description_length	tenure_months	booking_date_old	number_of_bookings	booking_rate	prop_type_Property type 1	...
0	1	140	3	11	232	30	4	4	0.133333	1	...
1	2	95	2	3	37	29	NaN	0	0.000000	1	...
2	3	95	2	16	172	29	1	1	0.034483	0	...
3	4	90	2	19	472	28	27	27	0.964286	0	...
4	5	125	5	21	442	28	NaN	0	0.000000	1	...