Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [4]:

    
# Okay!

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [5]:

    
# the pd.read... etc pulls in data using pandas to create a data frame
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')

# the following .head(5) function displays the header and first 5 line items in the data frame
# I used this to ensure the data frame was properly formatted

listings.head(5)
# bookings.head(5)









    Out[5]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
    
    
      4
       5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [6]:

    
# using the .describe function displays a bunch of information for each column
    # The median is equvilant to the 50% quartile. 

listings.describe()

# which is the media? 
#     what is this question asking?  

# describe standard dev
#     The standard deviation is a value that shows how spread out the numbers in a set are. 
#     For example, the pic count varies by 10.5 for each std dev. 
#     If the distribution was normal (which it is not), the std dev would show that 68.27% of values have pictures within 1 std. dev of 10.5 at a mean of 14.4









    Out[6]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      count
       408.000000
        408.000000
       408.000000
       408.000000
        408.000000
       408.000000
    
    
      mean
       204.500000
        187.806373
         2.997549
        14.389706
        309.159314
         8.487745
    
    
      std
       117.923704
        353.050858
         1.594676
        10.477428
        228.021684
         5.872088
    
    
      min
         1.000000
         39.000000
         1.000000
         1.000000
          0.000000
         1.000000
    
    
      25%
       102.750000
         90.000000
         2.000000
         6.000000
        179.000000
         4.000000
    
    
      50%
       204.500000
        125.000000
         2.000000
        12.000000
        250.000000
         7.000000
    
    
      75%
       306.250000
        199.000000
         4.000000
        20.000000
        389.500000
        13.000000
    
    
      max
       408.000000
       5000.000000
        10.000000
        71.000000
       1969.000000
        30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [7]:

    
# I used groupby to display the information in a table format
listings.groupby(['prop_type'])['person_capacity','picture_count','description_length','tenure_months','price'].agg(['mean'])









    Out[7]:






  
    
      
      person_capacity
      picture_count
      description_length
      tenure_months
      price
    
    
      
      mean
      mean
      mean
      mean
      mean
    
    
      prop_type
      
      
      
      
      
    
  
  
    
      Property type 1
       3.516729
       14.695167
       313.171004
        8.464684
       237.085502
    
    
      Property type 2
       2.000000
       13.948148
       304.851852
        8.377778
        93.288889
    
    
      Property type 3
       1.750000
        8.750000
       184.750000
       13.750000
        63.750000

Same, but by property type per neighborhood?



In [8]:

    
# I added a variable to the groupby to show neighborhood. 
# Depending on how you want to look at the data, it might be more intesting to look at property type first, then neighborhood.
# or neighborhood, then prop type. Both are provided. 

# listings.groupby(['prop_type','neighborhood'])['person_capacity','picture_count','description_length','tenure_months','price'].agg(['mean'])

listings.groupby(['neighborhood','prop_type'])['person_capacity','picture_count','description_length','tenure_months','price'].agg(['mean'])









    Out[8]:






  
    
      
      
      person_capacity
      picture_count
      description_length
      tenure_months
      price
    
    
      
      
      mean
      mean
      mean
      mean
      mean
    
    
      neighborhood
      prop_type
      
      
      
      
      
    
  
  
    
      Neighborhood 1
      Property type 1
       2.000000
       26.000000
       209.000000
        6.000000
        85.000000
    
    
      Neighborhood 10
      Property type 1
       3.500000
       13.333333
       391.000000
        3.833333
       142.500000
    
    
      Property type 2
       2.000000
       20.000000
       126.000000
        3.500000
       137.500000
    
    
      Neighborhood 11
      Property type 1
       3.214286
        9.928571
       379.000000
        9.642857
       159.428571
    
    
      Property type 2
       2.000000
       16.750000
       161.250000
       11.250000
        78.750000
    
    
      Property type 3
       2.000000
       15.000000
       196.000000
        8.000000
        75.000000
    
    
      Neighborhood 12
      Property type 1
       3.435897
       10.820513
       267.205128
        7.897436
       365.615385
    
    
      Property type 2
       1.947368
       10.473684
       244.526316
        9.842105
        96.894737
    
    
      Neighborhood 13
      Property type 1
       4.061224
       15.653061
       290.408163
        9.122449
       241.897959
    
    
      Property type 2
       1.826087
       16.695652
       418.565217
        9.739130
        81.130435
    
    
      Neighborhood 14
      Property type 1
       3.205882
       14.764706
       317.205882
        8.441176
       164.676471
    
    
      Property type 2
       1.857143
       15.904762
       348.619048
        8.714286
        83.809524
    
    
      Property type 3
       1.000000
        1.000000
       113.000000
        5.000000
        75.000000
    
    
      Neighborhood 15
      Property type 1
       3.720000
       14.320000
       321.760000
        9.320000
       178.880000
    
    
      Property type 2
       2.266667
       11.733333
       301.733333
        8.200000
        95.000000
    
    
      Neighborhood 16
      Property type 1
       2.928571
       21.642857
       310.714286
        7.071429
       158.928571
    
    
      Property type 2
       2.062500
       15.375000
       246.250000
        6.687500
        83.625000
    
    
      Neighborhood 17
      Property type 1
       3.521739
       16.086957
       317.347826
        9.869565
       189.869565
    
    
      Property type 2
       2.000000
       15.454545
       308.272727
        7.181818
       102.454545
    
    
      Property type 3
       2.000000
       15.000000
       189.000000
       23.000000
        65.000000
    
    
      Neighborhood 18
      Property type 1
       2.954545
       16.090909
       369.227273
        8.227273
       173.590909
    
    
      Property type 2
       2.222222
       12.333333
       297.777778
        9.222222
       120.666667
    
    
      Neighborhood 19
      Property type 1
       3.625000
       11.000000
       254.500000
        6.500000
       222.375000
    
    
      Property type 2
       2.000000
       15.125000
       383.375000
        5.500000
        88.875000
    
    
      Neighborhood 2
      Property type 1
       6.000000
        8.000000
       423.000000
        6.000000
       250.000000
    
    
      Neighborhood 20
      Property type 1
       2.777778
        9.444444
       223.555556
        9.666667
       804.333333
    
    
      Property type 2
       1.000000
        3.000000
       101.000000
        6.000000
        60.000000
    
    
      Neighborhood 21
      Property type 1
       4.250000
       49.000000
       306.250000
       14.750000
       362.500000
    
    
      Neighborhood 22
      Property type 1
       3.000000
       19.000000
       500.000000
        9.000000
       225.000000
    
    
      Neighborhood 3
      Property type 2
       2.000000
        7.000000
       264.000000
        9.000000
        60.000000
    
    
      Neighborhood 4
      Property type 2
       2.000000
       10.000000
        95.000000
       11.000000
        60.000000
    
    
      Property type 3
       2.000000
        4.000000
       241.000000
       19.000000
        40.000000
    
    
      Neighborhood 5
      Property type 1
       2.500000
        8.500000
       266.500000
       11.500000
       194.500000
    
    
      Neighborhood 6
      Property type 1
       3.333333
       12.666667
       290.666667
        4.000000
       146.000000
    
    
      Neighborhood 7
      Property type 1
       3.666667
       14.333333
       343.000000
        5.333333
       161.000000
    
    
      Property type 2
       2.000000
        3.000000
       148.000000
        2.000000
       100.000000
    
    
      Neighborhood 8
      Property type 1
       5.000000
       11.000000
       300.000000
        6.750000
       174.750000
    
    
      Property type 2
       4.000000
        5.000000
       223.000000
        3.000000
       350.000000
    
    
      Neighborhood 9
      Property type 1
       4.285714
       13.428571
       471.428571
        5.714286
       151.142857
    
    
      Property type 2
       2.000000
        3.500000
       114.500000
        9.000000
       110.000000

Plot daily bookings:



In [9]:

    
print(type(bookings))
bookings.head(5)
bookings.sort_index(by='booking_date', ascending=False)
bookingCounts = bookings.groupby(['booking_date'])['prop_id'].agg(['count'])

# Note to self: create a dataframe in order to graph the histogram with labels
# need to make sure that both columns were/are labeled. 

# bookingCounts = bookings['booking_date'].value_counts()
# print(type(bookingCounts))
# df = pd.DataFrame(bookingCounts)
# print(df)
# df.info()
print bookingCounts

propBookingCounts = bookings.groupby(['prop_id'])['booking_date'].agg(['count'])

print(propBookingCounts.info())









    



<class 'pandas.core.frame.DataFrame'>
              count
booking_date       
2011-01-01       11
2011-01-02        9
2011-01-03       10
2011-01-04        8
2011-01-05       15
2011-01-06       14
2011-01-07       14
2011-01-08        6
2011-01-09       10
2011-01-10       14
2011-01-11       15
2011-01-12       20
2011-01-13       20
2011-01-14       14
2011-01-15       11
2011-01-16       11
2011-01-17       21
2011-01-18        7
2011-01-19       14
2011-01-20       13
2011-01-21       21
2011-01-22       11
2011-01-23       14
2011-01-24       14
2011-01-25       16
2011-01-26       11
2011-01-27       16
2011-01-28       15
2011-01-29       10
2011-01-30       18
...             ...
2011-12-02        9
2011-12-03        4
2011-12-04       11
2011-12-05       12
2011-12-06       16
2011-12-07       14
2011-12-08       18
2011-12-09        7
2011-12-10       15
2011-12-11        6
2011-12-12       15
2011-12-13       14
2011-12-14       12
2011-12-15        8
2011-12-16       17
2011-12-17       13
2011-12-18       11
2011-12-19        9
2011-12-20        8
2011-12-21        8
2011-12-22        3
2011-12-23        8
2011-12-24        5
2011-12-25        2
2011-12-26       11
2011-12-27        9
2011-12-28       14
2011-12-29       14
2011-12-30       14
2011-12-31       10

[365 rows x 1 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 1 to 408
Data columns (total 1 columns):
count    328 non-null int64
dtypes: int64(1)
memory usage: 5.1 KB
None



In [10]:

    
bookingCounts.hist()
# need to label axis









    Out[10]:





array([[<matplotlib.axes.AxesSubplot object at 0x10f409d10>]], dtype=object)

Plot the daily bookings per neighborhood (provide a legend)



In [11]:

    
# First step is to merge the two lists because one has info on listing dates
listMerge = listings.merge(bookings, on='prop_id')
listGroup = listMerge.groupby(['neighborhood','booking_date'])['prop_id'].agg(['count']).unstack(0)

listGroup.plot()









    Out[11]:





<matplotlib.axes.AxesSubplot at 0x10bcec210>

Part 2 - Develop a data set



In [12]:

    
listMerge.head()









    Out[12]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      booking_date
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
       2011-03-09
    
    
      1
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
       2011-03-07
    
    
      2
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
       2011-05-24
    
    
      3
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
       2011-06-18
    
    
      4
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
       2011-08-16



In [13]:

    
listGroup.head()









    Out[13]:






  
    
      
      count
    
    
      neighborhood
      Neighborhood 1
      Neighborhood 10
      Neighborhood 11
      Neighborhood 12
      Neighborhood 13
      Neighborhood 14
      Neighborhood 15
      Neighborhood 16
      Neighborhood 17
      Neighborhood 18
      ...
      Neighborhood 20
      Neighborhood 21
      Neighborhood 22
      Neighborhood 3
      Neighborhood 4
      Neighborhood 5
      Neighborhood 6
      Neighborhood 7
      Neighborhood 8
      Neighborhood 9
    
    
      booking_date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2011-01-01
      NaN
      NaN
      NaN
      NaN
       4
       3
        1
      NaN
        1
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        2
    
    
      2011-01-02
      NaN
      NaN
        1
      NaN
       1
       3
        1
        1
      NaN
        2
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2011-01-03
      NaN
      NaN
      NaN
      NaN
       5
       2
      NaN
      NaN
        2
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
    
    
      2011-01-04
      NaN
      NaN
      NaN
        1
       1
       1
        2
      NaN
      NaN
        1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        2
    
    
      2011-01-05
      NaN
      NaN
        1
      NaN
       1
       6
        3
        1
        1
        1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
        1
    
  

5 rows × 21 columns

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [14]:

    
# @Chad&Ramesh ... I don't understand how this is supposed to work. 
# adding the columns is easy, but how are we supposed to iterate through and include the values?

listingsWithPropCount = listings.merge(propBookingCounts, left_on='prop_id', right_index=True)
listingsWithPropCount.rename(columns={'count': 'number_of_bookings'}, inplace=True)

# listings['booking_rate'] = listings.prop_id.map(booking_rate_map)
# listings['booking_rate'] = ""

# !!!   things that don't work:  !!!
# propBookingCounts.rename(columns={'count': 'number_of_bookings'}, inplace=True)
# propBookingCounts.ix[0:2, ['prop_id', 'number_of_bookings']]
# print propBookingCounts.head()
# listings['number_of_bookings'] = propBookingCounts.row_dt.map(lambda x: x.count)
# combiner = lambda x, y: np.where(isnull(x), y, x) 
# listings.combine(propBookingCounts, combiner)



In [15]:

    
listingsWithPropCount.head()









    Out[15]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
        4
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        1
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
       27
    
    
      5
       6
       Property type 2
       Neighborhood 13
        89
       2
       10
       886
       28
       88
    
    
      6
       7
       Property type 2
       Neighborhood 13
        85
       1
       11
        58
       24
        2



In [16]:

    
# def get_booking_rate(val):
#     if number_of_bookings != 0:
#         return number_of_bookings/listings['tenure_months']
#     else:
#         return 0
#of bookings/tenure_months
    
listingsWithPropCount['booking_rate'] = (listingsWithPropCount['number_of_bookings']/listingsWithPropCount['tenure_months'])
listingsWithPropCount.head()









    Out[16]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
        4
       0.133333
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        1
       0.034483
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
       27
       0.964286
    
    
      5
       6
       Property type 2
       Neighborhood 13
        89
       2
       10
       886
       28
       88
       3.142857
    
    
      6
       7
       Property type 2
       Neighborhood 13
        85
       1
       11
        58
       24
        2
       0.083333

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [17]:

    
established_properties = listingsWithPropCount[(listings.tenure_months > 10)]
established_properties









    



/Library/Python/2.7/site-packages/pandas/core/frame.py:1808: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)






    Out[17]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
        232
       30
        4
       0.133333
    
    
      2  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
        172
       29
        1
       0.034483
    
    
      3  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
        472
       28
       27
       0.964286
    
    
      5  
         6
       Property type 2
       Neighborhood 13
        89
       2
       10
        886
       28
       88
       3.142857
    
    
      6  
         7
       Property type 2
       Neighborhood 13
        85
       1
       11
         58
       24
        2
       0.083333
    
    
      7  
         8
       Property type 1
       Neighborhood 18
       120
       2
       13
        685
       23
       28
       1.217391
    
    
      8  
         9
       Property type 1
       Neighborhood 13
       210
       6
       27
        180
       23
        3
       0.130435
    
    
      9  
        10
       Property type 3
       Neighborhood 17
        65
       2
       15
        189
       23
       26
       1.130435
    
    
      10 
        11
       Property type 1
       Neighborhood 13
       145
       3
        9
        140
       22
        7
       0.318182
    
    
      11 
        12
       Property type 2
       Neighborhood 13
        89
       2
       11
        153
       22
       45
       2.045455
    
    
      12 
        13
       Property type 2
       Neighborhood 14
        96
       2
       10
        245
       20
       57
       2.850000
    
    
      13 
        14
       Property type 2
       Neighborhood 17
        95
       2
        8
        139
       20
       31
       1.550000
    
    
      14 
        15
       Property type 2
       Neighborhood 11
        95
       2
       15
        255
       19
       12
       0.631579
    
    
      15 
        16
       Property type 2
       Neighborhood 14
        95
       2
       39
        334
       19
       44
       2.315789
    
    
      16 
        17
       Property type 2
       Neighborhood 12
        65
       2
        7
        333
       19
        7
       0.368421
    
    
      18 
        19
       Property type 2
       Neighborhood 12
        65
       2
        9
        448
       19
        1
       0.052632
    
    
      19 
        20
       Property type 3
        Neighborhood 4
        40
       2
        4
        241
       19
        9
       0.473684
    
    
      20 
        21
       Property type 1
       Neighborhood 13
       295
       5
       22
        228
       18
        9
       0.500000
    
    
      22 
        23
       Property type 2
       Neighborhood 14
        90
       3
       15
        411
       18
       52
       2.888889
    
    
      23 
        24
       Property type 1
       Neighborhood 16
       110
       2
       10
        495
       17
       26
       1.529412
    
    
      24 
        25
       Property type 1
       Neighborhood 18
       215
       2
       16
        190
       17
        3
       0.176471
    
    
      25 
        26
       Property type 1
       Neighborhood 14
       139
       2
       20
        395
       17
        2
       0.117647
    
    
      26 
        27
       Property type 1
       Neighborhood 15
       180
       6
       26
        325
       17
       68
       4.000000
    
    
      27 
        28
       Property type 2
       Neighborhood 12
        95
       2
        8
        137
       16
        8
       0.500000
    
    
      28 
        29
       Property type 1
       Neighborhood 17
       125
       3
       31
        327
       16
        2
       0.125000
    
    
      29 
        30
       Property type 2
       Neighborhood 17
        90
       3
        5
        582
       16
        9
       0.562500
    
    
      32 
        33
       Property type 1
       Neighborhood 13
       246
       8
       10
        437
       16
        1
       0.062500
    
    
      34 
        35
       Property type 1
       Neighborhood 13
       170
       4
       39
        404
       16
        1
       0.062500
    
    
      42 
        43
       Property type 1
       Neighborhood 12
       229
       4
        6
        281
       16
        3
       0.187500
    
    
      43 
        44
       Property type 1
       Neighborhood 19
       500
       4
        6
        342
       16
        2
       0.125000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      82 
        83
       Property type 1
       Neighborhood 14
       565
       8
       24
       1111
       15
        1
       0.066667
    
    
      83 
        84
       Property type 1
       Neighborhood 13
       125
       2
       21
        208
       15
       16
       1.066667
    
    
      84 
        85
       Property type 2
       Neighborhood 13
        59
       1
       10
        205
       15
       50
       3.333333
    
    
      85 
        86
       Property type 2
       Neighborhood 18
        75
       1
        7
        197
       15
        4
       0.266667
    
    
      86 
        87
       Property type 2
       Neighborhood 18
        59
       1
        9
        167
       14
       14
       1.000000
    
    
      87 
        88
       Property type 1
       Neighborhood 16
        80
       2
       15
        145
       14
       21
       1.500000
    
    
      88 
        89
       Property type 1
       Neighborhood 21
       500
       5
       40
        896
       14
        1
       0.071429
    
    
      89 
        90
       Property type 1
       Neighborhood 13
       145
       6
       25
        297
       14
        9
       0.642857
    
    
      92 
        93
       Property type 2
       Neighborhood 15
        79
       4
       16
        418
       14
       69
       4.928571
    
    
      93 
        94
       Property type 1
       Neighborhood 14
       105
       4
       15
        210
       14
        7
       0.500000
    
    
      94 
        95
       Property type 1
       Neighborhood 14
       195
       3
       19
        214
       14
       12
       0.857143
    
    
      95 
        96
       Property type 1
       Neighborhood 11
       135
       3
       21
        303
       14
        5
       0.357143
    
    
      97 
        98
       Property type 1
       Neighborhood 13
       139
       4
       13
        409
       14
       17
       1.214286
    
    
      98 
        99
       Property type 1
       Neighborhood 12
       150
       2
       26
        425
       13
       17
       1.307692
    
    
      100
       101
       Property type 1
       Neighborhood 14
        69
       2
        5
        100
       13
       19
       1.461538
    
    
      101
       102
       Property type 2
       Neighborhood 11
        70
       2
        9
          0
       13
       21
       1.615385
    
    
      103
       104
       Property type 1
       Neighborhood 18
        40
       2
        9
        321
       13
       10
       0.769231
    
    
      105
       106
       Property type 1
        Neighborhood 9
        95
       3
       11
        248
       13
       33
       2.538462
    
    
      106
       107
       Property type 1
       Neighborhood 15
       160
       6
       23
        304
       12
       11
       0.916667
    
    
      107
       108
       Property type 1
       Neighborhood 12
       135
       2
        8
        297
       12
        9
       0.750000
    
    
      108
       109
       Property type 2
       Neighborhood 13
        80
       2
       20
        491
       12
       31
       2.583333
    
    
      109
       110
       Property type 1
       Neighborhood 20
       350
       3
        6
        135
       12
        2
       0.166667
    
    
      110
       111
       Property type 2
       Neighborhood 14
        85
       2
       33
        248
       12
        3
       0.250000
    
    
      111
       112
       Property type 1
       Neighborhood 14
       129
       2
       39
        759
       12
       37
       3.083333
    
    
      112
       113
       Property type 2
       Neighborhood 13
        50
       2
        5
        514
       12
       24
       2.000000
    
    
      113
       114
       Property type 1
       Neighborhood 13
       325
       4
       46
        227
       11
        3
       0.272727
    
    
      114
       115
       Property type 1
       Neighborhood 13
       180
       3
       18
        256
       11
        8
       0.727273
    
    
      116
       117
       Property type 2
       Neighborhood 14
        49
       2
       14
        417
       11
        8
       0.727273
    
    
      117
       118
       Property type 2
        Neighborhood 4
        60
       2
       10
         95
       11
       11
       1.000000
    
    
      118
       119
       Property type 2
       Neighborhood 12
        55
       2
        8
        333
       11
        1
       0.090909
    
  

92 rows × 10 columns

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [18]:

    
# is this something new?  I cannot find references to this from lectures?
pd.get_dummies(established_properties)









    Out[18]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
      prop_type_Property type 1
      prop_type_Property type 2
      ...
      neighborhood_Neighborhood 14
      neighborhood_Neighborhood 15
      neighborhood_Neighborhood 16
      neighborhood_Neighborhood 17
      neighborhood_Neighborhood 18
      neighborhood_Neighborhood 19
      neighborhood_Neighborhood 20
      neighborhood_Neighborhood 21
      neighborhood_Neighborhood 4
      neighborhood_Neighborhood 9
    
  
  
    
      0  
         1
       140
       3
       11
        232
       30
        4
       0.133333
       1
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2  
         3
        95
       2
       16
        172
       29
        1
       0.034483
       0
       1
      ...
       0
       0
       1
       0
       0
       0
       0
       0
       0
       0
    
    
      3  
         4
        90
       2
       19
        472
       28
       27
       0.964286
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      5  
         6
        89
       2
       10
        886
       28
       88
       3.142857
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      6  
         7
        85
       1
       11
         58
       24
        2
       0.083333
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      7  
         8
       120
       2
       13
        685
       23
       28
       1.217391
       1
       0
      ...
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
    
    
      8  
         9
       210
       6
       27
        180
       23
        3
       0.130435
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      9  
        10
        65
       2
       15
        189
       23
       26
       1.130435
       0
       0
      ...
       0
       0
       0
       1
       0
       0
       0
       0
       0
       0
    
    
      10 
        11
       145
       3
        9
        140
       22
        7
       0.318182
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      11 
        12
        89
       2
       11
        153
       22
       45
       2.045455
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      12 
        13
        96
       2
       10
        245
       20
       57
       2.850000
       0
       1
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      13 
        14
        95
       2
        8
        139
       20
       31
       1.550000
       0
       1
      ...
       0
       0
       0
       1
       0
       0
       0
       0
       0
       0
    
    
      14 
        15
        95
       2
       15
        255
       19
       12
       0.631579
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      15 
        16
        95
       2
       39
        334
       19
       44
       2.315789
       0
       1
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      16 
        17
        65
       2
        7
        333
       19
        7
       0.368421
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      18 
        19
        65
       2
        9
        448
       19
        1
       0.052632
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      19 
        20
        40
       2
        4
        241
       19
        9
       0.473684
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       1
       0
    
    
      20 
        21
       295
       5
       22
        228
       18
        9
       0.500000
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      22 
        23
        90
       3
       15
        411
       18
       52
       2.888889
       0
       1
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      23 
        24
       110
       2
       10
        495
       17
       26
       1.529412
       1
       0
      ...
       0
       0
       1
       0
       0
       0
       0
       0
       0
       0
    
    
      24 
        25
       215
       2
       16
        190
       17
        3
       0.176471
       1
       0
      ...
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
    
    
      25 
        26
       139
       2
       20
        395
       17
        2
       0.117647
       1
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      26 
        27
       180
       6
       26
        325
       17
       68
       4.000000
       1
       0
      ...
       0
       1
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      27 
        28
        95
       2
        8
        137
       16
        8
       0.500000
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      28 
        29
       125
       3
       31
        327
       16
        2
       0.125000
       1
       0
      ...
       0
       0
       0
       1
       0
       0
       0
       0
       0
       0
    
    
      29 
        30
        90
       3
        5
        582
       16
        9
       0.562500
       0
       1
      ...
       0
       0
       0
       1
       0
       0
       0
       0
       0
       0
    
    
      32 
        33
       246
       8
       10
        437
       16
        1
       0.062500
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      34 
        35
       170
       4
       39
        404
       16
        1
       0.062500
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      42 
        43
       229
       4
        6
        281
       16
        3
       0.187500
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      43 
        44
       500
       4
        6
        342
       16
        2
       0.125000
       1
       0
      ...
       0
       0
       0
       0
       0
       1
       0
       0
       0
       0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      82 
        83
       565
       8
       24
       1111
       15
        1
       0.066667
       1
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      83 
        84
       125
       2
       21
        208
       15
       16
       1.066667
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      84 
        85
        59
       1
       10
        205
       15
       50
       3.333333
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      85 
        86
        75
       1
        7
        197
       15
        4
       0.266667
       0
       1
      ...
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
    
    
      86 
        87
        59
       1
        9
        167
       14
       14
       1.000000
       0
       1
      ...
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
    
    
      87 
        88
        80
       2
       15
        145
       14
       21
       1.500000
       1
       0
      ...
       0
       0
       1
       0
       0
       0
       0
       0
       0
       0
    
    
      88 
        89
       500
       5
       40
        896
       14
        1
       0.071429
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       1
       0
       0
    
    
      89 
        90
       145
       6
       25
        297
       14
        9
       0.642857
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      92 
        93
        79
       4
       16
        418
       14
       69
       4.928571
       0
       1
      ...
       0
       1
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      93 
        94
       105
       4
       15
        210
       14
        7
       0.500000
       1
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      94 
        95
       195
       3
       19
        214
       14
       12
       0.857143
       1
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      95 
        96
       135
       3
       21
        303
       14
        5
       0.357143
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      97 
        98
       139
       4
       13
        409
       14
       17
       1.214286
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      98 
        99
       150
       2
       26
        425
       13
       17
       1.307692
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      100
       101
        69
       2
        5
        100
       13
       19
       1.461538
       1
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      101
       102
        70
       2
        9
          0
       13
       21
       1.615385
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      103
       104
        40
       2
        9
        321
       13
       10
       0.769231
       1
       0
      ...
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
    
    
      105
       106
        95
       3
       11
        248
       13
       33
       2.538462
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       1
    
    
      106
       107
       160
       6
       23
        304
       12
       11
       0.916667
       1
       0
      ...
       0
       1
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      107
       108
       135
       2
        8
        297
       12
        9
       0.750000
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      108
       109
        80
       2
       20
        491
       12
       31
       2.583333
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      109
       110
       350
       3
        6
        135
       12
        2
       0.166667
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       1
       0
       0
       0
    
    
      110
       111
        85
       2
       33
        248
       12
        3
       0.250000
       0
       1
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      111
       112
       129
       2
       39
        759
       12
       37
       3.083333
       1
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      112
       113
        50
       2
        5
        514
       12
       24
       2.000000
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      113
       114
       325
       4
       46
        227
       11
        3
       0.272727
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      114
       115
       180
       3
       18
        256
       11
        8
       0.727273
       1
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      116
       117
        49
       2
       14
        417
       11
        8
       0.727273
       0
       1
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      117
       118
        60
       2
       10
         95
       11
       11
       1.000000
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       1
       0
    
    
      118
       119
        55
       2
        8
        333
       11
        1
       0.090909
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

92 rows × 24 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [19]:

    
# @Chad/Ramesh -- its not clear to me what is supposed to be done here.  Are we supposed to graph the data?

from sklearn.cross_validation import train_test_split
from IPython.core.pylabtools import figsize
from sklearn.linear_model import LinearRegression



In [20]:

    
x = established_properties[['price','person_capacity','picture_count','description_length','tenure_months']].values
y = established_properties['booking_rate'].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8)

clf = LinearRegression()
clf.fit(x_train, y_train)









    Out[20]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)



In [21]:

    
ratePrediction = clf.predict(x_test)
sum_sq_model = np.sum((y_test - ratePrediction) ** 2)
sum_sq_model









    Out[21]:





205.89953307564807



In [22]:

    
sum_sq_naive = np.sum((y_test - y.mean()) ** 2)
sum_sq_naive









    Out[22]:





92.993147553323894



In [23]:

    
fig, ax = plt.subplots(1, 1)

ax.scatter(ratePrediction, y_test)
ax.set_xlabel('Predicated X')
ax.set_ylabel('Actual X')

# Draw the ideal line
ax.plot(y, y, 'r')









    Out[23]:





[<matplotlib.lines.Line2D at 0x10ffbc750>]



In [1]:

    
# a, b = np.arange(10).reshape((5, 2)), range(5)

# a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.33, random_state=42)

# print a_train
# print a_test
# print b_train
# print b_test

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [25]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets



In [25]:

report the score

http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [34]:

    
clf.score(x, y)









    Out[34]:





-0.91027066304371318

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [ ]:

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months
0	1	Property type 1	Neighborhood 14	140	3	11	232	30
1	2	Property type 1	Neighborhood 14	95	2	3	37	29
2	3	Property type 2	Neighborhood 16	95	2	16	172	29
3	4	Property type 2	Neighborhood 13	90	2	19	472	28
4	5	Property type 1	Neighborhood 15	125	5	21	442	28

	prop_id	price	person_capacity	picture_count	description_length	tenure_months
count	408.000000	408.000000	408.000000	408.000000	408.000000	408.000000
mean	204.500000	187.806373	2.997549	14.389706	309.159314	8.487745
std	117.923704	353.050858	1.594676	10.477428	228.021684	5.872088
min	1.000000	39.000000	1.000000	1.000000	0.000000	1.000000
25%	102.750000	90.000000	2.000000	6.000000	179.000000	4.000000
50%	204.500000	125.000000	2.000000	12.000000	250.000000	7.000000
75%	306.250000	199.000000	4.000000	20.000000	389.500000	13.000000
max	408.000000	5000.000000	10.000000	71.000000	1969.000000	30.000000

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	booking_date
0	1	Property type 1	Neighborhood 14	140	3	11	232	30	2011-03-09
1	1	Property type 1	Neighborhood 14	140	3	11	232	30	2011-03-07
2	1	Property type 1	Neighborhood 14	140	3	11	232	30	2011-05-24
3	1	Property type 1	Neighborhood 14	140	3	11	232	30	2011-06-18
4	3	Property type 2	Neighborhood 16	95	2	16	172	29	2011-08-16

	count
neighborhood	Neighborhood 1	Neighborhood 10	Neighborhood 11	Neighborhood 12	Neighborhood 13	Neighborhood 14	Neighborhood 15	Neighborhood 16	Neighborhood 17	Neighborhood 18	...	Neighborhood 20	Neighborhood 21	Neighborhood 22	Neighborhood 3	Neighborhood 4	Neighborhood 5	Neighborhood 6	Neighborhood 7	Neighborhood 8	Neighborhood 9
booking_date
2011-01-01	NaN	NaN	NaN	NaN	4	3	1	NaN	1	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2
2011-01-02	NaN	NaN	1	NaN	1	3	1	1	NaN	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-01-03	NaN	NaN	NaN	NaN	5	2	NaN	NaN	2	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1
2011-01-04	NaN	NaN	NaN	1	1	1	2	NaN	NaN	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2
2011-01-05	NaN	NaN	1	NaN	1	6	3	1	1	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	number_of_bookings	booking_rate
0	1	Property type 1	Neighborhood 14	140	3	11	232	30	4	0.133333
2	3	Property type 2	Neighborhood 16	95	2	16	172	29	1	0.034483
3	4	Property type 2	Neighborhood 13	90	2	19	472	28	27	0.964286
5	6	Property type 2	Neighborhood 13	89	2	10	886	28	88	3.142857
6	7	Property type 2	Neighborhood 13	85	1	11	58	24	2	0.083333