Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [229]:

    
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

pd.set_option('display.max_rows', 10)

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [230]:

    
bookings = pd.read_csv('../data/bookings.csv', parse_dates = ['booking_date'])
listings = pd.read_csv('../data/listings.csv')



In [231]:

    
print bookings









    



      prop_id booking_date
0           9   2011-06-17
1          13   2011-08-12
2          21   2011-06-20
3          28   2011-05-05
4          29   2011-11-17
...       ...          ...
6071      408   2011-06-02
6072      408   2011-08-22
6073      408   2011-07-24
6074      408   2011-01-12
6075      408   2011-09-08

[6076 rows x 2 columns]



In [232]:

    
print listings









    



     prop_id        prop_type     neighborhood  price  person_capacity  \
0          1  Property type 1  Neighborhood 14    140                3   
1          2  Property type 1  Neighborhood 14     95                2   
2          3  Property type 2  Neighborhood 16     95                2   
3          4  Property type 2  Neighborhood 13     90                2   
4          5  Property type 1  Neighborhood 15    125                5   
..       ...              ...              ...    ...              ...   
403      404  Property type 2  Neighborhood 14    100                1   
404      405  Property type 2  Neighborhood 13     85                2   
405      406  Property type 1   Neighborhood 9     70                3   
406      407  Property type 1  Neighborhood 13    129                2   
407      408  Property type 1  Neighborhood 14    100                3   

     picture_count  description_length  tenure_months  
0               11                 232             30  
1                3                  37             29  
2               16                 172             29  
3               19                 472             28  
4               21                 442             28  
..             ...                 ...            ...  
403              8                 235              1  
404             27                1048              1  
405             18                 153              1  
406             13                 370              1  
407             21                 707              1  

[408 rows x 8 columns]

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [233]:

    
bookings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)



In [234]:

    
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)



In [235]:

    
bookings.columns









    Out[235]:





Index([u'prop_id', u'booking_date'], dtype='object')



In [236]:

    
listings.columns









    Out[236]:





Index([u'prop_id', u'prop_type', u'neighborhood', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months'], dtype='object')



In [237]:

    
# for col in ['neighborhood']:
#    listings[col] = listings[col].map(lambda x: x.replace('Neighborhood ', '')).astype(int)
#for col in ['prop_type']:
#    listings[col] = listings[col].map(lambda x: x.replace('Property type ', '')).astype(int)

#print listings



In [238]:

    
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)



In [239]:

    
listings.price.std()









    Out[239]:





353.05085803923384



In [240]:

    
# listings.groupby('prop_id').price.mean()



In [241]:

    
listings[['price', 'person_capacity', 'picture_count', 'description_length']].describe()









    Out[241]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
    
  
  
    
      count
        408.000000
       408.000000
       408.000000
        408.000000
    
    
      mean
        187.806373
         2.997549
        14.389706
        309.159314
    
    
      std
        353.050858
         1.594676
        10.477428
        228.021684
    
    
      min
         39.000000
         1.000000
         1.000000
          0.000000
    
    
      25%
         90.000000
         2.000000
         6.000000
        179.000000
    
    
      50%
        125.000000
         2.000000
        12.000000
        250.000000
    
    
      75%
        199.000000
         4.000000
        20.000000
        389.500000
    
    
      max
       5000.000000
        10.000000
        71.000000
       1969.000000



In [242]:

    
fig, ax = plt.subplots(1,2)
listings.boxplot('price', ax=ax[0])
listings.price.plot(ax=ax[1])

ax[0].set_ylim([0, 500])
listings[listings.price > 1000]









    Out[242]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      48 
        49
       Property type 1
       Neighborhood 13
       2394
       8
       10
       363
       16
    
    
      196
       197
       Property type 1
       Neighborhood 12
       3050
       4
        6
       227
        7
    
    
      289
       290
       Property type 1
       Neighborhood 12
       2650
       4
        6
       232
        5
    
    
      290
       291
       Property type 1
       Neighborhood 12
       2350
       1
        6
       219
        5
    
    
      340
       341
       Property type 1
       Neighborhood 20
       5000
       4
        6
       145
        3



In [243]:

    
grouped = listings.groupby('prop_type')
grouped['price']









    Out[243]:





<pandas.core.groupby.SeriesGroupBy object at 0x00000000182FA1D0>



In [244]:

    
# grouped['price'].isnull() == False].describe()
# grouped[grouped['price'].isnull() == False].describe()

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [245]:

    
grouped = listings.groupby('prop_type')

grouped[['price', 'person_capacity', 'picture_count', 'description_length']].agg([np.mean, np.std, np.median])









    Out[245]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
    
    
      
      mean
      std
      median
      mean
      std
      median
      mean
      std
      median
      mean
      std
      median
    
    
      prop_type
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Property type 1
       237.085502
       425.710534
       150
       3.516729
       1.644955
       3
       14.695167
       10.623651
       12.0
       313.171004
       214.769141
       266.0
    
    
      Property type 2
        93.288889
        42.261246
        89
       2.000000
       0.846415
       2
       13.948148
       10.255191
       11.0
       304.851852
       255.135332
       239.0
    
    
      Property type 3
        63.750000
        16.520190
        70
       1.750000
       0.500000
       2
        8.750000
        7.320064
        9.5
       184.750000
        53.093471
       192.5



In [246]:

    
grouped.prop_type.unique()









    Out[246]:





prop_type
Property type 1    [Property type 1]
Property type 2    [Property type 2]
Property type 3    [Property type 3]
Name: prop_type, dtype: object

Same, but by property type per neighborhood?



In [247]:

    
naybor = listings.groupby(['prop_type', 'neighborhood'])['price', 'person_capacity', 'picture_count', 'description_length']



In [248]:

    
naybor.agg([np.mean, np.std, np.median])









    Out[248]:






  
    
      
      
      price
      person_capacity
      picture_count
      description_length
    
    
      
      
      mean
      std
      median
      mean
      std
      median
      mean
      std
      median
      mean
      std
      median
    
    
      prop_type
      neighborhood
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 1
        85.000000
              NaN
        85.0
       2.000000
            NaN
       2
       26.000000
            NaN
       26.0
       209.000000
              NaN
       209.0
    
    
      Neighborhood 10
       142.500000
        36.979724
       137.5
       3.500000
       1.224745
       4
       13.333333
       8.571270
       12.0
       391.000000
       146.929915
       425.5
    
    
      Neighborhood 11
       159.428571
        70.962302
       130.0
       3.214286
       1.311404
       3
        9.928571
       5.928605
        8.0
       379.000000
       396.956111
       295.5
    
    
      Neighborhood 12
       365.615385
       686.086484
       150.0
       3.435897
       1.874972
       3
       10.820513
       7.118810
        8.0
       267.205128
       137.867820
       251.0
    
    
      Neighborhood 13
       241.897959
       320.997874
       180.0
       4.061224
       1.586508
       4
       15.653061
       9.828850
       13.0
       290.408163
       124.417965
       256.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Property type 2
      Neighborhood 9
       110.000000
        14.142136
       110.0
       2.000000
       1.414214
       2
        3.500000
       0.707107
        3.5
       114.500000
       137.885822
       114.5
    
    
      Property type 3
      Neighborhood 11
        75.000000
              NaN
        75.0
       2.000000
            NaN
       2
       15.000000
            NaN
       15.0
       196.000000
              NaN
       196.0
    
    
      Neighborhood 14
        75.000000
              NaN
        75.0
       1.000000
            NaN
       1
        1.000000
            NaN
        1.0
       113.000000
              NaN
       113.0
    
    
      Neighborhood 17
        65.000000
              NaN
        65.0
       2.000000
            NaN
       2
       15.000000
            NaN
       15.0
       189.000000
              NaN
       189.0
    
    
      Neighborhood 4
        40.000000
              NaN
        40.0
       2.000000
            NaN
       2
        4.000000
            NaN
        4.0
       241.000000
              NaN
       241.0
    
  

40 rows × 12 columns

Plot daily bookings:



In [249]:

    
bookings.plot('booking_date', 'prop_id').fig_size = (50, 20)
plt.suptitle('Daily Bookings')
plt.xlabel('Booking Date')
plt.ylabel('# of Bookings')









    Out[249]:





<matplotlib.text.Text at 0x1736ebe0>

Plot the daily bookings per neighborhood (provide a legend)



In [250]:

    
n_listings = listings.merge(bookings, on = 'prop_id', how = 'left')



In [251]:

    
n_listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6156 entries, 0 to 6155
Data columns (total 9 columns):
prop_id               6156 non-null int64
prop_type             6156 non-null object
neighborhood          6156 non-null object
price                 6156 non-null int64
person_capacity       6156 non-null int64
picture_count         6156 non-null int64
description_length    6156 non-null int64
tenure_months         6156 non-null int64
booking_date          6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(6), object(2)



In [252]:

    
#n_listings.booking_date = pd.to_datetime(n_listings.booking_date)



In [253]:

    
n_listings.booking_date.map(lambda x: x.dayofweek)









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-253-23c2edc35118> in <module>()
----> 1 n_listings.booking_date.map(lambda x: x.dayofweek)

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in map(self, arg, na_action)
   1953                                      index=self.index).__finalize__(self)
   1954         else:
-> 1955             mapped = map_f(values, arg)
   1956             return self._constructor(mapped,
   1957                                      index=self.index).__finalize__(self)

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\lib.pyd in pandas.lib.map_infer (pandas\lib.c:53184)()

<ipython-input-253-23c2edc35118> in <lambda>(x)
----> 1 n_listings.booking_date.map(lambda x: x.dayofweek)

AttributeError: 'NaTType' object has no attribute 'dayofweek'



In [253]:



In [253]:



In [254]:

    
sns.factorplot("neighborhood", hue ="day_name", data = n_listings, palette = "Purples_d", size = 25, legend = True);









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-254-585a63fc9d8c> in <module>()
----> 1 sns.factorplot("neighborhood", hue ="day_name", data = n_listings, palette = "Purples_d", size = 25, legend = True);

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\seaborn\linearmodels.pyc in factorplot(x, y, hue, data, row, col, col_wrap, estimator, ci, n_boot, units, x_order, hue_order, col_order, row_order, kind, markers, linestyles, dodge, join, hline, size, aspect, palette, legend, legend_out, dropna, sharex, sharey, margin_titles)
    900     cols = [a for a in [x, y, hue, col, row, units] if a is not None]
    901     cols = pd.unique(cols).tolist()
--> 902     data = data[cols]
    903 
    904     facet_hue = hue if hue in [row, col] else None

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1670         if isinstance(key, (Series, np.ndarray, list)):
   1671             # either boolean or fancy integer index
-> 1672             return self._getitem_array(key)
   1673         elif isinstance(key, DataFrame):
   1674             return self._getitem_frame(key)

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
   1714             return self.take(indexer, axis=0, convert=False)
   1715         else:
-> 1716             indexer = self.ix._convert_to_indexer(key, axis=1)
   1717             return self.take(indexer, axis=1, convert=True)
   1718 

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1083                     if isinstance(obj, tuple) and is_setter:
   1084                         return {'key': obj}
-> 1085                     raise KeyError('%s not in index' % objarr[mask])
   1086 
   1087                 return indexer

KeyError: "['day_name'] not in index"



In [255]:

    
fig, ax = bookings.plot('booking_date', 'neighborhood').fig_size = (50, 20)
plt.suptitle('Daily Bookings')
ax.set_xlabel('Neighborhood')
ax.set_ylabel('# of Bookings')
ax.legend()









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-255-fdef08f7f71c> in <module>()
----> 1 fig, ax = bookings.plot('booking_date', 'neighborhood').fig_size = (50, 20)
      2 plt.suptitle('Daily Bookings')
      3 ax.set_xlabel('Neighborhood')
      4 ax.set_ylabel('# of Bookings')
      5 ax.legend()

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\tools\plotting.pyc in plot_frame(frame, x, y, subplots, sharex, sharey, use_index, figsize, grid, legend, rot, ax, style, title, xlim, ylim, logx, logy, xticks, yticks, kind, sort_columns, fontsize, secondary_y, **kwds)
   2135             label = x if x is not None else frame.index.name
   2136             label = kwds.pop('label', label)
-> 2137             ser = frame[y]
   2138             ser.index.name = label
   2139 

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1676             return self._getitem_multilevel(key)
   1677         else:
-> 1678             return self._getitem_column(key)
   1679 
   1680     def _getitem_column(self, key):

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
   1683         # get column
   1684         if self.columns.is_unique:
-> 1685             return self._get_item_cache(key)
   1686 
   1687         # duplicate columns & possible reduce dimensionaility

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
   1050         res = cache.get(item)
   1051         if res is None:
-> 1052             values = self._data.get(item)
   1053             res = self._box_item_values(item, values)
   1054             cache[item] = res

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
   2563 
   2564             if not isnull(item):
-> 2565                 loc = self.items.get_loc(item)
   2566             else:
   2567                 indexer = np.arange(len(self.items))[isnull(self.items)]

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\index.pyc in get_loc(self, key)
   1179         loc : int if unique index, possibly slice or mask if not
   1180         """
-> 1181         return self._engine.get_loc(_values_from_object(key))
   1182 
   1183     def get_value(self, series, key):

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_loc (pandas\index.c:3656)()

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_loc (pandas\index.c:3534)()

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\hashtable.pyd in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11911)()

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\hashtable.pyd in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11864)()

KeyError: 'neighborhood'

Part 2 - Develop a data set



In [256]:

    
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [257]:

    
total_book = bookings.groupby('prop_id').count().reset_index()

total_book.rename(columns = {'booking_date' : 'number_of_bookings'}, inplace = True)

total_book


# listings['e'] = Series(np.random.randn(sLength), index=df1.index)









    Out[257]:






  
    
      
      prop_id
      number_of_bookings
    
  
  
    
      0  
         1
        4
    
    
      1  
         3
        1
    
    
      2  
         4
       27
    
    
      3  
         6
       88
    
    
      4  
         7
        2
    
    
      ...
      ...
      ...
    
    
      323
       404
        3
    
    
      324
       405
       19
    
    
      325
       406
       19
    
    
      326
       407
       15
    
    
      327
       408
       54
    
  

328 rows × 2 columns



In [258]:

    
listings = listings.merge(total_book, on = 'prop_id', how = 'left')



In [259]:

    
listings.number_of_bookings.fillna(0, inplace = True)



In [260]:

    
listings.head()









    Out[260]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
        4
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
        0
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        1
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
       27
    
    
      4
       5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28
        0



In [261]:

    
listings['booking_rate'] = listings['number_of_bookings']/listings['tenure_months']

listings









    Out[261]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
        232
       30
        4
        0.133333
    
    
      1  
         2
       Property type 1
       Neighborhood 14
        95
       2
        3
         37
       29
        0
        0.000000
    
    
      2  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
        172
       29
        1
        0.034483
    
    
      3  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
        472
       28
       27
        0.964286
    
    
      4  
         5
       Property type 1
       Neighborhood 15
       125
       5
       21
        442
       28
        0
        0.000000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      403
       404
       Property type 2
       Neighborhood 14
       100
       1
        8
        235
        1
        3
        3.000000
    
    
      404
       405
       Property type 2
       Neighborhood 13
        85
       2
       27
       1048
        1
       19
       19.000000
    
    
      405
       406
       Property type 1
        Neighborhood 9
        70
       3
       18
        153
        1
       19
       19.000000
    
    
      406
       407
       Property type 1
       Neighborhood 13
       129
       2
       13
        370
        1
       15
       15.000000
    
    
      407
       408
       Property type 1
       Neighborhood 14
       100
       3
       21
        707
        1
       54
       54.000000
    
  

408 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [268]:

    
listings = listings[listings.tenure_months >= 10]
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 10 columns):
prop_id               144 non-null int64
prop_type             144 non-null object
neighborhood          144 non-null object
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
number_of_bookings    144 non-null float64
booking_rate          144 non-null float64
dtypes: float64(2), int64(6), object(2)

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [273]:

    
dummy_prop = pd.get_dummies(listings['prop_type'])
dummy_nay = pd.get_dummies(listings['neighborhood'])
listings_prop = listings.join([dummy_prop, dummy_nay])



In [299]:

    
listings_prop.booking_rate.fillna(0, inplace = True)



In [300]:

    
listings_prop.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 29 columns):
prop_id               144 non-null int64
prop_type             144 non-null object
neighborhood          144 non-null object
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
number_of_bookings    144 non-null float64
booking_rate          144 non-null float64
Property type 1       144 non-null float64
Property type 2       144 non-null float64
Property type 3       144 non-null float64
Neighborhood 11       144 non-null float64
Neighborhood 12       144 non-null float64
Neighborhood 13       144 non-null float64
Neighborhood 14       144 non-null float64
Neighborhood 15       144 non-null float64
Neighborhood 16       144 non-null float64
Neighborhood 17       144 non-null float64
Neighborhood 18       144 non-null float64
Neighborhood 19       144 non-null float64
Neighborhood 20       144 non-null float64
Neighborhood 21       144 non-null float64
Neighborhood 4        144 non-null float64
Neighborhood 5        144 non-null float64
Neighborhood 7        144 non-null float64
Neighborhood 8        144 non-null float64
Neighborhood 9        144 non-null float64
dtypes: float64(21), int64(6), object(2)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [301]:

    
from sklearn.cross_validation import train_test_split



In [302]:

    
merged_final = listings_prop.drop(['prop_id', 'prop_type', 'neighborhood', 'number_of_bookings'], axis = 1)
features = merged_final.drop(['booking_rate'], axis = 1).values
target = merged_final['booking_rate'].values



In [304]:









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-304-2f710f4ceb78> in <module>()
----> 1 target.fillna(0)

AttributeError: 'numpy.ndarray' object has no attribute 'fillna'



In [305]:

    
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.3)



In [306]:

    
target = np.log(target)



In [306]:

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [307]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()



In [308]:

    
lr.fit(feature_train, target_train)









    Out[308]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

fit your model with your test sets



In [ ]:

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [ ]:

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [ ]:

	price	person_capacity	picture_count	description_length
count	408.000000	408.000000	408.000000	408.000000
mean	187.806373	2.997549	14.389706	309.159314
std	353.050858	1.594676	10.477428	228.021684
min	39.000000	1.000000	1.000000	0.000000
25%	90.000000	2.000000	6.000000	179.000000
50%	125.000000	2.000000	12.000000	250.000000
75%	199.000000	4.000000	20.000000	389.500000
max	5000.000000	10.000000	71.000000	1969.000000

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months
48	49	Property type 1	Neighborhood 13	2394	8	10	363	16
196	197	Property type 1	Neighborhood 12	3050	4	6	227	7
289	290	Property type 1	Neighborhood 12	2650	4	6	232	5
290	291	Property type 1	Neighborhood 12	2350	1	6	219	5
340	341	Property type 1	Neighborhood 20	5000	4	6	145	3

	price			person_capacity			picture_count			description_length
	mean	std	median	mean	std	median	mean	std	median	mean	std	median
prop_type
Property type 1	237.085502	425.710534	150	3.516729	1.644955	3	14.695167	10.623651	12.0	313.171004	214.769141	266.0
Property type 2	93.288889	42.261246	89	2.000000	0.846415	2	13.948148	10.255191	11.0	304.851852	255.135332	239.0
Property type 3	63.750000	16.520190	70	1.750000	0.500000	2	8.750000	7.320064	9.5	184.750000	53.093471	192.5

	prop_id	number_of_bookings
0	1	4
1	3	1
2	4	27
3	6	88
4	7	2
...	...	...
323	404	3
324	405	19
325	406	19
326	407	15
327	408	54

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	number_of_bookings
0	1	Property type 1	Neighborhood 14	140	3	11	232	30	4
1	2	Property type 1	Neighborhood 14	95	2	3	37	29	0
2	3	Property type 2	Neighborhood 16	95	2	16	172	29	1
3	4	Property type 2	Neighborhood 13	90	2	19	472	28	27
4	5	Property type 1	Neighborhood 15	125	5	21	442	28	0

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	number_of_bookings	booking_rate
0	1	Property type 1	Neighborhood 14	140	3	11	232	30	4	0.133333
1	2	Property type 1	Neighborhood 14	95	2	3	37	29	0	0.000000
2	3	Property type 2	Neighborhood 16	95	2	16	172	29	1	0.034483
3	4	Property type 2	Neighborhood 13	90	2	19	472	28	27	0.964286
4	5	Property type 1	Neighborhood 15	125	5	21	442	28	0	0.000000
...	...	...	...	...	...	...	...	...	...	...
403	404	Property type 2	Neighborhood 14	100	1	8	235	1	3	3.000000
404	405	Property type 2	Neighborhood 13	85	2	27	1048	1	19	19.000000
405	406	Property type 1	Neighborhood 9	70	3	18	153	1	19	19.000000
406	407	Property type 1	Neighborhood 13	129	2	13	370	1	15	15.000000
407	408	Property type 1	Neighborhood 14	100	3	21	707	1	54	54.000000

	prop_id	number_of_bookings
0	1	4
1	3	1
2	4	27
3	6	88
4	7	2
...	...	...
323	404	3
324	405	19
325	406	19
326	407	15
327	408	54