Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [3]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)



In [4]:

    
pd.__version__









    Out[4]:





'0.14.1'

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [5]:

    
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')



In [6]:

    
listings.head(5)









    Out[6]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
    
    
      4
       5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28



In [7]:

    
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)



In [8]:

    
bookings.head(5)









    Out[8]:






  
    
      
      prop_id
      booking_date
    
  
  
    
      0
        9
       2011-06-17
    
    
      1
       13
       2011-08-12
    
    
      2
       21
       2011-06-20
    
    
      3
       28
       2011-05-05
    
    
      4
       29
       2011-11-17



In [9]:

    
bookings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null object
dtypes: int64(1), object(1)



In [10]:

    
listings.groupby('prop_type').prop_id.count()









    Out[10]:





prop_type
Property type 1    269
Property type 2    135
Property type 3      4
Name: prop_id, dtype: int64



In [11]:

    
bookings.groupby('prop_id').count()









    Out[11]:






  
    
      
      booking_date
    
    
      prop_id
      
    
  
  
    
      1  
        4
    
    
      3  
        1
    
    
      4  
       27
    
    
      6  
       88
    
    
      7  
        2
    
    
      ...
      ...
    
    
      404
        3
    
    
      405
       19
    
    
      406
       19
    
    
      407
       15
    
    
      408
       54
    
  

328 rows × 1 columns

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [12]:

    
print 'price', listings.price.mean(), '/', listings.price.median(), '/', listings.price.std()
print 'capacity', listings.person_capacity.mean(), '/', listings.person_capacity.median(), '/', listings.person_capacity.std()
print 'picture count', listings.picture_count.mean(), '/', listings.picture_count.median(), '/', listings.picture_count.std()
print 'desc length', listings.description_length.mean(), '/', listings.description_length.median(), '/', listings.description_length.std()
print 'tenure, in months', listings.tenure_months.mean(), '/', listings.tenure_months.median(), '/', listings.tenure_months.std()









    



price 187.806372549 / 125.0 / 353.050858039
capacity 2.99754901961 / 2.0 / 1.59467599246
picture count 14.3897058824 / 12.0 / 10.4774282678
desc length 309.159313725 / 250.0 / 228.021684411
tenure, in months 8.48774509804 / 7.0 / 5.87208837893

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [13]:

    
print 'price by', listings.groupby('prop_type').price.mean()
print 'capacity by', listings.groupby('prop_type').person_capacity.mean()
print 'pictures by', listings.groupby('prop_type').picture_count.mean()
print 'description length by', listings.groupby('prop_type').description_length.mean()
print 'tenure by', listings.groupby('prop_type').tenure_months.mean()









    



price by prop_type
Property type 1    237.1
Property type 2     93.3
Property type 3     63.8
Name: price, dtype: float64
capacity by prop_type
Property type 1    3.5
Property type 2    2.0
Property type 3    1.8
Name: person_capacity, dtype: float64
pictures by prop_type
Property type 1    14.7
Property type 2    13.9
Property type 3     8.8
Name: picture_count, dtype: float64
description length by prop_type
Property type 1    313.2
Property type 2    304.9
Property type 3    184.8
Name: description_length, dtype: float64
tenure by prop_type
Property type 1     8.5
Property type 2     8.4
Property type 3    13.8
Name: tenure_months, dtype: float64

Same, but by property type per neighborhood?



In [14]:

    
neighborhood_price = listings.groupby(['neighborhood','prop_type']).price.mean()
neighborhood_price.unstack(1)









    Out[14]:






  
    
      prop_type
      Property type 1
      Property type 2
      Property type 3
    
    
      neighborhood
      
      
      
    
  
  
    
      Neighborhood 1
        85.0
         NaN
      NaN
    
    
      Neighborhood 10
       142.5
       137.5
      NaN
    
    
      Neighborhood 11
       159.4
        78.8
       75
    
    
      Neighborhood 12
       365.6
        96.9
      NaN
    
    
      Neighborhood 13
       241.9
        81.1
      NaN
    
    
      ...
      ...
      ...
      ...
    
    
      Neighborhood 5
       194.5
         NaN
      NaN
    
    
      Neighborhood 6
       146.0
         NaN
      NaN
    
    
      Neighborhood 7
       161.0
       100.0
      NaN
    
    
      Neighborhood 8
       174.8
       350.0
      NaN
    
    
      Neighborhood 9
       151.1
       110.0
      NaN
    
  

22 rows × 3 columns

Plot daily bookings:



In [15]:

    
bookings.booking_date = pd.to_datetime(bookings.booking_date)
bookings.book_date = bookings.booking_date.map(lambda x: x.date())

bookings.book_date[0]

bookings_by_date = bookings.groupby('booking_date').count()



In [16]:

    
bookings_by_date.head(5)









    Out[16]:






  
    
      
      prop_id
    
    
      booking_date
      
    
  
  
    
      2011-01-01
       11
    
    
      2011-01-02
        9
    
    
      2011-01-03
       10
    
    
      2011-01-04
        8
    
    
      2011-01-05
       15



In [17]:

    
bookings_by_date.tail(5)









    Out[17]:






  
    
      
      prop_id
    
    
      booking_date
      
    
  
  
    
      2011-12-27
        9
    
    
      2011-12-28
       14
    
    
      2011-12-29
       14
    
    
      2011-12-30
       14
    
    
      2011-12-31
       10



In [18]:

    
bookings_by_date.info()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 365 entries, 2011-01-01 00:00:00 to 2011-12-31 00:00:00
Data columns (total 1 columns):
prop_id    365 non-null int64
dtypes: int64(1)



In [20]:

    
bookings_by_date.hist()









    Out[20]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000163D7B00>]], dtype=object)



In [150]:

    
# wait maybe we aren't supposed to build a histogram -- we may want a line chart / time series instead?
bookings_by_date.unstack()
bookings_by_date.rename(columns={'prop_id':'bookings'})









    Out[150]:






  
    
      
      bookings
    
    
      booking_date
      
    
  
  
    
      2011-01-01
       11
    
    
      2011-01-02
        9
    
    
      2011-01-03
       10
    
    
      2011-01-04
        8
    
    
      2011-01-05
       15
    
    
      ...
      ...
    
    
      2011-12-27
        9
    
    
      2011-12-28
       14
    
    
      2011-12-29
       14
    
    
      2011-12-30
       14
    
    
      2011-12-31
       10
    
  

365 rows × 1 columns



In [148]:

    
bookings_by_date.plot()









    Out[148]:





<matplotlib.axes._subplots.AxesSubplot at 0x2331d208>

Plot the daily bookings per neighborhood (provide a legend)



In [22]:

    
# merge listings data into bookings (by prop_id)
bookinginfo = bookings.merge(listings)

bookinginfo.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
booking_date          6076 non-null datetime64[ns]
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
dtypes: datetime64[ns](1), int64(6), object(2)



In [25]:

    
bookinginfo.head(5)









    Out[25]:






  
    
      
      prop_id
      booking_date
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      0
        9
      2011-06-17
       Property type 1
       Neighborhood 13
       210
       6
       27
       180
       23
    
    
      1
        9
      2011-07-23
       Property type 1
       Neighborhood 13
       210
       6
       27
       180
       23
    
    
      2
        9
      2011-10-28
       Property type 1
       Neighborhood 13
       210
       6
       27
       180
       23
    
    
      3
       13
      2011-08-12
       Property type 2
       Neighborhood 14
        96
       2
       10
       245
       20
    
    
      4
       13
      2011-03-17
       Property type 2
       Neighborhood 14
        96
       2
       10
       245
       20



In [251]:

    
neighbookinfo = bookinginfo.groupby(bookinginfo.booking_date)

neighbookinfo.head(5)









    Out[251]:






  
    
      
      prop_id
      booking_date
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      0   
         9
      2011-06-17
       Property type 1
       Neighborhood 13
       210
       6
       27
       180
       23
    
    
      1   
         9
      2011-07-23
       Property type 1
       Neighborhood 13
       210
       6
       27
       180
       23
    
    
      2   
         9
      2011-10-28
       Property type 1
       Neighborhood 13
       210
       6
       27
       180
       23
    
    
      3   
        13
      2011-08-12
       Property type 2
       Neighborhood 14
        96
       2
       10
       245
       20
    
    
      4   
        13
      2011-03-17
       Property type 2
       Neighborhood 14
        96
       2
       10
       245
       20
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      5263
       326
      2011-02-13
       Property type 1
       Neighborhood 12
       120
       3
       20
       395
        3
    
    
      5335
       329
      2011-09-12
       Property type 2
       Neighborhood 13
        95
       2
       20
       212
        3
    
    
      5405
       334
      2011-01-18
       Property type 1
       Neighborhood 15
       185
       3
       28
       101
        3
    
    
      5420
       334
      2011-12-03
       Property type 1
       Neighborhood 15
       185
       3
       28
       101
        3
    
    
      5479
       342
      2011-12-03
       Property type 2
       Neighborhood 16
        53
       1
        2
       191
        3
    
  

1819 rows × 9 columns



In [265]:

    
neigh_new = bookinginfo[['booking_date','neighborhood']]



In [267]:

    
neigh_new.head(5)









    Out[267]:






  
    
      
      booking_date
      neighborhood
    
  
  
    
      0
      2011-06-17
       Neighborhood 13
    
    
      1
      2011-07-23
       Neighborhood 13
    
    
      2
      2011-10-28
       Neighborhood 13
    
    
      3
      2011-08-12
       Neighborhood 14
    
    
      4
      2011-03-17
       Neighborhood 14



In [282]:

    
neigh_new.groupby(neigh_new.booking_date).count()

neigh_new.plot(x=neigh_new.booking_date, y=neigh_new.count(), by=neigh_new.neighborhood)
# I just can't seem to quite wrangle this dataframe into useable form









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-282-17644444d34f> in <module>()
      1 neigh_new.groupby(neigh_new.booking_date).count()
      2 
----> 3 neigh_new.plot(x=neigh_new.booking_date, y=neigh_new.count(), by=neigh_new.neighborhood)

C:\Anaconda\lib\site-packages\pandas\tools\plotting.pyc in plot_frame(frame, x, y, subplots, sharex, sharey, use_index, figsize, grid, legend, rot, ax, style, title, xlim, ylim, logx, logy, xticks, yticks, kind, sort_columns, fontsize, secondary_y, **kwds)
   2135             label = x if x is not None else frame.index.name
   2136             label = kwds.pop('label', label)
-> 2137             ser = frame[y]
   2138             ser.index.name = label
   2139 

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1670         if isinstance(key, (Series, np.ndarray, list)):
   1671             # either boolean or fancy integer index
-> 1672             return self._getitem_array(key)
   1673         elif isinstance(key, DataFrame):
   1674             return self._getitem_frame(key)

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
   1715         else:
   1716             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 1717             return self.take(indexer, axis=1, convert=True)
   1718 
   1719     def _getitem_multilevel(self, key):

C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in take(self, indices, axis, convert, is_copy)
   1233         new_data = self._data.take(indices,
   1234                                    axis=self._get_block_manager_axis(axis),
-> 1235                                    convert=True, verify=True)
   1236         result = self._constructor(new_data).__finalize__(self)
   1237 

C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in take(self, indexer, axis, verify, convert)
   2968         n = self.shape[axis]
   2969         if convert:
-> 2970             indexer = _maybe_convert_indices(indexer, n)
   2971 
   2972         if verify:

C:\Anaconda\lib\site-packages\pandas\core\indexing.pyc in _maybe_convert_indices(indices, n)
   1630     mask = (indices >= n) | (indices < 0)
   1631     if mask.any():
-> 1632         raise IndexError("indices are out-of-bounds")
   1633     return indices
   1634 

IndexError: indices are out-of-bounds

Part 2 - Develop a data set



In [50]:

    
# group bookings dataframe by prop_id and merge into listings
# calculate booking_rate
bookrate = bookings.groupby('prop_id').count()

bookrate.rename(columns={'booking_date':'times_booked'}, inplace=True)
bookrate.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 1 to 408
Data columns (total 1 columns):
times_booked    328 non-null int64
dtypes: int64(1)



In [46]:

    
bookrate.head(5)









    Out[46]:






  
    
      
      times_booked
    
    
      prop_id
      
    
  
  
    
      1
        4
    
    
      3
        1
    
    
      4
       27
    
    
      6
       88
    
    
      7
        2



In [62]:

    
bookrate.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 1 to 408
Data columns (total 1 columns):
times_booked    328 non-null int64
dtypes: int64(1)



In [63]:

    
bookrate.reset_index(inplace=True)

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [66]:

    
listinginfo = listings.merge(bookrate, on='prop_id', how='left')



In [74]:

    
listinginfo['booking_rate'] = (listinginfo.times_booked / listinginfo.tenure_months)



In [75]:

    
listinginfo.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 10 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
times_booked          328 non-null float64
booking_rate          328 non-null float64
dtypes: float64(2), int64(6), object(2)

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [224]:

    
srlistings = listinginfo[listinginfo.tenure_months>=10]

srlistings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 10 columns):
prop_id               144 non-null int64
prop_type             144 non-null object
neighborhood          144 non-null object
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
times_booked          113 non-null float64
booking_rate          113 non-null float64
dtypes: float64(2), int64(6), object(2)

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [225]:

    
srlistings.groupby('prop_type').prop_id.count()









    Out[225]:





prop_type
Property type 1    98
Property type 2    44
Property type 3     2
Name: prop_id, dtype: int64



In [226]:

    
# need to construct dummy variables in pandas 0.14.1
prop_type_dummies = pd.get_dummies(srlistings.prop_type)

neighborhood_dummies = pd.get_dummies(srlistings.neighborhood)

srlistings = srlistings.merge(prop_type_dummies,left_index=True, right_index=True)
srlistings = srlistings.merge(neighborhood_dummies, left_index=True, right_index=True)









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 29 columns):
prop_id               144 non-null int64
prop_type             144 non-null object
neighborhood          144 non-null object
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
times_booked          113 non-null float64
booking_rate          113 non-null float64
Property type 1       144 non-null float64
Property type 2       144 non-null float64
Property type 3       144 non-null float64
Neighborhood 11       144 non-null float64
Neighborhood 12       144 non-null float64
Neighborhood 13       144 non-null float64
Neighborhood 14       144 non-null float64
Neighborhood 15       144 non-null float64
Neighborhood 16       144 non-null float64
Neighborhood 17       144 non-null float64
Neighborhood 18       144 non-null float64
Neighborhood 19       144 non-null float64
Neighborhood 20       144 non-null float64
Neighborhood 21       144 non-null float64
Neighborhood 4        144 non-null float64
Neighborhood 5        144 non-null float64
Neighborhood 7        144 non-null float64
Neighborhood 8        144 non-null float64
Neighborhood 9        144 non-null float64
dtypes: float64(21), int64(6), object(2)



In [230]:

    
srlistings.drop(['prop_id','prop_type','neighborhood'], axis=1, inplace=True)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [158]:

    
from sklearn.cross_validation import train_test_split



In [233]:

    
#modellistings = srlistings.ix[::,['price','person_capacity','picture_count','description_length','tenure_months']]

modellistings = srlistings.drop(['times_booked','booking_rate'], axis=1)



In [234]:

    
modellistings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 24 columns):
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
Property type 1       144 non-null float64
Property type 2       144 non-null float64
Property type 3       144 non-null float64
Neighborhood 11       144 non-null float64
Neighborhood 12       144 non-null float64
Neighborhood 13       144 non-null float64
Neighborhood 14       144 non-null float64
Neighborhood 15       144 non-null float64
Neighborhood 16       144 non-null float64
Neighborhood 17       144 non-null float64
Neighborhood 18       144 non-null float64
Neighborhood 19       144 non-null float64
Neighborhood 20       144 non-null float64
Neighborhood 21       144 non-null float64
Neighborhood 4        144 non-null float64
Neighborhood 5        144 non-null float64
Neighborhood 7        144 non-null float64
Neighborhood 8        144 non-null float64
Neighborhood 9        144 non-null float64
dtypes: float64(19), int64(5)



In [235]:

    
X_train, X_test, y_train, y_test = train_test_split(modellistings,srlistings.booking_rate, test_size=.3)



In [236]:

    
modellistings.sort_index(by='tenure_months')









    Out[236]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      Property type 1
      Property type 2
      Property type 3
      Neighborhood 11
      Neighborhood 12
      ...
      Neighborhood 17
      Neighborhood 18
      Neighborhood 19
      Neighborhood 20
      Neighborhood 21
      Neighborhood 4
      Neighborhood 5
      Neighborhood 7
      Neighborhood 8
      Neighborhood 9
    
  
  
    
      143
       100
       2
        5
        35
       10
       0
       1
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      120
        69
       3
       15
       418
       10
       0
       1
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      121
       125
       2
       24
       756
       10
       1
       0
       0
       0
       0
      ...
       0
       1
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      122
        63
       1
        6
       575
       10
       0
       1
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      123
       135
       2
       19
       538
       10
       1
       0
       0
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4  
       125
       5
       21
       442
       28
       1
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3  
        90
       2
       19
       472
       28
       0
       1
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2  
        95
       2
       16
       172
       29
       0
       1
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      1  
        95
       2
        3
        37
       29
       1
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      0  
       140
       3
       11
       232
       30
       1
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

144 rows × 24 columns



In [237]:

    
y_train = np.nan_to_num(y_train)

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [238]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()



In [170]:

    
#import matplotlib.pylab as plt
#from sklearn.preprocessing import PolynomialFeatures
#from sklearn.pipeline import make_pipeline
#from IPython.core.pylabtools import figsize
#figsize(5,5)
#plt.style.use('fivethirtyeight')

fit your model with your test sets



In [239]:

    
lr.fit(X_train,y_train)

print('Coefficients: \n', lr.coef_)









    



('Coefficients: \n', array([ -5.15980915e-04,   1.57102967e-02,   2.04011171e-02,
         1.53288879e-03,  -7.88203222e-02,  -9.02142932e-01,
         3.00686525e-01,   6.01456408e-01,   3.54698057e-01,
        -6.07957700e-01,   1.82638734e-01,   1.26453752e-02,
         4.18162607e-01,  -4.43934253e-01,  -1.42811465e-01,
         3.64039496e-01,   1.00859318e+00,  -2.77319193e-01,
        -5.13646903e-01,  -6.74803004e-01,  -3.37308868e-01,
        -5.87278460e-01,  -4.40710666e-01,   1.68499307e+00]))



In [243]:

    
modellistings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 24 columns):
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
Property type 1       144 non-null float64
Property type 2       144 non-null float64
Property type 3       144 non-null float64
Neighborhood 11       144 non-null float64
Neighborhood 12       144 non-null float64
Neighborhood 13       144 non-null float64
Neighborhood 14       144 non-null float64
Neighborhood 15       144 non-null float64
Neighborhood 16       144 non-null float64
Neighborhood 17       144 non-null float64
Neighborhood 18       144 non-null float64
Neighborhood 19       144 non-null float64
Neighborhood 20       144 non-null float64
Neighborhood 21       144 non-null float64
Neighborhood 4        144 non-null float64
Neighborhood 5        144 non-null float64
Neighborhood 7        144 non-null float64
Neighborhood 8        144 non-null float64
Neighborhood 9        144 non-null float64
dtypes: float64(19), int64(5)



In [240]:

    
lr.predict(X_test)









    Out[240]:





array([-0.06599738,  2.32405604,  1.18787317, -0.23328787,  1.09711999,
       -0.16546032,  0.73572247,  0.5511452 ,  1.46521969,  0.3648306 ,
        0.78192198,  0.14247383,  1.18942348,  1.11047552,  2.10705106,
        1.3818424 ,  2.29948621,  0.48172804,  2.77998107,  0.87298155,
        1.20971872,  1.70067505,  0.7009982 , -0.37032488,  1.16190267,
        0.6640804 ,  1.30046839,  0.11121503,  1.56715565,  3.38038782,
        0.171327  ,  0.70128567,  0.32396861,  4.43545202,  2.02793682,
        1.26352596,  1.6955839 ,  1.40814721, -0.04009127,  0.95441145,
        1.97966647,  1.20869839,  1.8460597 ,  0.82481806])

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [241]:

    
lr.score(X_train, y_train, sample_weight=None)









    Out[241]:





0.40364920996567188

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

Estimates the variation in booking rate explained by the regressors (less than 20%).

. Other factors than price / size / pics / desc / tenure may be important (e.g. rating, responsiveness) . Some of the selected factors (e.g. tenure) may be unimportant and thus add noise to the model . Still need to add category, neighborhood dummies .

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [ ]:

    
# Simplify the model: remove neighborhoods, tenure



In [246]:

    
newmodellistings = modellistings.ix[::,['price','person_capacity','picture_count','description_length','Property type 1','Property type 2','Property type 3']]
X_train, X_test, y_train, y_test = train_test_split(newmodellistings,srlistings.booking_rate, test_size=.3)



In [248]:

    
y_train = np.nan_to_num(y_train)

lr.fit(X_train,y_train)

print('Coefficients: \n', lr.coef_)









    



('Coefficients: \n', array([-0.00087834,  0.00581314,  0.02427057,  0.00131916, -0.4251979 ,
        0.66151885, -0.23632095]))



In [249]:

    
lr.score(X_train, y_train, sample_weight=None)









    Out[249]:





0.23673363337564579



In [ ]:

    
# Maybe use a lasso or something to constrain the list of terms? With only 144 data points, 24 columns seems excessive

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months
0	1	Property type 1	Neighborhood 14	140	3	11	232	30
1	2	Property type 1	Neighborhood 14	95	2	3	37	29
2	3	Property type 2	Neighborhood 16	95	2	16	172	29
3	4	Property type 2	Neighborhood 13	90	2	19	472	28
4	5	Property type 1	Neighborhood 15	125	5	21	442	28

	prop_id	booking_date
0	9	2011-06-17
1	13	2011-08-12
2	21	2011-06-20
3	28	2011-05-05
4	29	2011-11-17

prop_type	Property type 1	Property type 2	Property type 3
neighborhood
Neighborhood 1	85.0	NaN	NaN
Neighborhood 10	142.5	137.5	NaN
Neighborhood 11	159.4	78.8	75
Neighborhood 12	365.6	96.9	NaN
Neighborhood 13	241.9	81.1	NaN
...	...	...	...
Neighborhood 5	194.5	NaN	NaN
Neighborhood 6	146.0	NaN	NaN
Neighborhood 7	161.0	100.0	NaN
Neighborhood 8	174.8	350.0	NaN
Neighborhood 9	151.1	110.0	NaN

	prop_id
booking_date
2011-01-01	11
2011-01-02	9
2011-01-03	10
2011-01-04	8
2011-01-05	15

	prop_id
booking_date
2011-12-27	9
2011-12-28	14
2011-12-29	14
2011-12-30	14
2011-12-31	10

	booking_date	neighborhood
0	2011-06-17	Neighborhood 13
1	2011-07-23	Neighborhood 13
2	2011-10-28	Neighborhood 13
3	2011-08-12	Neighborhood 14
4	2011-03-17	Neighborhood 14

	price	person_capacity	picture_count	description_length	tenure_months	Property type 1	Property type 2	Property type 3	Neighborhood 11	Neighborhood 12	...	Neighborhood 17	Neighborhood 18	Neighborhood 19	Neighborhood 20	Neighborhood 21	Neighborhood 4	Neighborhood 5	Neighborhood 7	Neighborhood 8	Neighborhood 9
143	100	2	5	35	10	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
120	69	3	15	418	10	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
121	125	2	24	756	10	1	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
122	63	1	6	575	10	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
123	135	2	19	538	10	1	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4	125	5	21	442	28	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	90	2	19	472	28	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	95	2	16	172	29	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	95	2	3	37	29	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
0	140	3	11	232	30	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	price	person_capacity	picture_count	description_length	tenure_months	Property type 1	Property type 2	Property type 3	Neighborhood 11	Neighborhood 12	...	Neighborhood 17	Neighborhood 18	Neighborhood 19	Neighborhood 20	Neighborhood 21	Neighborhood 4	Neighborhood 5	Neighborhood 7	Neighborhood 8	Neighborhood 9
143	100	2	5	35	10	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
120	69	3	15	418	10	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
121	125	2	24	756	10	1	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
122	63	1	6	575	10	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
123	135	2	19	538	10	1	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4	125	5	21	442	28	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	90	2	19	472	28	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	95	2	16	172	29	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	95	2	3	37	29	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
0	140	3	11	232	30	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0