Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [172]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [22]:

    
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')
listings.info()
listings.columns
#clean up listings.  Remove 'Property type' and convert to int.  Remove 'Neighborhood' and convert to int
listings['prop_type'] = listings['prop_type'].map(lambda x: x.replace("Property type ","")).astype(int)
listings['neighborhood'] = listings['neighborhood'].map(lambda x: x.replace("Neighborhood ", "")).astype(int)
listings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null int64
neighborhood          408 non-null int64
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(8)
memory usage: 28.7 KB

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [23]:

    
print "mean :"
print listings.drop('prop_id',1).mean()
print "\nmedian :"
print listings.drop('prop_id',1).median()
print "\nstandard deviation :"
print listings.drop('prop_id',1).std()









    



mean :
prop_type               1.4
neighborhood           14.1
price                 187.8
person_capacity         3.0
picture_count          14.4
description_length    309.2
tenure_months           8.5
dtype: float64

median :
prop_type               1
neighborhood           14
price                 125
person_capacity         2
picture_count          12
description_length    250
tenure_months           7
dtype: float64

standard deviation :
prop_type               0.5
neighborhood            3.2
price                 353.1
person_capacity         1.6
picture_count          10.5
description_length    228.0
tenure_months           5.9
dtype: float64

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [24]:

    
groupedListings = listings.groupby(['prop_type'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])
groupedListings









    Out[24]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      mean
      mean
      mean
      mean
      mean
    
    
      prop_type
      
      
      
      
      
    
  
  
    
      1
       237.1
       3.5
       14.7
       313.2
        8.5
    
    
      2
        93.3
       2.0
       13.9
       304.9
        8.4
    
    
      3
        63.8
       1.8
        8.8
       184.8
       13.8

Same, but by property type per neighborhood?



In [25]:

    
groupedListings = listings.groupby(['prop_type', 'neighborhood'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])
groupedListings
#bookings









    Out[25]:






  
    
      
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      
      
      mean
      mean
      mean
      mean
      mean
    
    
      prop_type
      neighborhood
      
      
      
      
      
    
  
  
    
      1
      1 
        85.0
       2.0
       26.0
       209.0
        6.0
    
    
      2 
       250.0
       6.0
        8.0
       423.0
        6.0
    
    
      5 
       194.5
       2.5
        8.5
       266.5
       11.5
    
    
      6 
       146.0
       3.3
       12.7
       290.7
        4.0
    
    
      7 
       161.0
       3.7
       14.3
       343.0
        5.3
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2
      20
        60.0
       1.0
        3.0
       101.0
        6.0
    
    
      3
      4 
        40.0
       2.0
        4.0
       241.0
       19.0
    
    
      11
        75.0
       2.0
       15.0
       196.0
        8.0
    
    
      14
        75.0
       1.0
        1.0
       113.0
        5.0
    
    
      17
        65.0
       2.0
       15.0
       189.0
       23.0
    
  

40 rows × 5 columns

Plot daily bookings:



In [206]:

    
groupedBookings = bookings.groupby(['booking_date']).agg(['count'])

groupedBookings.hist()
#groupedBookings.plot()
#cleaner plot -









    Out[206]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10dcf9910>]], dtype=object)

Plot the daily bookings per neighborhood (provide a legend)



In [27]:

    
combined = bookings.merge(listings)
combined
groupedCombined = combined.groupby(['neighborhood']).agg(['count'])
groupedCombined.plot()
#TODO - legend









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c43f210>

Part 2 - Develop a data set



In [28]:

    
bookings









    Out[28]:






  
    
      
      prop_id
      booking_date
    
  
  
    
      0   
         9
       2011-06-17
    
    
      1   
        13
       2011-08-12
    
    
      2   
        21
       2011-06-20
    
    
      3   
        28
       2011-05-05
    
    
      4   
        29
       2011-11-17
    
    
      ...
      ...
      ...
    
    
      6071
       408
       2011-06-02
    
    
      6072
       408
       2011-08-22
    
    
      6073
       408
       2011-07-24
    
    
      6074
       408
       2011-01-12
    
    
      6075
       408
       2011-09-08
    
  

6076 rows × 2 columns



In [29]:

    
listings









    Out[29]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      0  
         1
       1
       14
       140
       3
       11
        232
       30
    
    
      1  
         2
       1
       14
        95
       2
        3
         37
       29
    
    
      2  
         3
       2
       16
        95
       2
       16
        172
       29
    
    
      3  
         4
       2
       13
        90
       2
       19
        472
       28
    
    
      4  
         5
       1
       15
       125
       5
       21
        442
       28
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      403
       404
       2
       14
       100
       1
        8
        235
        1
    
    
      404
       405
       2
       13
        85
       2
       27
       1048
        1
    
    
      405
       406
       1
        9
        70
       3
       18
        153
        1
    
    
      406
       407
       1
       13
       129
       2
       13
        370
        1
    
    
      407
       408
       1
       14
       100
       3
       21
        707
        1
    
  

408 rows × 8 columns



In [30]:

    
groupedProperty = bookings.groupby(['prop_id']).count()

groupedProperty.reset_index(inplace=True)



In [31]:

    
groupedProperty.rename(columns = {'booking_date' : 'bookings'}, inplace=True)

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [32]:

    
bookingListings = listings.merge(groupedProperty)
bookingListings









    Out[32]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      bookings
    
  
  
    
      0  
         1
       1
       14
       140
       3
       11
        232
       30
        4
    
    
      1  
         3
       2
       16
        95
       2
       16
        172
       29
        1
    
    
      2  
         4
       2
       13
        90
       2
       19
        472
       28
       27
    
    
      3  
         6
       2
       13
        89
       2
       10
        886
       28
       88
    
    
      4  
         7
       2
       13
        85
       1
       11
         58
       24
        2
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      323
       404
       2
       14
       100
       1
        8
        235
        1
        3
    
    
      324
       405
       2
       13
        85
       2
       27
       1048
        1
       19
    
    
      325
       406
       1
        9
        70
       3
       18
        153
        1
       19
    
    
      326
       407
       1
       13
       129
       2
       13
        370
        1
       15
    
    
      327
       408
       1
       14
       100
       3
       21
        707
        1
       54
    
  

328 rows × 9 columns



In [33]:

    
bookingListings['booking_rate'] = bookingListings['bookings'] / bookingListings['tenure_months']
bookingListings









    Out[33]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      bookings
      booking_rate
    
  
  
    
      0  
         1
       1
       14
       140
       3
       11
        232
       30
        4
        0.1
    
    
      1  
         3
       2
       16
        95
       2
       16
        172
       29
        1
        0.0
    
    
      2  
         4
       2
       13
        90
       2
       19
        472
       28
       27
        1.0
    
    
      3  
         6
       2
       13
        89
       2
       10
        886
       28
       88
        3.1
    
    
      4  
         7
       2
       13
        85
       1
       11
         58
       24
        2
        0.1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      323
       404
       2
       14
       100
       1
        8
        235
        1
        3
        3.0
    
    
      324
       405
       2
       13
        85
       2
       27
       1048
        1
       19
       19.0
    
    
      325
       406
       1
        9
        70
       3
       18
        153
        1
       19
       19.0
    
    
      326
       407
       1
       13
       129
       2
       13
        370
        1
       15
       15.0
    
    
      327
       408
       1
       14
       100
       3
       21
        707
        1
       54
       54.0
    
  

328 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [34]:

    
establishedProperties = bookingListings[bookingListings['tenure_months'] > 9]
establishedProperties









    Out[34]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      bookings
      booking_rate
    
  
  
    
      0  
         1
       1
       14
       140
       3
       11
       232
       30
        4
       0.1
    
    
      1  
         3
       2
       16
        95
       2
       16
       172
       29
        1
       0.0
    
    
      2  
         4
       2
       13
        90
       2
       19
       472
       28
       27
       1.0
    
    
      3  
         6
       2
       13
        89
       2
       10
       886
       28
       88
       3.1
    
    
      4  
         7
       2
       13
        85
       1
       11
        58
       24
        2
       0.1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      108
       139
       2
       19
        85
       3
       35
       852
       10
       39
       3.9
    
    
      109
       140
       1
       12
       200
       4
       18
       125
       10
       10
       1.0
    
    
      110
       141
       2
       12
        45
       2
       36
       281
       10
        1
       0.1
    
    
      111
       142
       2
       15
        96
       2
        9
       138
       10
       48
       4.8
    
    
      112
       143
       2
       15
        58
       2
        7
       135
       10
       21
       2.1
    
  

113 rows × 10 columns

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [163]:

    
establishedProperties.info()
establishedProperties.prop_type.value_counts()
#establishedProperties['prop_type'] = establishedProperties['prop_type']
full_table = pd.get_dummies(establishedProperties, columns=['prop_type', 'neighborhood'])









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 113 entries, 0 to 112
Data columns (total 10 columns):
prop_id               113 non-null int64
prop_type             113 non-null int64
neighborhood          113 non-null int64
price                 113 non-null int64
person_capacity       113 non-null int64
picture_count         113 non-null int64
description_length    113 non-null int64
tenure_months         113 non-null int64
bookings              113 non-null int64
booking_rate          113 non-null float64
dtypes: float64(1), int64(9)
memory usage: 9.7 KB



In [164]:

    
pd.__version__
full_table









    Out[164]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      bookings
      booking_rate
      prop_type_1
      prop_type_2
      ...
      neighborhood_12
      neighborhood_13
      neighborhood_14
      neighborhood_15
      neighborhood_16
      neighborhood_17
      neighborhood_18
      neighborhood_19
      neighborhood_20
      neighborhood_21
    
  
  
    
      0  
         1
       140
       3
       11
       232
       30
        4
       0.1
       1
       0
      ...
       0
       0
       1
       0
       0
       0
       0
       0
       0
       0
    
    
      1  
         3
        95
       2
       16
       172
       29
        1
       0.0
       0
       1
      ...
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
    
    
      2  
         4
        90
       2
       19
       472
       28
       27
       1.0
       0
       1
      ...
       0
       1
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3  
         6
        89
       2
       10
       886
       28
       88
       3.1
       0
       1
      ...
       0
       1
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4  
         7
        85
       1
       11
        58
       24
        2
       0.1
       0
       1
      ...
       0
       1
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      108
       139
        85
       3
       35
       852
       10
       39
       3.9
       0
       1
      ...
       0
       0
       0
       0
       0
       0
       0
       1
       0
       0
    
    
      109
       140
       200
       4
       18
       125
       10
       10
       1.0
       1
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      110
       141
        45
       2
       36
       281
       10
        1
       0.1
       0
       1
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      111
       142
        96
       2
        9
       138
       10
       48
       4.8
       0
       1
      ...
       0
       0
       0
       1
       0
       0
       0
       0
       0
       0
    
    
      112
       143
        58
       2
        7
       135
       10
       21
       2.1
       0
       1
      ...
       0
       0
       0
       1
       0
       0
       0
       0
       0
       0
    
  

113 rows × 25 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [194]:

    
from sklearn.cross_validation import train_test_split
full_table = pd.get_dummies(establishedProperties, columns=['prop_type', 'neighborhood'])
features = full_table.drop(['prop_id', 'bookings', 'booking_rate'], axis=1)
#features
full_table['booking_rate'].hist()
#
#features = features.values
#targetDF.hist()
full_table['log'] = np.log(full_table['booking_rate'])
full_table.log.hist()
targetDF = full_table['log']
targetDF
target = targetDF.values
target









    Out[194]:





array([-2.01490302, -3.36729583, -0.03636764,  1.1451323 , -2.48490665,
        0.19671029, -2.03688193,  0.12260232, -1.1451323 ,  0.71562004,
        1.04731899,  0.43825493, -0.45953233,  0.83975065, -0.99852883,
       -2.94443898, -0.7472144 , -0.69314718,  1.06087196,  0.42488319,
       -1.73460106, -2.14006616,  1.38629436, -0.69314718, -2.07944154,
       -0.57536414, -2.77258872, -2.77258872, -1.67397643, -2.07944154,
       -2.07944154, -2.07944154, -2.07944154, -2.07944154, -1.67397643,
       -2.77258872, -1.67397643, -2.07944154, -0.69314718, -1.67397643,
       -2.07944154, -2.07944154, -1.38629436, -1.67397643, -2.77258872,
       -1.38629436, -2.07944154, -2.77258872, -0.82667857, -1.67397643,
       -0.82667857, -0.14310084,  1.5114575 ,  0.62415431, -1.09861229,
       -2.01490302, -2.7080502 , -1.32175584, -2.01490302, -2.01490302,
        1.40282366, -1.60943791, -2.7080502 ,  0.06453852,  1.2039728 ,
       -1.32175584,  0.        ,  0.40546511, -2.63905733, -0.44183275,
        1.59504917, -0.69314718, -0.15415068, -1.02961942,  0.19415601,
        0.26826399,  0.37948962,  0.47957308, -0.26236426,  0.9315582 ,
       -0.08701138, -0.28768207,  0.94908055, -1.79175947, -1.38629436,
        1.12601126,  0.69314718, -1.29928298, -0.31845373, -0.31845373,
        0.        , -2.39789527,  1.19392247,  1.33500107,  1.16315081,
        1.28093385, -0.91629073,  1.09861229,  1.41098697,  1.22377543,
        0.64185389,  1.90210753,  0.69314718,  0.78845736,  1.30833282,
        1.5260563 , -2.30258509,  1.30833282,  1.36097655,  0.        ,
       -2.30258509,  1.56861592,  0.74193734])



In [192]:

    
a_train, a_test, b_train, b_test = train_test_split(features, target, test_size=0.2, random_state = 3)
print "a_train = {}".format(a_train.shape)
print "a_test = {}".format(a_test.shape)
print "b_train = {}".format(b_train.shape)
print "b_test = {}".format(b_test.shape)









    



a_train = (90, 22)
a_test = (23, 22)
b_train = (90,)
b_test = (23,)

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [195]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(a_train, b_train)
lr.get_params()









    Out[195]:





{'copy_X': True, 'fit_intercept': True, 'normalize': False}

fit your model with your test sets



In [189]:

    
a_predictions = lr.predict(a_test)
print a_predictions
print a_test









    



[ -1.60326324  -1.4389137   -1.58931676   0.40861997  -0.55769854
   1.00780799  -0.47199401   0.68918824   0.85993268  -1.26929308
   0.89657448  -1.21783828   0.86832012   0.35317491  -0.53262232
  -0.93520844  -0.85092676  -0.1635407    0.7175301    0.93482118
 -13.91401022   0.19875565  -1.08584682]
[[  2.85000000e+02   5.00000000e+00   6.00000000e+00   2.41000000e+02
    1.60000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.20000000e+02   2.00000000e+00   1.30000000e+01   6.85000000e+02
    2.30000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.10000000e+02   6.00000000e+00   2.70000000e+01   1.80000000e+02
    2.30000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.10000000e+02   3.00000000e+00   2.90000000e+01   4.37000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.75000000e+02   8.00000000e+00   1.90000000e+01   4.57000000e+02
    1.50000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  6.30000000e+01   1.00000000e+00   6.00000000e+00   5.75000000e+02
    1.00000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.35000000e+02   2.00000000e+00   1.90000000e+01   5.38000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  9.00000000e+01   3.00000000e+00   5.00000000e+00   5.82000000e+02
    1.60000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  6.00000000e+01   3.00000000e+00   1.20000000e+01   2.73000000e+02
    1.50000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.00000000e+02   2.00000000e+00   6.00000000e+00   2.88000000e+02
    1.60000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  8.00000000e+01   2.00000000e+00   2.00000000e+01   4.91000000e+02
    1.20000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  4.00000000e+02   8.00000000e+00   1.80000000e+01   3.21000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  4.50000000e+01   2.00000000e+00   3.60000000e+01   2.81000000e+02
    1.00000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  5.50000000e+01   2.00000000e+00   8.00000000e+00   3.33000000e+02
    1.10000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  9.50000000e+01   2.00000000e+00   8.00000000e+00   1.37000000e+02
    1.60000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.29000000e+02   4.00000000e+00   5.00000000e+00   3.88000000e+02
    1.60000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  8.50000000e+01   1.00000000e+00   1.10000000e+01   5.80000000e+01
    2.40000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  9.50000000e+01   2.00000000e+00   1.50000000e+01   2.55000000e+02
    1.90000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  9.90000000e+01   2.00000000e+00   2.00000000e+01   1.43000000e+02
    1.00000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  8.50000000e+01   2.00000000e+00   3.30000000e+01   2.48000000e+02
    1.20000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.39400000e+03   8.00000000e+00   1.00000000e+01   3.63000000e+02
    1.60000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.00000000e+02   3.00000000e+00   2.30000000e+01   1.23000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.45000000e+02   5.00000000e+00   1.70000000e+01   2.82000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]]

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [190]:

    
lr.score(a_train, b_train)









    Out[190]:





0.544217540707576



In [191]:

    
lr.score(a_test, b_test)









    Out[191]:





-3.1774494576158769

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?



In [ ]:

    
Score gives a value that signifies how accurate the predictions are.  
The closer to 1 indicates a perfect model.  
Our training data does "mediocre" but the test data is very poor.

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [ ]:



In [ ]:

	price	person_capacity	picture_count	description_length	tenure_months
	mean	mean	mean	mean	mean
prop_type
1	237.1	3.5	14.7	313.2	8.5
2	93.3	2.0	13.9	304.9	8.4
3	63.8	1.8	8.8	184.8	13.8

	prop_id	booking_date
0	9	2011-06-17
1	13	2011-08-12
2	21	2011-06-20
3	28	2011-05-05
4	29	2011-11-17
...	...	...
6071	408	2011-06-02
6072	408	2011-08-22
6073	408	2011-07-24
6074	408	2011-01-12
6075	408	2011-09-08

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months
0	1	1	14	140	3	11	232	30
1	2	1	14	95	2	3	37	29
2	3	2	16	95	2	16	172	29
3	4	2	13	90	2	19	472	28
4	5	1	15	125	5	21	442	28
...	...	...	...	...	...	...	...	...
403	404	2	14	100	1	8	235	1
404	405	2	13	85	2	27	1048	1
405	406	1	9	70	3	18	153	1
406	407	1	13	129	2	13	370	1
407	408	1	14	100	3	21	707	1

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	bookings	booking_rate
0	1	1	14	140	3	11	232	30	4	0.1
1	3	2	16	95	2	16	172	29	1	0.0
2	4	2	13	90	2	19	472	28	27	1.0
3	6	2	13	89	2	10	886	28	88	3.1
4	7	2	13	85	1	11	58	24	2	0.1
...	...	...	...	...	...	...	...	...	...	...
323	404	2	14	100	1	8	235	1	3	3.0
324	405	2	13	85	2	27	1048	1	19	19.0
325	406	1	9	70	3	18	153	1	19	19.0
326	407	1	13	129	2	13	370	1	15	15.0
327	408	1	14	100	3	21	707	1	54	54.0

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	bookings	booking_rate
0	1	1	14	140	3	11	232	30	4	0.1
1	3	2	16	95	2	16	172	29	1	0.0
2	4	2	13	90	2	19	472	28	27	1.0
3	6	2	13	89	2	10	886	28	88	3.1
4	7	2	13	85	1	11	58	24	2	0.1
...	...	...	...	...	...	...	...	...	...	...
108	139	2	19	85	3	35	852	10	39	3.9
109	140	1	12	200	4	18	125	10	10	1.0
110	141	2	12	45	2	36	281	10	1	0.1
111	142	2	15	96	2	9	138	10	48	4.8
112	143	2	15	58	2	7	135	10	21	2.1

	prop_id	price	person_capacity	picture_count	description_length	tenure_months	bookings	booking_rate	prop_type_1	prop_type_2	...	neighborhood_12	neighborhood_13	neighborhood_14	neighborhood_15	neighborhood_16	neighborhood_17	neighborhood_18	neighborhood_19	neighborhood_20	neighborhood_21
0	1	140	3	11	232	30	4	0.1	1	0	...	0	0	1	0	0	0	0	0	0	0
1	3	95	2	16	172	29	1	0.0	0	1	...	0	0	0	0	1	0	0	0	0	0
2	4	90	2	19	472	28	27	1.0	0	1	...	0	1	0	0	0	0	0	0	0	0
3	6	89	2	10	886	28	88	3.1	0	1	...	0	1	0	0	0	0	0	0	0	0
4	7	85	1	11	58	24	2	0.1	0	1	...	0	1	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
108	139	85	3	35	852	10	39	3.9	0	1	...	0	0	0	0	0	0	0	1	0	0
109	140	200	4	18	125	10	10	1.0	1	0	...	1	0	0	0	0	0	0	0	0	0
110	141	45	2	36	281	10	1	0.1	0	1	...	1	0	0	0	0	0	0	0	0	0
111	142	96	2	9	138	10	48	4.8	0	1	...	0	0	0	1	0	0	0	0	0	0
112	143	58	2	7	135	10	21	2.1	0	1	...	0	0	0	1	0	0	0	0	0	0

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	bookings	booking_rate
0	1	1	14	140	3	11	232	30	4	0.1
1	3	2	16	95	2	16	172	29	1	0.0
2	4	2	13	90	2	19	472	28	27	1.0
3	6	2	13	89	2	10	886	28	88	3.1
4	7	2	13	85	1	11	58	24	2	0.1
...	...	...	...	...	...	...	...	...	...	...
108	139	2	19	85	3	35	852	10	39	3.9
109	140	1	12	200	4	18	125	10	10	1.0
110	141	2	12	45	2	36	281	10	1	0.1
111	142	2	15	96	2	9	138	10	48	4.8
112	143	2	15	58	2	7	135	10	21	2.1

	prop_id	price	person_capacity	picture_count	description_length	tenure_months	bookings	booking_rate	prop_type_1	prop_type_2	...	neighborhood_12	neighborhood_13	neighborhood_14	neighborhood_15	neighborhood_16	neighborhood_17	neighborhood_18	neighborhood_19	neighborhood_20	neighborhood_21
0	1	140	3	11	232	30	4	0.1	1	0	...	0	0	1	0	0	0	0	0	0	0
1	3	95	2	16	172	29	1	0.0	0	1	...	0	0	0	0	1	0	0	0	0	0
2	4	90	2	19	472	28	27	1.0	0	1	...	0	1	0	0	0	0	0	0	0	0
3	6	89	2	10	886	28	88	3.1	0	1	...	0	1	0	0	0	0	0	0	0	0
4	7	85	1	11	58	24	2	0.1	0	1	...	0	1	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
108	139	85	3	35	852	10	39	3.9	0	1	...	0	0	0	0	0	0	0	1	0	0
109	140	200	4	18	125	10	10	1.0	1	0	...	1	0	0	0	0	0	0	0	0	0
110	141	45	2	36	281	10	1	0.1	0	1	...	1	0	0	0	0	0	0	0	0	0
111	142	96	2	9	138	10	48	4.8	0	1	...	0	0	0	1	0	0	0	0	0	0
112	143	58	2	7	135	10	21	2.1	0	1	...	0	0	0	1	0	0	0	0	0	0