Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [4]:
# Okay!

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [5]:
# the pd.read... etc pulls in data using pandas to create a data frame
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')

# the following .head(5) function displays the header and first 5 line items in the data frame
# I used this to ensure the data frame was properly formatted

listings.head(5)
# bookings.head(5)


Out[5]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 Property type 1 Neighborhood 14 140 3 11 232 30
1 2 Property type 1 Neighborhood 14 95 2 3 37 29
2 3 Property type 2 Neighborhood 16 95 2 16 172 29
3 4 Property type 2 Neighborhood 13 90 2 19 472 28
4 5 Property type 1 Neighborhood 15 125 5 21 442 28

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [6]:
# using the .describe function displays a bunch of information for each column
    # The median is equvilant to the 50% quartile. 

listings.describe()

# which is the media? 
#     what is this question asking?  

# describe standard dev
#     The standard deviation is a value that shows how spread out the numbers in a set are. 
#     For example, the pic count varies by 10.5 for each std dev. 
#     If the distribution was normal (which it is not), the std dev would show that 68.27% of values have pictures within 1 std. dev of 10.5 at a mean of 14.4


Out[6]:
prop_id price person_capacity picture_count description_length tenure_months
count 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000
mean 204.500000 187.806373 2.997549 14.389706 309.159314 8.487745
std 117.923704 353.050858 1.594676 10.477428 228.021684 5.872088
min 1.000000 39.000000 1.000000 1.000000 0.000000 1.000000
25% 102.750000 90.000000 2.000000 6.000000 179.000000 4.000000
50% 204.500000 125.000000 2.000000 12.000000 250.000000 7.000000
75% 306.250000 199.000000 4.000000 20.000000 389.500000 13.000000
max 408.000000 5000.000000 10.000000 71.000000 1969.000000 30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [7]:
# I used groupby to display the information in a table format
listings.groupby(['prop_type'])['person_capacity','picture_count','description_length','tenure_months','price'].agg(['mean'])


Out[7]:
person_capacity picture_count description_length tenure_months price
mean mean mean mean mean
prop_type
Property type 1 3.516729 14.695167 313.171004 8.464684 237.085502
Property type 2 2.000000 13.948148 304.851852 8.377778 93.288889
Property type 3 1.750000 8.750000 184.750000 13.750000 63.750000

Same, but by property type per neighborhood?


In [8]:
# I added a variable to the groupby to show neighborhood. 
# Depending on how you want to look at the data, it might be more intesting to look at property type first, then neighborhood.
# or neighborhood, then prop type. Both are provided. 

# listings.groupby(['prop_type','neighborhood'])['person_capacity','picture_count','description_length','tenure_months','price'].agg(['mean'])

listings.groupby(['neighborhood','prop_type'])['person_capacity','picture_count','description_length','tenure_months','price'].agg(['mean'])


Out[8]:
person_capacity picture_count description_length tenure_months price
mean mean mean mean mean
neighborhood prop_type
Neighborhood 1 Property type 1 2.000000 26.000000 209.000000 6.000000 85.000000
Neighborhood 10 Property type 1 3.500000 13.333333 391.000000 3.833333 142.500000
Property type 2 2.000000 20.000000 126.000000 3.500000 137.500000
Neighborhood 11 Property type 1 3.214286 9.928571 379.000000 9.642857 159.428571
Property type 2 2.000000 16.750000 161.250000 11.250000 78.750000
Property type 3 2.000000 15.000000 196.000000 8.000000 75.000000
Neighborhood 12 Property type 1 3.435897 10.820513 267.205128 7.897436 365.615385
Property type 2 1.947368 10.473684 244.526316 9.842105 96.894737
Neighborhood 13 Property type 1 4.061224 15.653061 290.408163 9.122449 241.897959
Property type 2 1.826087 16.695652 418.565217 9.739130 81.130435
Neighborhood 14 Property type 1 3.205882 14.764706 317.205882 8.441176 164.676471
Property type 2 1.857143 15.904762 348.619048 8.714286 83.809524
Property type 3 1.000000 1.000000 113.000000 5.000000 75.000000
Neighborhood 15 Property type 1 3.720000 14.320000 321.760000 9.320000 178.880000
Property type 2 2.266667 11.733333 301.733333 8.200000 95.000000
Neighborhood 16 Property type 1 2.928571 21.642857 310.714286 7.071429 158.928571
Property type 2 2.062500 15.375000 246.250000 6.687500 83.625000
Neighborhood 17 Property type 1 3.521739 16.086957 317.347826 9.869565 189.869565
Property type 2 2.000000 15.454545 308.272727 7.181818 102.454545
Property type 3 2.000000 15.000000 189.000000 23.000000 65.000000
Neighborhood 18 Property type 1 2.954545 16.090909 369.227273 8.227273 173.590909
Property type 2 2.222222 12.333333 297.777778 9.222222 120.666667
Neighborhood 19 Property type 1 3.625000 11.000000 254.500000 6.500000 222.375000
Property type 2 2.000000 15.125000 383.375000 5.500000 88.875000
Neighborhood 2 Property type 1 6.000000 8.000000 423.000000 6.000000 250.000000
Neighborhood 20 Property type 1 2.777778 9.444444 223.555556 9.666667 804.333333
Property type 2 1.000000 3.000000 101.000000 6.000000 60.000000
Neighborhood 21 Property type 1 4.250000 49.000000 306.250000 14.750000 362.500000
Neighborhood 22 Property type 1 3.000000 19.000000 500.000000 9.000000 225.000000
Neighborhood 3 Property type 2 2.000000 7.000000 264.000000 9.000000 60.000000
Neighborhood 4 Property type 2 2.000000 10.000000 95.000000 11.000000 60.000000
Property type 3 2.000000 4.000000 241.000000 19.000000 40.000000
Neighborhood 5 Property type 1 2.500000 8.500000 266.500000 11.500000 194.500000
Neighborhood 6 Property type 1 3.333333 12.666667 290.666667 4.000000 146.000000
Neighborhood 7 Property type 1 3.666667 14.333333 343.000000 5.333333 161.000000
Property type 2 2.000000 3.000000 148.000000 2.000000 100.000000
Neighborhood 8 Property type 1 5.000000 11.000000 300.000000 6.750000 174.750000
Property type 2 4.000000 5.000000 223.000000 3.000000 350.000000
Neighborhood 9 Property type 1 4.285714 13.428571 471.428571 5.714286 151.142857
Property type 2 2.000000 3.500000 114.500000 9.000000 110.000000

Plot daily bookings:


In [9]:
print(type(bookings))
bookings.head(5)
bookings.sort_index(by='booking_date', ascending=False)
bookingCounts = bookings.groupby(['booking_date'])['prop_id'].agg(['count'])

# Note to self: create a dataframe in order to graph the histogram with labels
# need to make sure that both columns were/are labeled. 

# bookingCounts = bookings['booking_date'].value_counts()
# print(type(bookingCounts))
# df = pd.DataFrame(bookingCounts)
# print(df)
# df.info()
print bookingCounts

propBookingCounts = bookings.groupby(['prop_id'])['booking_date'].agg(['count'])

print(propBookingCounts.info())


<class 'pandas.core.frame.DataFrame'>
              count
booking_date       
2011-01-01       11
2011-01-02        9
2011-01-03       10
2011-01-04        8
2011-01-05       15
2011-01-06       14
2011-01-07       14
2011-01-08        6
2011-01-09       10
2011-01-10       14
2011-01-11       15
2011-01-12       20
2011-01-13       20
2011-01-14       14
2011-01-15       11
2011-01-16       11
2011-01-17       21
2011-01-18        7
2011-01-19       14
2011-01-20       13
2011-01-21       21
2011-01-22       11
2011-01-23       14
2011-01-24       14
2011-01-25       16
2011-01-26       11
2011-01-27       16
2011-01-28       15
2011-01-29       10
2011-01-30       18
...             ...
2011-12-02        9
2011-12-03        4
2011-12-04       11
2011-12-05       12
2011-12-06       16
2011-12-07       14
2011-12-08       18
2011-12-09        7
2011-12-10       15
2011-12-11        6
2011-12-12       15
2011-12-13       14
2011-12-14       12
2011-12-15        8
2011-12-16       17
2011-12-17       13
2011-12-18       11
2011-12-19        9
2011-12-20        8
2011-12-21        8
2011-12-22        3
2011-12-23        8
2011-12-24        5
2011-12-25        2
2011-12-26       11
2011-12-27        9
2011-12-28       14
2011-12-29       14
2011-12-30       14
2011-12-31       10

[365 rows x 1 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 1 to 408
Data columns (total 1 columns):
count    328 non-null int64
dtypes: int64(1)
memory usage: 5.1 KB
None

In [10]:
bookingCounts.hist()
# need to label axis


Out[10]:
array([[<matplotlib.axes.AxesSubplot object at 0x10f409d10>]], dtype=object)

Plot the daily bookings per neighborhood (provide a legend)


In [11]:
# First step is to merge the two lists because one has info on listing dates
listMerge = listings.merge(bookings, on='prop_id')
listGroup = listMerge.groupby(['neighborhood','booking_date'])['prop_id'].agg(['count']).unstack(0)

listGroup.plot()


Out[11]:
<matplotlib.axes.AxesSubplot at 0x10bcec210>

Part 2 - Develop a data set


In [12]:
listMerge.head()


Out[12]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months booking_date
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 2011-03-09
1 1 Property type 1 Neighborhood 14 140 3 11 232 30 2011-03-07
2 1 Property type 1 Neighborhood 14 140 3 11 232 30 2011-05-24
3 1 Property type 1 Neighborhood 14 140 3 11 232 30 2011-06-18
4 3 Property type 2 Neighborhood 16 95 2 16 172 29 2011-08-16

In [13]:
listGroup.head()


Out[13]:
count
neighborhood Neighborhood 1 Neighborhood 10 Neighborhood 11 Neighborhood 12 Neighborhood 13 Neighborhood 14 Neighborhood 15 Neighborhood 16 Neighborhood 17 Neighborhood 18 ... Neighborhood 20 Neighborhood 21 Neighborhood 22 Neighborhood 3 Neighborhood 4 Neighborhood 5 Neighborhood 6 Neighborhood 7 Neighborhood 8 Neighborhood 9
booking_date
2011-01-01 NaN NaN NaN NaN 4 3 1 NaN 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
2011-01-02 NaN NaN 1 NaN 1 3 1 1 NaN 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-01-03 NaN NaN NaN NaN 5 2 NaN NaN 2 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
2011-01-04 NaN NaN NaN 1 1 1 2 NaN NaN 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
2011-01-05 NaN NaN 1 NaN 1 6 3 1 1 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1

5 rows × 21 columns

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [14]:
# @Chad&Ramesh ... I don't understand how this is supposed to work. 
# adding the columns is easy, but how are we supposed to iterate through and include the values?

listingsWithPropCount = listings.merge(propBookingCounts, left_on='prop_id', right_index=True)
listingsWithPropCount.rename(columns={'count': 'number_of_bookings'}, inplace=True)

# listings['booking_rate'] = listings.prop_id.map(booking_rate_map)
# listings['booking_rate'] = ""

# !!!   things that don't work:  !!!
# propBookingCounts.rename(columns={'count': 'number_of_bookings'}, inplace=True)
# propBookingCounts.ix[0:2, ['prop_id', 'number_of_bookings']]
# print propBookingCounts.head()
# listings['number_of_bookings'] = propBookingCounts.row_dt.map(lambda x: x.count)
# combiner = lambda x, y: np.where(isnull(x), y, x) 
# listings.combine(propBookingCounts, combiner)

In [15]:
listingsWithPropCount.head()


Out[15]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27
5 6 Property type 2 Neighborhood 13 89 2 10 886 28 88
6 7 Property type 2 Neighborhood 13 85 1 11 58 24 2

In [16]:
# def get_booking_rate(val):
#     if number_of_bookings != 0:
#         return number_of_bookings/listings['tenure_months']
#     else:
#         return 0
#of bookings/tenure_months
    
listingsWithPropCount['booking_rate'] = (listingsWithPropCount['number_of_bookings']/listingsWithPropCount['tenure_months'])
listingsWithPropCount.head()


Out[16]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 0.133333
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1 0.034483
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27 0.964286
5 6 Property type 2 Neighborhood 13 89 2 10 886 28 88 3.142857
6 7 Property type 2 Neighborhood 13 85 1 11 58 24 2 0.083333

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [17]:
established_properties = listingsWithPropCount[(listings.tenure_months > 10)]
established_properties


/Library/Python/2.7/site-packages/pandas/core/frame.py:1808: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[17]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 0.133333
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1 0.034483
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27 0.964286
5 6 Property type 2 Neighborhood 13 89 2 10 886 28 88 3.142857
6 7 Property type 2 Neighborhood 13 85 1 11 58 24 2 0.083333
7 8 Property type 1 Neighborhood 18 120 2 13 685 23 28 1.217391
8 9 Property type 1 Neighborhood 13 210 6 27 180 23 3 0.130435
9 10 Property type 3 Neighborhood 17 65 2 15 189 23 26 1.130435
10 11 Property type 1 Neighborhood 13 145 3 9 140 22 7 0.318182
11 12 Property type 2 Neighborhood 13 89 2 11 153 22 45 2.045455
12 13 Property type 2 Neighborhood 14 96 2 10 245 20 57 2.850000
13 14 Property type 2 Neighborhood 17 95 2 8 139 20 31 1.550000
14 15 Property type 2 Neighborhood 11 95 2 15 255 19 12 0.631579
15 16 Property type 2 Neighborhood 14 95 2 39 334 19 44 2.315789
16 17 Property type 2 Neighborhood 12 65 2 7 333 19 7 0.368421
18 19 Property type 2 Neighborhood 12 65 2 9 448 19 1 0.052632
19 20 Property type 3 Neighborhood 4 40 2 4 241 19 9 0.473684
20 21 Property type 1 Neighborhood 13 295 5 22 228 18 9 0.500000
22 23 Property type 2 Neighborhood 14 90 3 15 411 18 52 2.888889
23 24 Property type 1 Neighborhood 16 110 2 10 495 17 26 1.529412
24 25 Property type 1 Neighborhood 18 215 2 16 190 17 3 0.176471
25 26 Property type 1 Neighborhood 14 139 2 20 395 17 2 0.117647
26 27 Property type 1 Neighborhood 15 180 6 26 325 17 68 4.000000
27 28 Property type 2 Neighborhood 12 95 2 8 137 16 8 0.500000
28 29 Property type 1 Neighborhood 17 125 3 31 327 16 2 0.125000
29 30 Property type 2 Neighborhood 17 90 3 5 582 16 9 0.562500
32 33 Property type 1 Neighborhood 13 246 8 10 437 16 1 0.062500
34 35 Property type 1 Neighborhood 13 170 4 39 404 16 1 0.062500
42 43 Property type 1 Neighborhood 12 229 4 6 281 16 3 0.187500
43 44 Property type 1 Neighborhood 19 500 4 6 342 16 2 0.125000
... ... ... ... ... ... ... ... ... ... ...
82 83 Property type 1 Neighborhood 14 565 8 24 1111 15 1 0.066667
83 84 Property type 1 Neighborhood 13 125 2 21 208 15 16 1.066667
84 85 Property type 2 Neighborhood 13 59 1 10 205 15 50 3.333333
85 86 Property type 2 Neighborhood 18 75 1 7 197 15 4 0.266667
86 87 Property type 2 Neighborhood 18 59 1 9 167 14 14 1.000000
87 88 Property type 1 Neighborhood 16 80 2 15 145 14 21 1.500000
88 89 Property type 1 Neighborhood 21 500 5 40 896 14 1 0.071429
89 90 Property type 1 Neighborhood 13 145 6 25 297 14 9 0.642857
92 93 Property type 2 Neighborhood 15 79 4 16 418 14 69 4.928571
93 94 Property type 1 Neighborhood 14 105 4 15 210 14 7 0.500000
94 95 Property type 1 Neighborhood 14 195 3 19 214 14 12 0.857143
95 96 Property type 1 Neighborhood 11 135 3 21 303 14 5 0.357143
97 98 Property type 1 Neighborhood 13 139 4 13 409 14 17 1.214286
98 99 Property type 1 Neighborhood 12 150 2 26 425 13 17 1.307692
100 101 Property type 1 Neighborhood 14 69 2 5 100 13 19 1.461538
101 102 Property type 2 Neighborhood 11 70 2 9 0 13 21 1.615385
103 104 Property type 1 Neighborhood 18 40 2 9 321 13 10 0.769231
105 106 Property type 1 Neighborhood 9 95 3 11 248 13 33 2.538462
106 107 Property type 1 Neighborhood 15 160 6 23 304 12 11 0.916667
107 108 Property type 1 Neighborhood 12 135 2 8 297 12 9 0.750000
108 109 Property type 2 Neighborhood 13 80 2 20 491 12 31 2.583333
109 110 Property type 1 Neighborhood 20 350 3 6 135 12 2 0.166667
110 111 Property type 2 Neighborhood 14 85 2 33 248 12 3 0.250000
111 112 Property type 1 Neighborhood 14 129 2 39 759 12 37 3.083333
112 113 Property type 2 Neighborhood 13 50 2 5 514 12 24 2.000000
113 114 Property type 1 Neighborhood 13 325 4 46 227 11 3 0.272727
114 115 Property type 1 Neighborhood 13 180 3 18 256 11 8 0.727273
116 117 Property type 2 Neighborhood 14 49 2 14 417 11 8 0.727273
117 118 Property type 2 Neighborhood 4 60 2 10 95 11 11 1.000000
118 119 Property type 2 Neighborhood 12 55 2 8 333 11 1 0.090909

92 rows × 10 columns

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [18]:
# is this something new?  I cannot find references to this from lectures?
pd.get_dummies(established_properties)


Out[18]:
prop_id price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate prop_type_Property type 1 prop_type_Property type 2 ... neighborhood_Neighborhood 14 neighborhood_Neighborhood 15 neighborhood_Neighborhood 16 neighborhood_Neighborhood 17 neighborhood_Neighborhood 18 neighborhood_Neighborhood 19 neighborhood_Neighborhood 20 neighborhood_Neighborhood 21 neighborhood_Neighborhood 4 neighborhood_Neighborhood 9
0 1 140 3 11 232 30 4 0.133333 1 0 ... 1 0 0 0 0 0 0 0 0 0
2 3 95 2 16 172 29 1 0.034483 0 1 ... 0 0 1 0 0 0 0 0 0 0
3 4 90 2 19 472 28 27 0.964286 0 1 ... 0 0 0 0 0 0 0 0 0 0
5 6 89 2 10 886 28 88 3.142857 0 1 ... 0 0 0 0 0 0 0 0 0 0
6 7 85 1 11 58 24 2 0.083333 0 1 ... 0 0 0 0 0 0 0 0 0 0
7 8 120 2 13 685 23 28 1.217391 1 0 ... 0 0 0 0 1 0 0 0 0 0
8 9 210 6 27 180 23 3 0.130435 1 0 ... 0 0 0 0 0 0 0 0 0 0
9 10 65 2 15 189 23 26 1.130435 0 0 ... 0 0 0 1 0 0 0 0 0 0
10 11 145 3 9 140 22 7 0.318182 1 0 ... 0 0 0 0 0 0 0 0 0 0
11 12 89 2 11 153 22 45 2.045455 0 1 ... 0 0 0 0 0 0 0 0 0 0
12 13 96 2 10 245 20 57 2.850000 0 1 ... 1 0 0 0 0 0 0 0 0 0
13 14 95 2 8 139 20 31 1.550000 0 1 ... 0 0 0 1 0 0 0 0 0 0
14 15 95 2 15 255 19 12 0.631579 0 1 ... 0 0 0 0 0 0 0 0 0 0
15 16 95 2 39 334 19 44 2.315789 0 1 ... 1 0 0 0 0 0 0 0 0 0
16 17 65 2 7 333 19 7 0.368421 0 1 ... 0 0 0 0 0 0 0 0 0 0
18 19 65 2 9 448 19 1 0.052632 0 1 ... 0 0 0 0 0 0 0 0 0 0
19 20 40 2 4 241 19 9 0.473684 0 0 ... 0 0 0 0 0 0 0 0 1 0
20 21 295 5 22 228 18 9 0.500000 1 0 ... 0 0 0 0 0 0 0 0 0 0
22 23 90 3 15 411 18 52 2.888889 0 1 ... 1 0 0 0 0 0 0 0 0 0
23 24 110 2 10 495 17 26 1.529412 1 0 ... 0 0 1 0 0 0 0 0 0 0
24 25 215 2 16 190 17 3 0.176471 1 0 ... 0 0 0 0 1 0 0 0 0 0
25 26 139 2 20 395 17 2 0.117647 1 0 ... 1 0 0 0 0 0 0 0 0 0
26 27 180 6 26 325 17 68 4.000000 1 0 ... 0 1 0 0 0 0 0 0 0 0
27 28 95 2 8 137 16 8 0.500000 0 1 ... 0 0 0 0 0 0 0 0 0 0
28 29 125 3 31 327 16 2 0.125000 1 0 ... 0 0 0 1 0 0 0 0 0 0
29 30 90 3 5 582 16 9 0.562500 0 1 ... 0 0 0 1 0 0 0 0 0 0
32 33 246 8 10 437 16 1 0.062500 1 0 ... 0 0 0 0 0 0 0 0 0 0
34 35 170 4 39 404 16 1 0.062500 1 0 ... 0 0 0 0 0 0 0 0 0 0
42 43 229 4 6 281 16 3 0.187500 1 0 ... 0 0 0 0 0 0 0 0 0 0
43 44 500 4 6 342 16 2 0.125000 1 0 ... 0 0 0 0 0 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82 83 565 8 24 1111 15 1 0.066667 1 0 ... 1 0 0 0 0 0 0 0 0 0
83 84 125 2 21 208 15 16 1.066667 1 0 ... 0 0 0 0 0 0 0 0 0 0
84 85 59 1 10 205 15 50 3.333333 0 1 ... 0 0 0 0 0 0 0 0 0 0
85 86 75 1 7 197 15 4 0.266667 0 1 ... 0 0 0 0 1 0 0 0 0 0
86 87 59 1 9 167 14 14 1.000000 0 1 ... 0 0 0 0 1 0 0 0 0 0
87 88 80 2 15 145 14 21 1.500000 1 0 ... 0 0 1 0 0 0 0 0 0 0
88 89 500 5 40 896 14 1 0.071429 1 0 ... 0 0 0 0 0 0 0 1 0 0
89 90 145 6 25 297 14 9 0.642857 1 0 ... 0 0 0 0 0 0 0 0 0 0
92 93 79 4 16 418 14 69 4.928571 0 1 ... 0 1 0 0 0 0 0 0 0 0
93 94 105 4 15 210 14 7 0.500000 1 0 ... 1 0 0 0 0 0 0 0 0 0
94 95 195 3 19 214 14 12 0.857143 1 0 ... 1 0 0 0 0 0 0 0 0 0
95 96 135 3 21 303 14 5 0.357143 1 0 ... 0 0 0 0 0 0 0 0 0 0
97 98 139 4 13 409 14 17 1.214286 1 0 ... 0 0 0 0 0 0 0 0 0 0
98 99 150 2 26 425 13 17 1.307692 1 0 ... 0 0 0 0 0 0 0 0 0 0
100 101 69 2 5 100 13 19 1.461538 1 0 ... 1 0 0 0 0 0 0 0 0 0
101 102 70 2 9 0 13 21 1.615385 0 1 ... 0 0 0 0 0 0 0 0 0 0
103 104 40 2 9 321 13 10 0.769231 1 0 ... 0 0 0 0 1 0 0 0 0 0
105 106 95 3 11 248 13 33 2.538462 1 0 ... 0 0 0 0 0 0 0 0 0 1
106 107 160 6 23 304 12 11 0.916667 1 0 ... 0 1 0 0 0 0 0 0 0 0
107 108 135 2 8 297 12 9 0.750000 1 0 ... 0 0 0 0 0 0 0 0 0 0
108 109 80 2 20 491 12 31 2.583333 0 1 ... 0 0 0 0 0 0 0 0 0 0
109 110 350 3 6 135 12 2 0.166667 1 0 ... 0 0 0 0 0 0 1 0 0 0
110 111 85 2 33 248 12 3 0.250000 0 1 ... 1 0 0 0 0 0 0 0 0 0
111 112 129 2 39 759 12 37 3.083333 1 0 ... 1 0 0 0 0 0 0 0 0 0
112 113 50 2 5 514 12 24 2.000000 0 1 ... 0 0 0 0 0 0 0 0 0 0
113 114 325 4 46 227 11 3 0.272727 1 0 ... 0 0 0 0 0 0 0 0 0 0
114 115 180 3 18 256 11 8 0.727273 1 0 ... 0 0 0 0 0 0 0 0 0 0
116 117 49 2 14 417 11 8 0.727273 0 1 ... 1 0 0 0 0 0 0 0 0 0
117 118 60 2 10 95 11 11 1.000000 0 1 ... 0 0 0 0 0 0 0 0 1 0
118 119 55 2 8 333 11 1 0.090909 0 1 ... 0 0 0 0 0 0 0 0 0 0

92 rows × 24 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [19]:
# @Chad/Ramesh -- its not clear to me what is supposed to be done here.  Are we supposed to graph the data?

from sklearn.cross_validation import train_test_split
from IPython.core.pylabtools import figsize
from sklearn.linear_model import LinearRegression

In [20]:
x = established_properties[['price','person_capacity','picture_count','description_length','tenure_months']].values
y = established_properties['booking_rate'].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8)

clf = LinearRegression()
clf.fit(x_train, y_train)


Out[20]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [21]:
ratePrediction = clf.predict(x_test)
sum_sq_model = np.sum((y_test - ratePrediction) ** 2)
sum_sq_model


Out[21]:
205.89953307564807

In [22]:
sum_sq_naive = np.sum((y_test - y.mean()) ** 2)
sum_sq_naive


Out[22]:
92.993147553323894

In [23]:
fig, ax = plt.subplots(1, 1)

ax.scatter(ratePrediction, y_test)
ax.set_xlabel('Predicated X')
ax.set_ylabel('Actual X')

# Draw the ideal line
ax.plot(y, y, 'r')


Out[23]:
[<matplotlib.lines.Line2D at 0x10ffbc750>]

In [1]:
# a, b = np.arange(10).reshape((5, 2)), range(5)

# a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.33, random_state=42)

# print a_train
# print a_test
# print b_train
# print b_test

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [25]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets


In [25]:

report the score

http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score


In [34]:
clf.score(x, y)


Out[34]:
-0.91027066304371318

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]: