Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
bookings=pd.read_csv('../data/bookings.csv')

In [3]:
bookings.info()
bookings.head(5)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null object
dtypes: int64(1), object(1)
memory usage: 142.4+ KB
Out[3]:
prop_id booking_date
0 9 2011-06-17
1 13 2011-08-12
2 21 2011-06-20
3 28 2011-05-05
4 29 2011-11-17

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [4]:
listings=pd.read_csv('../data/listings.csv')

In [5]:
listings.info()
listings.head(5)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB
Out[5]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 Property type 1 Neighborhood 14 140 3 11 232 30
1 2 Property type 1 Neighborhood 14 95 2 3 37 29
2 3 Property type 2 Neighborhood 16 95 2 16 172 29
3 4 Property type 2 Neighborhood 13 90 2 19 472 28
4 5 Property type 1 Neighborhood 15 125 5 21 442 28

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [6]:
listings.price.mean()


Out[6]:
187.80637254901961

In [7]:
listings.mean(axis=0)


Out[7]:
prop_id               204.500000
price                 187.806373
person_capacity         2.997549
picture_count          14.389706
description_length    309.159314
tenure_months           8.487745
dtype: float64

In [8]:
listings.median(axis=0)


Out[8]:
prop_id               204.5
price                 125.0
person_capacity         2.0
picture_count          12.0
description_length    250.0
tenure_months           7.0
dtype: float64

In [9]:
listings.std(axis=0)


Out[9]:
prop_id               117.923704
price                 353.050858
person_capacity         1.594676
picture_count          10.477428
description_length    228.021684
tenure_months           5.872088
dtype: float64

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [10]:
listings.groupby(['prop_type'])['price','person_capacity','picture_count','description_length','tenure_months'].mean()


Out[10]:
price person_capacity picture_count description_length tenure_months
prop_type
Property type 1 237.085502 3.516729 14.695167 313.171004 8.464684
Property type 2 93.288889 2.000000 13.948148 304.851852 8.377778
Property type 3 63.750000 1.750000 8.750000 184.750000 13.750000

Same, but by property type per neighborhood?


In [11]:
listings.groupby(['neighborhood','prop_type'])['price','person_capacity','picture_count','description_length','tenure_months'].mean()


Out[11]:
price person_capacity picture_count description_length tenure_months
neighborhood prop_type
Neighborhood 1 Property type 1 85.000000 2.000000 26.000000 209.000000 6.000000
Neighborhood 10 Property type 1 142.500000 3.500000 13.333333 391.000000 3.833333
Property type 2 137.500000 2.000000 20.000000 126.000000 3.500000
Neighborhood 11 Property type 1 159.428571 3.214286 9.928571 379.000000 9.642857
Property type 2 78.750000 2.000000 16.750000 161.250000 11.250000
Property type 3 75.000000 2.000000 15.000000 196.000000 8.000000
Neighborhood 12 Property type 1 365.615385 3.435897 10.820513 267.205128 7.897436
Property type 2 96.894737 1.947368 10.473684 244.526316 9.842105
Neighborhood 13 Property type 1 241.897959 4.061224 15.653061 290.408163 9.122449
Property type 2 81.130435 1.826087 16.695652 418.565217 9.739130
Neighborhood 14 Property type 1 164.676471 3.205882 14.764706 317.205882 8.441176
Property type 2 83.809524 1.857143 15.904762 348.619048 8.714286
Property type 3 75.000000 1.000000 1.000000 113.000000 5.000000
Neighborhood 15 Property type 1 178.880000 3.720000 14.320000 321.760000 9.320000
Property type 2 95.000000 2.266667 11.733333 301.733333 8.200000
Neighborhood 16 Property type 1 158.928571 2.928571 21.642857 310.714286 7.071429
Property type 2 83.625000 2.062500 15.375000 246.250000 6.687500
Neighborhood 17 Property type 1 189.869565 3.521739 16.086957 317.347826 9.869565
Property type 2 102.454545 2.000000 15.454545 308.272727 7.181818
Property type 3 65.000000 2.000000 15.000000 189.000000 23.000000
Neighborhood 18 Property type 1 173.590909 2.954545 16.090909 369.227273 8.227273
Property type 2 120.666667 2.222222 12.333333 297.777778 9.222222
Neighborhood 19 Property type 1 222.375000 3.625000 11.000000 254.500000 6.500000
Property type 2 88.875000 2.000000 15.125000 383.375000 5.500000
Neighborhood 2 Property type 1 250.000000 6.000000 8.000000 423.000000 6.000000
Neighborhood 20 Property type 1 804.333333 2.777778 9.444444 223.555556 9.666667
Property type 2 60.000000 1.000000 3.000000 101.000000 6.000000
Neighborhood 21 Property type 1 362.500000 4.250000 49.000000 306.250000 14.750000
Neighborhood 22 Property type 1 225.000000 3.000000 19.000000 500.000000 9.000000
Neighborhood 3 Property type 2 60.000000 2.000000 7.000000 264.000000 9.000000
Neighborhood 4 Property type 2 60.000000 2.000000 10.000000 95.000000 11.000000
Property type 3 40.000000 2.000000 4.000000 241.000000 19.000000
Neighborhood 5 Property type 1 194.500000 2.500000 8.500000 266.500000 11.500000
Neighborhood 6 Property type 1 146.000000 3.333333 12.666667 290.666667 4.000000
Neighborhood 7 Property type 1 161.000000 3.666667 14.333333 343.000000 5.333333
Property type 2 100.000000 2.000000 3.000000 148.000000 2.000000
Neighborhood 8 Property type 1 174.750000 5.000000 11.000000 300.000000 6.750000
Property type 2 350.000000 4.000000 5.000000 223.000000 3.000000
Neighborhood 9 Property type 1 151.142857 4.285714 13.428571 471.428571 5.714286
Property type 2 110.000000 2.000000 3.500000 114.500000 9.000000

Plot daily bookings:


In [12]:
bookings.booking_date=pd.to_datetime(bookings.booking_date)
print dir(bookings.booking_date[0])


['__add__', '__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pyx_vtable__', '__qualname__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rsub__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__weakref__', '_date_repr', '_get_field', '_get_start_end_field', '_has_time_component', '_repr_base', '_time_repr', 'asm8', 'astimezone', 'combine', 'ctime', 'date', 'day', 'dayofweek', 'dayofyear', 'dst', 'freq', 'freqstr', 'fromordinal', 'fromtimestamp', 'hour', 'is_month_end', 'is_month_start', 'is_quarter_end', 'is_quarter_start', 'is_year_end', 'is_year_start', 'isocalendar', 'isoformat', 'isoweekday', 'max', 'microsecond', 'min', 'minute', 'month', 'nanosecond', 'now', 'offset', 'quarter', 'replace', 'resolution', 'second', 'strftime', 'strptime', 'time', 'timetuple', 'timetz', 'to_datetime', 'to_julian_date', 'to_period', 'to_pydatetime', 'today', 'toordinal', 'tz', 'tz_convert', 'tz_localize', 'tzinfo', 'tzname', 'utcfromtimestamp', 'utcnow', 'utcoffset', 'utctimetuple', 'value', 'week', 'weekday', 'weekofyear', 'year']

In [13]:
bookings.booking_date.value_counts().plot()


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x109e5b9d0>

In [14]:
bookings.info()
bookings.booking_date.head(5)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)
memory usage: 142.4 KB
Out[14]:
0   2011-06-17
1   2011-08-12
2   2011-06-20
3   2011-05-05
4   2011-11-17
Name: booking_date, dtype: datetime64[ns]

Plot the daily bookings per neighborhood (provide a legend)


In [15]:
listMerge = listings.merge(bookings, on='prop_id')
listMerge.groupby(['neighborhood','booking_date'])['prop_id'].agg(['count']).unstack(0).plot()


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x109f71a10>

Part 2 - Develop a data set


In [17]:
listings.columns


Out[17]:
Index([u'prop_id', u'prop_type', u'neighborhood', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months'], dtype='object')

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [18]:
bookings.head(5)


Out[18]:
prop_id booking_date
0 9 2011-06-17
1 13 2011-08-12
2 21 2011-06-20
3 28 2011-05-05
4 29 2011-11-17

In [19]:
book_by_prop=bookings.groupby('prop_id')[['prop_id']].count()
book_by_prop.head()


Out[19]:
prop_id
prop_id
1 4
3 1
4 27
6 88
7 2

In [20]:
book_by_prop.rename(columns={'prop_id':'number_of_bookings'}, inplace=True)

In [21]:
book_by_prop.reset_index(inplace=True)

In [22]:
book_by_prop.info()
book_by_prop.head(10)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 0 to 327
Data columns (total 2 columns):
prop_id               328 non-null int64
number_of_bookings    328 non-null int64
dtypes: int64(2)
memory usage: 7.7 KB
Out[22]:
prop_id number_of_bookings
0 1 4
1 3 1
2 4 27
3 6 88
4 7 2
5 8 28
6 9 3
7 10 26
8 11 7
9 12 45

In [23]:
listings=listings.merge(book_by_prop, on='prop_id', how='left')

In [24]:
listings.fillna(0.0, inplace=True)

In [25]:
listings.head(10)


Out[25]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 0
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 0
5 6 Property type 2 Neighborhood 13 89 2 10 886 28 88
6 7 Property type 2 Neighborhood 13 85 1 11 58 24 2
7 8 Property type 1 Neighborhood 18 120 2 13 685 23 28
8 9 Property type 1 Neighborhood 13 210 6 27 180 23 3
9 10 Property type 3 Neighborhood 17 65 2 15 189 23 26

In [26]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 9 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
number_of_bookings    408 non-null float64
dtypes: float64(1), int64(6), object(2)
memory usage: 31.9+ KB

In [27]:
listings['booking_rate']=listings.number_of_bookings/listings.tenure_months

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [28]:
listings=listings[listings.tenure_months>10]

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [29]:
pd.core.reshape.get_dummies(listings.prop_type)


Out[29]:
Property type 1 Property type 2 Property type 3
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 1 0 0
5 0 1 0
6 0 1 0
7 1 0 0
8 1 0 0
9 0 0 1
10 1 0 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 1 0
15 0 1 0
16 0 1 0
17 0 1 0
18 0 1 0
19 0 0 1
20 1 0 0
21 1 0 0
22 0 1 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 0 1 0
28 1 0 0
29 0 1 0
... ... ... ...
90 1 0 0
91 0 1 0
92 0 1 0
93 1 0 0
94 1 0 0
95 1 0 0
96 1 0 0
97 1 0 0
98 1 0 0
99 0 1 0
100 1 0 0
101 0 1 0
102 1 0 0
103 1 0 0
104 1 0 0
105 1 0 0
106 1 0 0
107 1 0 0
108 0 1 0
109 1 0 0
110 0 1 0
111 1 0 0
112 0 1 0
113 1 0 0
114 1 0 0
115 1 0 0
116 0 1 0
117 0 1 0
118 0 1 0
119 0 1 0

120 rows × 3 columns


In [30]:
pd.core.reshape.get_dummies(listings.neighborhood)


Out[30]:
Neighborhood 11 Neighborhood 12 Neighborhood 13 Neighborhood 14 Neighborhood 15 Neighborhood 16 Neighborhood 17 Neighborhood 18 Neighborhood 19 Neighborhood 20 Neighborhood 21 Neighborhood 4 Neighborhood 5 Neighborhood 9
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 0 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 1 0 0 0 0 0 0
8 0 0 1 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 1 0 0 0 0 0 0 0
10 0 0 1 0 0 0 0 0 0 0 0 0 0 0
11 0 0 1 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 1 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 1 0 0 0 0 0 0 0
14 1 0 0 0 0 0 0 0 0 0 0 0 0 0
15 0 0 0 1 0 0 0 0 0 0 0 0 0 0
16 0 1 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 1 0 0 0 0 0 0 0 0 0 0
18 0 1 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 1 0 0
20 0 0 1 0 0 0 0 0 0 0 0 0 0 0
21 0 1 0 0 0 0 0 0 0 0 0 0 0 0
22 0 0 0 1 0 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 1 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 1 0 0 0 0 0 0
25 0 0 0 1 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 1 0 0 0 0 0 0 0 0 0
27 0 1 0 0 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 1 0 0 0 0 0 0 0
29 0 0 0 0 0 0 1 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
90 0 0 0 0 0 0 1 0 0 0 0 0 0 0
91 0 1 0 0 0 0 0 0 0 0 0 0 0 0
92 0 0 0 0 1 0 0 0 0 0 0 0 0 0
93 0 0 0 1 0 0 0 0 0 0 0 0 0 0
94 0 0 0 1 0 0 0 0 0 0 0 0 0 0
95 1 0 0 0 0 0 0 0 0 0 0 0 0 0
96 0 0 0 0 0 0 0 1 0 0 0 0 0 0
97 0 0 1 0 0 0 0 0 0 0 0 0 0 0
98 0 1 0 0 0 0 0 0 0 0 0 0 0 0
99 0 0 0 0 1 0 0 0 0 0 0 0 0 0
100 0 0 0 1 0 0 0 0 0 0 0 0 0 0
101 1 0 0 0 0 0 0 0 0 0 0 0 0 0
102 0 1 0 0 0 0 0 0 0 0 0 0 0 0
103 0 0 0 0 0 0 0 1 0 0 0 0 0 0
104 0 0 0 0 0 0 0 0 0 1 0 0 0 0
105 0 0 0 0 0 0 0 0 0 0 0 0 0 1
106 0 0 0 0 1 0 0 0 0 0 0 0 0 0
107 0 1 0 0 0 0 0 0 0 0 0 0 0 0
108 0 0 1 0 0 0 0 0 0 0 0 0 0 0
109 0 0 0 0 0 0 0 0 0 1 0 0 0 0
110 0 0 0 1 0 0 0 0 0 0 0 0 0 0
111 0 0 0 1 0 0 0 0 0 0 0 0 0 0
112 0 0 1 0 0 0 0 0 0 0 0 0 0 0
113 0 0 1 0 0 0 0 0 0 0 0 0 0 0
114 0 0 1 0 0 0 0 0 0 0 0 0 0 0
115 0 0 0 0 0 0 1 0 0 0 0 0 0 0
116 0 0 0 1 0 0 0 0 0 0 0 0 0 0
117 0 0 0 0 0 0 0 0 0 0 0 1 0 0
118 0 1 0 0 0 0 0 0 0 0 0 0 0 0
119 0 0 0 0 0 0 0 0 0 0 0 0 0 1

120 rows × 14 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [31]:
listings.booking_rate.hist()


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a5c8710>

In [33]:
np.log(listings.booking_rate)


Out[33]:
0    -2.014903
1         -inf
2    -3.367296
3    -0.036368
4         -inf
5     1.145132
6    -2.484907
7     0.196710
8    -2.036882
9     0.122602
10   -1.145132
11    0.715620
12    1.047319
13    0.438255
14   -0.459532
...
105    0.931558
106   -0.087011
107   -0.287682
108    0.949081
109   -1.791759
110   -1.386294
111    1.126011
112    0.693147
113   -1.299283
114   -0.318454
115        -inf
116   -0.318454
117    0.000000
118   -2.397895
119        -inf
Name: booking_rate, Length: 120, dtype: float64

In [34]:
from sklearn.cross_validation import train_test_split
feature_cols = ['price', 'tenure_months','person_capacity','description_length','picture_count']
a, b = listings[feature_cols], listings.booking_rate
a_train, a_test, b_train, b_test=train_test_split(a,b, test_size=0.33)

In [35]:
listings.info()
a_train


<class 'pandas.core.frame.DataFrame'>
Int64Index: 120 entries, 0 to 119
Data columns (total 10 columns):
prop_id               120 non-null int64
prop_type             120 non-null object
neighborhood          120 non-null object
price                 120 non-null int64
person_capacity       120 non-null int64
picture_count         120 non-null int64
description_length    120 non-null int64
tenure_months         120 non-null int64
number_of_bookings    120 non-null float64
booking_rate          120 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 10.3+ KB
Out[35]:
array([[125,  16,   3, 327,  31],
       [ 65,  19,   2, 333,   7],
       [ 40,  19,   2, 241,   4],
       [150,  13,   2,  83,   6],
       [215,  17,   2, 190,  16],
       [233,  16,   8, 404,  12],
       [285,  16,   5, 241,   6],
       [200,  15,   4, 241,   7],
       [100,  11,   1, 212,   4],
       [326,  16,   6, 301,   6],
       [410,  16,   8, 305,  18],
       [125,  11,   2,  89,   9],
       [250,  16,   3, 289,   6],
       [350,  16,   4, 102,   5],
       [175,  16,   2, 255,   6],
       [149,  15,   3, 264,  10],
       [130,  16,   2, 240,   6],
       [230,  16,   3, 283,   6],
       [320,  16,   6, 341,   6],
       [180,  17,   6, 325,  26],
       [296,  16,   3, 283,   6],
       [ 89,  22,   2, 153,  11],
       [234,  16,   7, 361,  12],
       [ 59,  15,   1, 205,  10],
       [115,  14,   2, 340,   4],
       [294,  16,   6, 287,   5],
       [ 95,  15,   2, 308,  26],
       [200,  16,   2, 288,   6],
       [120,  23,   2, 685,  13],
       [295,  18,   5, 228,  22],
       [125,  15,   2, 208,  21],
       [ 80,  12,   2, 491,  20],
       [195,  14,   3, 214,  19],
       [ 95,  16,   2, 137,   8],
       [287,  16,   8, 309,  11],
       [164,  16,   2, 351,  11],
       [350,  12,   3, 135,   6],
       [ 49,  11,   2, 417,  14],
       [275,  15,   8, 457,  19],
       [264,  16,   4, 576,  12],
       [229,  16,   4, 281,   6],
       [135,  12,   2, 297,   8],
       [ 90,  28,   2, 472,  19],
       [ 95,  19,   2, 334,  39],
       [229,  16,   4, 388,   5],
       [105,  14,   4, 210,  15],
       [ 95,  15,   3, 315,   6],
       [145,  22,   3, 140,   9],
       [ 95,  29,   2, 172,  16],
       [ 90,  18,   3, 411,  15],
       [246,  16,   8, 437,  10],
       [139,  17,   2, 395,  20],
       [170,  16,   4, 404,  39],
       [ 75,  15,   1, 197,   7],
       [300,  15,   4, 116,  63],
       [300,  15,   4,  84,  22],
       [350,  15,   4, 129,  71],
       [ 70,  13,   1, 508,   4],
       [199,  16,   2, 199,   5],
       [150,  13,   2, 425,  26],
       [500,  16,   4, 342,   6],
       [250,  16,   6, 243,   6],
       [ 40,  13,   2, 321,   9],
       [125,  14,   2,  51,   1],
       [145,  14,   6, 297,  25],
       [250,  16,   3, 279,   6],
       [293,  16,   6, 246,  11],
       [210,  23,   6, 180,  27],
       [ 55,  11,   2, 333,   8],
       [229,  16,   4, 388,   6],
       [ 95,  13,   3, 248,  11],
       [ 96,  20,   2, 245,  10],
       [129,  12,   2, 759,  39],
       [100,  19,   2,  39,   5],
       [ 65,  23,   2, 189,  15],
       [160,  12,   6, 304,  23],
       [180,  11,   3, 256,  18],
       [199,  16,   3, 266,   6],
       [ 95,  19,   2, 255,  15],
       [125,  28,   5, 442,  21]])

In [36]:
b_train.reshape(-1,1)


Out[36]:
array([[ 0.125     ],
       [ 0.36842105],
       [ 0.47368421],
       [ 0.        ],
       [ 0.17647059],
       [ 0.        ],
       [ 0.25      ],
       [ 0.        ],
       [ 0.        ],
       [ 0.5       ],
       [ 0.        ],
       [ 0.        ],
       [ 0.0625    ],
       [ 0.        ],
       [ 0.125     ],
       [ 0.13333333],
       [ 0.4375    ],
       [ 0.0625    ],
       [ 0.1875    ],
       [ 4.        ],
       [ 0.125     ],
       [ 2.04545455],
       [ 0.        ],
       [ 3.33333333],
       [ 0.        ],
       [ 0.125     ],
       [ 1.86666667],
       [ 0.125     ],
       [ 1.2173913 ],
       [ 0.5       ],
       [ 1.06666667],
       [ 2.58333333],
       [ 0.85714286],
       [ 0.5       ],
       [ 0.        ],
       [ 0.        ],
       [ 0.16666667],
       [ 0.72727273],
       [ 0.13333333],
       [ 0.        ],
       [ 0.1875    ],
       [ 0.75      ],
       [ 0.96428571],
       [ 2.31578947],
       [ 0.25      ],
       [ 0.5       ],
       [ 0.        ],
       [ 0.31818182],
       [ 0.03448276],
       [ 2.88888889],
       [ 0.0625    ],
       [ 0.11764706],
       [ 0.0625    ],
       [ 0.26666667],
       [ 0.13333333],
       [ 0.26666667],
       [ 0.33333333],
       [ 0.        ],
       [ 0.125     ],
       [ 1.30769231],
       [ 0.125     ],
       [ 0.125     ],
       [ 0.76923077],
       [ 0.        ],
       [ 0.64285714],
       [ 0.125     ],
       [ 0.        ],
       [ 0.13043478],
       [ 0.09090909],
       [ 0.0625    ],
       [ 2.53846154],
       [ 2.85      ],
       [ 3.08333333],
       [ 0.        ],
       [ 1.13043478],
       [ 0.91666667],
       [ 0.72727273],
       [ 0.125     ],
       [ 0.63157895],
       [ 0.        ]])

In [37]:
b_train.shape


Out[37]:
(80,)

In [38]:
#need to include price, person capacity, picture count, description length, and tenure months

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [39]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

In [40]:
#Linear Regression
clf = LinearRegression()
clf.fit(a_train, b_train)


Out[40]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [41]:
b_pred = clf.predict(a_test)

In [42]:
a_test
print b_pred[0], a_test[0]


0.814508380483 [ 80  14   2 145  15]

In [43]:
# Let's compute sum of Errors between Actual and Predicted
# Again, more on this next week - I just want to show how these tools work together
sum_sq_model = np.sum((b_test - b_pred) ** 2)
sum_sq_model


Out[43]:
106.60999687663644

In [44]:
# Compare with the base naive model where we say predicted value is just the mean value
sum_sq_naive = np.sum((b_test - b.mean()) ** 2)
sum_sq_naive


Out[44]:
64.933464984311186

fit your model with your test sets


In [45]:
clf.score(a_test,b_test)


Out[45]:
-0.6709818998178676

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]: