Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [2]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

In [3]:
pd.__version__


Out[3]:
'0.15.1'

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [4]:
# Let's explore the Datasets
bookings = pd.read_csv('../data/bookings.csv', parse_dates=['booking_date'])
listings = pd.read_csv('../data/listings.csv')

In [5]:
bookings.tail()


Out[5]:
prop_id booking_date
6071 408 2011-06-02
6072 408 2011-08-22
6073 408 2011-07-24
6074 408 2011-01-12
6075 408 2011-09-08

In [6]:
bookings.set_index('booking_date' ,inplace=True)

In [7]:
bookings['number_of_bookings'] = 1

In [8]:
help(bookings.resample)


Help on method resample in module pandas.core.generic:

resample(self, rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0) method of pandas.core.frame.DataFrame instance
    Convenience method for frequency conversion and resampling of regular
    time-series data.
    
    Parameters
    ----------
    rule : string
        the offset string or object representing target conversion
    how : string
        method for down- or re-sampling, default to 'mean' for
        downsampling
    axis : int, optional, default 0
    fill_method : string, default None
        fill_method for upsampling
    closed : {'right', 'left'}
        Which side of bin interval is closed
    label : {'right', 'left'}
        Which bin edge label to label bucket with
    convention : {'start', 'end', 's', 'e'}
    kind : "period"/"timestamp"
    loffset : timedelta
        Adjust the resampled time labels
    limit : int, default None
        Maximum size gap to when reindexing with fill_method
    base : int, default 0
        For frequencies that evenly subdivide 1 day, the "origin" of the
        aggregated intervals. For example, for '5min' frequency, base could
        range from 0 through 4. Defaults to 0


In [9]:
bookings.resample('M', how='count').number_of_bookings.plot()


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1094dcf10>

In [10]:
bookings.resample('D')


Out[10]:
prop_id number_of_bookings
booking_date
2011-01-01 183.181818 1
2011-01-02 169.222222 1
2011-01-03 170.300000 1
2011-01-04 223.875000 1
2011-01-05 162.200000 1
2011-01-06 204.285714 1
2011-01-07 144.571429 1
2011-01-08 230.833333 1
2011-01-09 151.400000 1
2011-01-10 227.928571 1
2011-01-11 191.600000 1
2011-01-12 192.450000 1
2011-01-13 185.350000 1
2011-01-14 209.857143 1
2011-01-15 196.727273 1
2011-01-16 190.909091 1
2011-01-17 219.190476 1
2011-01-18 234.000000 1
2011-01-19 221.285714 1
2011-01-20 241.076923 1
2011-01-21 206.809524 1
2011-01-22 228.000000 1
2011-01-23 143.142857 1
2011-01-24 281.142857 1
2011-01-25 223.062500 1
2011-01-26 166.909091 1
2011-01-27 199.750000 1
2011-01-28 232.800000 1
2011-01-29 234.700000 1
2011-01-30 164.444444 1
... ... ...
2011-12-02 302.444444 1
2011-12-03 252.250000 1
2011-12-04 191.090909 1
2011-12-05 237.916667 1
2011-12-06 206.687500 1
2011-12-07 199.000000 1
2011-12-08 253.833333 1
2011-12-09 233.571429 1
2011-12-10 174.400000 1
2011-12-11 258.166667 1
2011-12-12 252.933333 1
2011-12-13 207.285714 1
2011-12-14 205.083333 1
2011-12-15 161.125000 1
2011-12-16 173.882353 1
2011-12-17 181.692308 1
2011-12-18 235.727273 1
2011-12-19 243.000000 1
2011-12-20 149.375000 1
2011-12-21 197.000000 1
2011-12-22 226.333333 1
2011-12-23 251.625000 1
2011-12-24 263.200000 1
2011-12-25 237.000000 1
2011-12-26 198.454545 1
2011-12-27 202.777778 1
2011-12-28 200.857143 1
2011-12-29 154.714286 1
2011-12-30 215.071429 1
2011-12-31 198.100000 1

365 rows × 2 columns


In [11]:
listings.tail()


Out[11]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
403 404 Property type 2 Neighborhood 14 100 1 8 235 1
404 405 Property type 2 Neighborhood 13 85 2 27 1048 1
405 406 Property type 1 Neighborhood 9 70 3 18 153 1
406 407 Property type 1 Neighborhood 13 129 2 13 370 1
407 408 Property type 1 Neighborhood 14 100 3 21 707 1

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [12]:
listings.describe()


Out[12]:
prop_id price person_capacity picture_count description_length tenure_months
count 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000
mean 204.500000 187.806373 2.997549 14.389706 309.159314 8.487745
std 117.923704 353.050858 1.594676 10.477428 228.021684 5.872088
min 1.000000 39.000000 1.000000 1.000000 0.000000 1.000000
25% 102.750000 90.000000 2.000000 6.000000 179.000000 4.000000
50% 204.500000 125.000000 2.000000 12.000000 250.000000 7.000000
75% 306.250000 199.000000 4.000000 20.000000 389.500000 13.000000
max 408.000000 5000.000000 10.000000 71.000000 1969.000000 30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [13]:
len([1, 2, 3])


Out[13]:
3

In [14]:
[1, 2, 3].__len__()


Out[14]:
3

In [15]:
class MyClass(object):
    def __init__(self, value):
        self.value = value
        
    def __len__(self):
        return self.value

In [16]:
myobj = MyClass(5)

In [17]:
listings.groupby(['prop_type']).agg(['mean', 'count'])


Out[17]:
prop_id price person_capacity picture_count description_length tenure_months
mean count mean count mean count mean count mean count mean count
prop_type
Property type 1 204.754647 269 237.085502 269 3.516729 269 14.695167 269 313.171004 269 8.464684 269
Property type 2 206.392593 135 93.288889 135 2.000000 135 13.948148 135 304.851852 135 8.377778 135
Property type 3 123.500000 4 63.750000 4 1.750000 4 8.750000 4 184.750000 4 13.750000 4

Same, but by property type per neighborhood?


In [18]:
listings.head(2)


Out[18]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 Property type 1 Neighborhood 14 140 3 11 232 30
1 2 Property type 1 Neighborhood 14 95 2 3 37 29

In [19]:
pd.pivot_table(listings, values='person_capacity', index='neighborhood', columns='prop_type').head(2)


Out[19]:
prop_type Property type 1 Property type 2 Property type 3
neighborhood
Neighborhood 1 2.0 NaN NaN
Neighborhood 10 3.5 2 NaN

In [20]:
group_cols = ['neighborhood', 'prop_type']
agg_cols = ['person_capacity', 'price']
listings.groupby(group_cols)[agg_cols].agg(['sum', 'count']).unstack(level='prop_type')
#listings.groupby(group_cols)[agg_cols].agg(['sum', 'count']).unstack(1)


Out[20]:
person_capacity price
sum count sum count
prop_type Property type 1 Property type 2 Property type 3 Property type 1 Property type 2 Property type 3 Property type 1 Property type 2 Property type 3 Property type 1 Property type 2 Property type 3
neighborhood
Neighborhood 1 2 NaN NaN 1 NaN NaN 85 NaN NaN 1 NaN NaN
Neighborhood 10 21 4 NaN 6 2 NaN 855 275 NaN 6 2 NaN
Neighborhood 11 45 8 2 14 4 1 2232 315 75 14 4 1
Neighborhood 12 134 37 NaN 39 19 NaN 14259 1841 NaN 39 19 NaN
Neighborhood 13 199 42 NaN 49 23 NaN 11853 1866 NaN 49 23 NaN
Neighborhood 14 109 39 1 34 21 1 5599 1760 75 34 21 1
Neighborhood 15 93 34 NaN 25 15 NaN 4472 1425 NaN 25 15 NaN
Neighborhood 16 41 33 NaN 14 16 NaN 2225 1338 NaN 14 16 NaN
Neighborhood 17 81 22 2 23 11 1 4367 1127 65 23 11 1
Neighborhood 18 65 20 NaN 22 9 NaN 3819 1086 NaN 22 9 NaN
Neighborhood 19 29 16 NaN 8 8 NaN 1779 711 NaN 8 8 NaN
Neighborhood 2 6 NaN NaN 1 NaN NaN 250 NaN NaN 1 NaN NaN
Neighborhood 20 25 1 NaN 9 1 NaN 7239 60 NaN 9 1 NaN
Neighborhood 21 17 NaN NaN 4 NaN NaN 1450 NaN NaN 4 NaN NaN
Neighborhood 22 3 NaN NaN 1 NaN NaN 225 NaN NaN 1 NaN NaN
Neighborhood 3 NaN 2 NaN NaN 1 NaN NaN 60 NaN NaN 1 NaN
Neighborhood 4 NaN 2 2 NaN 1 1 NaN 60 40 NaN 1 1
Neighborhood 5 5 NaN NaN 2 NaN NaN 389 NaN NaN 2 NaN NaN
Neighborhood 6 10 NaN NaN 3 NaN NaN 438 NaN NaN 3 NaN NaN
Neighborhood 7 11 2 NaN 3 1 NaN 483 100 NaN 3 1 NaN
Neighborhood 8 20 4 NaN 4 1 NaN 699 350 NaN 4 1 NaN
Neighborhood 9 30 4 NaN 7 2 NaN 1058 220 NaN 7 2 NaN

In [21]:
listings.groupby(['prop_type', 'neighborhood']).agg(['mean', 'count'])


Out[21]:
prop_id price person_capacity picture_count description_length tenure_months
mean count mean count mean count mean count mean count mean count
prop_type neighborhood
Property type 1 Neighborhood 1 235.000000 1 85.000000 1 2.000000 1 26.000000 1 209.000000 1 6.000000 1
Neighborhood 10 307.500000 6 142.500000 6 3.500000 6 13.333333 6 391.000000 6 3.833333 6
Neighborhood 11 174.000000 14 159.428571 14 3.214286 14 9.928571 14 379.000000 14 9.642857 14
Neighborhood 12 211.307692 39 365.615385 39 3.435897 39 10.820513 39 267.205128 39 7.897436 39
Neighborhood 13 190.142857 49 241.897959 49 4.061224 49 15.653061 49 290.408163 49 9.122449 49
Neighborhood 14 220.764706 34 164.676471 34 3.205882 34 14.764706 34 317.205882 34 8.441176 34
Neighborhood 15 191.560000 25 178.880000 25 3.720000 25 14.320000 25 321.760000 25 9.320000 25
Neighborhood 16 233.000000 14 158.928571 14 2.928571 14 21.642857 14 310.714286 14 7.071429 14
Neighborhood 17 166.043478 23 189.869565 23 3.521739 23 16.086957 23 317.347826 23 9.869565 23
Neighborhood 18 210.000000 22 173.590909 22 2.954545 22 16.090909 22 369.227273 22 8.227273 22
Neighborhood 19 253.250000 8 222.375000 8 3.625000 8 11.000000 8 254.500000 8 6.500000 8
Neighborhood 2 244.000000 1 250.000000 1 6.000000 1 8.000000 1 423.000000 1 6.000000 1
Neighborhood 20 174.111111 9 804.333333 9 2.777778 9 9.444444 9 223.555556 9 9.666667 9
Neighborhood 21 79.250000 4 362.500000 4 4.250000 4 49.000000 4 306.250000 4 14.750000 4
Neighborhood 22 162.000000 1 225.000000 1 3.000000 1 19.000000 1 500.000000 1 9.000000 1
Neighborhood 5 132.500000 2 194.500000 2 2.500000 2 8.500000 2 266.500000 2 11.500000 2
Neighborhood 6 291.333333 3 146.000000 3 3.333333 3 12.666667 3 290.666667 3 4.000000 3
Neighborhood 7 273.333333 3 161.000000 3 3.666667 3 14.333333 3 343.000000 3 5.333333 3
Neighborhood 8 218.250000 4 174.750000 4 5.000000 4 11.000000 4 300.000000 4 6.750000 4
Neighborhood 9 265.857143 7 151.142857 7 4.285714 7 13.428571 7 471.428571 7 5.714286 7
Property type 2 Neighborhood 10 327.000000 2 137.500000 2 2.000000 2 20.000000 2 126.000000 2 3.500000 2
Neighborhood 11 146.250000 4 78.750000 4 2.000000 4 16.750000 4 161.250000 4 11.250000 4
Neighborhood 12 164.263158 19 96.894737 19 1.947368 19 10.473684 19 244.526316 19 9.842105 19
Neighborhood 13 199.000000 23 81.130435 23 1.826087 23 16.695652 23 418.565217 23 9.739130 23
Neighborhood 14 195.047619 21 83.809524 21 1.857143 21 15.904762 21 348.619048 21 8.714286 21
Neighborhood 15 194.666667 15 95.000000 15 2.266667 15 11.733333 15 301.733333 15 8.200000 15
Neighborhood 16 251.562500 16 83.625000 16 2.062500 16 15.375000 16 246.250000 16 6.687500 16
Neighborhood 17 242.181818 11 102.454545 11 2.000000 11 15.454545 11 308.272727 11 7.181818 11
Neighborhood 18 179.333333 9 120.666667 9 2.222222 9 12.333333 9 297.777778 9 9.222222 9
Neighborhood 19 256.750000 8 88.875000 8 2.000000 8 15.125000 8 383.375000 8 5.500000 8
Neighborhood 20 230.000000 1 60.000000 1 1.000000 1 3.000000 1 101.000000 1 6.000000 1
Neighborhood 3 166.000000 1 60.000000 1 2.000000 1 7.000000 1 264.000000 1 9.000000 1
Neighborhood 4 118.000000 1 60.000000 1 2.000000 1 10.000000 1 95.000000 1 11.000000 1
Neighborhood 7 365.000000 1 100.000000 1 2.000000 1 3.000000 1 148.000000 1 2.000000 1
Neighborhood 8 343.000000 1 350.000000 1 4.000000 1 5.000000 1 223.000000 1 3.000000 1
Neighborhood 9 165.500000 2 110.000000 2 2.000000 2 3.500000 2 114.500000 2 9.000000 2
Property type 3 Neighborhood 11 178.000000 1 75.000000 1 2.000000 1 15.000000 1 196.000000 1 8.000000 1
Neighborhood 14 286.000000 1 75.000000 1 1.000000 1 1.000000 1 113.000000 1 5.000000 1
Neighborhood 17 10.000000 1 65.000000 1 2.000000 1 15.000000 1 189.000000 1 23.000000 1
Neighborhood 4 20.000000 1 40.000000 1 2.000000 1 4.000000 1 241.000000 1 19.000000 1

Plot daily bookings:


In [34]:
bookings = bookings.reset_index()
bookings.groupby('booking_date').count()


Out[34]:
prop_id number_of_bookings
booking_date
2011-01-01 11 11
2011-01-02 9 9
2011-01-03 10 10
2011-01-04 8 8
2011-01-05 15 15
2011-01-06 14 14
2011-01-07 14 14
2011-01-08 6 6
2011-01-09 10 10
2011-01-10 14 14
2011-01-11 15 15
2011-01-12 20 20
2011-01-13 20 20
2011-01-14 14 14
2011-01-15 11 11
2011-01-16 11 11
2011-01-17 21 21
2011-01-18 7 7
2011-01-19 14 14
2011-01-20 13 13
2011-01-21 21 21
2011-01-22 11 11
2011-01-23 14 14
2011-01-24 14 14
2011-01-25 16 16
2011-01-26 11 11
2011-01-27 16 16
2011-01-28 15 15
2011-01-29 10 10
2011-01-30 18 18
... ... ...
2011-12-02 9 9
2011-12-03 4 4
2011-12-04 11 11
2011-12-05 12 12
2011-12-06 16 16
2011-12-07 14 14
2011-12-08 18 18
2011-12-09 7 7
2011-12-10 15 15
2011-12-11 6 6
2011-12-12 15 15
2011-12-13 14 14
2011-12-14 12 12
2011-12-15 8 8
2011-12-16 17 17
2011-12-17 13 13
2011-12-18 11 11
2011-12-19 9 9
2011-12-20 8 8
2011-12-21 8 8
2011-12-22 3 3
2011-12-23 8 8
2011-12-24 5 5
2011-12-25 2 2
2011-12-26 11 11
2011-12-27 9 9
2011-12-28 14 14
2011-12-29 14 14
2011-12-30 14 14
2011-12-31 10 10

365 rows × 2 columns


In [35]:
# Plot daily bookings
#grid_plot = sns.FacetGrid(bookings, row='booking_date', col='prop_id')
#grid_plot.map(sns.regplot, 'booking_date', color='.3', fit_reg=False, x_jitter=.1)


#prop_id	booking_date 

#ax = sns.boxplot(bookings.age)
#ax.set_title('Age Distribution by class')

#bookings.groupby(['booking_date']).agg(['count']).plot(kind='bar')
#bookings.groupby('booking_date').agg(['count']).plot(kind='bar')
bookings.groupby('booking_date').count().plot(kind='bar')


Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a0f1b90>

Plot the daily bookings per neighborhood (provide a legend)


In [36]:
bookings.head()


Out[36]:
booking_date prop_id number_of_bookings
0 2011-06-17 9 1
1 2011-08-12 13 1
2 2011-06-20 21 1
3 2011-05-05 28 1
4 2011-11-17 29 1

In [40]:
#first merge
merged_bookings_listings = pd.merge(bookings, listings, on='prop_id')
merged_bookings_listings.head()
merged_bookings_listings.groupby('neighborhood')['neighborhood'].count().plot(kind='bar')

#groupby('cylinders')['mpg'].count().plot(kind='bar')
#bookings.resample('M', how='count').number_of_bookings.plot()


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d60c290>

Part 2 - Develop a data set


In [ ]:

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [41]:
b =bookings.groupby('prop_id').count()

In [42]:
b.head()


Out[42]:
booking_date number_of_bookings
prop_id
1 4 4
3 1 1
4 27 27
6 88 88
7 2 2

In [43]:
c = b.reset_index()

In [44]:
c.head()


Out[44]:
prop_id booking_date number_of_bookings
0 1 4 4
1 3 1 1
2 4 27 27
3 6 88 88
4 7 2 2

In [54]:
number_of_bookings = bookings.groupby('prop_id').count().reset_index()
number_of_bookings.rename(columns={'booking_date':'number_of_bookings2'}, inplace = True)
number_of_bookings.head()
listings2 = pd.merge(listings, number_of_bookings, how='left', on='prop_id')
listings2.number_of_bookings.fillna(0, inplace=True)
listings2.head()


Out[54]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings2 number_of_bookings
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 4
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 NaN 0
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1 1
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27 27
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 NaN 0

In [55]:
#listings['number_of_bookings'] = 1
#listings['booking_rate'] = 

number_of_bookings = bookings.groupby('prop_id').count().reset_index()
number_of_bookings.rename(columns={'booking_date':'booking_date_old'}, inplace = True)
listings2 = pd.merge(listings, number_of_bookings, how='left', on='prop_id')
listings2.number_of_bookings.fillna(0, inplace=True)
# Alternative way: listings2['number_of_bookings'] = listings2.number_of_bookings.fillna(0)
listings2['booking_rate'] = listings2.number_of_bookings/listings2.tenure_months

listings2.tail()


Out[55]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months booking_date_old number_of_bookings booking_rate
403 404 Property type 2 Neighborhood 14 100 1 8 235 1 3 3 3
404 405 Property type 2 Neighborhood 13 85 2 27 1048 1 19 19 19
405 406 Property type 1 Neighborhood 9 70 3 18 153 1 19 19 19
406 407 Property type 1 Neighborhood 13 129 2 13 370 1 15 15 15
407 408 Property type 1 Neighborhood 14 100 3 21 707 1 54 54 54

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [57]:
listings2[listings2.tenure_months > 9].tail()


Out[57]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months booking_date_old number_of_bookings booking_rate
139 140 Property type 1 Neighborhood 12 200 4 18 125 10 10 10 1.0
140 141 Property type 2 Neighborhood 12 45 2 36 281 10 1 1 0.1
141 142 Property type 2 Neighborhood 15 96 2 9 138 10 48 48 4.8
142 143 Property type 2 Neighborhood 15 58 2 7 135 10 21 21 2.1
143 144 Property type 2 Neighborhood 14 100 2 5 35 10 NaN 0 0.0

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [70]:
#this converted the strings to multiple collums, each unique to the possible values in the original collumn.  Note this only applied to the strings.  The integers did not change.
listings3 = pd.get_dummies(listings2)
#pd.get_dummies(listings2[['prop_type', 'neighborhood', 'price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months', 'booking_rate']])

In [72]:
listings3.columns


Out[72]:
Index([u'prop_id', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months', u'booking_date_old', u'number_of_bookings', u'booking_rate', u'prop_type_Property type 1', u'prop_type_Property type 2', u'prop_type_Property type 3', u'neighborhood_Neighborhood 1', u'neighborhood_Neighborhood 10', u'neighborhood_Neighborhood 11', u'neighborhood_Neighborhood 12', u'neighborhood_Neighborhood 13', u'neighborhood_Neighborhood 14', u'neighborhood_Neighborhood 15', u'neighborhood_Neighborhood 16', u'neighborhood_Neighborhood 17', u'neighborhood_Neighborhood 18', u'neighborhood_Neighborhood 19', u'neighborhood_Neighborhood 2', u'neighborhood_Neighborhood 20', u'neighborhood_Neighborhood 21', u'neighborhood_Neighborhood 22', u'neighborhood_Neighborhood 3', u'neighborhood_Neighborhood 4', u'neighborhood_Neighborhood 5', u'neighborhood_Neighborhood 6', u'neighborhood_Neighborhood 7', u'neighborhood_Neighborhood 8', u'neighborhood_Neighborhood 9'], dtype='object')

In [115]:
listings3.head()


Out[115]:
prop_id price person_capacity picture_count description_length tenure_months booking_date_old number_of_bookings booking_rate prop_type_Property type 1 ... neighborhood_Neighborhood 20 neighborhood_Neighborhood 21 neighborhood_Neighborhood 22 neighborhood_Neighborhood 3 neighborhood_Neighborhood 4 neighborhood_Neighborhood 5 neighborhood_Neighborhood 6 neighborhood_Neighborhood 7 neighborhood_Neighborhood 8 neighborhood_Neighborhood 9
0 1 140 3 11 232 30 4 4 0.133333 1 ... 0 0 0 0 0 0 0 0 0 0
1 2 95 2 3 37 29 NaN 0 0.000000 1 ... 0 0 0 0 0 0 0 0 0 0
2 3 95 2 16 172 29 1 1 0.034483 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 90 2 19 472 28 27 27 0.964286 0 ... 0 0 0 0 0 0 0 0 0 0
4 5 125 5 21 442 28 NaN 0 0.000000 1 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 34 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [74]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(listings3[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months'] 
], listings3.booking_rate, random_state=12, test_size=0.2)

In [ ]:

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [93]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets


In [94]:
lr.fit(X_train, y_train)


Out[94]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [95]:
lr.score(X_test, y_test)


Out[95]:
0.1256213709762638

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

The score returns the coefficient of determination R^2 of the prediction. Our "score" was 0.1256213709762638, which seems pretty low (the best is 1). Therefore, this suggests the X_test data does not strongly predict Y_test data.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [120]:
X_train, X_test, y_train, y_test = train_test_split(listings3[['price', 'picture_count', 'tenure_months'] 
], listings3.booking_rate, random_state=12, test_size=0.2)

lr = LinearRegression()

lr.fit(X_train, y_train)

lr.score(X_test, y_test)


#I tried adding and removing additional fields, and nothing seemed to significantly increasee the coefficient.

#Not sure how to create monthly revenue.


Out[120]:
0.15289924643736186

In [ ]:


In [121]:
#Is it possible to plot the final regression in the homework?

#What does “random_state=12” mean??  

#How create “monthly revenue” predictor in HW1?

#How could I identify which of the x_test inputs are the most important to determine y?

In [ ]:


In [107]:
##Optional - can we plot this info?##

In [106]:
from sklearn.preprocessing import PolynomialFeatures


def f(x):
    return np.sin(2 * np.pi * x)

# generate points used to plot
x_plot = np.linspace(0, 1, 100)

def plot_approximation(est, ax, label=None):
    """Plot the approximation of ``est`` on axis ``ax``. """
    ax.plot(x_plot, f(x_plot), label='ground truth', color='green')
    ax.scatter(X_train, y_train)
    ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='red', label=label)
    ax.set_ylim((-2, 2))
    ax.set_xlim((0, 1))
    ax.set_ylabel('y')
    ax.set_xlabel('x')
    ax.legend(loc='upper right',frameon=True)

In [101]:
fig,ax = plt.subplots(1,1)
degree = 1
lr = make_pipeline(PolynomialFeatures(degree), LinearRegression())
plot_approximation(lr, ax, label='1')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-101-cd4fe06f7b2d> in <module>()
      2 degree = 1
      3 lr = make_pipeline(PolynomialFeatures(degree), LinearRegression())
----> 4 plot_approximation(lr, ax, label='1')

<ipython-input-99-4bc76fe74d65> in plot_approximation(est, ax, label)
     11     """Plot the approximation of ``est`` on axis ``ax``. """
     12     ax.plot(x_plot, f(x_plot), label='ground truth', color='green')
---> 13     ax.scatter(X_train, y_train)
     14     ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='red', label=label)
     15     ax.set_ylim((-2, 2))

/Users/johnfohr/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, **kwargs)
   3575         y = np.ma.ravel(y)
   3576         if x.size != y.size:
-> 3577             raise ValueError("x and y must be the same size")
   3578 
   3579         s = np.ma.ravel(s)  # This doesn't have to match x, y in size.

ValueError: x and y must be the same size