Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [72]:
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt

In [73]:
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

%matplotlib inline

In [310]:
bookingsdate = pd.read_csv('../data/bookings.csv',index_col='booking_date')
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [447]:
neighbor11['booking_date']=pd.to_datetime(bookings['booking_date'])
alldates['booking_date']=pd.to_datetime(alldates['booking_date'])
allbooking=pd.merge(neighbor11,alldates,on='booking_date')
neighbor11


Out[447]:
neighborhood booking_date
booking_date
2011-01-02 1 NaT
2011-01-05 1 NaT
2011-01-07 1 NaT
2011-01-08 1 NaT
2011-01-09 1 NaT
... ... ...
2011-12-20 1 NaT
2011-12-23 1 NaT
2011-12-24 2 NaT
2011-12-26 1 NaT
2011-12-28 1 NaT

214 rows × 2 columns


In [378]:
dates = pd.date_range('1/1/2011', periods=365)
alldates=pd.DataFrame(index=dates)

#alldatesbook=pd.concat([alldates,bookingsdate])
#alldatesbook


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-378-10f56fead1a2> in <module>()
      1 dates = pd.date_range('1/1/2011', periods=365)
----> 2 dates.head()
      3 alldates=pd.DataFrame(index=dates)
      4 
      5 #alldatesbook=pd.concat([alldates,bookingsdate])

AttributeError: 'DatetimeIndex' object has no attribute 'head'

In [ ]:


In [76]:
listings


Out[76]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 Property type 1 Neighborhood 14 140 3 11 232 30
1 2 Property type 1 Neighborhood 14 95 2 3 37 29
2 3 Property type 2 Neighborhood 16 95 2 16 172 29
3 4 Property type 2 Neighborhood 13 90 2 19 472 28
4 5 Property type 1 Neighborhood 15 125 5 21 442 28
... ... ... ... ... ... ... ... ...
403 404 Property type 2 Neighborhood 14 100 1 8 235 1
404 405 Property type 2 Neighborhood 13 85 2 27 1048 1
405 406 Property type 1 Neighborhood 9 70 3 18 153 1
406 407 Property type 1 Neighborhood 13 129 2 13 370 1
407 408 Property type 1 Neighborhood 14 100 3 21 707 1

408 rows × 8 columns


In [76]:

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [274]:
listings.describe()


Out[274]:
prop_id price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
count 408.0 408.0 408.0 408.0 408.0 408.0 327.0 327.0
mean 204.5 187.8 3.0 14.4 309.2 8.5 18.4 4.0
std 117.9 353.1 1.6 10.5 228.0 5.9 20.4 6.2
min 1.0 39.0 1.0 1.0 0.0 1.0 1.0 0.0
25% 102.8 90.0 2.0 6.0 179.0 4.0 3.0 0.4
50% 204.5 125.0 2.0 12.0 250.0 7.0 9.0 1.6
75% 306.2 199.0 4.0 20.0 389.5 13.0 29.5 4.9
max 408.0 5000.0 10.0 71.0 1969.0 30.0 109.0 52.0

In [275]:
# or alternatively...

listingstats=listings.groupby('prop_id').agg(['count', 'mean', 'median', 'min', 'max'])
listingstats[['price','person_capacity','picture_count','description_length','tenure_months']]


Out[275]:
price person_capacity ... description_length tenure_months
count mean median min max count mean median min max ... count mean median min max count mean median min max
prop_id
1 1 140 140 140 140 1 3 3 3 3 ... 1 232 232 232 232 1 30 30 30 30
2 1 95 95 95 95 1 2 2 2 2 ... 1 37 37 37 37 1 29 29 29 29
3 1 95 95 95 95 1 2 2 2 2 ... 1 172 172 172 172 1 29 29 29 29
4 1 90 90 90 90 1 2 2 2 2 ... 1 472 472 472 472 1 28 28 28 28
5 1 125 125 125 125 1 5 5 5 5 ... 1 442 442 442 442 1 28 28 28 28
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
404 1 100 100 100 100 1 1 1 1 1 ... 1 235 235 235 235 1 1 1 1 1
405 1 85 85 85 85 1 2 2 2 2 ... 1 1048 1048 1048 1048 1 1 1 1 1
406 1 70 70 70 70 1 3 3 3 3 ... 1 153 153 153 153 1 1 1 1 1
407 1 129 129 129 129 1 2 2 2 2 ... 1 370 370 370 370 1 1 1 1 1
408 1 100 100 100 100 1 3 3 3 3 ... 1 707 707 707 707 1 1 1 1 1

408 rows × 25 columns

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [276]:
typestats=listings.groupby('prop_type').agg(['count', 'mean', 'median', 'min', 'max'])
typestats[['price','person_capacity','picture_count','description_length','tenure_months']]


Out[276]:
price person_capacity ... description_length tenure_months
count mean median min max count mean median min max ... count mean median min max count mean median min max
prop_type
Property type 1 269 237.1 150 40 5000 269 3.5 3 1 10 ... 269 313.2 266.0 17 1719 269 8.5 7.0 1 30
Property type 2 135 93.3 89 39 350 135 2.0 2 1 6 ... 135 304.9 239.0 0 1969 135 8.4 7.0 1 29
Property type 3 4 63.8 70 40 75 4 1.8 2 1 2 ... 4 184.8 192.5 113 241 4 13.8 13.5 5 23

3 rows × 25 columns

Same, but by property type per neighborhood?


In [16]:
neighborhoodstats=listings.groupby(['prop_type','neighborhood']).agg(['count', 'mean', 'median', 'min', 'max'])
neighborhoodstats[['price','person_capacity','picture_count','description_length','tenure_months']]


Out[16]:
price person_capacity ... description_length tenure_months
count mean median min max count mean median min max ... count mean median min max count mean median min max
prop_type neighborhood
Property type 1 Neighborhood 1 1 85.0 85.0 85 85 1 2.0 2 2 2 ... 1 209.0 209.0 209 209 1 6.0 6.0 6 6
Neighborhood 10 6 142.5 137.5 90 205 6 3.5 4 2 5 ... 6 391.0 425.5 160 537 6 3.8 4.5 1 5
Neighborhood 11 14 159.4 130.0 95 319 14 3.2 3 2 6 ... 14 379.0 295.5 82 1719 14 9.6 9.5 1 16
Neighborhood 12 39 365.6 150.0 60 3050 39 3.4 3 1 8 ... 39 267.2 251.0 45 670 39 7.9 6.0 1 18
Neighborhood 13 49 241.9 180.0 110 2394 49 4.1 4 2 8 ... 49 290.4 256.0 74 716 49 9.1 8.0 1 23
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Property type 2 Neighborhood 9 2 110.0 110.0 100 120 2 2.0 2 1 3 ... 2 114.5 114.5 17 212 2 9.0 9.0 7 11
Property type 3 Neighborhood 11 1 75.0 75.0 75 75 1 2.0 2 2 2 ... 1 196.0 196.0 196 196 1 8.0 8.0 8 8
Neighborhood 14 1 75.0 75.0 75 75 1 1.0 1 1 1 ... 1 113.0 113.0 113 113 1 5.0 5.0 5 5
Neighborhood 17 1 65.0 65.0 65 65 1 2.0 2 2 2 ... 1 189.0 189.0 189 189 1 23.0 23.0 23 23
Neighborhood 4 1 40.0 40.0 40 40 1 2.0 2 2 2 ... 1 241.0 241.0 241 241 1 19.0 19.0 19 19

40 rows × 25 columns

Plot daily bookings:


In [551]:
#plotting timeseries of all bookings
dailystats=bookings.groupby(['booking_date']).agg('count')

dailystats.plot(figsize=(20,10))


Out[551]:
<matplotlib.axes._subplots.AxesSubplot at 0x115a20050>

Plot the daily bookings per neighborhood (provide a legend)


In [277]:
joineddf=bookings.merge(listings, on="prop_id")
joineddf


Out[277]:
prop_id booking_date prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 188 2011-01-01 Property type 1 Neighborhood 9 95 3 10 247 8 16 2.0
1 188 2011-01-04 Property type 1 Neighborhood 9 95 3 10 247 8 16 2.0
2 188 2011-01-05 Property type 1 Neighborhood 9 95 3 10 247 8 16 2.0
3 188 2011-01-11 Property type 1 Neighborhood 9 95 3 10 247 8 16 2.0
4 188 2011-01-14 Property type 1 Neighborhood 9 95 3 10 247 8 16 2.0
... ... ... ... ... ... ... ... ... ... ... ...
6071 153 2011-12-07 Property type 2 Neighborhood 18 95 2 22 174 9 6 0.7
6072 153 2011-12-28 Property type 2 Neighborhood 18 95 2 22 174 9 6 0.7
6073 403 2011-12-19 Property type 1 Neighborhood 19 100 4 2 69 1 5 5.0
6074 249 2011-12-19 Property type 1 Neighborhood 14 75 2 9 59 6 NaN NaN
6075 64 2011-12-29 Property type 1 Neighborhood 20 250 3 6 289 16 2 0.1

6076 rows × 11 columns


In [928]:
neighborhoods=joineddf['neighborhood'].unique()
len(neighborhoods)


Out[928]:
21

In [491]:
bookingsneighbor['count']=joineddf.groupby(['neighborhood','booking_date'])['neighborhood'].agg('count')
neighbor11=pd.DataFrame(bookingsneighbor['Neighborhood 11'])
alldates['booking_date']=dates
neighbor11['booking_date'] = neighbor11.index
neighbor11['booking_date'] =neighbor11['booking_date'].astype(str)
type(neighbor11['booking_date'][2])
#pd.merge(neighbor11,alldates,on='booking_date')
bookingsneighborpd=pd.DataFrame(bookingsneighbor)


Out[491]:
str

In [946]:
color=['red','blue','green','yellow','orange','red','blue','green','yellow','orange','red','blue','green','yellow','orange','red','blue','green','yellow','orange']

In [948]:
#looking at other students plots I see I may have interpreted daily plot differently and you may have expected a bar plot with days of the week
#I'm having trouble generating a meaningful legend and a random color for each line but I'm pretty happy I figured out how to
# plot a sparse data series because initally I was plotting only the days wheres there was a booking instead of the entire year.
#however I see  how this plot would not be particularly helpful.


import random
neighborhooddfs = []
for hood in neighborhoods:
    neighborhooddfs.append(pd.DataFrame(bookingsneighbor[hood]))
    
alldatesneighborhooddfs=[]
for hooddf in neighborhooddfs:
    hooddf['booking_date'] = hooddf.index
    hooddf['booking_date'] = hooddf['booking_date'].astype(str)
    alldatesneighborhooddfs.append(pd.merge(alldates,hooddf,on='booking_date',how='left'))

alldatesneighborhooddfs[1]=alldatesneighborhooddfs[1].fillna(0)
    

ax=alldatesneighborhooddfs[1].plot(by='booking_date', figsize = (20,10), legend=True)

for i in range(len(alldatesneighborhooddfs)):
    alldatesneighborhooddfs[i]=alldatesneighborhooddfs[i].fillna(0)


for i in range(len(alldatesneighborhooddfs[1:])):
    alldatesneighborhooddfs[i].fillna(0)
    alldatesneighborhooddfs[i].plot(by='booking_date', figsize = (20,10), ax=ax)

#for i in range(len(neighborhoods)):
#   pd.DataFrame(bookingsneighbor[i])
    
# list=["thing", "other thing", "third thing"]    
# for thing in list:
#     print thing

# for i in len(list):
#     print list[i]



In [ ]:
seaborn.barplot

In [570]:
#didnt use this
neighbor14=pd.DataFrame(bookingsneighbor['Neighborhood 14'])
neighbor16=pd.DataFrame(bookingsneighbor['Neighborhood 16'])
neighbor13=pd.DataFrame(bookingsneighbor['Neighborhood 13'])
neighbor15=pd.DataFrame(bookingsneighbor['Neighborhood 15'])
neighbor18=pd.DataFrame(bookingsneighbor['Neighborhood 18'])
neighbor17=pd.DataFrame(bookingsneighbor['Neighborhood 17'])
neighbor12=pd.DataFrame(bookingsneighbor['Neighborhood 12'])
neighbor4=pd.DataFrame(bookingsneighbor['Neighborhood 4'])
neighbor19=pd.DataFrame(bookingsneighbor['Neighborhood 19'])
neighbor5=pd.DataFrame(bookingsneighbor['Neighborhood 5'])
neighbor20=pd.DataFrame(bookingsneighbor['Neighborhood 20'])
neighbor21=pd.DataFrame(bookingsneighbor['Neighborhood 21'])
neighbor9=pd.DataFrame(bookingsneighbor['Neighborhood 9'])
neighbor7=pd.DataFrame(bookingsneighbor['Neighborhood 7'])
neighbor8=pd.DataFrame(bookingsneighbor['Neighborhood 8'])
neighbor22=pd.DataFrame(bookingsneighbor['Neighborhood 22'])
neighbor3=pd.DataFrame(bookingsneighbor['Neighborhood 3'])
neighbor1=pd.DataFrame(bookingsneighbor['Neighborhood 1'])
neighbor10=pd.DataFrame(bookingsneighbor['Neighborhood 10'])
neighbor6=pd.DataFrame(bookingsneighbor['Neighborhood 6'])




#for i in range(len(bookingsneighbor['neighborhood'])):
#    pd.DataFrame(bookingsneighbor[i])

In [643]:
# alldates['booking_date']= alldates['booking_date'].map(lambda datetime: str(alldates['booking_date'][datetime].split(" ")[0]))

# alldates

# # type(alldates['strtime'][2])

# allbooking=pd.merge(alldates,neighbor11,on='booking_date',how='left')
# allbooking=allbooking.fillna(0)
# allbooking.plot(figsize = (20,10),legend=True)
# # # type(alldates['strtime'][2])

# # datetime=pd.Timestamp('2011-01-06')
# # str(alldates['booking_date'][datetime])

In [644]:
# ax=bookingsneighbor['Neighborhood 1'].plot(figsize = (20,10), legend=True,xticks=None)

In [645]:
# bookingsneighbor=joineddf.groupby(['neighborhood','booking_date'])['neighborhood'].agg('count')

# ax=bookingsneighbor['Neighborhood 11'].plot(by='booking_date', figsize = (20,10), legend=True, ax=ax)
# #bookingsneighbor['Neighborhood 1'].plot(by='booking_date', figsize = (20,10), legend=True, ax=ax)

In [ ]:


In [163]:
##Part 2 - Develop a data set


Out[163]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 NaN
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 4
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 NaN
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 1
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 27

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [649]:
#adding columns for number of bookings and rate
numbook=joineddf.groupby(['prop_id']).agg('count')
listings['number_of_bookings']=numbook['booking_date']
listings['booking_rate']=listings['number_of_bookings']/listings['tenure_months']
listings=listings.fillna(0)
listings


Out[649]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 0 0.0
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 4 0.1
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 0 0.0
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 1 0.0
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 27 1.0
... ... ... ... ... ... ... ... ... ... ...
403 404 Property type 2 Neighborhood 14 100 1 8 235 1 1 1.0
404 405 Property type 2 Neighborhood 13 85 2 27 1048 1 3 3.0
405 406 Property type 1 Neighborhood 9 70 3 18 153 1 19 19.0
406 407 Property type 1 Neighborhood 13 129 2 13 370 1 19 19.0
407 408 Property type 1 Neighborhood 14 100 3 21 707 1 15 15.0

408 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [661]:
#filtering well established properties
established = listings[listings['tenure_months'] > 10]
established=established.fillna(0)
established


Out[661]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 0 0.0
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 4 0.1
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 0 0.0
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 1 0.0
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 27 1.0
... ... ... ... ... ... ... ... ... ... ...
115 116 Property type 1 Neighborhood 17 125 2 9 89 11 8 0.7
116 117 Property type 2 Neighborhood 14 49 2 14 417 11 0 0.0
117 118 Property type 2 Neighborhood 4 60 2 10 95 11 8 0.7
118 119 Property type 2 Neighborhood 12 55 2 8 333 11 11 1.0
119 120 Property type 2 Neighborhood 9 100 1 4 212 11 1 0.1

120 rows × 10 columns

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [662]:
#creating dummy variables for prop type
listings_dum_prop=pd.get_dummies(listings['prop_type'])
listings_dum_prop


Out[662]:
Property type 1 Property type 2 Property type 3
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 1 0 0
... ... ... ...
403 0 1 0
404 0 1 0
405 1 0 0
406 1 0 0
407 1 0 0

408 rows × 3 columns


In [663]:
#creating dummy variables for neighborhood
listings_dum_neig=pd.get_dummies(listings['neighborhood'])
listings_dum_neig


Out[663]:
Neighborhood 1 Neighborhood 10 Neighborhood 11 Neighborhood 12 Neighborhood 13 Neighborhood 14 Neighborhood 15 Neighborhood 16 Neighborhood 17 Neighborhood 18 ... Neighborhood 20 Neighborhood 21 Neighborhood 22 Neighborhood 3 Neighborhood 4 Neighborhood 5 Neighborhood 6 Neighborhood 7 Neighborhood 8 Neighborhood 9
0 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
403 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
404 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
405 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
406 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
407 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

408 rows × 22 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [666]:
from sklearn.cross_validation import train_test_split

In [671]:
listings.columns


Out[671]:
Index([u'prop_id', u'prop_type', u'neighborhood', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months', u'number_of_bookings', u'booking_rate'], dtype='object')

In [ ]:


In [809]:
bookrate_train, bookrate_test, piccount_train, piccount_test, price_train, price_test, desclen_train, desclen_test, tenure_train, tenure_test = train_test_split(listings['booking_rate'],listings['picture_count'], listings['price'], listings['description_length'], listings['tenure_months'],test_size=0.33, random_state=42)
#bookrate_train, bookrate_test, price_train, price_test= train_test_split(listings['booking_rate'],listings['price'],test_size=0.25)

In [868]:
#creating my X training set for fitting
everything_train=[]
for everything in range(len(price_train)):
    everything = [piccount_train[everything],price_train[everything], desclen_train[everything], tenure_train[everything]]
    everything_train.append(everything)

In [882]:
#creating my X testing set
everything_test=[]
for everything in range(len(price_test)):
    everything = [piccount_test[everything],price_test[everything], desclen_test[everything], tenure_test[everything]]
    everything_test.append(everything)

In [837]:
# didnt use this
pictraining =[]
for pic in piccount_train:
    pictraining.append([pic.astype(float)])

In [838]:
# didnt use this
tenuretraining =[]
for tenure in tenure_train:
    tenuretraining.append([tenure.astype(float)])

In [839]:
# didnt use this
descriptraining =[]
for description in desclen_train:
    descriptraining.append([description.astype(float)])

In [840]:
# didnt use this
price_train
pricetraining =[]
for price in price_train:
    pricetraining.append([price.astype(float)])

In [927]:
#creating my Y training set for fitting
bookrate_train
booktraining =[]
for rate in bookrate_train:
    booktraining.append([rate])

In [842]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

from sklearn.cross_validation import train_test_split
import matplotlib.pylab as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from IPython.core.pylabtools import figsize
figsize(5,5)
plt.style.use('fivethirtyeight')

In [835]:
def plot_approximation(est, ax, label=None):
    """Plot the approximation of ``est`` on axis ``ax``. """
    #ax.plot(x_plot, f(x_plot), label='ground truth', color='green')
    ax.scatter( pictraining, booktraining, label='training data', color='red')
    ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='blue', label=label)
    ax.set_ylim((0, 100))
    ax.set_xlim((0, 65))
    ax.set_ylabel('y')
    ax.set_xlabel('x')
    ax.legend(loc='upper right',frameon=True)

In [834]:
piccount_train.max()


Out[834]:
60

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [887]:
#below fitting my model with my train sets

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

degree = 1
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(everything_train, booktraining)


Out[887]:
0.12420820100158525

In [863]:
# sett=[pictraining,descriptraining,pricetraining,tenuretraining]


Out[863]:
60.0

In [865]:
#making some plots that I did not end up using but I wanted to retain this code

# for lst in sett:
#     fig,ax = plt.subplots(1,1)
#     degree = 1
#     est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
#     est.fit(lst, booktraining)
#     ax.scatter( lst, booktraining, label='training data', color='red')
#     ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='blue',label='fit')
#     ax.set_ylim((0, 55))
#     ax.set_xlim((0, max(map(max,lst))))
#     ax.set_ylabel('y')
#     ax.set_xlabel('x')
#     ax.legend(loc='upper right',frameon=True)


fit your model with your test sets


In [898]:
#creating my prediction from my X test set
prediction=est.predict(everything_test)
#plotting this prediction against my Y test set, if this was a good model I would expect a line
fig,ax = plt.subplots(1,1)
ax.scatter( prediction, bookrate_test, label='prediction', color='red')


Out[898]:
<matplotlib.collections.PathCollection at 0x11e8c9890>

In [889]:
#reporting my score, this model isn't very good
score = est.score(everything_test,bookrate_test)
score


Out[889]:
0.12420820100158525

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

I think the score method is calculating the r sqaured value and an r sqaured of 0.12 is telling me the model is not particularly good at predicting booking rate from the parameters we fed it.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]:


In [ ]:


In [ ]:


In [ ]: