Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [3]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

In [4]:
pd.__version__


Out[4]:
'0.14.1'

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [5]:
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')

In [6]:
listings.head(5)


Out[6]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 Property type 1 Neighborhood 14 140 3 11 232 30
1 2 Property type 1 Neighborhood 14 95 2 3 37 29
2 3 Property type 2 Neighborhood 16 95 2 16 172 29
3 4 Property type 2 Neighborhood 13 90 2 19 472 28
4 5 Property type 1 Neighborhood 15 125 5 21 442 28

In [7]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)

In [8]:
bookings.head(5)


Out[8]:
prop_id booking_date
0 9 2011-06-17
1 13 2011-08-12
2 21 2011-06-20
3 28 2011-05-05
4 29 2011-11-17

In [9]:
bookings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null object
dtypes: int64(1), object(1)

In [10]:
listings.groupby('prop_type').prop_id.count()


Out[10]:
prop_type
Property type 1    269
Property type 2    135
Property type 3      4
Name: prop_id, dtype: int64

In [11]:
bookings.groupby('prop_id').count()


Out[11]:
booking_date
prop_id
1 4
3 1
4 27
6 88
7 2
... ...
404 3
405 19
406 19
407 15
408 54

328 rows × 1 columns

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [12]:
print 'price', listings.price.mean(), '/', listings.price.median(), '/', listings.price.std()
print 'capacity', listings.person_capacity.mean(), '/', listings.person_capacity.median(), '/', listings.person_capacity.std()
print 'picture count', listings.picture_count.mean(), '/', listings.picture_count.median(), '/', listings.picture_count.std()
print 'desc length', listings.description_length.mean(), '/', listings.description_length.median(), '/', listings.description_length.std()
print 'tenure, in months', listings.tenure_months.mean(), '/', listings.tenure_months.median(), '/', listings.tenure_months.std()


price 187.806372549 / 125.0 / 353.050858039
capacity 2.99754901961 / 2.0 / 1.59467599246
picture count 14.3897058824 / 12.0 / 10.4774282678
desc length 309.159313725 / 250.0 / 228.021684411
tenure, in months 8.48774509804 / 7.0 / 5.87208837893

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [13]:
print 'price by', listings.groupby('prop_type').price.mean()
print 'capacity by', listings.groupby('prop_type').person_capacity.mean()
print 'pictures by', listings.groupby('prop_type').picture_count.mean()
print 'description length by', listings.groupby('prop_type').description_length.mean()
print 'tenure by', listings.groupby('prop_type').tenure_months.mean()


price by prop_type
Property type 1    237.1
Property type 2     93.3
Property type 3     63.8
Name: price, dtype: float64
capacity by prop_type
Property type 1    3.5
Property type 2    2.0
Property type 3    1.8
Name: person_capacity, dtype: float64
pictures by prop_type
Property type 1    14.7
Property type 2    13.9
Property type 3     8.8
Name: picture_count, dtype: float64
description length by prop_type
Property type 1    313.2
Property type 2    304.9
Property type 3    184.8
Name: description_length, dtype: float64
tenure by prop_type
Property type 1     8.5
Property type 2     8.4
Property type 3    13.8
Name: tenure_months, dtype: float64

Same, but by property type per neighborhood?


In [14]:
neighborhood_price = listings.groupby(['neighborhood','prop_type']).price.mean()
neighborhood_price.unstack(1)


Out[14]:
prop_type Property type 1 Property type 2 Property type 3
neighborhood
Neighborhood 1 85.0 NaN NaN
Neighborhood 10 142.5 137.5 NaN
Neighborhood 11 159.4 78.8 75
Neighborhood 12 365.6 96.9 NaN
Neighborhood 13 241.9 81.1 NaN
... ... ... ...
Neighborhood 5 194.5 NaN NaN
Neighborhood 6 146.0 NaN NaN
Neighborhood 7 161.0 100.0 NaN
Neighborhood 8 174.8 350.0 NaN
Neighborhood 9 151.1 110.0 NaN

22 rows × 3 columns

Plot daily bookings:


In [15]:
bookings.booking_date = pd.to_datetime(bookings.booking_date)
bookings.book_date = bookings.booking_date.map(lambda x: x.date())

bookings.book_date[0]

bookings_by_date = bookings.groupby('booking_date').count()

In [16]:
bookings_by_date.head(5)


Out[16]:
prop_id
booking_date
2011-01-01 11
2011-01-02 9
2011-01-03 10
2011-01-04 8
2011-01-05 15

In [17]:
bookings_by_date.tail(5)


Out[17]:
prop_id
booking_date
2011-12-27 9
2011-12-28 14
2011-12-29 14
2011-12-30 14
2011-12-31 10

In [18]:
bookings_by_date.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 365 entries, 2011-01-01 00:00:00 to 2011-12-31 00:00:00
Data columns (total 1 columns):
prop_id    365 non-null int64
dtypes: int64(1)

In [20]:
bookings_by_date.hist()


Out[20]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000163D7B00>]], dtype=object)

In [150]:
# wait maybe we aren't supposed to build a histogram -- we may want a line chart / time series instead?
bookings_by_date.unstack()
bookings_by_date.rename(columns={'prop_id':'bookings'})


Out[150]:
bookings
booking_date
2011-01-01 11
2011-01-02 9
2011-01-03 10
2011-01-04 8
2011-01-05 15
... ...
2011-12-27 9
2011-12-28 14
2011-12-29 14
2011-12-30 14
2011-12-31 10

365 rows × 1 columns


In [148]:
bookings_by_date.plot()


Out[148]:
<matplotlib.axes._subplots.AxesSubplot at 0x2331d208>

Plot the daily bookings per neighborhood (provide a legend)


In [22]:
# merge listings data into bookings (by prop_id)
bookinginfo = bookings.merge(listings)

bookinginfo.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
booking_date          6076 non-null datetime64[ns]
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
dtypes: datetime64[ns](1), int64(6), object(2)

In [25]:
bookinginfo.head(5)


Out[25]:
prop_id booking_date prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 9 2011-06-17 Property type 1 Neighborhood 13 210 6 27 180 23
1 9 2011-07-23 Property type 1 Neighborhood 13 210 6 27 180 23
2 9 2011-10-28 Property type 1 Neighborhood 13 210 6 27 180 23
3 13 2011-08-12 Property type 2 Neighborhood 14 96 2 10 245 20
4 13 2011-03-17 Property type 2 Neighborhood 14 96 2 10 245 20

In [251]:
neighbookinfo = bookinginfo.groupby(bookinginfo.booking_date)

neighbookinfo.head(5)


Out[251]:
prop_id booking_date prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 9 2011-06-17 Property type 1 Neighborhood 13 210 6 27 180 23
1 9 2011-07-23 Property type 1 Neighborhood 13 210 6 27 180 23
2 9 2011-10-28 Property type 1 Neighborhood 13 210 6 27 180 23
3 13 2011-08-12 Property type 2 Neighborhood 14 96 2 10 245 20
4 13 2011-03-17 Property type 2 Neighborhood 14 96 2 10 245 20
... ... ... ... ... ... ... ... ... ...
5263 326 2011-02-13 Property type 1 Neighborhood 12 120 3 20 395 3
5335 329 2011-09-12 Property type 2 Neighborhood 13 95 2 20 212 3
5405 334 2011-01-18 Property type 1 Neighborhood 15 185 3 28 101 3
5420 334 2011-12-03 Property type 1 Neighborhood 15 185 3 28 101 3
5479 342 2011-12-03 Property type 2 Neighborhood 16 53 1 2 191 3

1819 rows × 9 columns


In [265]:
neigh_new = bookinginfo[['booking_date','neighborhood']]

In [267]:
neigh_new.head(5)


Out[267]:
booking_date neighborhood
0 2011-06-17 Neighborhood 13
1 2011-07-23 Neighborhood 13
2 2011-10-28 Neighborhood 13
3 2011-08-12 Neighborhood 14
4 2011-03-17 Neighborhood 14

In [282]:
neigh_new.groupby(neigh_new.booking_date).count()

neigh_new.plot(x=neigh_new.booking_date, y=neigh_new.count(), by=neigh_new.neighborhood)
# I just can't seem to quite wrangle this dataframe into useable form


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-282-17644444d34f> in <module>()
      1 neigh_new.groupby(neigh_new.booking_date).count()
      2 
----> 3 neigh_new.plot(x=neigh_new.booking_date, y=neigh_new.count(), by=neigh_new.neighborhood)

C:\Anaconda\lib\site-packages\pandas\tools\plotting.pyc in plot_frame(frame, x, y, subplots, sharex, sharey, use_index, figsize, grid, legend, rot, ax, style, title, xlim, ylim, logx, logy, xticks, yticks, kind, sort_columns, fontsize, secondary_y, **kwds)
   2135             label = x if x is not None else frame.index.name
   2136             label = kwds.pop('label', label)
-> 2137             ser = frame[y]
   2138             ser.index.name = label
   2139 

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1670         if isinstance(key, (Series, np.ndarray, list)):
   1671             # either boolean or fancy integer index
-> 1672             return self._getitem_array(key)
   1673         elif isinstance(key, DataFrame):
   1674             return self._getitem_frame(key)

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
   1715         else:
   1716             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 1717             return self.take(indexer, axis=1, convert=True)
   1718 
   1719     def _getitem_multilevel(self, key):

C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in take(self, indices, axis, convert, is_copy)
   1233         new_data = self._data.take(indices,
   1234                                    axis=self._get_block_manager_axis(axis),
-> 1235                                    convert=True, verify=True)
   1236         result = self._constructor(new_data).__finalize__(self)
   1237 

C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in take(self, indexer, axis, verify, convert)
   2968         n = self.shape[axis]
   2969         if convert:
-> 2970             indexer = _maybe_convert_indices(indexer, n)
   2971 
   2972         if verify:

C:\Anaconda\lib\site-packages\pandas\core\indexing.pyc in _maybe_convert_indices(indices, n)
   1630     mask = (indices >= n) | (indices < 0)
   1631     if mask.any():
-> 1632         raise IndexError("indices are out-of-bounds")
   1633     return indices
   1634 

IndexError: indices are out-of-bounds

Part 2 - Develop a data set


In [50]:
# group bookings dataframe by prop_id and merge into listings
# calculate booking_rate
bookrate = bookings.groupby('prop_id').count()

bookrate.rename(columns={'booking_date':'times_booked'}, inplace=True)
bookrate.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 1 to 408
Data columns (total 1 columns):
times_booked    328 non-null int64
dtypes: int64(1)

In [46]:
bookrate.head(5)


Out[46]:
times_booked
prop_id
1 4
3 1
4 27
6 88
7 2

In [62]:
bookrate.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 1 to 408
Data columns (total 1 columns):
times_booked    328 non-null int64
dtypes: int64(1)

In [63]:
bookrate.reset_index(inplace=True)

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [66]:
listinginfo = listings.merge(bookrate, on='prop_id', how='left')

In [74]:
listinginfo['booking_rate'] = (listinginfo.times_booked / listinginfo.tenure_months)

In [75]:
listinginfo.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 10 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
times_booked          328 non-null float64
booking_rate          328 non-null float64
dtypes: float64(2), int64(6), object(2)

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [224]:
srlistings = listinginfo[listinginfo.tenure_months>=10]

srlistings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 10 columns):
prop_id               144 non-null int64
prop_type             144 non-null object
neighborhood          144 non-null object
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
times_booked          113 non-null float64
booking_rate          113 non-null float64
dtypes: float64(2), int64(6), object(2)

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [225]:
srlistings.groupby('prop_type').prop_id.count()


Out[225]:
prop_type
Property type 1    98
Property type 2    44
Property type 3     2
Name: prop_id, dtype: int64

In [226]:
# need to construct dummy variables in pandas 0.14.1
prop_type_dummies = pd.get_dummies(srlistings.prop_type)

neighborhood_dummies = pd.get_dummies(srlistings.neighborhood)

srlistings = srlistings.merge(prop_type_dummies,left_index=True, right_index=True)
srlistings = srlistings.merge(neighborhood_dummies, left_index=True, right_index=True)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 29 columns):
prop_id               144 non-null int64
prop_type             144 non-null object
neighborhood          144 non-null object
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
times_booked          113 non-null float64
booking_rate          113 non-null float64
Property type 1       144 non-null float64
Property type 2       144 non-null float64
Property type 3       144 non-null float64
Neighborhood 11       144 non-null float64
Neighborhood 12       144 non-null float64
Neighborhood 13       144 non-null float64
Neighborhood 14       144 non-null float64
Neighborhood 15       144 non-null float64
Neighborhood 16       144 non-null float64
Neighborhood 17       144 non-null float64
Neighborhood 18       144 non-null float64
Neighborhood 19       144 non-null float64
Neighborhood 20       144 non-null float64
Neighborhood 21       144 non-null float64
Neighborhood 4        144 non-null float64
Neighborhood 5        144 non-null float64
Neighborhood 7        144 non-null float64
Neighborhood 8        144 non-null float64
Neighborhood 9        144 non-null float64
dtypes: float64(21), int64(6), object(2)

In [230]:
srlistings.drop(['prop_id','prop_type','neighborhood'], axis=1, inplace=True)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [158]:
from sklearn.cross_validation import train_test_split

In [233]:
#modellistings = srlistings.ix[::,['price','person_capacity','picture_count','description_length','tenure_months']]

modellistings = srlistings.drop(['times_booked','booking_rate'], axis=1)

In [234]:
modellistings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 24 columns):
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
Property type 1       144 non-null float64
Property type 2       144 non-null float64
Property type 3       144 non-null float64
Neighborhood 11       144 non-null float64
Neighborhood 12       144 non-null float64
Neighborhood 13       144 non-null float64
Neighborhood 14       144 non-null float64
Neighborhood 15       144 non-null float64
Neighborhood 16       144 non-null float64
Neighborhood 17       144 non-null float64
Neighborhood 18       144 non-null float64
Neighborhood 19       144 non-null float64
Neighborhood 20       144 non-null float64
Neighborhood 21       144 non-null float64
Neighborhood 4        144 non-null float64
Neighborhood 5        144 non-null float64
Neighborhood 7        144 non-null float64
Neighborhood 8        144 non-null float64
Neighborhood 9        144 non-null float64
dtypes: float64(19), int64(5)

In [235]:
X_train, X_test, y_train, y_test = train_test_split(modellistings,srlistings.booking_rate, test_size=.3)

In [236]:
modellistings.sort_index(by='tenure_months')


Out[236]:
price person_capacity picture_count description_length tenure_months Property type 1 Property type 2 Property type 3 Neighborhood 11 Neighborhood 12 ... Neighborhood 17 Neighborhood 18 Neighborhood 19 Neighborhood 20 Neighborhood 21 Neighborhood 4 Neighborhood 5 Neighborhood 7 Neighborhood 8 Neighborhood 9
143 100 2 5 35 10 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
120 69 3 15 418 10 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
121 125 2 24 756 10 1 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
122 63 1 6 575 10 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
123 135 2 19 538 10 1 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4 125 5 21 442 28 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 90 2 19 472 28 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 95 2 16 172 29 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 95 2 3 37 29 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
0 140 3 11 232 30 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

144 rows × 24 columns


In [237]:
y_train = np.nan_to_num(y_train)

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [238]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [170]:
#import matplotlib.pylab as plt
#from sklearn.preprocessing import PolynomialFeatures
#from sklearn.pipeline import make_pipeline
#from IPython.core.pylabtools import figsize
#figsize(5,5)
#plt.style.use('fivethirtyeight')

fit your model with your test sets


In [239]:
lr.fit(X_train,y_train)

print('Coefficients: \n', lr.coef_)


('Coefficients: \n', array([ -5.15980915e-04,   1.57102967e-02,   2.04011171e-02,
         1.53288879e-03,  -7.88203222e-02,  -9.02142932e-01,
         3.00686525e-01,   6.01456408e-01,   3.54698057e-01,
        -6.07957700e-01,   1.82638734e-01,   1.26453752e-02,
         4.18162607e-01,  -4.43934253e-01,  -1.42811465e-01,
         3.64039496e-01,   1.00859318e+00,  -2.77319193e-01,
        -5.13646903e-01,  -6.74803004e-01,  -3.37308868e-01,
        -5.87278460e-01,  -4.40710666e-01,   1.68499307e+00]))

In [243]:
modellistings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 24 columns):
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
Property type 1       144 non-null float64
Property type 2       144 non-null float64
Property type 3       144 non-null float64
Neighborhood 11       144 non-null float64
Neighborhood 12       144 non-null float64
Neighborhood 13       144 non-null float64
Neighborhood 14       144 non-null float64
Neighborhood 15       144 non-null float64
Neighborhood 16       144 non-null float64
Neighborhood 17       144 non-null float64
Neighborhood 18       144 non-null float64
Neighborhood 19       144 non-null float64
Neighborhood 20       144 non-null float64
Neighborhood 21       144 non-null float64
Neighborhood 4        144 non-null float64
Neighborhood 5        144 non-null float64
Neighborhood 7        144 non-null float64
Neighborhood 8        144 non-null float64
Neighborhood 9        144 non-null float64
dtypes: float64(19), int64(5)

In [240]:
lr.predict(X_test)


Out[240]:
array([-0.06599738,  2.32405604,  1.18787317, -0.23328787,  1.09711999,
       -0.16546032,  0.73572247,  0.5511452 ,  1.46521969,  0.3648306 ,
        0.78192198,  0.14247383,  1.18942348,  1.11047552,  2.10705106,
        1.3818424 ,  2.29948621,  0.48172804,  2.77998107,  0.87298155,
        1.20971872,  1.70067505,  0.7009982 , -0.37032488,  1.16190267,
        0.6640804 ,  1.30046839,  0.11121503,  1.56715565,  3.38038782,
        0.171327  ,  0.70128567,  0.32396861,  4.43545202,  2.02793682,
        1.26352596,  1.6955839 ,  1.40814721, -0.04009127,  0.95441145,
        1.97966647,  1.20869839,  1.8460597 ,  0.82481806])

In [241]:
lr.score(X_train, y_train, sample_weight=None)


Out[241]:
0.40364920996567188

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

Estimates the variation in booking rate explained by the regressors (less than 20%).

. Other factors than price / size / pics / desc / tenure may be important (e.g. rating, responsiveness) . Some of the selected factors (e.g. tenure) may be unimportant and thus add noise to the model . Still need to add category, neighborhood dummies .

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]:
# Simplify the model: remove neighborhoods, tenure

In [246]:
newmodellistings = modellistings.ix[::,['price','person_capacity','picture_count','description_length','Property type 1','Property type 2','Property type 3']]
X_train, X_test, y_train, y_test = train_test_split(newmodellistings,srlistings.booking_rate, test_size=.3)

In [248]:
y_train = np.nan_to_num(y_train)

lr.fit(X_train,y_train)

print('Coefficients: \n', lr.coef_)


('Coefficients: \n', array([-0.00087834,  0.00581314,  0.02427057,  0.00131916, -0.4251979 ,
        0.66151885, -0.23632095]))

In [249]:
lr.score(X_train, y_train, sample_weight=None)


Out[249]:
0.23673363337564579

In [ ]:
# Maybe use a lasso or something to constrain the list of terms? With only 144 data points, 24 columns seems excessive