Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [172]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [22]:
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')
listings.info()
listings.columns
#clean up listings.  Remove 'Property type' and convert to int.  Remove 'Neighborhood' and convert to int
listings['prop_type'] = listings['prop_type'].map(lambda x: x.replace("Property type ","")).astype(int)
listings['neighborhood'] = listings['neighborhood'].map(lambda x: x.replace("Neighborhood ", "")).astype(int)
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null int64
neighborhood          408 non-null int64
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(8)
memory usage: 28.7 KB

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [23]:
print "mean :"
print listings.drop('prop_id',1).mean()
print "\nmedian :"
print listings.drop('prop_id',1).median()
print "\nstandard deviation :"
print listings.drop('prop_id',1).std()


mean :
prop_type               1.4
neighborhood           14.1
price                 187.8
person_capacity         3.0
picture_count          14.4
description_length    309.2
tenure_months           8.5
dtype: float64

median :
prop_type               1
neighborhood           14
price                 125
person_capacity         2
picture_count          12
description_length    250
tenure_months           7
dtype: float64

standard deviation :
prop_type               0.5
neighborhood            3.2
price                 353.1
person_capacity         1.6
picture_count          10.5
description_length    228.0
tenure_months           5.9
dtype: float64

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [24]:
groupedListings = listings.groupby(['prop_type'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])
groupedListings


Out[24]:
price person_capacity picture_count description_length tenure_months
mean mean mean mean mean
prop_type
1 237.1 3.5 14.7 313.2 8.5
2 93.3 2.0 13.9 304.9 8.4
3 63.8 1.8 8.8 184.8 13.8

Same, but by property type per neighborhood?


In [25]:
groupedListings = listings.groupby(['prop_type', 'neighborhood'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])
groupedListings
#bookings


Out[25]:
price person_capacity picture_count description_length tenure_months
mean mean mean mean mean
prop_type neighborhood
1 1 85.0 2.0 26.0 209.0 6.0
2 250.0 6.0 8.0 423.0 6.0
5 194.5 2.5 8.5 266.5 11.5
6 146.0 3.3 12.7 290.7 4.0
7 161.0 3.7 14.3 343.0 5.3
... ... ... ... ... ... ...
2 20 60.0 1.0 3.0 101.0 6.0
3 4 40.0 2.0 4.0 241.0 19.0
11 75.0 2.0 15.0 196.0 8.0
14 75.0 1.0 1.0 113.0 5.0
17 65.0 2.0 15.0 189.0 23.0

40 rows × 5 columns

Plot daily bookings:


In [206]:
groupedBookings = bookings.groupby(['booking_date']).agg(['count'])

groupedBookings.hist()
#groupedBookings.plot()
#cleaner plot -


Out[206]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10dcf9910>]], dtype=object)

Plot the daily bookings per neighborhood (provide a legend)


In [27]:
combined = bookings.merge(listings)
combined
groupedCombined = combined.groupby(['neighborhood']).agg(['count'])
groupedCombined.plot()
#TODO - legend


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c43f210>

Part 2 - Develop a data set


In [28]:
bookings


Out[28]:
prop_id booking_date
0 9 2011-06-17
1 13 2011-08-12
2 21 2011-06-20
3 28 2011-05-05
4 29 2011-11-17
... ... ...
6071 408 2011-06-02
6072 408 2011-08-22
6073 408 2011-07-24
6074 408 2011-01-12
6075 408 2011-09-08

6076 rows × 2 columns


In [29]:
listings


Out[29]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 1 14 140 3 11 232 30
1 2 1 14 95 2 3 37 29
2 3 2 16 95 2 16 172 29
3 4 2 13 90 2 19 472 28
4 5 1 15 125 5 21 442 28
... ... ... ... ... ... ... ... ...
403 404 2 14 100 1 8 235 1
404 405 2 13 85 2 27 1048 1
405 406 1 9 70 3 18 153 1
406 407 1 13 129 2 13 370 1
407 408 1 14 100 3 21 707 1

408 rows × 8 columns


In [30]:
groupedProperty = bookings.groupby(['prop_id']).count()

groupedProperty.reset_index(inplace=True)

In [31]:
groupedProperty.rename(columns = {'booking_date' : 'bookings'}, inplace=True)

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [32]:
bookingListings = listings.merge(groupedProperty)
bookingListings


Out[32]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months bookings
0 1 1 14 140 3 11 232 30 4
1 3 2 16 95 2 16 172 29 1
2 4 2 13 90 2 19 472 28 27
3 6 2 13 89 2 10 886 28 88
4 7 2 13 85 1 11 58 24 2
... ... ... ... ... ... ... ... ... ...
323 404 2 14 100 1 8 235 1 3
324 405 2 13 85 2 27 1048 1 19
325 406 1 9 70 3 18 153 1 19
326 407 1 13 129 2 13 370 1 15
327 408 1 14 100 3 21 707 1 54

328 rows × 9 columns


In [33]:
bookingListings['booking_rate'] = bookingListings['bookings'] / bookingListings['tenure_months']
bookingListings


Out[33]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months bookings booking_rate
0 1 1 14 140 3 11 232 30 4 0.1
1 3 2 16 95 2 16 172 29 1 0.0
2 4 2 13 90 2 19 472 28 27 1.0
3 6 2 13 89 2 10 886 28 88 3.1
4 7 2 13 85 1 11 58 24 2 0.1
... ... ... ... ... ... ... ... ... ... ...
323 404 2 14 100 1 8 235 1 3 3.0
324 405 2 13 85 2 27 1048 1 19 19.0
325 406 1 9 70 3 18 153 1 19 19.0
326 407 1 13 129 2 13 370 1 15 15.0
327 408 1 14 100 3 21 707 1 54 54.0

328 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [34]:
establishedProperties = bookingListings[bookingListings['tenure_months'] > 9]
establishedProperties


Out[34]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months bookings booking_rate
0 1 1 14 140 3 11 232 30 4 0.1
1 3 2 16 95 2 16 172 29 1 0.0
2 4 2 13 90 2 19 472 28 27 1.0
3 6 2 13 89 2 10 886 28 88 3.1
4 7 2 13 85 1 11 58 24 2 0.1
... ... ... ... ... ... ... ... ... ... ...
108 139 2 19 85 3 35 852 10 39 3.9
109 140 1 12 200 4 18 125 10 10 1.0
110 141 2 12 45 2 36 281 10 1 0.1
111 142 2 15 96 2 9 138 10 48 4.8
112 143 2 15 58 2 7 135 10 21 2.1

113 rows × 10 columns

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [163]:
establishedProperties.info()
establishedProperties.prop_type.value_counts()
#establishedProperties['prop_type'] = establishedProperties['prop_type']
full_table = pd.get_dummies(establishedProperties, columns=['prop_type', 'neighborhood'])


<class 'pandas.core.frame.DataFrame'>
Int64Index: 113 entries, 0 to 112
Data columns (total 10 columns):
prop_id               113 non-null int64
prop_type             113 non-null int64
neighborhood          113 non-null int64
price                 113 non-null int64
person_capacity       113 non-null int64
picture_count         113 non-null int64
description_length    113 non-null int64
tenure_months         113 non-null int64
bookings              113 non-null int64
booking_rate          113 non-null float64
dtypes: float64(1), int64(9)
memory usage: 9.7 KB

In [164]:
pd.__version__
full_table


Out[164]:
prop_id price person_capacity picture_count description_length tenure_months bookings booking_rate prop_type_1 prop_type_2 ... neighborhood_12 neighborhood_13 neighborhood_14 neighborhood_15 neighborhood_16 neighborhood_17 neighborhood_18 neighborhood_19 neighborhood_20 neighborhood_21
0 1 140 3 11 232 30 4 0.1 1 0 ... 0 0 1 0 0 0 0 0 0 0
1 3 95 2 16 172 29 1 0.0 0 1 ... 0 0 0 0 1 0 0 0 0 0
2 4 90 2 19 472 28 27 1.0 0 1 ... 0 1 0 0 0 0 0 0 0 0
3 6 89 2 10 886 28 88 3.1 0 1 ... 0 1 0 0 0 0 0 0 0 0
4 7 85 1 11 58 24 2 0.1 0 1 ... 0 1 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
108 139 85 3 35 852 10 39 3.9 0 1 ... 0 0 0 0 0 0 0 1 0 0
109 140 200 4 18 125 10 10 1.0 1 0 ... 1 0 0 0 0 0 0 0 0 0
110 141 45 2 36 281 10 1 0.1 0 1 ... 1 0 0 0 0 0 0 0 0 0
111 142 96 2 9 138 10 48 4.8 0 1 ... 0 0 0 1 0 0 0 0 0 0
112 143 58 2 7 135 10 21 2.1 0 1 ... 0 0 0 1 0 0 0 0 0 0

113 rows × 25 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [194]:
from sklearn.cross_validation import train_test_split
full_table = pd.get_dummies(establishedProperties, columns=['prop_type', 'neighborhood'])
features = full_table.drop(['prop_id', 'bookings', 'booking_rate'], axis=1)
#features
full_table['booking_rate'].hist()
#
#features = features.values
#targetDF.hist()
full_table['log'] = np.log(full_table['booking_rate'])
full_table.log.hist()
targetDF = full_table['log']
targetDF
target = targetDF.values
target


Out[194]:
array([-2.01490302, -3.36729583, -0.03636764,  1.1451323 , -2.48490665,
        0.19671029, -2.03688193,  0.12260232, -1.1451323 ,  0.71562004,
        1.04731899,  0.43825493, -0.45953233,  0.83975065, -0.99852883,
       -2.94443898, -0.7472144 , -0.69314718,  1.06087196,  0.42488319,
       -1.73460106, -2.14006616,  1.38629436, -0.69314718, -2.07944154,
       -0.57536414, -2.77258872, -2.77258872, -1.67397643, -2.07944154,
       -2.07944154, -2.07944154, -2.07944154, -2.07944154, -1.67397643,
       -2.77258872, -1.67397643, -2.07944154, -0.69314718, -1.67397643,
       -2.07944154, -2.07944154, -1.38629436, -1.67397643, -2.77258872,
       -1.38629436, -2.07944154, -2.77258872, -0.82667857, -1.67397643,
       -0.82667857, -0.14310084,  1.5114575 ,  0.62415431, -1.09861229,
       -2.01490302, -2.7080502 , -1.32175584, -2.01490302, -2.01490302,
        1.40282366, -1.60943791, -2.7080502 ,  0.06453852,  1.2039728 ,
       -1.32175584,  0.        ,  0.40546511, -2.63905733, -0.44183275,
        1.59504917, -0.69314718, -0.15415068, -1.02961942,  0.19415601,
        0.26826399,  0.37948962,  0.47957308, -0.26236426,  0.9315582 ,
       -0.08701138, -0.28768207,  0.94908055, -1.79175947, -1.38629436,
        1.12601126,  0.69314718, -1.29928298, -0.31845373, -0.31845373,
        0.        , -2.39789527,  1.19392247,  1.33500107,  1.16315081,
        1.28093385, -0.91629073,  1.09861229,  1.41098697,  1.22377543,
        0.64185389,  1.90210753,  0.69314718,  0.78845736,  1.30833282,
        1.5260563 , -2.30258509,  1.30833282,  1.36097655,  0.        ,
       -2.30258509,  1.56861592,  0.74193734])

In [192]:
a_train, a_test, b_train, b_test = train_test_split(features, target, test_size=0.2, random_state = 3)
print "a_train = {}".format(a_train.shape)
print "a_test = {}".format(a_test.shape)
print "b_train = {}".format(b_train.shape)
print "b_test = {}".format(b_test.shape)


a_train = (90, 22)
a_test = (23, 22)
b_train = (90,)
b_test = (23,)

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [195]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(a_train, b_train)
lr.get_params()


Out[195]:
{'copy_X': True, 'fit_intercept': True, 'normalize': False}

fit your model with your test sets


In [189]:
a_predictions = lr.predict(a_test)
print a_predictions
print a_test


[ -1.60326324  -1.4389137   -1.58931676   0.40861997  -0.55769854
   1.00780799  -0.47199401   0.68918824   0.85993268  -1.26929308
   0.89657448  -1.21783828   0.86832012   0.35317491  -0.53262232
  -0.93520844  -0.85092676  -0.1635407    0.7175301    0.93482118
 -13.91401022   0.19875565  -1.08584682]
[[  2.85000000e+02   5.00000000e+00   6.00000000e+00   2.41000000e+02
    1.60000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.20000000e+02   2.00000000e+00   1.30000000e+01   6.85000000e+02
    2.30000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.10000000e+02   6.00000000e+00   2.70000000e+01   1.80000000e+02
    2.30000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.10000000e+02   3.00000000e+00   2.90000000e+01   4.37000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.75000000e+02   8.00000000e+00   1.90000000e+01   4.57000000e+02
    1.50000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  6.30000000e+01   1.00000000e+00   6.00000000e+00   5.75000000e+02
    1.00000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.35000000e+02   2.00000000e+00   1.90000000e+01   5.38000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  9.00000000e+01   3.00000000e+00   5.00000000e+00   5.82000000e+02
    1.60000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  6.00000000e+01   3.00000000e+00   1.20000000e+01   2.73000000e+02
    1.50000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.00000000e+02   2.00000000e+00   6.00000000e+00   2.88000000e+02
    1.60000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  8.00000000e+01   2.00000000e+00   2.00000000e+01   4.91000000e+02
    1.20000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  4.00000000e+02   8.00000000e+00   1.80000000e+01   3.21000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  4.50000000e+01   2.00000000e+00   3.60000000e+01   2.81000000e+02
    1.00000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  5.50000000e+01   2.00000000e+00   8.00000000e+00   3.33000000e+02
    1.10000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  9.50000000e+01   2.00000000e+00   8.00000000e+00   1.37000000e+02
    1.60000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.29000000e+02   4.00000000e+00   5.00000000e+00   3.88000000e+02
    1.60000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  8.50000000e+01   1.00000000e+00   1.10000000e+01   5.80000000e+01
    2.40000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  9.50000000e+01   2.00000000e+00   1.50000000e+01   2.55000000e+02
    1.90000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  9.90000000e+01   2.00000000e+00   2.00000000e+01   1.43000000e+02
    1.00000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  8.50000000e+01   2.00000000e+00   3.30000000e+01   2.48000000e+02
    1.20000000e+01   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.39400000e+03   8.00000000e+00   1.00000000e+01   3.63000000e+02
    1.60000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  1.00000000e+02   3.00000000e+00   2.30000000e+01   1.23000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  2.45000000e+02   5.00000000e+00   1.70000000e+01   2.82000000e+02
    1.00000000e+01   1.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]]

In [190]:
lr.score(a_train, b_train)


Out[190]:
0.544217540707576

In [191]:
lr.score(a_test, b_test)


Out[191]:
-3.1774494576158769

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

In [ ]:
Score gives a value that signifies how accurate the predictions are.  
The closer to 1 indicates a perfect model.  
Our training data does "mediocre" but the test data is very poor.

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]:


In [ ]: