Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [2]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [3]:

    
bookings = pd.read_csv('../data/bookings.csv', parse_dates = True)
listings = pd.read_csv('../data/listings.csv')

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [4]:

    
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe().loc[['mean', '50%', 'std']]









    Out[4]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      mean
       187.8
       3.0
       14.4
       309.2
       8.5
    
    
      50%
       125.0
       2.0
       12.0
       250.0
       7.0
    
    
      std
       353.1
       1.6
       10.5
       228.0
       5.9

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [5]:

    
listings.groupby('prop_type')[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].mean()









    Out[5]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      prop_type
      
      
      
      
      
    
  
  
    
      Property type 1
       237.1
       3.5
       14.7
       313.2
        8.5
    
    
      Property type 2
        93.3
       2.0
       13.9
       304.9
        8.4
    
    
      Property type 3
        63.8
       1.8
        8.8
       184.8
       13.8

Same, but by property type per neighborhood?



In [6]:

    
listings.groupby(['neighborhood','prop_type'])[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].mean()









    Out[6]:






  
    
      
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      neighborhood
      prop_type
      
      
      
      
      
    
  
  
    
      Neighborhood 1
      Property type 1
        85.0
       2.0
       26.0
       209.0
        6.0
    
    
      Neighborhood 10
      Property type 1
       142.5
       3.5
       13.3
       391.0
        3.8
    
    
      Property type 2
       137.5
       2.0
       20.0
       126.0
        3.5
    
    
      Neighborhood 11
      Property type 1
       159.4
       3.2
        9.9
       379.0
        9.6
    
    
      Property type 2
        78.8
       2.0
       16.8
       161.2
       11.2
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Neighborhood 7
      Property type 2
       100.0
       2.0
        3.0
       148.0
        2.0
    
    
      Neighborhood 8
      Property type 1
       174.8
       5.0
       11.0
       300.0
        6.8
    
    
      Property type 2
       350.0
       4.0
        5.0
       223.0
        3.0
    
    
      Neighborhood 9
      Property type 1
       151.1
       4.3
       13.4
       471.4
        5.7
    
    
      Property type 2
       110.0
       2.0
        3.5
       114.5
        9.0
    
  

40 rows × 5 columns



In [7]:

    
lb = listings.merge(bookings)

Plot daily bookings:



In [8]:

    
lb.booking_date = pd.to_datetime(lb.booking_date)
lb['day'] = lb.booking_date.map(lambda x: x.dayofweek)
#interpreting 'daily' as Monday, Tuesday, Wednesday...
def day(x):
    day = {}
    day[0] = 'M'
    day[1] = 'Tu'
    day[2] = 'W'
    day[3] = 'Th'
    day[4] = 'F'
    day[5] = 'Sa'
    day[6] = 'Su'
    
    return day[x]

lb['day_name'] = lb['day'].map(day)

lb.head()

lb.groupby('day').day_name.value_counts().plot(kind = 'bar'); #how do you order it how you want?









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x10bb2d250>

Plot the daily bookings per neighborhood (provide a legend)



In [9]:

    
#lb.groupby(['neighborhood','day']).day_name.value_counts().plot(kind = 'bar', figsize = (15,10), legend = True);

#seaborn info
sns.factorplot("neighborhood", hue = "day_name", data = lb, kind = "bar", palette = "Greens_d", size = 30);

Part 2 - Develop a data set



In [9]:

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [10]:

    
count_bookings = bookings.groupby('prop_id').count().reset_index() #need to reset index so prop_id and booking_date are both columns vs. index
lb_new = listings.merge(count_bookings, on = 'prop_id')
lb_new.rename(columns = {'booking_date': 'number_of_bookings'}, inplace= True)
lb_new['booking_rate'] = lb_new.number_of_bookings / lb_new.tenure_months
lb_new









    Out[10]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
        232
       30
        4
        0.1
    
    
      1  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
        172
       29
        1
        0.0
    
    
      2  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
        472
       28
       27
        1.0
    
    
      3  
         6
       Property type 2
       Neighborhood 13
        89
       2
       10
        886
       28
       88
        3.1
    
    
      4  
         7
       Property type 2
       Neighborhood 13
        85
       1
       11
         58
       24
        2
        0.1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      323
       404
       Property type 2
       Neighborhood 14
       100
       1
        8
        235
        1
        3
        3.0
    
    
      324
       405
       Property type 2
       Neighborhood 13
        85
       2
       27
       1048
        1
       19
       19.0
    
    
      325
       406
       Property type 1
        Neighborhood 9
        70
       3
       18
        153
        1
       19
       19.0
    
    
      326
       407
       Property type 1
       Neighborhood 13
       129
       2
       13
        370
        1
       15
       15.0
    
    
      327
       408
       Property type 1
       Neighborhood 14
       100
       3
       21
        707
        1
       54
       54.0
    
  

328 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [11]:

    
lb_new2 = lb_new[lb_new['tenure_months'] >= 10]
lb_new2









    Out[11]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
        4
       0.1
    
    
      1  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        1
       0.0
    
    
      2  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
       27
       1.0
    
    
      3  
         6
       Property type 2
       Neighborhood 13
        89
       2
       10
       886
       28
       88
       3.1
    
    
      4  
         7
       Property type 2
       Neighborhood 13
        85
       1
       11
        58
       24
        2
       0.1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      108
       139
       Property type 2
       Neighborhood 19
        85
       3
       35
       852
       10
       39
       3.9
    
    
      109
       140
       Property type 1
       Neighborhood 12
       200
       4
       18
       125
       10
       10
       1.0
    
    
      110
       141
       Property type 2
       Neighborhood 12
        45
       2
       36
       281
       10
        1
       0.1
    
    
      111
       142
       Property type 2
       Neighborhood 15
        96
       2
        9
       138
       10
       48
       4.8
    
    
      112
       143
       Property type 2
       Neighborhood 15
        58
       2
        7
       135
       10
       21
       2.1
    
  

113 rows × 10 columns

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [14]:

    
pt_dummy = pd.get_dummies(lb_new2['prop_type'])
n_dummy = pd.get_dummies(lb_new2['neighborhood'])
lb_new3 = lb_new2.join(pt_dummy)
lb_new3 = lb_new3.join(n_dummy)
#pd.set_option('display.max_columns', 100)
#lb_new3

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [15]:

    
from sklearn.cross_validation import train_test_split



In [46]:

    
#remove items that are not regressors or predictors
lb_4 = lb_new3.drop(['prop_id', 'prop_type', 'neighborhood', 'number_of_bookings'], axis = 1)

# has only features needed
rel_features = lb_4.drop(['booking_rate'], axis = 1).values

target = lb_4['booking_rate'].values
target = np.log(target)

feat_train, feat_test, tar_train, tar_test = train_test_split(rel_features, target, train_size = .3, random_state = 9)



In [47]:

    
lb_4









    Out[47]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      booking_rate
      Property type 1
      Property type 2
      Property type 3
      Neighborhood 11
      ...
      Neighborhood 15
      Neighborhood 16
      Neighborhood 17
      Neighborhood 18
      Neighborhood 19
      Neighborhood 20
      Neighborhood 21
      Neighborhood 4
      Neighborhood 8
      Neighborhood 9
    
  
  
    
      0  
       140
       3
       11
       232
       30
       0.1
       1
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      1  
        95
       2
       16
       172
       29
       0.0
       0
       1
       0
       0
      ...
       0
       1
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2  
        90
       2
       19
       472
       28
       1.0
       0
       1
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3  
        89
       2
       10
       886
       28
       3.1
       0
       1
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4  
        85
       1
       11
        58
       24
       0.1
       0
       1
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      108
        85
       3
       35
       852
       10
       3.9
       0
       1
       0
       0
      ...
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
    
    
      109
       200
       4
       18
       125
       10
       1.0
       1
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      110
        45
       2
       36
       281
       10
       0.1
       0
       1
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      111
        96
       2
        9
       138
       10
       4.8
       0
       1
       0
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      112
        58
       2
        7
       135
       10
       2.1
       0
       1
       0
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

113 rows × 23 columns



In [47]:

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [48]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets



In [49]:

    
lr.fit(feat_train, tar_train)









    Out[49]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)



In [50]:

    
tar_pred = lr.predict(feat_test)



In [51]:

    
sum_sq_model = np.sum((tar_test - tar_pred) ** 2)
sum_sq_model









    Out[51]:





625.02919824797573



In [52]:

    
sum_sq_naive = np.sum((tar_test - tar_test.mean()) ** 2)
sum_sq_naive









    Out[52]:





144.20787136292714



In [53]:

    
fig, ax = plt.subplots(1, 1)

ax.scatter(tar_pred, tar_test)

# Draw the ideal line
ax.plot(target, target, 'r')









    Out[53]:





[<matplotlib.lines.Line2D at 0x10e158610>]



In [54]:

    
#test score before using score
1 - (sum_sq_model/sum_sq_naive)









    Out[54]:





-3.3342238696178264

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [55]:

    
#lr.score(rel_features, target, sample_weight=None)
lr.score(feat_test, tar_test, sample_weight=None)









    Out[55]:





-3.3342238696178264

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

It seems that the score method provides the 'coefficient of determination' (1 - (regression sum of squares/residual sum of squares)). This can vary based on what's in our test set. It takes our test feature set and our test target set in order to compare. The score (coeff of determination) is generally an indicator of how good your model is.

For this model, I seem to get fairly low numbers, which is not great, but they (again) vary, which means the model isn't necessarily the most consistent (though I don't know how much variance is arguably acceptable). The values are negative pretty often when I've re-run the split of the data, which means the model is worse than the mean of the data, which is fairly terrible. This continued even after I took the log of the target data so I'm really not sure where I'm going wrong with that one. After speaking with some people in the class, they split the feature training set up by feature and then set up loops to append the data together, so it is structured differently than my training set and that appeared to work well for them.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [ ]:

    
# price*booking rate



In [ ]:



In [ ]:

	price	person_capacity	picture_count	description_length	tenure_months
mean	187.8	3.0	14.4	309.2	8.5
50%	125.0	2.0	12.0	250.0	7.0
std	353.1	1.6	10.5	228.0	5.9

	price	person_capacity	picture_count	description_length	tenure_months
prop_type
Property type 1	237.1	3.5	14.7	313.2	8.5
Property type 2	93.3	2.0	13.9	304.9	8.4
Property type 3	63.8	1.8	8.8	184.8	13.8

		price	person_capacity	picture_count	description_length	tenure_months
neighborhood	prop_type
Neighborhood 1	Property type 1	85.0	2.0	26.0	209.0	6.0
Neighborhood 10	Property type 1	142.5	3.5	13.3	391.0	3.8
Neighborhood 10	Property type 2	137.5	2.0	20.0	126.0	3.5
Neighborhood 11	Property type 1	159.4	3.2	9.9	379.0	9.6
Neighborhood 11	Property type 2	78.8	2.0	16.8	161.2	11.2
...	...	...	...	...	...	...
Neighborhood 7	Property type 2	100.0	2.0	3.0	148.0	2.0
Neighborhood 8	Property type 1	174.8	5.0	11.0	300.0	6.8
Neighborhood 8	Property type 2	350.0	4.0	5.0	223.0	3.0
Neighborhood 9	Property type 1	151.1	4.3	13.4	471.4	5.7
Neighborhood 9	Property type 2	110.0	2.0	3.5	114.5	9.0

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	number_of_bookings	booking_rate
0	1	Property type 1	Neighborhood 14	140	3	11	232	30	4	0.1
1	3	Property type 2	Neighborhood 16	95	2	16	172	29	1	0.0
2	4	Property type 2	Neighborhood 13	90	2	19	472	28	27	1.0
3	6	Property type 2	Neighborhood 13	89	2	10	886	28	88	3.1
4	7	Property type 2	Neighborhood 13	85	1	11	58	24	2	0.1
...	...	...	...	...	...	...	...	...	...	...
323	404	Property type 2	Neighborhood 14	100	1	8	235	1	3	3.0
324	405	Property type 2	Neighborhood 13	85	2	27	1048	1	19	19.0
325	406	Property type 1	Neighborhood 9	70	3	18	153	1	19	19.0
326	407	Property type 1	Neighborhood 13	129	2	13	370	1	15	15.0
327	408	Property type 1	Neighborhood 14	100	3	21	707	1	54	54.0

	price	person_capacity	picture_count	description_length	tenure_months	booking_rate	Property type 1	Property type 2	Property type 3	Neighborhood 11	...	Neighborhood 15	Neighborhood 16	Neighborhood 17	Neighborhood 18	Neighborhood 19	Neighborhood 20	Neighborhood 21	Neighborhood 4	Neighborhood 8	Neighborhood 9
0	140	3	11	232	30	0.1	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	95	2	16	172	29	0.0	0	1	0	0	...	0	1	0	0	0	0	0	0	0	0
2	90	2	19	472	28	1.0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
3	89	2	10	886	28	3.1	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
4	85	1	11	58	24	0.1	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
108	85	3	35	852	10	3.9	0	1	0	0	...	0	0	0	0	1	0	0	0	0	0
109	200	4	18	125	10	1.0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
110	45	2	36	281	10	0.1	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
111	96	2	9	138	10	4.8	0	1	0	0	...	1	0	0	0	0	0	0	0	0	0
112	58	2	7	135	10	2.1	0	1	0	0	...	1	0	0	0	0	0	0	0	0	0

	price	person_capacity	picture_count	description_length	tenure_months	booking_rate	Property type 1	Property type 2	Property type 3	Neighborhood 11	...	Neighborhood 15	Neighborhood 16	Neighborhood 17	Neighborhood 18	Neighborhood 19	Neighborhood 20	Neighborhood 21	Neighborhood 4	Neighborhood 8	Neighborhood 9
0	140	3	11	232	30	0.1	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	95	2	16	172	29	0.0	0	1	0	0	...	0	1	0	0	0	0	0	0	0	0
2	90	2	19	472	28	1.0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
3	89	2	10	886	28	3.1	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
4	85	1	11	58	24	0.1	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
108	85	3	35	852	10	3.9	0	1	0	0	...	0	0	0	0	1	0	0	0	0	0
109	200	4	18	125	10	1.0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
110	45	2	36	281	10	0.1	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
111	96	2	9	138	10	4.8	0	1	0	0	...	1	0	0	0	0	0	0	0	0	0
112	58	2	7	135	10	2.1	0	1	0	0	...	1	0	0	0	0	0	0	0	0	0