#Homework 1 - Data Analysis and Regression In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [331]:

    
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline



In [332]:

    
pd.set_option('display.max_row', None)
pd.set_option('display.precision', 3)
pd.set_option('display.max_column', None)

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [333]:

    
bookings = pd.read_csv('../data/bookings.csv', parse_dates = ['booking_date'])
bookings.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)



In [334]:

    
listings = pd.read_csv('../data/listings.csv')



In [335]:

    
merged = listings.merge(bookings)
merged.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
booking_date          6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(6), object(2)

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [336]:

    
listings[['price', 'person_capacity',
          'picture_count', 'description_length',
          'tenure_months']].describe().loc[['mean', '50%', 'std']]









    Out[336]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      mean
       187.81
       3.00
       14.39
       309.16
       8.49
    
    
      50%
       125.00
       2.00
       12.00
       250.00
       7.00
    
    
      std
       353.05
       1.59
       10.48
       228.02
       5.87

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [337]:

    
listings.groupby('prop_type')['price',
                              'person_capacity', 'picture_count',
                              'description_length', 'tenure_months'].mean()









    Out[337]:






  
    
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      prop_type
      
      
      
      
      
    
  
  
    
      Property type 1
       237.09
       3.52
       14.70
       313.17
        8.46
    
    
      Property type 2
        93.29
       2.00
       13.95
       304.85
        8.38
    
    
      Property type 3
        63.75
       1.75
        8.75
       184.75
       13.75

Same, but by property type per neighborhood?



In [338]:

    
listings.groupby(['neighborhood', 'prop_type'])['price',
                                                'person_capacity', 'picture_count',
                                                'description_length', 'tenure_months'].mean()









    Out[338]:






  
    
      
      
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      neighborhood
      prop_type
      
      
      
      
      
    
  
  
    
      Neighborhood 1
      Property type 1
        85.00
       2.00
       26.00
       209.00
        6.00
    
    
      Neighborhood 10
      Property type 1
       142.50
       3.50
       13.33
       391.00
        3.83
    
    
      Property type 2
       137.50
       2.00
       20.00
       126.00
        3.50
    
    
      Neighborhood 11
      Property type 1
       159.43
       3.21
        9.93
       379.00
        9.64
    
    
      Property type 2
        78.75
       2.00
       16.75
       161.25
       11.25
    
    
      Property type 3
        75.00
       2.00
       15.00
       196.00
        8.00
    
    
      Neighborhood 12
      Property type 1
       365.62
       3.44
       10.82
       267.21
        7.90
    
    
      Property type 2
        96.89
       1.95
       10.47
       244.53
        9.84
    
    
      Neighborhood 13
      Property type 1
       241.90
       4.06
       15.65
       290.41
        9.12
    
    
      Property type 2
        81.13
       1.83
       16.70
       418.57
        9.74
    
    
      Neighborhood 14
      Property type 1
       164.68
       3.21
       14.76
       317.21
        8.44
    
    
      Property type 2
        83.81
       1.86
       15.90
       348.62
        8.71
    
    
      Property type 3
        75.00
       1.00
        1.00
       113.00
        5.00
    
    
      Neighborhood 15
      Property type 1
       178.88
       3.72
       14.32
       321.76
        9.32
    
    
      Property type 2
        95.00
       2.27
       11.73
       301.73
        8.20
    
    
      Neighborhood 16
      Property type 1
       158.93
       2.93
       21.64
       310.71
        7.07
    
    
      Property type 2
        83.62
       2.06
       15.38
       246.25
        6.69
    
    
      Neighborhood 17
      Property type 1
       189.87
       3.52
       16.09
       317.35
        9.87
    
    
      Property type 2
       102.45
       2.00
       15.45
       308.27
        7.18
    
    
      Property type 3
        65.00
       2.00
       15.00
       189.00
       23.00
    
    
      Neighborhood 18
      Property type 1
       173.59
       2.95
       16.09
       369.23
        8.23
    
    
      Property type 2
       120.67
       2.22
       12.33
       297.78
        9.22
    
    
      Neighborhood 19
      Property type 1
       222.38
       3.62
       11.00
       254.50
        6.50
    
    
      Property type 2
        88.88
       2.00
       15.12
       383.38
        5.50
    
    
      Neighborhood 2
      Property type 1
       250.00
       6.00
        8.00
       423.00
        6.00
    
    
      Neighborhood 20
      Property type 1
       804.33
       2.78
        9.44
       223.56
        9.67
    
    
      Property type 2
        60.00
       1.00
        3.00
       101.00
        6.00
    
    
      Neighborhood 21
      Property type 1
       362.50
       4.25
       49.00
       306.25
       14.75
    
    
      Neighborhood 22
      Property type 1
       225.00
       3.00
       19.00
       500.00
        9.00
    
    
      Neighborhood 3
      Property type 2
        60.00
       2.00
        7.00
       264.00
        9.00
    
    
      Neighborhood 4
      Property type 2
        60.00
       2.00
       10.00
        95.00
       11.00
    
    
      Property type 3
        40.00
       2.00
        4.00
       241.00
       19.00
    
    
      Neighborhood 5
      Property type 1
       194.50
       2.50
        8.50
       266.50
       11.50
    
    
      Neighborhood 6
      Property type 1
       146.00
       3.33
       12.67
       290.67
        4.00
    
    
      Neighborhood 7
      Property type 1
       161.00
       3.67
       14.33
       343.00
        5.33
    
    
      Property type 2
       100.00
       2.00
        3.00
       148.00
        2.00
    
    
      Neighborhood 8
      Property type 1
       174.75
       5.00
       11.00
       300.00
        6.75
    
    
      Property type 2
       350.00
       4.00
        5.00
       223.00
        3.00
    
    
      Neighborhood 9
      Property type 1
       151.14
       4.29
       13.43
       471.43
        5.71
    
    
      Property type 2
       110.00
       2.00
        3.50
       114.50
        9.00

Plot daily bookings:



In [339]:

    
merged.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
booking_date          6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(6), object(2)



In [340]:

    
merged.booking_date = pd.to_datetime(merged.booking_date)



In [341]:

    
merged.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
booking_date          6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(6), object(2)



In [342]:

    
merged['day_of_week'] = merged.booking_date.map(lambda x: x.dayofweek)



In [343]:

    
def day(x):
    day_of_week = {}
    day_of_week[0] = 'Mon'
    day_of_week[1] = 'Tues'
    day_of_week[2] = 'Wed'
    day_of_week[3] = 'Thurs'
    day_of_week[4] = 'Fri'
    day_of_week[5] = 'Sat'
    day_of_week[6] = 'Sun'
    return day_of_week[x]

merged['day_name'] = merged['day_of_week'].map(day)



In [344]:

    
merged.groupby(['day_of_week'])['day_name'].value_counts().plot(kind = 'bar')









    Out[344]:





<matplotlib.axes._subplots.AxesSubplot at 0x110b7e5d0>

Plot the daily bookings per neighborhood (provide a legend)



In [345]:

    
sns.factorplot("neighborhood", hue="day_name", data=merged, palette="Purples_d", size = 25, legend = True);



In [345]:

Part 2 - Develop a data set

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [395]:

    
bookings_ct = bookings.groupby('prop_id').count().reset_index()
merged2 = listings.merge(bookings_ct, on = 'prop_id')
merged2.head()









    Out[395]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      booking_date
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
        4
    
    
      1
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        1
    
    
      2
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
       27
    
    
      3
       6
       Property type 2
       Neighborhood 13
        89
       2
       10
       886
       28
       88
    
    
      4
       7
       Property type 2
       Neighborhood 13
        85
       1
       11
        58
       24
        2



In [396]:

    
merged2.rename(columns = {'booking_date' : 'number_of_bookings'}, inplace = True)
merged2['booking_rate'] = merged2.number_of_bookings/merged2.tenure_months
merged2.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 328 entries, 0 to 327
Data columns (total 10 columns):
prop_id               328 non-null int64
prop_type             328 non-null object
neighborhood          328 non-null object
price                 328 non-null int64
person_capacity       328 non-null int64
picture_count         328 non-null int64
description_length    328 non-null int64
tenure_months         328 non-null int64
number_of_bookings    328 non-null int64
booking_rate          328 non-null float64
dtypes: float64(1), int64(7), object(2)

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [397]:

    
merged2 = merged2[merged2.tenure_months  > 9]
merged2.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 113 entries, 0 to 112
Data columns (total 10 columns):
prop_id               113 non-null int64
prop_type             113 non-null object
neighborhood          113 non-null object
price                 113 non-null int64
person_capacity       113 non-null int64
picture_count         113 non-null int64
description_length    113 non-null int64
tenure_months         113 non-null int64
number_of_bookings    113 non-null int64
booking_rate          113 non-null float64
dtypes: float64(1), int64(7), object(2)

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [398]:

    
dummy_prop = pd.get_dummies(merged2['prop_type'])
dummy_neigh = pd.get_dummies(merged2['neighborhood'])
merged2_prop = merged2.join(dummy_prop)



In [399]:

    
merged3 = merged2_prop.join(dummy_neigh)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [400]:

    
from sklearn.cross_validation import train_test_split



In [401]:

    
merged_final = merged3.drop(['prop_id', 'prop_type', 'neighborhood', 'number_of_bookings'], axis = 1)
features = merged_final.drop(['booking_rate'], axis = 1).values
target = merged_final['booking_rate'].values



In [401]:



In [402]:

    
features.shape









    Out[402]:





(113, 22)



In [403]:

    
target = np.log(target)



In [404]:

    
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size = .2, random_state = (9))

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [419]:

    
from sklearn.linear_model import LinearRegression
lr = LinearRegression()



In [420]:

    
lr.fit(feature_train, target_train )









    Out[420]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

fit your model with your test sets



In [421]:

    
target_pred = lr.predict(feature_test)



In [422]:

    
# target_pred[target_pred == target_pred.min()] = target_pred.mean()

target_pred[target_pred < -4] = target_pred.mean()

target_pred









    Out[422]:





array([ 0.29523577, -2.04599243, -1.15709947, -1.87976958, -0.65496419,
       -0.1700946 , -0.73483983, -1.12476064,  0.48094131, -0.24461875,
       -0.21776611,  0.2702393 , -1.06444888, -1.18860123, -0.3414003 ,
       -0.76560807,  0.68593991, -0.20624223, -0.07908219, -1.11982075,
       -0.38995087,  0.64780865,  0.29358936])

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [423]:

    
sum_sq_model = np.sum((target_test - target_pred) ** 2)
sum_sq_model









    Out[423]:





43.863998813931239



In [424]:

    
sum_sq_naive = np.sum((target_test - target_test.mean()) ** 2)
sum_sq_naive









    Out[424]:





44.793393920738836



In [425]:

    
1 - sum_sq_model / sum_sq_naive









    Out[425]:





0.020748486003363542



In [426]:

    
lr.score(feature_test, target_test, sample_weight = None)









    Out[426]:





0.020748486003363542



In [388]:

    
fig, ax = plt.subplots(1,1)

ax.scatter(target_pred, target_test)
ax.plot(target, target, 'r')









    Out[388]:





[<matplotlib.lines.Line2D at 0x1108f5e50>]

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

The score method is an indicator of how well the model is performing. It takes the squared difference between the target test y value and the predicted 'y' value. It then takes 1 and subtracts the squared difference. If value is 1 it means that the squared difference is 0, which would reflex prefectly predicted 'y' values. If the value is below 0 it means that the predicted values are worse than the just placing the mean of 'y' for every predicted value.

The model comes up with a generally low or negative number, varying each time because train_test_split takes a different sample each time it runs. That means that the prediction is generally slightly better or slightly worse than the mean depending on the target_test and feature_test sample.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [191]:

    
merged_final['monthly_rev'] = merged_final.booking_rate * merged_final.price



In [192]:

    
merged_final.monthly_rev.hist(bins = 50);



In [193]:

    
features1 = merged_final.drop(['booking_rate', 'monthly_rev'], axis = 1).values



In [194]:

    
target1 = merged_final['monthly_rev']



In [195]:

    
target1 = np.log(target1)



In [196]:

    
features_train1, features_test1, target_train1, target_test1 = train_test_split(features1, target1, test_size = .2)



In [197]:

    
lr.fit(features_train1, target_train1)









    Out[197]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)



In [198]:

    
features_train1









    Out[198]:





array([[  49.,    2.,   14., ...,    0.,    0.,    0.],
       [ 350.,    4.,   71., ...,    0.,    0.,    0.],
       [ 175.,    3.,   24., ...,    0.,    0.,    0.],
       ..., 
       [ 180.,    6.,   26., ...,    0.,    0.,    0.],
       [ 110.,    3.,   29., ...,    0.,    0.,    0.],
       [  65.,    2.,   15., ...,    0.,    0.,    0.]])



In [199]:

    
target_pred1 = lr.predict(features_test1)



In [199]:



In [200]:

    
sum_sq_model = np.sum((target_test1 - target_pred1) ** 2)
sum_sq_model









    Out[200]:





27.479035852227003



In [201]:

    
sum_sq_naive = np.sum((target_test1 - target_test1.mean()) ** 2)
sum_sq_naive









    Out[201]:





21.537077221694485



In [202]:

    
1 - sum_sq_model / sum_sq_naive









    Out[202]:





-0.27589438294566415



In [203]:

    
lr.score(features_test1, target_test1, sample_weight = None)









    Out[203]:





-0.27589438294566415



In [204]:

    
fig, ax = plt.subplots(1,1)

ax.scatter(target_pred1, target_test1)
ax.plot(target1, target1, 'r')









    Out[204]:





[<matplotlib.lines.Line2D at 0x10f78b890>]



In [43]:



In [43]:



In [ ]:

	price	person_capacity	picture_count	description_length	tenure_months
mean	187.81	3.00	14.39	309.16	8.49
50%	125.00	2.00	12.00	250.00	7.00
std	353.05	1.59	10.48	228.02	5.87

	price	person_capacity	picture_count	description_length	tenure_months
prop_type
Property type 1	237.09	3.52	14.70	313.17	8.46
Property type 2	93.29	2.00	13.95	304.85	8.38
Property type 3	63.75	1.75	8.75	184.75	13.75

		price	person_capacity	picture_count	description_length	tenure_months
neighborhood	prop_type
Neighborhood 1	Property type 1	85.00	2.00	26.00	209.00	6.00
Neighborhood 10	Property type 1	142.50	3.50	13.33	391.00	3.83
Neighborhood 10	Property type 2	137.50	2.00	20.00	126.00	3.50
Neighborhood 11	Property type 1	159.43	3.21	9.93	379.00	9.64
	Property type 2	78.75	2.00	16.75	161.25	11.25
	Property type 3	75.00	2.00	15.00	196.00	8.00
Neighborhood 12	Property type 1	365.62	3.44	10.82	267.21	7.90
Neighborhood 12	Property type 2	96.89	1.95	10.47	244.53	9.84
Neighborhood 13	Property type 1	241.90	4.06	15.65	290.41	9.12
Neighborhood 13	Property type 2	81.13	1.83	16.70	418.57	9.74
Neighborhood 14	Property type 1	164.68	3.21	14.76	317.21	8.44
	Property type 2	83.81	1.86	15.90	348.62	8.71
	Property type 3	75.00	1.00	1.00	113.00	5.00
Neighborhood 15	Property type 1	178.88	3.72	14.32	321.76	9.32
Neighborhood 15	Property type 2	95.00	2.27	11.73	301.73	8.20
Neighborhood 16	Property type 1	158.93	2.93	21.64	310.71	7.07
Neighborhood 16	Property type 2	83.62	2.06	15.38	246.25	6.69
Neighborhood 17	Property type 1	189.87	3.52	16.09	317.35	9.87
	Property type 2	102.45	2.00	15.45	308.27	7.18
	Property type 3	65.00	2.00	15.00	189.00	23.00
Neighborhood 18	Property type 1	173.59	2.95	16.09	369.23	8.23
Neighborhood 18	Property type 2	120.67	2.22	12.33	297.78	9.22
Neighborhood 19	Property type 1	222.38	3.62	11.00	254.50	6.50
Neighborhood 19	Property type 2	88.88	2.00	15.12	383.38	5.50
Neighborhood 2	Property type 1	250.00	6.00	8.00	423.00	6.00
Neighborhood 20	Property type 1	804.33	2.78	9.44	223.56	9.67
Neighborhood 20	Property type 2	60.00	1.00	3.00	101.00	6.00
Neighborhood 21	Property type 1	362.50	4.25	49.00	306.25	14.75
Neighborhood 22	Property type 1	225.00	3.00	19.00	500.00	9.00
Neighborhood 3	Property type 2	60.00	2.00	7.00	264.00	9.00
Neighborhood 4	Property type 2	60.00	2.00	10.00	95.00	11.00
Neighborhood 4	Property type 3	40.00	2.00	4.00	241.00	19.00
Neighborhood 5	Property type 1	194.50	2.50	8.50	266.50	11.50
Neighborhood 6	Property type 1	146.00	3.33	12.67	290.67	4.00
Neighborhood 7	Property type 1	161.00	3.67	14.33	343.00	5.33
Neighborhood 7	Property type 2	100.00	2.00	3.00	148.00	2.00
Neighborhood 8	Property type 1	174.75	5.00	11.00	300.00	6.75
Neighborhood 8	Property type 2	350.00	4.00	5.00	223.00	3.00
Neighborhood 9	Property type 1	151.14	4.29	13.43	471.43	5.71
Neighborhood 9	Property type 2	110.00	2.00	3.50	114.50	9.00