Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [2]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [3]:
bookings = pd.read_csv('../data/bookings.csv', parse_dates = True)
listings = pd.read_csv('../data/listings.csv')

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [4]:
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe().loc[['mean', '50%', 'std']]


Out[4]:
price person_capacity picture_count description_length tenure_months
mean 187.8 3.0 14.4 309.2 8.5
50% 125.0 2.0 12.0 250.0 7.0
std 353.1 1.6 10.5 228.0 5.9

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [5]:
listings.groupby('prop_type')[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].mean()


Out[5]:
price person_capacity picture_count description_length tenure_months
prop_type
Property type 1 237.1 3.5 14.7 313.2 8.5
Property type 2 93.3 2.0 13.9 304.9 8.4
Property type 3 63.8 1.8 8.8 184.8 13.8

Same, but by property type per neighborhood?


In [6]:
listings.groupby(['neighborhood','prop_type'])[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].mean()


Out[6]:
price person_capacity picture_count description_length tenure_months
neighborhood prop_type
Neighborhood 1 Property type 1 85.0 2.0 26.0 209.0 6.0
Neighborhood 10 Property type 1 142.5 3.5 13.3 391.0 3.8
Property type 2 137.5 2.0 20.0 126.0 3.5
Neighborhood 11 Property type 1 159.4 3.2 9.9 379.0 9.6
Property type 2 78.8 2.0 16.8 161.2 11.2
... ... ... ... ... ... ...
Neighborhood 7 Property type 2 100.0 2.0 3.0 148.0 2.0
Neighborhood 8 Property type 1 174.8 5.0 11.0 300.0 6.8
Property type 2 350.0 4.0 5.0 223.0 3.0
Neighborhood 9 Property type 1 151.1 4.3 13.4 471.4 5.7
Property type 2 110.0 2.0 3.5 114.5 9.0

40 rows × 5 columns


In [7]:
lb = listings.merge(bookings)

Plot daily bookings:


In [8]:
lb.booking_date = pd.to_datetime(lb.booking_date)
lb['day'] = lb.booking_date.map(lambda x: x.dayofweek)
#interpreting 'daily' as Monday, Tuesday, Wednesday...
def day(x):
    day = {}
    day[0] = 'M'
    day[1] = 'Tu'
    day[2] = 'W'
    day[3] = 'Th'
    day[4] = 'F'
    day[5] = 'Sa'
    day[6] = 'Su'
    
    return day[x]

lb['day_name'] = lb['day'].map(day)

lb.head()

lb.groupby('day').day_name.value_counts().plot(kind = 'bar'); #how do you order it how you want?


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bb2d250>

Plot the daily bookings per neighborhood (provide a legend)


In [9]:
#lb.groupby(['neighborhood','day']).day_name.value_counts().plot(kind = 'bar', figsize = (15,10), legend = True);

#seaborn info
sns.factorplot("neighborhood", hue = "day_name", data = lb, kind = "bar", palette = "Greens_d", size = 30);


Part 2 - Develop a data set


In [9]:

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [10]:
count_bookings = bookings.groupby('prop_id').count().reset_index() #need to reset index so prop_id and booking_date are both columns vs. index
lb_new = listings.merge(count_bookings, on = 'prop_id')
lb_new.rename(columns = {'booking_date': 'number_of_bookings'}, inplace= True)
lb_new['booking_rate'] = lb_new.number_of_bookings / lb_new.tenure_months
lb_new


Out[10]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 0.1
1 3 Property type 2 Neighborhood 16 95 2 16 172 29 1 0.0
2 4 Property type 2 Neighborhood 13 90 2 19 472 28 27 1.0
3 6 Property type 2 Neighborhood 13 89 2 10 886 28 88 3.1
4 7 Property type 2 Neighborhood 13 85 1 11 58 24 2 0.1
... ... ... ... ... ... ... ... ... ... ...
323 404 Property type 2 Neighborhood 14 100 1 8 235 1 3 3.0
324 405 Property type 2 Neighborhood 13 85 2 27 1048 1 19 19.0
325 406 Property type 1 Neighborhood 9 70 3 18 153 1 19 19.0
326 407 Property type 1 Neighborhood 13 129 2 13 370 1 15 15.0
327 408 Property type 1 Neighborhood 14 100 3 21 707 1 54 54.0

328 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [11]:
lb_new2 = lb_new[lb_new['tenure_months'] >= 10]
lb_new2


Out[11]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 0.1
1 3 Property type 2 Neighborhood 16 95 2 16 172 29 1 0.0
2 4 Property type 2 Neighborhood 13 90 2 19 472 28 27 1.0
3 6 Property type 2 Neighborhood 13 89 2 10 886 28 88 3.1
4 7 Property type 2 Neighborhood 13 85 1 11 58 24 2 0.1
... ... ... ... ... ... ... ... ... ... ...
108 139 Property type 2 Neighborhood 19 85 3 35 852 10 39 3.9
109 140 Property type 1 Neighborhood 12 200 4 18 125 10 10 1.0
110 141 Property type 2 Neighborhood 12 45 2 36 281 10 1 0.1
111 142 Property type 2 Neighborhood 15 96 2 9 138 10 48 4.8
112 143 Property type 2 Neighborhood 15 58 2 7 135 10 21 2.1

113 rows × 10 columns

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [14]:
pt_dummy = pd.get_dummies(lb_new2['prop_type'])
n_dummy = pd.get_dummies(lb_new2['neighborhood'])
lb_new3 = lb_new2.join(pt_dummy)
lb_new3 = lb_new3.join(n_dummy)
#pd.set_option('display.max_columns', 100)
#lb_new3

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [15]:
from sklearn.cross_validation import train_test_split

In [46]:
#remove items that are not regressors or predictors
lb_4 = lb_new3.drop(['prop_id', 'prop_type', 'neighborhood', 'number_of_bookings'], axis = 1)

# has only features needed
rel_features = lb_4.drop(['booking_rate'], axis = 1).values

target = lb_4['booking_rate'].values
target = np.log(target)

feat_train, feat_test, tar_train, tar_test = train_test_split(rel_features, target, train_size = .3, random_state = 9)

In [47]:
lb_4


Out[47]:
price person_capacity picture_count description_length tenure_months booking_rate Property type 1 Property type 2 Property type 3 Neighborhood 11 ... Neighborhood 15 Neighborhood 16 Neighborhood 17 Neighborhood 18 Neighborhood 19 Neighborhood 20 Neighborhood 21 Neighborhood 4 Neighborhood 8 Neighborhood 9
0 140 3 11 232 30 0.1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 95 2 16 172 29 0.0 0 1 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 90 2 19 472 28 1.0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 89 2 10 886 28 3.1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 85 1 11 58 24 0.1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
108 85 3 35 852 10 3.9 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 0
109 200 4 18 125 10 1.0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
110 45 2 36 281 10 0.1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
111 96 2 9 138 10 4.8 0 1 0 0 ... 1 0 0 0 0 0 0 0 0 0
112 58 2 7 135 10 2.1 0 1 0 0 ... 1 0 0 0 0 0 0 0 0 0

113 rows × 23 columns


In [47]:

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [48]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets


In [49]:
lr.fit(feat_train, tar_train)


Out[49]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [50]:
tar_pred = lr.predict(feat_test)

In [51]:
sum_sq_model = np.sum((tar_test - tar_pred) ** 2)
sum_sq_model


Out[51]:
625.02919824797573

In [52]:
sum_sq_naive = np.sum((tar_test - tar_test.mean()) ** 2)
sum_sq_naive


Out[52]:
144.20787136292714

In [53]:
fig, ax = plt.subplots(1, 1)

ax.scatter(tar_pred, tar_test)

# Draw the ideal line
ax.plot(target, target, 'r')


Out[53]:
[<matplotlib.lines.Line2D at 0x10e158610>]

In [54]:
#test score before using score
1 - (sum_sq_model/sum_sq_naive)


Out[54]:
-3.3342238696178264

In [55]:
#lr.score(rel_features, target, sample_weight=None)
lr.score(feat_test, tar_test, sample_weight=None)


Out[55]:
-3.3342238696178264

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

It seems that the score method provides the 'coefficient of determination' (1 - (regression sum of squares/residual sum of squares)). This can vary based on what's in our test set. It takes our test feature set and our test target set in order to compare. The score (coeff of determination) is generally an indicator of how good your model is.

For this model, I seem to get fairly low numbers, which is not great, but they (again) vary, which means the model isn't necessarily the most consistent (though I don't know how much variance is arguably acceptable). The values are negative pretty often when I've re-run the split of the data, which means the model is worse than the mean of the data, which is fairly terrible. This continued even after I took the log of the target data so I'm really not sure where I'm going wrong with that one. After speaking with some people in the class, they split the feature training set up by feature and then set up loops to append the data together, so it is structured differently than my training set and that appeared to work well for them.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]:
# price*booking rate

In [ ]:


In [ ]: