Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [483]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [484]:
#Extracting Data from files and storing in dataframe
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')

In [488]:
#Identifying the columns in the listings dataframe
listings.head()


Out[488]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 Property type 1 Neighborhood 14 140 3 11 232 30
1 2 Property type 1 Neighborhood 14 95 2 3 37 29
2 3 Property type 2 Neighborhood 16 95 2 16 172 29
3 4 Property type 2 Neighborhood 13 90 2 19 472 28
4 5 Property type 1 Neighborhood 15 125 5 21 442 28

5 rows × 8 columns


In [489]:
#Identifying the columns in the bookings dataframe
bookings.head()


Out[489]:
prop_id booking_date
0 9 2011-06-17
1 13 2011-08-12
2 21 2011-06-20
3 28 2011-05-05
4 29 2011-11-17

5 rows × 2 columns

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [490]:
#Mean, Standard Deviation of price, person capacity, picture count, description length and tenure of the properties 
listings.iloc[: , 3:8].describe()


Out[490]:
price person_capacity picture_count description_length tenure_months
count 408.000000 408.000000 408.000000 408.000000 408.000000
mean 187.806373 2.997549 14.389706 309.159314 8.487745
std 353.050858 1.594676 10.477428 228.021684 5.872088
min 39.000000 1.000000 1.000000 0.000000 1.000000
25% 90.000000 2.000000 6.000000 179.000000 4.000000
50% 125.000000 2.000000 12.000000 250.000000 7.000000
75% 199.000000 4.000000 20.000000 389.500000 13.000000
max 5000.000000 10.000000 71.000000 1969.000000 30.000000

8 rows × 5 columns


In [480]:
#Median of the above columns
listings.iloc[: , 3:8].median()


Out[480]:
price                 125
person_capacity         2
picture_count          12
description_length    250
tenure_months           7
dtype: float64

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [491]:
# Mean Price, person capacity, picture count, description length, tenure months grouped by property type
listings.groupby(["prop_type"])["price","person_capacity","picture_count","description_length","tenure_months"].mean()


Out[491]:
price person_capacity picture_count description_length tenure_months
prop_type
Property type 1 237.085502 3.516729 14.695167 313.171004 8.464684
Property type 2 93.288889 2.000000 13.948148 304.851852 8.377778
Property type 3 63.750000 1.750000 8.750000 184.750000 13.750000

3 rows × 5 columns

Same, but by property type per neighborhood?


In [492]:
listings.groupby(["prop_type","neighborhood"])["price","person_capacity","picture_count","description_length","tenure_months"].mean()


Out[492]:
price person_capacity picture_count description_length tenure_months
prop_type neighborhood
Property type 1 Neighborhood 1 85.000000 2.000000 26.000000 209.000000 6.000000
Neighborhood 10 142.500000 3.500000 13.333333 391.000000 3.833333
Neighborhood 11 159.428571 3.214286 9.928571 379.000000 9.642857
Neighborhood 12 365.615385 3.435897 10.820513 267.205128 7.897436
Neighborhood 13 241.897959 4.061224 15.653061 290.408163 9.122449
Neighborhood 14 164.676471 3.205882 14.764706 317.205882 8.441176
Neighborhood 15 178.880000 3.720000 14.320000 321.760000 9.320000
Neighborhood 16 158.928571 2.928571 21.642857 310.714286 7.071429
Neighborhood 17 189.869565 3.521739 16.086957 317.347826 9.869565
Neighborhood 18 173.590909 2.954545 16.090909 369.227273 8.227273
Neighborhood 19 222.375000 3.625000 11.000000 254.500000 6.500000
Neighborhood 2 250.000000 6.000000 8.000000 423.000000 6.000000
Neighborhood 20 804.333333 2.777778 9.444444 223.555556 9.666667
Neighborhood 21 362.500000 4.250000 49.000000 306.250000 14.750000
Neighborhood 22 225.000000 3.000000 19.000000 500.000000 9.000000
Neighborhood 5 194.500000 2.500000 8.500000 266.500000 11.500000
Neighborhood 6 146.000000 3.333333 12.666667 290.666667 4.000000
Neighborhood 7 161.000000 3.666667 14.333333 343.000000 5.333333
Neighborhood 8 174.750000 5.000000 11.000000 300.000000 6.750000
Neighborhood 9 151.142857 4.285714 13.428571 471.428571 5.714286
Property type 2 Neighborhood 10 137.500000 2.000000 20.000000 126.000000 3.500000
Neighborhood 11 78.750000 2.000000 16.750000 161.250000 11.250000
Neighborhood 12 96.894737 1.947368 10.473684 244.526316 9.842105
Neighborhood 13 81.130435 1.826087 16.695652 418.565217 9.739130
Neighborhood 14 83.809524 1.857143 15.904762 348.619048 8.714286
Neighborhood 15 95.000000 2.266667 11.733333 301.733333 8.200000
Neighborhood 16 83.625000 2.062500 15.375000 246.250000 6.687500
Neighborhood 17 102.454545 2.000000 15.454545 308.272727 7.181818
Neighborhood 18 120.666667 2.222222 12.333333 297.777778 9.222222
Neighborhood 19 88.875000 2.000000 15.125000 383.375000 5.500000
Neighborhood 20 60.000000 1.000000 3.000000 101.000000 6.000000
Neighborhood 3 60.000000 2.000000 7.000000 264.000000 9.000000
Neighborhood 4 60.000000 2.000000 10.000000 95.000000 11.000000
Neighborhood 7 100.000000 2.000000 3.000000 148.000000 2.000000
Neighborhood 8 350.000000 4.000000 5.000000 223.000000 3.000000
Neighborhood 9 110.000000 2.000000 3.500000 114.500000 9.000000
Property type 3 Neighborhood 11 75.000000 2.000000 15.000000 196.000000 8.000000
Neighborhood 14 75.000000 1.000000 1.000000 113.000000 5.000000
Neighborhood 17 65.000000 2.000000 15.000000 189.000000 23.000000
Neighborhood 4 40.000000 2.000000 4.000000 241.000000 19.000000

40 rows × 5 columns

Plot daily bookings:


In [590]:
#Line Plot using valu_counts.
bookings.booking_date.value_counts().plot(kind='line')


Out[590]:
<matplotlib.axes.AxesSubplot at 0x1265ca350>

In [494]:
#line plot using group by
bookings.groupby("booking_date")["booking_date"].count().plot(kind = 'line')


Out[494]:
<matplotlib.axes.AxesSubplot at 0x12ece6f50>

Plot the daily bookings per neighborhood (provide a legend)


In [495]:
#listmerge.groupby(["neighborhood","booking_date"])"prop_id"].agg(['count']).unstack(0)

Part 2 - Develop a data set


In [496]:
#Grouping the bookings data in order to get the counts per property
nbookings = bookings.groupby("prop_id")[["prop_id"]].count()

#renaming the count column to number_of_bookings as required in the question
nbookings.rename(columns = {"booking_date" :  "number_of_bookings"}, inplace = True)

#dropping redundant columns
nbookings.drop("prop_id", axis=1, inplace=True)  # Axis 0 is rows, Axis 1 is columns

In [509]:
#nbookings

In [499]:
#indexing the listings to prop_id. In order to do that, creating a column to replicate the prop_id. Used this to join with same index in bookings
listings["prop_index"]=listings["prop_id"]
listings.set_index(["prop_index"], inplace=True)

In [507]:
#listings.prop_id

In [503]:
#joining bookings and listings on index. Here index is prop_id
listings = listings.join(nbookings)

In [510]:
#listings.head(100)

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [511]:
#adding new calculated column
listings["booking_rate"] = listings["number_of_bookings"]/listings["tenure_months"]

In [513]:
listings.head(1)


Out[513]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
prop_index
1 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 0.133333

1 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [473]:
#moving the well established properties to another dataframe
listings_we = listings[listings.tenure_months >= 10]

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [515]:
#seperating the prop_type into booloean values
dummies_prop_type = pd.get_dummies(listings_we["prop_type"])

In [516]:
#seperating the neighborhood into booloean values
dummies_neighborhood = pd.get_dummies(listings_we["neighborhood"])

In [589]:
dummies_neighborhood.info()
dummies_prop_type.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 1 to 144
Data columns (total 16 columns):
Neighborhood 11    144 non-null float64
Neighborhood 12    144 non-null float64
Neighborhood 13    144 non-null float64
Neighborhood 14    144 non-null float64
Neighborhood 15    144 non-null float64
Neighborhood 16    144 non-null float64
Neighborhood 17    144 non-null float64
Neighborhood 18    144 non-null float64
Neighborhood 19    144 non-null float64
Neighborhood 20    144 non-null float64
Neighborhood 21    144 non-null float64
Neighborhood 4     144 non-null float64
Neighborhood 5     144 non-null float64
Neighborhood 7     144 non-null float64
Neighborhood 8     144 non-null float64
Neighborhood 9     144 non-null float64
dtypes: float64(16)<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 1 to 144
Data columns (total 3 columns):
Property type 1    144 non-null float64
Property type 2    144 non-null float64
Property type 3    144 non-null float64
dtypes: float64(3)

In [530]:
#joining the prop_type boolean columns to the main dataframe
listings_we = listings_we.join(dummies_prop_type)

In [531]:
#joining the neighborhood booloean values to the main dataframe
listings_we = listings_we.join(dummies_neighborhood)

In [533]:
listings_we.columns
#columns with null values
#booking_rate, number_of_bookings


Out[533]:
Index([u'prop_id', u'prop_type', u'neighborhood', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months', u'number_of_bookings', u'booking_rate', u'Property type 1', u'Property type 2', u'Property type 3', u'Neighborhood 11', u'Neighborhood 12', u'Neighborhood 13', u'Neighborhood 14', u'Neighborhood 15', u'Neighborhood 16', u'Neighborhood 17', u'Neighborhood 18', u'Neighborhood 19', u'Neighborhood 20', u'Neighborhood 21', u'Neighborhood 4', u'Neighborhood 5', u'Neighborhood 7', u'Neighborhood 8', u'Neighborhood 9'], dtype='object')

In [534]:
#number of rows with null values in booking_rate
len(listings_we[pd.isnull(listings_we.booking_rate)])
#len(listings_we)
#len(listings)


Out[534]:
31

In [535]:
#Since booking rate is going to be our predictor variable, let us not consider the rows with null(NaN) 
#booking rate in our train/test sets
listings_we = listings_we[listings_we.booking_rate.notnull()]

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [536]:
from sklearn.cross_validation import train_test_split

In [537]:
#features is X and target is y
features = listings_we[["price","person_capacity","picture_count","description_length","tenure_months","Property type 1","Property type 2", u'Property type 3', u'Neighborhood 11', u'Neighborhood 12', u'Neighborhood 13', u'Neighborhood 14', u'Neighborhood 15', u'Neighborhood 16', u'Neighborhood 17', u'Neighborhood 18', u'Neighborhood 19', u'Neighborhood 20', u'Neighborhood 21', u'Neighborhood 4', u'Neighborhood 5', u'Neighborhood 9']].values
target = listings_we["booking_rate"].values

In [538]:
#X.shape gives the rows,column count of X
features.shape


Out[538]:
(113, 22)

In [539]:
#y.shape gives the target value count
target.shape


Out[539]:
(113,)

In [540]:
#values in target
target


Out[540]:
array([ 0.13333333,  0.03448276,  0.96428571,  3.14285714,  0.08333333,
        1.2173913 ,  0.13043478,  1.13043478,  0.31818182,  2.04545455,
        2.85      ,  1.55      ,  0.63157895,  2.31578947,  0.36842105,
        0.05263158,  0.47368421,  0.5       ,  2.88888889,  1.52941176,
        0.17647059,  0.11764706,  4.        ,  0.5       ,  0.125     ,
        0.5625    ,  0.0625    ,  0.0625    ,  0.1875    ,  0.125     ,
        0.125     ,  0.125     ,  0.125     ,  0.125     ,  0.1875    ,
        0.0625    ,  0.1875    ,  0.125     ,  0.5       ,  0.1875    ,
        0.125     ,  0.125     ,  0.25      ,  0.1875    ,  0.0625    ,
        0.25      ,  0.125     ,  0.0625    ,  0.4375    ,  0.1875    ,
        0.4375    ,  0.86666667,  4.53333333,  1.86666667,  0.33333333,
        0.13333333,  0.06666667,  0.26666667,  0.13333333,  0.13333333,
        4.06666667,  0.2       ,  0.06666667,  1.06666667,  3.33333333,
        0.26666667,  1.        ,  1.5       ,  0.07142857,  0.64285714,
        4.92857143,  0.5       ,  0.85714286,  0.35714286,  1.21428571,
        1.30769231,  1.46153846,  1.61538462,  0.76923077,  2.53846154,
        0.91666667,  0.75      ,  2.58333333,  0.16666667,  0.25      ,
        3.08333333,  2.        ,  0.27272727,  0.72727273,  0.72727273,
        1.        ,  0.09090909,  3.3       ,  3.8       ,  3.2       ,
        3.6       ,  0.4       ,  3.        ,  4.1       ,  3.4       ,
        1.9       ,  6.7       ,  2.        ,  2.2       ,  3.7       ,
        4.6       ,  0.1       ,  3.7       ,  3.9       ,  1.        ,
        0.1       ,  4.8       ,  2.1       ])

In [562]:
#train test split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.33)

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [563]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

fit your model with your test sets


In [564]:
lr.fit(features_train, target_train)


Out[564]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [565]:
#predicting y from X_test using the linear model created from train data
target_pred = lr.predict(features_test)

In [566]:
#square sum error
sum_sq_model = np.sum((target_test - target_pred) ** 2)
sum_sq_model


Out[566]:
101.6471710188284

In [567]:
lr.score(features_test,target_test)
#lr.score(features_train,target_train)
#features_train.shape


Out[567]:
0.14628053517557793

In [568]:
#plotting original test Vs predicted
fig, ax = plt.subplots(1, 1)

ax.scatter(target_pred, target_test)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

# Draw the ideal line
ax.plot(target, target, 'r')


Out[568]:
[<matplotlib.lines.Line2D at 0x12f948d90>]

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

The score method gives us the R-square value. It returns the R^2 of self.predict(X) wrt. y. Best possible score is 1.0, lower values are worse. Since the score in this case are low, this tells us that the model is not good


In [568]:

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [569]:
#space to understand which columns is correlating with booking_rate
from pandas.tools.plotting import scatter_matrix
#scat = scatter_matrix(listings_we[["price","person_capacity","picture_count","description_length","tenure_months","Property type 1","Property type 2", u'Property type 3', u'Neighborhood 11', u'Neighborhood 12', u'Neighborhood 13', u'Neighborhood 14', u'Neighborhood 15', u'Neighborhood 16', u'Neighborhood 17', u'Neighborhood 18', u'Neighborhood 19', u'Neighborhood 20', u'Neighborhood 21', u'Neighborhood 4', u'Neighborhood 5', u'Neighborhood 9']])
scat = scatter_matrix(listings_we[["booking_rate","person_capacity"]])
#,"person_capacity","picture_count","description_length","tenure_months","Property type 1","Property type 2", u'Property type 3', u'Neighborhood 11', u'Neighborhood 12', u'Neighborhood 13', u'Neighborhood 14', u'Neighborhood 15', u'Neighborhood 16', u'Neighborhood 17', u'Neighborhood 18', u'Neighborhood 19', u'Neighborhood 20', u'Neighborhood 21', u'Neighborhood 4', u'Neighborhood 5', u'Neighborhood 9']])



In [576]:
#Taking log of booking rate. Histogram looks more normally distributed than just the booking rate. See below two histograms
booking_rate_log = np.log(listings_we.booking_rate)
#listings_we.booking_rate.hist()
booking_rate_log.hist()


Out[576]:
<matplotlib.axes.AxesSubplot at 0x128c3f910>

In [571]:
#original boking rate
listings_we["booking_rate"].hist()


Out[571]:
<matplotlib.axes.AxesSubplot at 0x129feb190>

In [585]:
#since taking the log gives negative results, shifting the data by 5 points.
listings_we["booking_rate_log"] = np.log(listings_we.booking_rate)+5

In [586]:
#below histogram (log + 5) is ore normal and on the positive axis
listings_we["booking_rate_log"].hist()


Out[586]:
<matplotlib.axes.AxesSubplot at 0x129dee090>

In [587]:
#same steps with log
lr_log = LinearRegression()
features_log = listings_we[["price","person_capacity","picture_count","description_length","tenure_months","Property type 1","Property type 2", u'Property type 3', u'Neighborhood 11', u'Neighborhood 12', u'Neighborhood 13', u'Neighborhood 14', u'Neighborhood 15', u'Neighborhood 16', u'Neighborhood 17', u'Neighborhood 18', u'Neighborhood 19', u'Neighborhood 20', u'Neighborhood 21', u'Neighborhood 4', u'Neighborhood 5', u'Neighborhood 9']].values
target_log = listings_we["booking_rate_log"].values
features_log_train, features_log_test, target_log_train, target_log_test = train_test_split(features_log, target_log, test_size=0.33)
lr_log.fit(features_log_train, target_log_train)
target_log_pred = lr_log.predict(features_log_test)
sum_sq_model_log = np.sum((target_log_test - target_log_pred) ** 2)
print sum_sq_model_log
print lr_log.score(features_log_test,target_log_test)


58.8595012474
0.323705291529

In [588]:
#plotting graph same as above
fig, ax = plt.subplots(1, 1)

ax.scatter(target_log_pred, target_log_test)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

# Draw the ideal line
ax.plot(target, target, 'r')


Out[588]:
[<matplotlib.lines.Line2D at 0x128f42fd0>]

In [464]:
#space for trying other transformations
booking_rate_arcsin = np.(listings_we.booking_rate)

In [465]:
booking_rate_arcsin.hist()


Out[465]:
<matplotlib.axes.AxesSubplot at 0x1298c7bd0>

In [ ]: