Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [1]:
import pandas as pd
import numpy
import scipy
import matplotlib.pyplot as plt

%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [2]:
bookings = pd.read_csv("../data/bookings.csv")
listings = pd.read_csv('../data/listings.csv')

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [3]:
listings.describe()
#or
#listings.median()
#listings.mean()
#listings.std()


Out[3]:
prop_id price person_capacity picture_count description_length tenure_months
count 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000
mean 204.500000 187.806373 2.997549 14.389706 309.159314 8.487745
std 117.923704 353.050858 1.594676 10.477428 228.021684 5.872088
min 1.000000 39.000000 1.000000 1.000000 0.000000 1.000000
25% 102.750000 90.000000 2.000000 6.000000 179.000000 4.000000
50% 204.500000 125.000000 2.000000 12.000000 250.000000 7.000000
75% 306.250000 199.000000 4.000000 20.000000 389.500000 13.000000
max 408.000000 5000.000000 10.000000 71.000000 1969.000000 30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [4]:
listings.groupby('prop_type').mean()


Out[4]:
prop_id price person_capacity picture_count description_length tenure_months
prop_type
Property type 1 204.754647 237.085502 3.516729 14.695167 313.171004 8.464684
Property type 2 206.392593 93.288889 2.000000 13.948148 304.851852 8.377778
Property type 3 123.500000 63.750000 1.750000 8.750000 184.750000 13.750000

Same, but by property type per neighborhood?


In [5]:
listings.groupby(['neighborhood', 'prop_type']).mean()


Out[5]:
prop_id price person_capacity picture_count description_length tenure_months
neighborhood prop_type
Neighborhood 1 Property type 1 235.000000 85.000000 2.000000 26.000000 209.000000 6.000000
Neighborhood 10 Property type 1 307.500000 142.500000 3.500000 13.333333 391.000000 3.833333
Property type 2 327.000000 137.500000 2.000000 20.000000 126.000000 3.500000
Neighborhood 11 Property type 1 174.000000 159.428571 3.214286 9.928571 379.000000 9.642857
Property type 2 146.250000 78.750000 2.000000 16.750000 161.250000 11.250000
Property type 3 178.000000 75.000000 2.000000 15.000000 196.000000 8.000000
Neighborhood 12 Property type 1 211.307692 365.615385 3.435897 10.820513 267.205128 7.897436
Property type 2 164.263158 96.894737 1.947368 10.473684 244.526316 9.842105
Neighborhood 13 Property type 1 190.142857 241.897959 4.061224 15.653061 290.408163 9.122449
Property type 2 199.000000 81.130435 1.826087 16.695652 418.565217 9.739130
Neighborhood 14 Property type 1 220.764706 164.676471 3.205882 14.764706 317.205882 8.441176
Property type 2 195.047619 83.809524 1.857143 15.904762 348.619048 8.714286
Property type 3 286.000000 75.000000 1.000000 1.000000 113.000000 5.000000
Neighborhood 15 Property type 1 191.560000 178.880000 3.720000 14.320000 321.760000 9.320000
Property type 2 194.666667 95.000000 2.266667 11.733333 301.733333 8.200000
Neighborhood 16 Property type 1 233.000000 158.928571 2.928571 21.642857 310.714286 7.071429
Property type 2 251.562500 83.625000 2.062500 15.375000 246.250000 6.687500
Neighborhood 17 Property type 1 166.043478 189.869565 3.521739 16.086957 317.347826 9.869565
Property type 2 242.181818 102.454545 2.000000 15.454545 308.272727 7.181818
Property type 3 10.000000 65.000000 2.000000 15.000000 189.000000 23.000000
Neighborhood 18 Property type 1 210.000000 173.590909 2.954545 16.090909 369.227273 8.227273
Property type 2 179.333333 120.666667 2.222222 12.333333 297.777778 9.222222
Neighborhood 19 Property type 1 253.250000 222.375000 3.625000 11.000000 254.500000 6.500000
Property type 2 256.750000 88.875000 2.000000 15.125000 383.375000 5.500000
Neighborhood 2 Property type 1 244.000000 250.000000 6.000000 8.000000 423.000000 6.000000
Neighborhood 20 Property type 1 174.111111 804.333333 2.777778 9.444444 223.555556 9.666667
Property type 2 230.000000 60.000000 1.000000 3.000000 101.000000 6.000000
Neighborhood 21 Property type 1 79.250000 362.500000 4.250000 49.000000 306.250000 14.750000
Neighborhood 22 Property type 1 162.000000 225.000000 3.000000 19.000000 500.000000 9.000000
Neighborhood 3 Property type 2 166.000000 60.000000 2.000000 7.000000 264.000000 9.000000
Neighborhood 4 Property type 2 118.000000 60.000000 2.000000 10.000000 95.000000 11.000000
Property type 3 20.000000 40.000000 2.000000 4.000000 241.000000 19.000000
Neighborhood 5 Property type 1 132.500000 194.500000 2.500000 8.500000 266.500000 11.500000
Neighborhood 6 Property type 1 291.333333 146.000000 3.333333 12.666667 290.666667 4.000000
Neighborhood 7 Property type 1 273.333333 161.000000 3.666667 14.333333 343.000000 5.333333
Property type 2 365.000000 100.000000 2.000000 3.000000 148.000000 2.000000
Neighborhood 8 Property type 1 218.250000 174.750000 5.000000 11.000000 300.000000 6.750000
Property type 2 343.000000 350.000000 4.000000 5.000000 223.000000 3.000000
Neighborhood 9 Property type 1 265.857143 151.142857 4.285714 13.428571 471.428571 5.714286
Property type 2 165.500000 110.000000 2.000000 3.500000 114.500000 9.000000

Plot daily bookings:


In [6]:
with plt.style.context('fivethirtyeight'):
    #bookings.sort('booking_date', inplace = True)
    ax = bookings['booking_date'].value_counts().sort_index().plot(figsize = (12,8))
    ax.set_ylabel("Number of bookings")
    ax.set_xlabel('Date')


Plot the daily bookings per neighborhood (provide a legend)


In [7]:
#i need to copy all of the booking data to a new df
#I need to get the neighborhood id from the listings data and add that column by matching the ID
#now i need to add the count per neighborhood per day
#then I need to plot that
bookingsNeighborhood = bookings.merge(listings[['prop_id', 'neighborhood']], on='prop_id')

Part 2 - Develop a data set


In [7]:

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [8]:
listings2 = listings
listings2["number_of_bookings"] = bookings.groupby('prop_id')[['prop_id']].count()
listings2["booking_rate"] = listings2['number_of_bookings']/listings2['tenure_months']

#[['col_name']] returns a data frame, just [] would return a series

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [9]:
listings3 = listings2[listings2['tenure_months'] >=10]

prop_type and neighborhood are categorical variables, use get_dummies()http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [10]:
listings4 = listings3

for column in ['neighborhood','prop_type']:
    dummies = pd.get_dummies(listings4[column])
    listings4[dummies.columns] = dummies

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [11]:
from sklearn.cross_validation import train_test_split

In [15]:
#remove garbage
listings5 = listings4
#listings5.replace([numpy.inf, -numpy.inf], numpy.nan)
#istings5.dropna(axis=0)
#listings5.dropna(subset=['number_of_bookings'],how='any', inplace = True)
listings5 = listings5.dropna()
#listings5.info()

In [132]:
x_axis = listings5[['price', 'person_capacity',
           'picture_count', 'description_length', 'tenure_months', 'Property type 2', 'Property type 3',
           'Neighborhood 18', 'Neighborhood 19', 'Neighborhood 20',
           'Neighborhood 21', 'Neighborhood 4', 'Neighborhood 5',
           'Neighborhood 7', 'Neighborhood 8', 'Neighborhood 9',]]

y_axis = listings5['booking_rate']

x_train, x_test, y_train, y_test = train_test_split(x_axis, y_axis)

In [53]:
listings5['Neighborhood 18'].value_counts()


Out[53]:
0    79
1    34
dtype: int64

In [ ]:

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [95]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

fit your model with your test sets


In [133]:
model = lr.fit(x_test, y_test, n_jobs = 10)

In [134]:
model.score(x_test, y_test)


Out[134]:
0.62463395370380381

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

It returns the R^2 valued which measure of how close the data fits the regression. It is the coefficient of determination.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]: