Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [1]:

    
import pandas as pd
import numpy
import scipy
import matplotlib.pyplot as plt

%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [2]:

    
bookings = pd.read_csv("../data/bookings.csv")
listings = pd.read_csv('../data/listings.csv')

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [3]:

    
listings.describe()
#or
#listings.median()
#listings.mean()
#listings.std()









    Out[3]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      count
       408.000000
        408.000000
       408.000000
       408.000000
        408.000000
       408.000000
    
    
      mean
       204.500000
        187.806373
         2.997549
        14.389706
        309.159314
         8.487745
    
    
      std
       117.923704
        353.050858
         1.594676
        10.477428
        228.021684
         5.872088
    
    
      min
         1.000000
         39.000000
         1.000000
         1.000000
          0.000000
         1.000000
    
    
      25%
       102.750000
         90.000000
         2.000000
         6.000000
        179.000000
         4.000000
    
    
      50%
       204.500000
        125.000000
         2.000000
        12.000000
        250.000000
         7.000000
    
    
      75%
       306.250000
        199.000000
         4.000000
        20.000000
        389.500000
        13.000000
    
    
      max
       408.000000
       5000.000000
        10.000000
        71.000000
       1969.000000
        30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [4]:

    
listings.groupby('prop_type').mean()









    Out[4]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      prop_type
      
      
      
      
      
      
    
  
  
    
      Property type 1
       204.754647
       237.085502
       3.516729
       14.695167
       313.171004
        8.464684
    
    
      Property type 2
       206.392593
        93.288889
       2.000000
       13.948148
       304.851852
        8.377778
    
    
      Property type 3
       123.500000
        63.750000
       1.750000
        8.750000
       184.750000
       13.750000

Same, but by property type per neighborhood?



In [5]:

    
listings.groupby(['neighborhood', 'prop_type']).mean()









    Out[5]:






  
    
      
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
    
      neighborhood
      prop_type
      
      
      
      
      
      
    
  
  
    
      Neighborhood 1
      Property type 1
       235.000000
        85.000000
       2.000000
       26.000000
       209.000000
        6.000000
    
    
      Neighborhood 10
      Property type 1
       307.500000
       142.500000
       3.500000
       13.333333
       391.000000
        3.833333
    
    
      Property type 2
       327.000000
       137.500000
       2.000000
       20.000000
       126.000000
        3.500000
    
    
      Neighborhood 11
      Property type 1
       174.000000
       159.428571
       3.214286
        9.928571
       379.000000
        9.642857
    
    
      Property type 2
       146.250000
        78.750000
       2.000000
       16.750000
       161.250000
       11.250000
    
    
      Property type 3
       178.000000
        75.000000
       2.000000
       15.000000
       196.000000
        8.000000
    
    
      Neighborhood 12
      Property type 1
       211.307692
       365.615385
       3.435897
       10.820513
       267.205128
        7.897436
    
    
      Property type 2
       164.263158
        96.894737
       1.947368
       10.473684
       244.526316
        9.842105
    
    
      Neighborhood 13
      Property type 1
       190.142857
       241.897959
       4.061224
       15.653061
       290.408163
        9.122449
    
    
      Property type 2
       199.000000
        81.130435
       1.826087
       16.695652
       418.565217
        9.739130
    
    
      Neighborhood 14
      Property type 1
       220.764706
       164.676471
       3.205882
       14.764706
       317.205882
        8.441176
    
    
      Property type 2
       195.047619
        83.809524
       1.857143
       15.904762
       348.619048
        8.714286
    
    
      Property type 3
       286.000000
        75.000000
       1.000000
        1.000000
       113.000000
        5.000000
    
    
      Neighborhood 15
      Property type 1
       191.560000
       178.880000
       3.720000
       14.320000
       321.760000
        9.320000
    
    
      Property type 2
       194.666667
        95.000000
       2.266667
       11.733333
       301.733333
        8.200000
    
    
      Neighborhood 16
      Property type 1
       233.000000
       158.928571
       2.928571
       21.642857
       310.714286
        7.071429
    
    
      Property type 2
       251.562500
        83.625000
       2.062500
       15.375000
       246.250000
        6.687500
    
    
      Neighborhood 17
      Property type 1
       166.043478
       189.869565
       3.521739
       16.086957
       317.347826
        9.869565
    
    
      Property type 2
       242.181818
       102.454545
       2.000000
       15.454545
       308.272727
        7.181818
    
    
      Property type 3
        10.000000
        65.000000
       2.000000
       15.000000
       189.000000
       23.000000
    
    
      Neighborhood 18
      Property type 1
       210.000000
       173.590909
       2.954545
       16.090909
       369.227273
        8.227273
    
    
      Property type 2
       179.333333
       120.666667
       2.222222
       12.333333
       297.777778
        9.222222
    
    
      Neighborhood 19
      Property type 1
       253.250000
       222.375000
       3.625000
       11.000000
       254.500000
        6.500000
    
    
      Property type 2
       256.750000
        88.875000
       2.000000
       15.125000
       383.375000
        5.500000
    
    
      Neighborhood 2
      Property type 1
       244.000000
       250.000000
       6.000000
        8.000000
       423.000000
        6.000000
    
    
      Neighborhood 20
      Property type 1
       174.111111
       804.333333
       2.777778
        9.444444
       223.555556
        9.666667
    
    
      Property type 2
       230.000000
        60.000000
       1.000000
        3.000000
       101.000000
        6.000000
    
    
      Neighborhood 21
      Property type 1
        79.250000
       362.500000
       4.250000
       49.000000
       306.250000
       14.750000
    
    
      Neighborhood 22
      Property type 1
       162.000000
       225.000000
       3.000000
       19.000000
       500.000000
        9.000000
    
    
      Neighborhood 3
      Property type 2
       166.000000
        60.000000
       2.000000
        7.000000
       264.000000
        9.000000
    
    
      Neighborhood 4
      Property type 2
       118.000000
        60.000000
       2.000000
       10.000000
        95.000000
       11.000000
    
    
      Property type 3
        20.000000
        40.000000
       2.000000
        4.000000
       241.000000
       19.000000
    
    
      Neighborhood 5
      Property type 1
       132.500000
       194.500000
       2.500000
        8.500000
       266.500000
       11.500000
    
    
      Neighborhood 6
      Property type 1
       291.333333
       146.000000
       3.333333
       12.666667
       290.666667
        4.000000
    
    
      Neighborhood 7
      Property type 1
       273.333333
       161.000000
       3.666667
       14.333333
       343.000000
        5.333333
    
    
      Property type 2
       365.000000
       100.000000
       2.000000
        3.000000
       148.000000
        2.000000
    
    
      Neighborhood 8
      Property type 1
       218.250000
       174.750000
       5.000000
       11.000000
       300.000000
        6.750000
    
    
      Property type 2
       343.000000
       350.000000
       4.000000
        5.000000
       223.000000
        3.000000
    
    
      Neighborhood 9
      Property type 1
       265.857143
       151.142857
       4.285714
       13.428571
       471.428571
        5.714286
    
    
      Property type 2
       165.500000
       110.000000
       2.000000
        3.500000
       114.500000
        9.000000

Plot daily bookings:



In [6]:

    
with plt.style.context('fivethirtyeight'):
    #bookings.sort('booking_date', inplace = True)
    ax = bookings['booking_date'].value_counts().sort_index().plot(figsize = (12,8))
    ax.set_ylabel("Number of bookings")
    ax.set_xlabel('Date')

Plot the daily bookings per neighborhood (provide a legend)



In [7]:

    
#i need to copy all of the booking data to a new df
#I need to get the neighborhood id from the listings data and add that column by matching the ID
#now i need to add the count per neighborhood per day
#then I need to plot that
bookingsNeighborhood = bookings.merge(listings[['prop_id', 'neighborhood']], on='prop_id')

Part 2 - Develop a data set



In [7]:

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [8]:

    
listings2 = listings
listings2["number_of_bookings"] = bookings.groupby('prop_id')[['prop_id']].count()
listings2["booking_rate"] = listings2['number_of_bookings']/listings2['tenure_months']

#[['col_name']] returns a data frame, just [] would return a series

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [9]:

    
listings3 = listings2[listings2['tenure_months'] >=10]

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()`http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [10]:

    
listings4 = listings3

for column in ['neighborhood','prop_type']:
    dummies = pd.get_dummies(listings4[column])
    listings4[dummies.columns] = dummies

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [11]:

    
from sklearn.cross_validation import train_test_split



In [15]:

    
#remove garbage
listings5 = listings4
#listings5.replace([numpy.inf, -numpy.inf], numpy.nan)
#istings5.dropna(axis=0)
#listings5.dropna(subset=['number_of_bookings'],how='any', inplace = True)
listings5 = listings5.dropna()
#listings5.info()



In [132]:

    
x_axis = listings5[['price', 'person_capacity',
           'picture_count', 'description_length', 'tenure_months', 'Property type 2', 'Property type 3',
           'Neighborhood 18', 'Neighborhood 19', 'Neighborhood 20',
           'Neighborhood 21', 'Neighborhood 4', 'Neighborhood 5',
           'Neighborhood 7', 'Neighborhood 8', 'Neighborhood 9',]]

y_axis = listings5['booking_rate']

x_train, x_test, y_train, y_test = train_test_split(x_axis, y_axis)



In [53]:

    
listings5['Neighborhood 18'].value_counts()









    Out[53]:





0    79
1    34
dtype: int64



In [ ]:

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [95]:

    
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

fit your model with your test sets



In [133]:

    
model = lr.fit(x_test, y_test, n_jobs = 10)

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [134]:

    
model.score(x_test, y_test)









    Out[134]:





0.62463395370380381

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

It returns the R^2 valued which measure of how close the data fits the regression. It is the coefficient of determination.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [ ]:

	prop_id	price	person_capacity	picture_count	description_length	tenure_months
count	408.000000	408.000000	408.000000	408.000000	408.000000	408.000000
mean	204.500000	187.806373	2.997549	14.389706	309.159314	8.487745
std	117.923704	353.050858	1.594676	10.477428	228.021684	5.872088
min	1.000000	39.000000	1.000000	1.000000	0.000000	1.000000
25%	102.750000	90.000000	2.000000	6.000000	179.000000	4.000000
50%	204.500000	125.000000	2.000000	12.000000	250.000000	7.000000
75%	306.250000	199.000000	4.000000	20.000000	389.500000	13.000000
max	408.000000	5000.000000	10.000000	71.000000	1969.000000	30.000000

	prop_id	price	person_capacity	picture_count	description_length	tenure_months
prop_type
Property type 1	204.754647	237.085502	3.516729	14.695167	313.171004	8.464684
Property type 2	206.392593	93.288889	2.000000	13.948148	304.851852	8.377778
Property type 3	123.500000	63.750000	1.750000	8.750000	184.750000	13.750000

		prop_id	price	person_capacity	picture_count	description_length	tenure_months
neighborhood	prop_type
Neighborhood 1	Property type 1	235.000000	85.000000	2.000000	26.000000	209.000000	6.000000
Neighborhood 10	Property type 1	307.500000	142.500000	3.500000	13.333333	391.000000	3.833333
Neighborhood 10	Property type 2	327.000000	137.500000	2.000000	20.000000	126.000000	3.500000
Neighborhood 11	Property type 1	174.000000	159.428571	3.214286	9.928571	379.000000	9.642857
	Property type 2	146.250000	78.750000	2.000000	16.750000	161.250000	11.250000
	Property type 3	178.000000	75.000000	2.000000	15.000000	196.000000	8.000000
Neighborhood 12	Property type 1	211.307692	365.615385	3.435897	10.820513	267.205128	7.897436
Neighborhood 12	Property type 2	164.263158	96.894737	1.947368	10.473684	244.526316	9.842105
Neighborhood 13	Property type 1	190.142857	241.897959	4.061224	15.653061	290.408163	9.122449
Neighborhood 13	Property type 2	199.000000	81.130435	1.826087	16.695652	418.565217	9.739130
Neighborhood 14	Property type 1	220.764706	164.676471	3.205882	14.764706	317.205882	8.441176
	Property type 2	195.047619	83.809524	1.857143	15.904762	348.619048	8.714286
	Property type 3	286.000000	75.000000	1.000000	1.000000	113.000000	5.000000
Neighborhood 15	Property type 1	191.560000	178.880000	3.720000	14.320000	321.760000	9.320000
Neighborhood 15	Property type 2	194.666667	95.000000	2.266667	11.733333	301.733333	8.200000
Neighborhood 16	Property type 1	233.000000	158.928571	2.928571	21.642857	310.714286	7.071429
Neighborhood 16	Property type 2	251.562500	83.625000	2.062500	15.375000	246.250000	6.687500
Neighborhood 17	Property type 1	166.043478	189.869565	3.521739	16.086957	317.347826	9.869565
	Property type 2	242.181818	102.454545	2.000000	15.454545	308.272727	7.181818
	Property type 3	10.000000	65.000000	2.000000	15.000000	189.000000	23.000000
Neighborhood 18	Property type 1	210.000000	173.590909	2.954545	16.090909	369.227273	8.227273
Neighborhood 18	Property type 2	179.333333	120.666667	2.222222	12.333333	297.777778	9.222222
Neighborhood 19	Property type 1	253.250000	222.375000	3.625000	11.000000	254.500000	6.500000
Neighborhood 19	Property type 2	256.750000	88.875000	2.000000	15.125000	383.375000	5.500000
Neighborhood 2	Property type 1	244.000000	250.000000	6.000000	8.000000	423.000000	6.000000
Neighborhood 20	Property type 1	174.111111	804.333333	2.777778	9.444444	223.555556	9.666667
Neighborhood 20	Property type 2	230.000000	60.000000	1.000000	3.000000	101.000000	6.000000
Neighborhood 21	Property type 1	79.250000	362.500000	4.250000	49.000000	306.250000	14.750000
Neighborhood 22	Property type 1	162.000000	225.000000	3.000000	19.000000	500.000000	9.000000
Neighborhood 3	Property type 2	166.000000	60.000000	2.000000	7.000000	264.000000	9.000000
Neighborhood 4	Property type 2	118.000000	60.000000	2.000000	10.000000	95.000000	11.000000
Neighborhood 4	Property type 3	20.000000	40.000000	2.000000	4.000000	241.000000	19.000000
Neighborhood 5	Property type 1	132.500000	194.500000	2.500000	8.500000	266.500000	11.500000
Neighborhood 6	Property type 1	291.333333	146.000000	3.333333	12.666667	290.666667	4.000000
Neighborhood 7	Property type 1	273.333333	161.000000	3.666667	14.333333	343.000000	5.333333
Neighborhood 7	Property type 2	365.000000	100.000000	2.000000	3.000000	148.000000	2.000000
Neighborhood 8	Property type 1	218.250000	174.750000	5.000000	11.000000	300.000000	6.750000
Neighborhood 8	Property type 2	343.000000	350.000000	4.000000	5.000000	223.000000	3.000000
Neighborhood 9	Property type 1	265.857143	151.142857	4.285714	13.428571	471.428571	5.714286
Neighborhood 9	Property type 2	165.500000	110.000000	2.000000	3.500000	114.500000	9.000000