Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.



In [72]:

    
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt



In [73]:

    
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

%matplotlib inline



In [310]:

    
bookingsdate = pd.read_csv('../data/bookings.csv',index_col='booking_date')
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')

Part 1 - Data exploration

First, create 2 data frames: `listings` and `bookings` from their respective data files



In [447]:

    
neighbor11['booking_date']=pd.to_datetime(bookings['booking_date'])
alldates['booking_date']=pd.to_datetime(alldates['booking_date'])
allbooking=pd.merge(neighbor11,alldates,on='booking_date')
neighbor11









    Out[447]:






  
    
      
      neighborhood
      booking_date
    
    
      booking_date
      
      
    
  
  
    
      2011-01-02
       1
      NaT
    
    
      2011-01-05
       1
      NaT
    
    
      2011-01-07
       1
      NaT
    
    
      2011-01-08
       1
      NaT
    
    
      2011-01-09
       1
      NaT
    
    
      ...
      ...
      ...
    
    
      2011-12-20
       1
      NaT
    
    
      2011-12-23
       1
      NaT
    
    
      2011-12-24
       2
      NaT
    
    
      2011-12-26
       1
      NaT
    
    
      2011-12-28
       1
      NaT
    
  

214 rows × 2 columns



In [378]:

    
dates = pd.date_range('1/1/2011', periods=365)
alldates=pd.DataFrame(index=dates)

#alldatesbook=pd.concat([alldates,bookingsdate])
#alldatesbook









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-378-10f56fead1a2> in <module>()
      1 dates = pd.date_range('1/1/2011', periods=365)
----> 2 dates.head()
      3 alldates=pd.DataFrame(index=dates)
      4 
      5 #alldatesbook=pd.concat([alldates,bookingsdate])

AttributeError: 'DatetimeIndex' object has no attribute 'head'



In [ ]:



In [76]:

    
listings









    Out[76]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
        232
       30
    
    
      1  
         2
       Property type 1
       Neighborhood 14
        95
       2
        3
         37
       29
    
    
      2  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
        172
       29
    
    
      3  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
        472
       28
    
    
      4  
         5
       Property type 1
       Neighborhood 15
       125
       5
       21
        442
       28
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      403
       404
       Property type 2
       Neighborhood 14
       100
       1
        8
        235
        1
    
    
      404
       405
       Property type 2
       Neighborhood 13
        85
       2
       27
       1048
        1
    
    
      405
       406
       Property type 1
        Neighborhood 9
        70
       3
       18
        153
        1
    
    
      406
       407
       Property type 1
       Neighborhood 13
       129
       2
       13
        370
        1
    
    
      407
       408
       Property type 1
       Neighborhood 14
       100
       3
       21
        707
        1
    
  

408 rows × 8 columns



In [76]:

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?



In [274]:

    
listings.describe()









    Out[274]:






  
    
      
      prop_id
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      count
       408.0
        408.0
       408.0
       408.0
        408.0
       408.0
       327.0
       327.0
    
    
      mean
       204.5
        187.8
         3.0
        14.4
        309.2
         8.5
        18.4
         4.0
    
    
      std
       117.9
        353.1
         1.6
        10.5
        228.0
         5.9
        20.4
         6.2
    
    
      min
         1.0
         39.0
         1.0
         1.0
          0.0
         1.0
         1.0
         0.0
    
    
      25%
       102.8
         90.0
         2.0
         6.0
        179.0
         4.0
         3.0
         0.4
    
    
      50%
       204.5
        125.0
         2.0
        12.0
        250.0
         7.0
         9.0
         1.6
    
    
      75%
       306.2
        199.0
         4.0
        20.0
        389.5
        13.0
        29.5
         4.9
    
    
      max
       408.0
       5000.0
        10.0
        71.0
       1969.0
        30.0
       109.0
        52.0



In [275]:

    
# or alternatively...

listingstats=listings.groupby('prop_id').agg(['count', 'mean', 'median', 'min', 'max'])
listingstats[['price','person_capacity','picture_count','description_length','tenure_months']]









    Out[275]:






  
    
      
      price
      person_capacity
      ...
      description_length
      tenure_months
    
    
      
      count
      mean
      median
      min
      max
      count
      mean
      median
      min
      max
      ...
      count
      mean
      median
      min
      max
      count
      mean
      median
      min
      max
    
    
      prop_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1  
       1
       140
       140
       140
       140
       1
       3
       3
       3
       3
      ...
       1
        232
        232
        232
        232
       1
       30
       30
       30
       30
    
    
      2  
       1
        95
        95
        95
        95
       1
       2
       2
       2
       2
      ...
       1
         37
         37
         37
         37
       1
       29
       29
       29
       29
    
    
      3  
       1
        95
        95
        95
        95
       1
       2
       2
       2
       2
      ...
       1
        172
        172
        172
        172
       1
       29
       29
       29
       29
    
    
      4  
       1
        90
        90
        90
        90
       1
       2
       2
       2
       2
      ...
       1
        472
        472
        472
        472
       1
       28
       28
       28
       28
    
    
      5  
       1
       125
       125
       125
       125
       1
       5
       5
       5
       5
      ...
       1
        442
        442
        442
        442
       1
       28
       28
       28
       28
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      404
       1
       100
       100
       100
       100
       1
       1
       1
       1
       1
      ...
       1
        235
        235
        235
        235
       1
        1
        1
        1
        1
    
    
      405
       1
        85
        85
        85
        85
       1
       2
       2
       2
       2
      ...
       1
       1048
       1048
       1048
       1048
       1
        1
        1
        1
        1
    
    
      406
       1
        70
        70
        70
        70
       1
       3
       3
       3
       3
      ...
       1
        153
        153
        153
        153
       1
        1
        1
        1
        1
    
    
      407
       1
       129
       129
       129
       129
       1
       2
       2
       2
       2
      ...
       1
        370
        370
        370
        370
       1
        1
        1
        1
        1
    
    
      408
       1
       100
       100
       100
       100
       1
       3
       3
       3
       3
      ...
       1
        707
        707
        707
        707
       1
        1
        1
        1
        1
    
  

408 rows × 25 columns

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?



In [276]:

    
typestats=listings.groupby('prop_type').agg(['count', 'mean', 'median', 'min', 'max'])
typestats[['price','person_capacity','picture_count','description_length','tenure_months']]









    Out[276]:






  
    
      
      price
      person_capacity
      ...
      description_length
      tenure_months
    
    
      
      count
      mean
      median
      min
      max
      count
      mean
      median
      min
      max
      ...
      count
      mean
      median
      min
      max
      count
      mean
      median
      min
      max
    
    
      prop_type
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Property type 1
       269
       237.1
       150
       40
       5000
       269
       3.5
       3
       1
       10
      ...
       269
       313.2
       266.0
        17
       1719
       269
        8.5
        7.0
       1
       30
    
    
      Property type 2
       135
        93.3
        89
       39
        350
       135
       2.0
       2
       1
        6
      ...
       135
       304.9
       239.0
         0
       1969
       135
        8.4
        7.0
       1
       29
    
    
      Property type 3
         4
        63.8
        70
       40
         75
         4
       1.8
       2
       1
        2
      ...
         4
       184.8
       192.5
       113
        241
         4
       13.8
       13.5
       5
       23
    
  

3 rows × 25 columns

Same, but by property type per neighborhood?



In [16]:

    
neighborhoodstats=listings.groupby(['prop_type','neighborhood']).agg(['count', 'mean', 'median', 'min', 'max'])
neighborhoodstats[['price','person_capacity','picture_count','description_length','tenure_months']]









    Out[16]:






  
    
      
      
      price
      person_capacity
      ...
      description_length
      tenure_months
    
    
      
      
      count
      mean
      median
      min
      max
      count
      mean
      median
      min
      max
      ...
      count
      mean
      median
      min
      max
      count
      mean
      median
      min
      max
    
    
      prop_type
      neighborhood
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Property type 1
      Neighborhood 1
        1
        85.0
        85.0
        85
         85
        1
       2.0
       2
       2
       2
      ...
        1
       209.0
       209.0
       209
        209
        1
        6.0
        6.0
        6
        6
    
    
      Neighborhood 10
        6
       142.5
       137.5
        90
        205
        6
       3.5
       4
       2
       5
      ...
        6
       391.0
       425.5
       160
        537
        6
        3.8
        4.5
        1
        5
    
    
      Neighborhood 11
       14
       159.4
       130.0
        95
        319
       14
       3.2
       3
       2
       6
      ...
       14
       379.0
       295.5
        82
       1719
       14
        9.6
        9.5
        1
       16
    
    
      Neighborhood 12
       39
       365.6
       150.0
        60
       3050
       39
       3.4
       3
       1
       8
      ...
       39
       267.2
       251.0
        45
        670
       39
        7.9
        6.0
        1
       18
    
    
      Neighborhood 13
       49
       241.9
       180.0
       110
       2394
       49
       4.1
       4
       2
       8
      ...
       49
       290.4
       256.0
        74
        716
       49
        9.1
        8.0
        1
       23
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Property type 2
      Neighborhood 9
        2
       110.0
       110.0
       100
        120
        2
       2.0
       2
       1
       3
      ...
        2
       114.5
       114.5
        17
        212
        2
        9.0
        9.0
        7
       11
    
    
      Property type 3
      Neighborhood 11
        1
        75.0
        75.0
        75
         75
        1
       2.0
       2
       2
       2
      ...
        1
       196.0
       196.0
       196
        196
        1
        8.0
        8.0
        8
        8
    
    
      Neighborhood 14
        1
        75.0
        75.0
        75
         75
        1
       1.0
       1
       1
       1
      ...
        1
       113.0
       113.0
       113
        113
        1
        5.0
        5.0
        5
        5
    
    
      Neighborhood 17
        1
        65.0
        65.0
        65
         65
        1
       2.0
       2
       2
       2
      ...
        1
       189.0
       189.0
       189
        189
        1
       23.0
       23.0
       23
       23
    
    
      Neighborhood 4
        1
        40.0
        40.0
        40
         40
        1
       2.0
       2
       2
       2
      ...
        1
       241.0
       241.0
       241
        241
        1
       19.0
       19.0
       19
       19
    
  

40 rows × 25 columns

Plot daily bookings:



In [551]:

    
#plotting timeseries of all bookings
dailystats=bookings.groupby(['booking_date']).agg('count')

dailystats.plot(figsize=(20,10))









    Out[551]:





<matplotlib.axes._subplots.AxesSubplot at 0x115a20050>

Plot the daily bookings per neighborhood (provide a legend)



In [277]:

    
joineddf=bookings.merge(listings, on="prop_id")
joineddf









    Out[277]:






  
    
      
      prop_id
      booking_date
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0   
       188
       2011-01-01
       Property type 1
        Neighborhood 9
        95
       3
       10
       247
        8
       16
       2.0
    
    
      1   
       188
       2011-01-04
       Property type 1
        Neighborhood 9
        95
       3
       10
       247
        8
       16
       2.0
    
    
      2   
       188
       2011-01-05
       Property type 1
        Neighborhood 9
        95
       3
       10
       247
        8
       16
       2.0
    
    
      3   
       188
       2011-01-11
       Property type 1
        Neighborhood 9
        95
       3
       10
       247
        8
       16
       2.0
    
    
      4   
       188
       2011-01-14
       Property type 1
        Neighborhood 9
        95
       3
       10
       247
        8
       16
       2.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      6071
       153
       2011-12-07
       Property type 2
       Neighborhood 18
        95
       2
       22
       174
        9
        6
       0.7
    
    
      6072
       153
       2011-12-28
       Property type 2
       Neighborhood 18
        95
       2
       22
       174
        9
        6
       0.7
    
    
      6073
       403
       2011-12-19
       Property type 1
       Neighborhood 19
       100
       4
        2
        69
        1
        5
       5.0
    
    
      6074
       249
       2011-12-19
       Property type 1
       Neighborhood 14
        75
       2
        9
        59
        6
      NaN
       NaN
    
    
      6075
        64
       2011-12-29
       Property type 1
       Neighborhood 20
       250
       3
        6
       289
       16
        2
       0.1
    
  

6076 rows × 11 columns



In [928]:

    
neighborhoods=joineddf['neighborhood'].unique()
len(neighborhoods)









    Out[928]:





21



In [491]:

    
bookingsneighbor['count']=joineddf.groupby(['neighborhood','booking_date'])['neighborhood'].agg('count')
neighbor11=pd.DataFrame(bookingsneighbor['Neighborhood 11'])
alldates['booking_date']=dates
neighbor11['booking_date'] = neighbor11.index
neighbor11['booking_date'] =neighbor11['booking_date'].astype(str)
type(neighbor11['booking_date'][2])
#pd.merge(neighbor11,alldates,on='booking_date')
bookingsneighborpd=pd.DataFrame(bookingsneighbor)









    Out[491]:





str



In [946]:

    
color=['red','blue','green','yellow','orange','red','blue','green','yellow','orange','red','blue','green','yellow','orange','red','blue','green','yellow','orange']



In [948]:

    
#looking at other students plots I see I may have interpreted daily plot differently and you may have expected a bar plot with days of the week
#I'm having trouble generating a meaningful legend and a random color for each line but I'm pretty happy I figured out how to
# plot a sparse data series because initally I was plotting only the days wheres there was a booking instead of the entire year.
#however I see  how this plot would not be particularly helpful.


import random
neighborhooddfs = []
for hood in neighborhoods:
    neighborhooddfs.append(pd.DataFrame(bookingsneighbor[hood]))
    
alldatesneighborhooddfs=[]
for hooddf in neighborhooddfs:
    hooddf['booking_date'] = hooddf.index
    hooddf['booking_date'] = hooddf['booking_date'].astype(str)
    alldatesneighborhooddfs.append(pd.merge(alldates,hooddf,on='booking_date',how='left'))

alldatesneighborhooddfs[1]=alldatesneighborhooddfs[1].fillna(0)
    

ax=alldatesneighborhooddfs[1].plot(by='booking_date', figsize = (20,10), legend=True)

for i in range(len(alldatesneighborhooddfs)):
    alldatesneighborhooddfs[i]=alldatesneighborhooddfs[i].fillna(0)


for i in range(len(alldatesneighborhooddfs[1:])):
    alldatesneighborhooddfs[i].fillna(0)
    alldatesneighborhooddfs[i].plot(by='booking_date', figsize = (20,10), ax=ax)

#for i in range(len(neighborhoods)):
#   pd.DataFrame(bookingsneighbor[i])
    
# list=["thing", "other thing", "third thing"]    
# for thing in list:
#     print thing

# for i in len(list):
#     print list[i]



In [ ]:

    
seaborn.barplot



In [570]:

    
#didnt use this
neighbor14=pd.DataFrame(bookingsneighbor['Neighborhood 14'])
neighbor16=pd.DataFrame(bookingsneighbor['Neighborhood 16'])
neighbor13=pd.DataFrame(bookingsneighbor['Neighborhood 13'])
neighbor15=pd.DataFrame(bookingsneighbor['Neighborhood 15'])
neighbor18=pd.DataFrame(bookingsneighbor['Neighborhood 18'])
neighbor17=pd.DataFrame(bookingsneighbor['Neighborhood 17'])
neighbor12=pd.DataFrame(bookingsneighbor['Neighborhood 12'])
neighbor4=pd.DataFrame(bookingsneighbor['Neighborhood 4'])
neighbor19=pd.DataFrame(bookingsneighbor['Neighborhood 19'])
neighbor5=pd.DataFrame(bookingsneighbor['Neighborhood 5'])
neighbor20=pd.DataFrame(bookingsneighbor['Neighborhood 20'])
neighbor21=pd.DataFrame(bookingsneighbor['Neighborhood 21'])
neighbor9=pd.DataFrame(bookingsneighbor['Neighborhood 9'])
neighbor7=pd.DataFrame(bookingsneighbor['Neighborhood 7'])
neighbor8=pd.DataFrame(bookingsneighbor['Neighborhood 8'])
neighbor22=pd.DataFrame(bookingsneighbor['Neighborhood 22'])
neighbor3=pd.DataFrame(bookingsneighbor['Neighborhood 3'])
neighbor1=pd.DataFrame(bookingsneighbor['Neighborhood 1'])
neighbor10=pd.DataFrame(bookingsneighbor['Neighborhood 10'])
neighbor6=pd.DataFrame(bookingsneighbor['Neighborhood 6'])




#for i in range(len(bookingsneighbor['neighborhood'])):
#    pd.DataFrame(bookingsneighbor[i])



In [643]:

    
# alldates['booking_date']= alldates['booking_date'].map(lambda datetime: str(alldates['booking_date'][datetime].split(" ")[0]))

# alldates

# # type(alldates['strtime'][2])

# allbooking=pd.merge(alldates,neighbor11,on='booking_date',how='left')
# allbooking=allbooking.fillna(0)
# allbooking.plot(figsize = (20,10),legend=True)
# # # type(alldates['strtime'][2])

# # datetime=pd.Timestamp('2011-01-06')
# # str(alldates['booking_date'][datetime])



In [644]:

    
# ax=bookingsneighbor['Neighborhood 1'].plot(figsize = (20,10), legend=True,xticks=None)



In [645]:

    
# bookingsneighbor=joineddf.groupby(['neighborhood','booking_date'])['neighborhood'].agg('count')

# ax=bookingsneighbor['Neighborhood 11'].plot(by='booking_date', figsize = (20,10), legend=True, ax=ax)
# #bookingsneighbor['Neighborhood 1'].plot(by='booking_date', figsize = (20,10), legend=True, ax=ax)



In [ ]:



In [163]:

    
##Part 2 - Develop a data set









    Out[163]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
    
  
  
    
      0
       1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
      NaN
    
    
      1
       2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
        4
    
    
      2
       3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
      NaN
    
    
      3
       4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
        1
    
    
      4
       5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28
       27

Add the columns `number_of_bookings` and `booking_rate` (number_of_bookings/tenure_months) to your `listings` data frame



In [649]:

    
#adding columns for number of bookings and rate
numbook=joineddf.groupby(['prop_id']).agg('count')
listings['number_of_bookings']=numbook['booking_date']
listings['booking_rate']=listings['number_of_bookings']/listings['tenure_months']
listings=listings.fillna(0)
listings









    Out[649]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
        232
       30
        0
        0.0
    
    
      1  
         2
       Property type 1
       Neighborhood 14
        95
       2
        3
         37
       29
        4
        0.1
    
    
      2  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
        172
       29
        0
        0.0
    
    
      3  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
        472
       28
        1
        0.0
    
    
      4  
         5
       Property type 1
       Neighborhood 15
       125
       5
       21
        442
       28
       27
        1.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      403
       404
       Property type 2
       Neighborhood 14
       100
       1
        8
        235
        1
        1
        1.0
    
    
      404
       405
       Property type 2
       Neighborhood 13
        85
       2
       27
       1048
        1
        3
        3.0
    
    
      405
       406
       Property type 1
        Neighborhood 9
        70
       3
       18
        153
        1
       19
       19.0
    
    
      406
       407
       Property type 1
       Neighborhood 13
       129
       2
       13
        370
        1
       19
       19.0
    
    
      407
       408
       Property type 1
       Neighborhood 14
       100
       3
       21
        707
        1
       15
       15.0
    
  

408 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months



In [661]:

    
#filtering well established properties
established = listings[listings['tenure_months'] > 10]
established=established.fillna(0)
established









    Out[661]:






  
    
      
      prop_id
      prop_type
      neighborhood
      price
      person_capacity
      picture_count
      description_length
      tenure_months
      number_of_bookings
      booking_rate
    
  
  
    
      0  
         1
       Property type 1
       Neighborhood 14
       140
       3
       11
       232
       30
        0
       0.0
    
    
      1  
         2
       Property type 1
       Neighborhood 14
        95
       2
        3
        37
       29
        4
       0.1
    
    
      2  
         3
       Property type 2
       Neighborhood 16
        95
       2
       16
       172
       29
        0
       0.0
    
    
      3  
         4
       Property type 2
       Neighborhood 13
        90
       2
       19
       472
       28
        1
       0.0
    
    
      4  
         5
       Property type 1
       Neighborhood 15
       125
       5
       21
       442
       28
       27
       1.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      115
       116
       Property type 1
       Neighborhood 17
       125
       2
        9
        89
       11
        8
       0.7
    
    
      116
       117
       Property type 2
       Neighborhood 14
        49
       2
       14
       417
       11
        0
       0.0
    
    
      117
       118
       Property type 2
        Neighborhood 4
        60
       2
       10
        95
       11
        8
       0.7
    
    
      118
       119
       Property type 2
       Neighborhood 12
        55
       2
        8
       333
       11
       11
       1.0
    
    
      119
       120
       Property type 2
        Neighborhood 9
       100
       1
        4
       212
       11
        1
       0.1
    
  

120 rows × 10 columns

`prop_type` and `neighborhood` are categorical variables, use `get_dummies()` (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.



In [662]:

    
#creating dummy variables for prop type
listings_dum_prop=pd.get_dummies(listings['prop_type'])
listings_dum_prop









    Out[662]:






  
    
      
      Property type 1
      Property type 2
      Property type 3
    
  
  
    
      0  
       1
       0
       0
    
    
      1  
       1
       0
       0
    
    
      2  
       0
       1
       0
    
    
      3  
       0
       1
       0
    
    
      4  
       1
       0
       0
    
    
      ...
      ...
      ...
      ...
    
    
      403
       0
       1
       0
    
    
      404
       0
       1
       0
    
    
      405
       1
       0
       0
    
    
      406
       1
       0
       0
    
    
      407
       1
       0
       0
    
  

408 rows × 3 columns



In [663]:

    
#creating dummy variables for neighborhood
listings_dum_neig=pd.get_dummies(listings['neighborhood'])
listings_dum_neig









    Out[663]:






  
    
      
      Neighborhood 1
      Neighborhood 10
      Neighborhood 11
      Neighborhood 12
      Neighborhood 13
      Neighborhood 14
      Neighborhood 15
      Neighborhood 16
      Neighborhood 17
      Neighborhood 18
      ...
      Neighborhood 20
      Neighborhood 21
      Neighborhood 22
      Neighborhood 3
      Neighborhood 4
      Neighborhood 5
      Neighborhood 6
      Neighborhood 7
      Neighborhood 8
      Neighborhood 9
    
  
  
    
      0  
       0
       0
       0
       0
       0
       1
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      1  
       0
       0
       0
       0
       0
       1
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2  
       0
       0
       0
       0
       0
       0
       0
       1
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3  
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4  
       0
       0
       0
       0
       0
       0
       1
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      403
       0
       0
       0
       0
       0
       1
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      404
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      405
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       1
    
    
      406
       0
       0
       0
       0
       1
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      407
       0
       0
       0
       0
       0
       1
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

408 rows × 22 columns

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis



In [666]:

    
from sklearn.cross_validation import train_test_split



In [671]:

    
listings.columns









    Out[671]:





Index([u'prop_id', u'prop_type', u'neighborhood', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months', u'number_of_bookings', u'booking_rate'], dtype='object')



In [ ]:



In [809]:

    
bookrate_train, bookrate_test, piccount_train, piccount_test, price_train, price_test, desclen_train, desclen_test, tenure_train, tenure_test = train_test_split(listings['booking_rate'],listings['picture_count'], listings['price'], listings['description_length'], listings['tenure_months'],test_size=0.33, random_state=42)
#bookrate_train, bookrate_test, price_train, price_test= train_test_split(listings['booking_rate'],listings['price'],test_size=0.25)



In [868]:

    
#creating my X training set for fitting
everything_train=[]
for everything in range(len(price_train)):
    everything = [piccount_train[everything],price_train[everything], desclen_train[everything], tenure_train[everything]]
    everything_train.append(everything)



In [882]:

    
#creating my X testing set
everything_test=[]
for everything in range(len(price_test)):
    everything = [piccount_test[everything],price_test[everything], desclen_test[everything], tenure_test[everything]]
    everything_test.append(everything)



In [837]:

    
# didnt use this
pictraining =[]
for pic in piccount_train:
    pictraining.append([pic.astype(float)])



In [838]:

    
# didnt use this
tenuretraining =[]
for tenure in tenure_train:
    tenuretraining.append([tenure.astype(float)])



In [839]:

    
# didnt use this
descriptraining =[]
for description in desclen_train:
    descriptraining.append([description.astype(float)])



In [840]:

    
# didnt use this
price_train
pricetraining =[]
for price in price_train:
    pricetraining.append([price.astype(float)])



In [927]:

    
#creating my Y training set for fitting
bookrate_train
booktraining =[]
for rate in bookrate_train:
    booktraining.append([rate])



In [842]:

    
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

from sklearn.cross_validation import train_test_split
import matplotlib.pylab as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from IPython.core.pylabtools import figsize
figsize(5,5)
plt.style.use('fivethirtyeight')



In [835]:

    
def plot_approximation(est, ax, label=None):
    """Plot the approximation of ``est`` on axis ``ax``. """
    #ax.plot(x_plot, f(x_plot), label='ground truth', color='green')
    ax.scatter( pictraining, booktraining, label='training data', color='red')
    ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='blue', label=label)
    ax.set_ylim((0, 100))
    ax.set_xlim((0, 65))
    ax.set_ylabel('y')
    ax.set_xlabel('x')
    ax.legend(loc='upper right',frameon=True)



In [834]:

    
piccount_train.max()









    Out[834]:





60

Part 3 - Model `booking_rate`

Create a linear regression model of your listings



In [887]:

    
#below fitting my model with my train sets

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

degree = 1
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(everything_train, booktraining)









    Out[887]:





0.12420820100158525



In [863]:

    
# sett=[pictraining,descriptraining,pricetraining,tenuretraining]









    Out[863]:





60.0



In [865]:

    
#making some plots that I did not end up using but I wanted to retain this code

# for lst in sett:
#     fig,ax = plt.subplots(1,1)
#     degree = 1
#     est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
#     est.fit(lst, booktraining)
#     ax.scatter( lst, booktraining, label='training data', color='red')
#     ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='blue',label='fit')
#     ax.set_ylim((0, 55))
#     ax.set_xlim((0, max(map(max,lst))))
#     ax.set_ylabel('y')
#     ax.set_xlabel('x')
#     ax.legend(loc='upper right',frameon=True)

fit your model with your test sets



In [898]:

    
#creating my prediction from my X test set
prediction=est.predict(everything_test)
#plotting this prediction against my Y test set, if this was a good model I would expect a line
fig,ax = plt.subplots(1,1)
ax.scatter( prediction, bookrate_test, label='prediction', color='red')









    Out[898]:





<matplotlib.collections.PathCollection at 0x11e8c9890>

report the score

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score



In [889]:

    
#reporting my score, this model isn't very good
score = est.score(everything_test,bookrate_test)
score









    Out[889]:





0.12420820100158525

Interpret the results of the above model:

What does the score method do?
What does this tell us about our model?

I think the score method is calculating the r sqaured value and an r sqaured of 0.12 is telling me the model is not particularly good at predicting booking rate from the parameters we fed it.

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	neighborhood	booking_date
booking_date
2011-01-02	1	NaT
2011-01-05	1	NaT
2011-01-07	1	NaT
2011-01-08	1	NaT
2011-01-09	1	NaT
...	...	...
2011-12-20	1	NaT
2011-12-23	1	NaT
2011-12-24	2	NaT
2011-12-26	1	NaT
2011-12-28	1	NaT

	prop_id	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months
0	1	Property type 1	Neighborhood 14	140	3	11	232	30
1	2	Property type 1	Neighborhood 14	95	2	3	37	29
2	3	Property type 2	Neighborhood 16	95	2	16	172	29
3	4	Property type 2	Neighborhood 13	90	2	19	472	28
4	5	Property type 1	Neighborhood 15	125	5	21	442	28
...	...	...	...	...	...	...	...	...
403	404	Property type 2	Neighborhood 14	100	1	8	235	1
404	405	Property type 2	Neighborhood 13	85	2	27	1048	1
405	406	Property type 1	Neighborhood 9	70	3	18	153	1
406	407	Property type 1	Neighborhood 13	129	2	13	370	1
407	408	Property type 1	Neighborhood 14	100	3	21	707	1

	prop_id	price	person_capacity	picture_count	description_length	tenure_months	number_of_bookings	booking_rate
count	408.0	408.0	408.0	408.0	408.0	408.0	327.0	327.0
mean	204.5	187.8	3.0	14.4	309.2	8.5	18.4	4.0
std	117.9	353.1	1.6	10.5	228.0	5.9	20.4	6.2
min	1.0	39.0	1.0	1.0	0.0	1.0	1.0	0.0
25%	102.8	90.0	2.0	6.0	179.0	4.0	3.0	0.4
50%	204.5	125.0	2.0	12.0	250.0	7.0	9.0	1.6
75%	306.2	199.0	4.0	20.0	389.5	13.0	29.5	4.9
max	408.0	5000.0	10.0	71.0	1969.0	30.0	109.0	52.0

	price					person_capacity					...	description_length					tenure_months
	count	mean	median	min	max	count	mean	median	min	max	...	count	mean	median	min	max	count	mean	median	min	max
prop_id
1	1	140	140	140	140	1	3	3	3	3	...	1	232	232	232	232	1	30	30	30	30
2	1	95	95	95	95	1	2	2	2	2	...	1	37	37	37	37	1	29	29	29	29
3	1	95	95	95	95	1	2	2	2	2	...	1	172	172	172	172	1	29	29	29	29
4	1	90	90	90	90	1	2	2	2	2	...	1	472	472	472	472	1	28	28	28	28
5	1	125	125	125	125	1	5	5	5	5	...	1	442	442	442	442	1	28	28	28	28
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
404	1	100	100	100	100	1	1	1	1	1	...	1	235	235	235	235	1	1	1	1	1
405	1	85	85	85	85	1	2	2	2	2	...	1	1048	1048	1048	1048	1	1	1	1	1
406	1	70	70	70	70	1	3	3	3	3	...	1	153	153	153	153	1	1	1	1	1
407	1	129	129	129	129	1	2	2	2	2	...	1	370	370	370	370	1	1	1	1	1
408	1	100	100	100	100	1	3	3	3	3	...	1	707	707	707	707	1	1	1	1	1

	prop_id	booking_date	prop_type	neighborhood	price	person_capacity	picture_count	description_length	tenure_months	number_of_bookings	booking_rate
0	188	2011-01-01	Property type 1	Neighborhood 9	95	3	10	247	8	16	2.0
1	188	2011-01-04	Property type 1	Neighborhood 9	95	3	10	247	8	16	2.0
2	188	2011-01-05	Property type 1	Neighborhood 9	95	3	10	247	8	16	2.0
3	188	2011-01-11	Property type 1	Neighborhood 9	95	3	10	247	8	16	2.0
4	188	2011-01-14	Property type 1	Neighborhood 9	95	3	10	247	8	16	2.0
...	...	...	...	...	...	...	...	...	...	...	...
6071	153	2011-12-07	Property type 2	Neighborhood 18	95	2	22	174	9	6	0.7
6072	153	2011-12-28	Property type 2	Neighborhood 18	95	2	22	174	9	6	0.7
6073	403	2011-12-19	Property type 1	Neighborhood 19	100	4	2	69	1	5	5.0
6074	249	2011-12-19	Property type 1	Neighborhood 14	75	2	9	59	6	NaN	NaN
6075	64	2011-12-29	Property type 1	Neighborhood 20	250	3	6	289	16	2	0.1

	Property type 1	Property type 2	Property type 3
0	1	0	0
1	1	0	0
2	0	1	0
3	0	1	0
4	1	0	0
...	...	...	...
403	0	1	0
404	0	1	0
405	1	0	0
406	1	0	0
407	1	0	0

	Neighborhood 1	Neighborhood 10	Neighborhood 11	Neighborhood 12	Neighborhood 13	Neighborhood 14	Neighborhood 15	Neighborhood 16	Neighborhood 17	Neighborhood 18	...	Neighborhood 20	Neighborhood 21	Neighborhood 22	Neighborhood 3	Neighborhood 4	Neighborhood 5	Neighborhood 6	Neighborhood 7	Neighborhood 8	Neighborhood 9
0	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
403	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
404	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
405	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
406	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
407	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	Property type 1	Property type 2	Property type 3
0	1	0	0
1	1	0	0
2	0	1	0
3	0	1	0
4	1	0	0
...	...	...	...
403	0	1	0
404	0	1	0
405	1	0	0
406	1	0	0
407	1	0	0

	Neighborhood 1	Neighborhood 10	Neighborhood 11	Neighborhood 12	Neighborhood 13	Neighborhood 14	Neighborhood 15	Neighborhood 16	Neighborhood 17	Neighborhood 18	...	Neighborhood 20	Neighborhood 21	Neighborhood 22	Neighborhood 3	Neighborhood 4	Neighborhood 5	Neighborhood 6	Neighborhood 7	Neighborhood 8	Neighborhood 9
0	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
403	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
404	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
405	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
406	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
407	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0