Making Inferences: Do AirBnB's cause Rents to Increase?

Time Series in pandas


In [1]:
import pandas as pd
import matplotlib as plt

# draw plots in notebook
%matplotlib inline

In [2]:
# make plots SVG (higher quality)
%config InlineBackend.figure_format = 'svg'

In [3]:
# more time/compute intensive to parse dates. but we know we definitely have/need them
df = pd.read_csv('data/sf_listings.csv', parse_dates=['last_review'], infer_datetime_format=True)
df_reviews = pd.read_csv('data/reviews.csv', parse_dates=['date'], infer_datetime_format=True)

In [4]:
df_reviews.date[0]


Out[4]:
Timestamp('2014-02-27 00:00:00')

In [5]:
df.head()


Out[5]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 1162609 Lovely One Bedroom Apartment 6368122 Taylor NaN Seacliff 37.785217 -122.488655 Entire home/apt 350 4 8 2015-09-17 0.28 1 90
1 6032828 Historic Seacliff Home 30384615 Patricia NaN Seacliff 37.783658 -122.489398 Entire home/apt 300 1 0 NaT NaN 1 173
2 6938818 Best Secret in Town 36381578 Harris NaN Seacliff 37.781505 -122.504754 Private room 119 1 10 2015-10-08 2.33 1 341
3 8087607 Single Room Beautiful Beach Condo 3264449 Keith NaN Seacliff 37.775318 -122.511621 Private room 79 1 0 NaT NaN 1 40
4 4781448 3 Bd 2.5 Ba Full Flat Condo w Views 13112558 Pam NaN Seacliff 37.781797 -122.492492 Entire home/apt 695 2 1 2015-08-11 0.36 1 23

In [6]:
# display general diagnostic info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7029 entries, 0 to 7028
Data columns (total 16 columns):
id                                7029 non-null int64
name                              7029 non-null object
host_id                           7029 non-null int64
host_name                         6504 non-null object
neighbourhood_group               0 non-null float64
neighbourhood                     7029 non-null object
latitude                          7029 non-null float64
longitude                         7029 non-null float64
room_type                         7029 non-null object
price                             7029 non-null int64
minimum_nights                    7029 non-null int64
number_of_reviews                 7029 non-null int64
last_review                       5698 non-null datetime64[ns]
reviews_per_month                 5698 non-null float64
calculated_host_listings_count    7029 non-null int64
availability_365                  7029 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(7), object(4)
memory usage: 878.7+ KB

In [7]:
df_reviews.head()


Out[7]:
listing_id date
0 1994427 2014-02-27
1 1994427 2015-10-07
2 1994427 2015-10-12
3 1994427 2015-10-17
4 1994427 2015-10-26

In [8]:
# index DataFrame on listing_id in order to join datasets
reindexed_df = df_reviews.set_index('listing_id')
reindexed_df.head()


Out[8]:
date
listing_id
1994427 2014-02-27
1994427 2015-10-07
1994427 2015-10-12
1994427 2015-10-17
1994427 2015-10-26

In [9]:
# remember the original id in a column to group on
df['listing_id'] = df['id']
df_listing = df.set_index('id')
df_listing.head()


Out[9]:
name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 listing_id
id
1162609 Lovely One Bedroom Apartment 6368122 Taylor NaN Seacliff 37.785217 -122.488655 Entire home/apt 350 4 8 2015-09-17 0.28 1 90 1162609
6032828 Historic Seacliff Home 30384615 Patricia NaN Seacliff 37.783658 -122.489398 Entire home/apt 300 1 0 NaT NaN 1 173 6032828
6938818 Best Secret in Town 36381578 Harris NaN Seacliff 37.781505 -122.504754 Private room 119 1 10 2015-10-08 2.33 1 341 6938818
8087607 Single Room Beautiful Beach Condo 3264449 Keith NaN Seacliff 37.775318 -122.511621 Private room 79 1 0 NaT NaN 1 40 8087607
4781448 3 Bd 2.5 Ba Full Flat Condo w Views 13112558 Pam NaN Seacliff 37.781797 -122.492492 Entire home/apt 695 2 1 2015-08-11 0.36 1 23 4781448

Star Schema (facts vs. dimensions)

In our case, the individual review events are the facts and listings themselves are the dimensions.


In [10]:
# join the listing information with the review information
review_timeseries = df_listing.join(reindexed_df)

print review_timeseries.columns
review_timeseries.head()


Index([u'name', u'host_id', u'host_name', u'neighbourhood_group',
       u'neighbourhood', u'latitude', u'longitude', u'room_type', u'price',
       u'minimum_nights', u'number_of_reviews', u'last_review',
       u'reviews_per_month', u'calculated_host_listings_count',
       u'availability_365', u'listing_id', u'date'],
      dtype='object')
Out[10]:
name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 listing_id date
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2009-07-23
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2009-08-03
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2009-09-27
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2009-11-05
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2010-02-13

In [11]:
# nothing new/interesting here...
review_timeseries.groupby('listing_id').count()['name'].hist(bins=100, figsize=(12,6));



In [12]:
# causes python to crash, lets see if there is a better way
# review_timeseries.groupby(['neighbourhood','date']).count()

In [13]:
# lets try a pivot table...
reviews_over_time = pd.crosstab(review_timeseries.date, review_timeseries.neighbourhood)
reviews_over_time.head()


Out[13]:
neighbourhood Bayview Bernal Heights Castro/Upper Market Chinatown Crocker Amazon Diamond Heights Downtown/Civic Center Excelsior Financial District Glen Park ... Presidio Presidio Heights Russian Hill Seacliff South of Market Treasure Island/YBI Twin Peaks Visitacion Valley West of Twin Peaks Western Addition
date
2009-03-29 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2009-05-03 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2009-05-23 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2009-06-12 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2009-07-15 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 37 columns


In [14]:
# let's look at some particular neighborhoods
neighborhoods = df.neighbourhood.unique()
print neighborhoods


['Seacliff' 'Haight Ashbury' 'Outer Mission' 'Downtown/Civic Center'
 'Diamond Heights' 'Lakeshore' 'Russian Hill' 'Noe Valley' 'Inner Sunset'
 'Treasure Island/YBI' 'Outer Richmond' 'Crocker Amazon' 'Excelsior'
 'Parkside' 'Financial District' 'Ocean View' 'Mission'
 'West of Twin Peaks' 'Inner Richmond' 'Marina' 'Bayview'
 'Visitacion Valley' 'Pacific Heights' 'Presidio Heights' 'South of Market'
 'Glen Park' 'Potrero Hill' 'Castro/Upper Market' 'Twin Peaks'
 'Bernal Heights' 'Chinatown' 'North Beach' 'Presidio' 'Nob Hill'
 'Outer Sunset' 'Western Addition' 'Golden Gate Park']

In [16]:
# a little noisy
reviews_over_time[['Mission', 'South of Market', 'Noe Valley']].plot(figsize=(12,6))


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d5d2d10>

In [17]:
# smooth by resampling by month
reviews_over_time.resample('M').mean()[['Mission', 'South of Market', 'Noe Valley']].plot(figsize=(12,6))


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d262990>

Pandas Resample String convention

Interesting things of note:

  • Each neighborhood has an activity spike in Fall 2014 and Summer 2015.
  • Likely a late summer vacation surge (since that is when SF has nicest weather :)
    • It is periodic and the magnitude of the increase is itself increasing (good news for AirBnB!)...

Using the following functions, find which columns correlate with increased activity (# of reviews and reviews per month):


In [ ]:
# Exercise 1 Solution

Correlation vs. Regression

Actually pretty nearly similar mathematically...

At least inferentially: http://stats.stackexchange.com/questions/2125/whats-the-difference-between-correlation-and-simple-linear-regression

Introduction to Machine Learning

slideshow!

Model Evaluation: Finding Under (or over) valued Listings


In [18]:
from sklearn import linear_model

In [31]:
features = df[['host_name', 'neighbourhood', 'room_type', 'minimum_nights','number_of_reviews', \
               'calculated_host_listings_count', 'availability_365']]
labels = df['price']

In [20]:
# no price!
features.head()


Out[20]:
host_name neighbourhood room_type minimum_nights number_of_reviews calculated_host_listings_count availability_365
0 Taylor Seacliff Entire home/apt 4 8 1 90
1 Patricia Seacliff Entire home/apt 1 0 1 173
2 Harris Seacliff Private room 1 10 1 341
3 Keith Seacliff Private room 1 0 1 40
4 Pam Seacliff Entire home/apt 2 1 1 23

In [21]:
# Categorical -> One Hot Encoding
# http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features
dummies = pd.get_dummies(features)

# sklearn likes matrices
feature_matrix = dummies.as_matrix()

In [22]:
labels.as_matrix()


Out[22]:
array([ 350,  300,  119, ...,   80, 2500,   72])

In [23]:
feature_matrix


Out[23]:
array([[  4.,   8.,   1., ...,   1.,   0.,   0.],
       [  1.,   0.,   1., ...,   1.,   0.,   0.],
       [  1.,  10.,   1., ...,   0.,   1.,   0.],
       ..., 
       [  1.,   0.,   1., ...,   0.,   1.,   0.],
       [  1.,   0.,   1., ...,   1.,   0.,   0.],
       [  2.,   3.,   2., ...,   1.,   0.,   0.]])

In [24]:
# Initialize and Fit sklearn model

model = linear_model.LinearRegression()
clf = model.fit(feature_matrix, labels.as_matrix())

In [25]:
# How well did we do?
clf.score(feature_matrix, labels.as_matrix())


Out[25]:
0.48736577956996263

In [27]:
print "There are {0} features...".format(len(clf.coef_))
clf.coef_


There are 2442 features...
Out[27]:
array([  -0.55304055,   -0.89869096,   -2.23493976, ...,  126.97532449,
        -28.92829922,  -98.04702526])

In [32]:
# Remove the name column, we are probably overfitting...
no_name = features.copy()
no_name.pop('host_name')
no_names_feature_m = pd.get_dummies(no_name).as_matrix()

In [33]:
model = linear_model.LinearRegression(normalize=True)
clf = model.fit(no_names_feature_m, labels.as_matrix())

In [34]:
# Turns out the name feature is highly predictive...
# but not very useful: https://www.kaggle.com/wiki/Leakage
clf.score(no_names_feature_m, labels.as_matrix())


Out[34]:
0.08339169072234609

In [35]:
len(clf.coef_)


Out[35]:
44

In [37]:
# We need more and better features
df2 = pd.read_csv('data/listings_full.csv')
df2.columns


Out[37]:
Index([u'id', u'listing_url', u'scrape_id', u'last_scraped', u'name',
       u'summary', u'space', u'description', u'experiences_offered',
       u'neighborhood_overview', u'notes', u'transit', u'thumbnail_url',
       u'medium_url', u'picture_url', u'xl_picture_url', u'host_id',
       u'host_url', u'host_name', u'host_since', u'host_location',
       u'host_about', u'host_response_time', u'host_response_rate',
       u'host_acceptance_rate', u'host_is_superhost', u'host_thumbnail_url',
       u'host_picture_url', u'host_neighbourhood', u'host_listings_count',
       u'host_total_listings_count', u'host_verifications',
       u'host_has_profile_pic', u'host_identity_verified', u'street',
       u'neighbourhood', u'neighbourhood_cleansed',
       u'neighbourhood_group_cleansed', u'city', u'state', u'zipcode',
       u'market', u'smart_location', u'country_code', u'country', u'latitude',
       u'longitude', u'is_location_exact', u'property_type', u'room_type',
       u'accommodates', u'bathrooms', u'bedrooms', u'beds', u'bed_type',
       u'amenities', u'square_feet', u'price', u'weekly_price',
       u'monthly_price', u'security_deposit', u'cleaning_fee',
       u'guests_included', u'extra_people', u'minimum_nights',
       u'maximum_nights', u'calendar_updated', u'has_availability',
       u'availability_30', u'availability_60', u'availability_90',
       u'availability_365', u'calendar_last_scraped', u'number_of_reviews',
       u'first_review', u'last_review', u'review_scores_rating',
       u'review_scores_accuracy', u'review_scores_cleanliness',
       u'review_scores_checkin', u'review_scores_communication',
       u'review_scores_location', u'review_scores_value', u'requires_license',
       u'license', u'jurisdiction_names', u'instant_bookable',
       u'cancellation_policy', u'require_guest_profile_picture',
       u'require_guest_phone_verification', u'calculated_host_listings_count',
       u'reviews_per_month'],
      dtype='object')

In [38]:
df2.head()


Out[38]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... review_scores_value requires_license license jurisdiction_names instant_bookable cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count reviews_per_month
0 1162609 https://www.airbnb.com/rooms/1162609 20151101181058 2015-11-02 Lovely One Bedroom Apartment This is a big well designed space. Our neighbo... This is a great one bedroom apartment. It is a... This is a big well designed space. Our neighbo... none NaN ... 9.0 t NaN SAN FRANCISCO f strict f f 1 0.28
1 6032828 https://www.airbnb.com/rooms/6032828 20151101181058 2015-11-01 Historic Seacliff Home Beautiful charmng multi- bedroom home in prest... NaN Beautiful charmng multi- bedroom home in prest... none NaN ... NaN t NaN SAN FRANCISCO f flexible f f 1 NaN
2 6938818 https://www.airbnb.com/rooms/6938818 20151101181058 2015-11-02 Best Secret in Town Convenience is the key for your travel. Walkin... The room is spacious and it is on the top leve... Convenience is the key for your travel. Walkin... none It is in the city and close to everything. Par... ... 9.0 t S. F. Short-Term Residential Rental Registrati... SAN FRANCISCO f strict f f 1 2.33
3 8087607 https://www.airbnb.com/rooms/8087607 20151101181058 2015-11-02 Single Room Beautiful Beach Condo When I travel, my quiet beach side condo makes... NaN When I travel, my quiet beach side condo makes... none NaN ... NaN t NaN SAN FRANCISCO f flexible f f 1 NaN
4 4781448 https://www.airbnb.com/rooms/4781448 20151101181058 2015-11-02 3 Bd 2.5 Ba Full Flat Condo w Views Our family-friendly condo is located in San Fr... Open concept living, dining area has great cit... Our family-friendly condo is located in San Fr... none We are located across the street from Lincoln ... ... 10.0 t NaN SAN FRANCISCO t strict f f 1 0.36

5 rows × 92 columns


In [39]:
# get a snapshot of some of the columns in the center of the matrix
df2.iloc[1:5, 40:60]


Out[39]:
zipcode market smart_location country_code country latitude longitude is_location_exact property_type room_type accommodates bathrooms bedrooms beds bed_type amenities square_feet price weekly_price monthly_price
1 94121.0 San Francisco San Francisco, CA US United States 37.783658 -122.489398 t House Entire home/apt 6 2.0 4.0 2.0 Real Bed {TV,"Cable TV",Internet,"Wireless Internet",Ki... NaN $300.00 NaN NaN
2 94121.0 San Francisco San Francisco, CA US United States 37.781505 -122.504754 t House Private room 2 1.0 1.0 1.0 Futon {TV,Internet,"Wireless Internet",Kitchen,Heati... NaN $119.00 $700.00 NaN
3 94121.0 San Francisco San Francisco, CA US United States 37.775318 -122.511621 t Apartment Private room 2 1.0 1.0 1.0 Real Bed {TV,"Cable TV",Internet,"Wireless Internet",Ki... NaN $79.00 NaN NaN
4 94121.0 San Francisco San Francisco, CA US United States 37.781797 -122.492492 t Apartment Entire home/apt 8 2.5 3.0 4.0 Real Bed {TV,"Cable TV",Internet,"Wireless Internet",Ki... NaN $695.00 NaN NaN

In [40]:
# optimistically lets just use a few key features to start. Remember Occam's razor..
select_features = df2[['host_has_profile_pic' ,'host_identity_verified', 'host_listings_count','host_response_time', 'host_acceptance_rate', 'host_is_superhost', 'transit', 'neighbourhood_cleansed','is_location_exact', 'property_type', 'room_type', 'accommodates','bathrooms','bedrooms','beds']]

In [41]:
select_features.head()


Out[41]:
host_has_profile_pic host_identity_verified host_listings_count host_response_time host_acceptance_rate host_is_superhost transit neighbourhood_cleansed is_location_exact property_type room_type accommodates bathrooms bedrooms beds
0 t t 1.0 within an hour 100% f NaN Seacliff t Apartment Entire home/apt 2 1.0 1.0 1.0
1 t f 1.0 NaN NaN f NaN Seacliff t House Entire home/apt 6 2.0 4.0 2.0
2 t t 1.0 within a few hours 75% f If you are on budget, the public transportatio... Seacliff t House Private room 2 1.0 1.0 1.0
3 t t 1.0 within an hour 0% f NaN Seacliff t Apartment Private room 2 1.0 1.0 1.0
4 t t 2.0 within a few hours 100% f Easy bus access to the entire city with 3 bus ... Seacliff t Apartment Entire home/apt 8 2.5 3.0 4.0

In [43]:
# moar feature engineering. fill in missing data since it wil break our model
select_features = select_features.fillna({'host_response_time': 'NA', 'host_acceptance_rate': '-1%'})
select_features.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7029 entries, 0 to 7028
Data columns (total 15 columns):
host_has_profile_pic      7028 non-null object
host_identity_verified    7028 non-null object
host_listings_count       7028 non-null float64
host_response_time        7029 non-null object
host_acceptance_rate      7029 non-null object
host_is_superhost         7028 non-null object
transit                   4571 non-null object
neighbourhood_cleansed    7029 non-null object
is_location_exact         7029 non-null object
property_type             7024 non-null object
room_type                 7029 non-null object
accommodates              7029 non-null int64
bathrooms                 6974 non-null float64
bedrooms                  7020 non-null float64
beds                      7017 non-null float64
dtypes: float64(4), int64(1), object(10)
memory usage: 823.8+ KB

In [44]:
# convert the percentage as a string into a float
select_features.host_acceptance_rate = select_features.host_acceptance_rate.str.strip('%').astype(float) / 100

In [45]:
# Binarize transit column... the listing is either near transit or it isn't
select_features.transit = select_features.transit.isnull()

In [46]:
select_features.transit


Out[46]:
0        True
1        True
2       False
3        True
4       False
5       False
6       False
7       False
8       False
9       False
10       True
11      False
12       True
13      False
14      False
15      False
16       True
17       True
18      False
19       True
20       True
21      False
22      False
23       True
24       True
25       True
26      False
27      False
28       True
29      False
        ...  
6999    False
7000    False
7001    False
7002    False
7003    False
7004     True
7005    False
7006    False
7007    False
7008    False
7009    False
7010    False
7011    False
7012    False
7013    False
7014    False
7015    False
7016    False
7017    False
7018     True
7019     True
7020    False
7021    False
7022    False
7023    False
7024    False
7025    False
7026     True
7027     True
7028    False
Name: transit, dtype: bool

In [47]:
# One last fill incase we missed any nulls
dummies = pd.get_dummies(select_features).fillna(0)
feature_matrix = dummies.as_matrix()

In [48]:
# Price as a currency string -> price as a float
labels = df2.price.str.strip('$').str.replace(',', '').astype(float)

In [49]:
# initialize model again
model = linear_model.LinearRegression(normalize=True)
clf = model.fit(feature_matrix, labels)

In [50]:
# much better!
clf.score(feature_matrix, labels)


Out[50]:
0.2237815899681127

In [52]:
# a sweet spot in between over and under fitting
len(clf.coef_)


Out[52]:
79

Making Predictions: How should I price my Listing?!?


In [53]:
# Predict what we should price listing #1000 at given its features
clf.predict(feature_matrix[1100])


Out[53]:
array([ 479.5])

In [55]:
# Looks like it is overpriced...
df2.iloc[1100].price


Out[55]:
'$550.00'

In [56]:
# And it shows... there are only 2 reviews per month
df2.iloc[1100]


Out[56]:
id                                                                            8700433
listing_url                                      https://www.airbnb.com/rooms/8700433
scrape_id                                                              20151101181058
last_scraped                                                               2015-11-02
name                                              Modern Victorian Home in Noe Valley
summary                             This comfortable and stylish Victorian home ha...
space                               Our rental space is the upper floor of our hom...
description                         This comfortable and stylish Victorian home ha...
experiences_offered                                                              none
neighborhood_overview               Noe Valley is a neighborhood located between t...
notes                               Two night minimum for Fri. and Sat. night unle...
transit                             It is a 10-15 minute walk to the 24th street s...
thumbnail_url                       https://a0.muscache.com/ac/pictures/110465388/...
medium_url                          https://a0.muscache.com/im/pictures/110465388/...
picture_url                         https://a0.muscache.com/ac/pictures/110465388/...
xl_picture_url                      https://a0.muscache.com/ac/pictures/110465388/...
host_id                                                                        801578
host_url                                     https://www.airbnb.com/users/show/801578
host_name                                                                        Mark
host_since                                                                 2011-07-10
host_location                                San Francisco, California, United States
host_about                          Nice guy, serial technology entrepreneur, occa...
host_response_time                                                 within a few hours
host_response_rate                                                               100%
host_acceptance_rate                                                              98%
host_is_superhost                                                                   f
host_thumbnail_url                  https://a0.muscache.com/ac/users/801578/profil...
host_picture_url                    https://a0.muscache.com/ac/users/801578/profil...
host_neighbourhood                                                  Island of Hawaiʻi
host_listings_count                                                                 3
                                                          ...                        
guests_included                                                                     1
extra_people                                                                    $0.00
minimum_nights                                                                      1
maximum_nights                                                                   1125
calendar_updated                                                           2 days ago
has_availability                                                                    t
availability_30                                                                    13
availability_60                                                                    42
availability_90                                                                    72
availability_365                                                                  347
calendar_last_scraped                                                      2015-11-01
number_of_reviews                                                                   2
first_review                                                               2015-10-19
last_review                                                                2015-10-24
review_scores_rating                                                               90
review_scores_accuracy                                                             10
review_scores_cleanliness                                                           9
review_scores_checkin                                                               9
review_scores_communication                                                        10
review_scores_location                                                             10
review_scores_value                                                                 9
requires_license                                                                    t
license                                                                           NaN
jurisdiction_names                                                      SAN FRANCISCO
instant_bookable                                                                    f
cancellation_policy                                                            strict
require_guest_profile_picture                                                       f
require_guest_phone_verification                                                    f
calculated_host_listings_count                                                      1
reviews_per_month                                                                   2
Name: 1100, dtype: object

In [57]:
# Where the top listing have 10+ reviews per month
df2.sort_values('reviews_per_month', ascending=False).head()


Out[57]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... review_scores_value requires_license license jurisdiction_names instant_bookable cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count reviews_per_month
1271 6437264 https://www.airbnb.com/rooms/6437264 20151101181058 2015-11-02 Apartment + Loft overlooking SF Unique sparsely furnished apartment in Noe val... This is a very sparsely furnished main floor a... Unique sparsely furnished apartment in Noe val... none You are in a very safe and desirable neighborh... ... 9.0 t NaN SAN FRANCISCO f flexible f f 1 11.52
1746 6629798 https://www.airbnb.com/rooms/6629798 20151101181058 2015-11-02 Room w/ private bath + free parking One bedroom + private bathroom (which has a sk... -Free Fast WiFi (105Mbps) One bedroom + private bathroom (which has a sk... none is a safe, quiet and sunny neighborhood locate... ... 9.0 t NaN SAN FRANCISCO t moderate f f 1 11.25
3038 8118808 https://www.airbnb.com/rooms/8118808 20151101181058 2015-11-02 Master Suite in a beautiful house It is one block walking to Forest Hill station... NaN It is one block walking to Forest Hill station... none NaN ... 9.0 t NaN SAN FRANCISCO t flexible f f 4 10.59
2966 7498479 https://www.airbnb.com/rooms/7498479 20151101181058 2015-11-01 Bluescreen Room: Mission SF The Bluescreen Room is in the heart of the Mis... The Bluescreen Room *was* a fully functional g... The Bluescreen Room is in the heart of the Mis... none You are 1 block away from the Valencia Corrido... ... 9.0 t STR-0000652 SAN FRANCISCO f flexible f f 2 10.42
1097 7129539 https://www.airbnb.com/rooms/7129539 20151101181058 2015-11-02 Small, private Noe Valley apartment This is a sparsely furnished, small, cozy, sec... This completely private apartment suits up to ... This is a sparsely furnished, small, cozy, sec... none What's great about Noe Valley will speak for i... ... 9.0 t NaN SAN FRANCISCO f moderate f f 1 10.26

5 rows × 92 columns


In [58]:
# Zip together our column names with our beta coefficients
coefficients = zip(dummies.columns, clf.coef_)

In [59]:
# Most significant
sorted(coefficients, key=lambda coef: coef[1], reverse=True)[:10]


Out[59]:
[('room_type_Entire home/apt', 649630273189155.75),
 ('room_type_Private room', 649630273189092.38),
 ('room_type_Shared room', 649630273188997.0),
 ('host_is_superhost_t', 410651297312091.06),
 ('host_is_superhost_f', 410651297312067.56),
 ('is_location_exact_t', 380381413039120.19),
 ('is_location_exact_f', 380381413039088.94),
 ('property_type_Boat', 1661.2929738392058),
 ('property_type_Treehouse', 265.29860088537487),
 ('property_type_Yurt', 251.91169252271425)]

In [60]:
# Least significant
sorted(coefficients, key=lambda coef: coef[1])[:10]


Out[60]:
[('neighbourhood_cleansed_Visitacion Valley', -1826685830349805.2),
 ('neighbourhood_cleansed_Crocker Amazon', -1826685830349795.8),
 ('neighbourhood_cleansed_Excelsior', -1826685830349777.8),
 ('neighbourhood_cleansed_Treasure Island/YBI', -1826685830349775.2),
 ('neighbourhood_cleansed_Parkside', -1826685830349771.0),
 ('neighbourhood_cleansed_Ocean View', -1826685830349769.0),
 ('neighbourhood_cleansed_Outer Mission', -1826685830349767.8),
 ('neighbourhood_cleansed_Bayview', -1826685830349767.5),
 ('neighbourhood_cleansed_Diamond Heights', -1826685830349763.5),
 ('neighbourhood_cleansed_Glen Park', -1826685830349759.2)]

In [ ]: