Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [229]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

pd.set_option('display.max_rows', 10)

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [230]:
bookings = pd.read_csv('../data/bookings.csv', parse_dates = ['booking_date'])
listings = pd.read_csv('../data/listings.csv')

In [231]:
print bookings


      prop_id booking_date
0           9   2011-06-17
1          13   2011-08-12
2          21   2011-06-20
3          28   2011-05-05
4          29   2011-11-17
...       ...          ...
6071      408   2011-06-02
6072      408   2011-08-22
6073      408   2011-07-24
6074      408   2011-01-12
6075      408   2011-09-08

[6076 rows x 2 columns]

In [232]:
print listings


     prop_id        prop_type     neighborhood  price  person_capacity  \
0          1  Property type 1  Neighborhood 14    140                3   
1          2  Property type 1  Neighborhood 14     95                2   
2          3  Property type 2  Neighborhood 16     95                2   
3          4  Property type 2  Neighborhood 13     90                2   
4          5  Property type 1  Neighborhood 15    125                5   
..       ...              ...              ...    ...              ...   
403      404  Property type 2  Neighborhood 14    100                1   
404      405  Property type 2  Neighborhood 13     85                2   
405      406  Property type 1   Neighborhood 9     70                3   
406      407  Property type 1  Neighborhood 13    129                2   
407      408  Property type 1  Neighborhood 14    100                3   

     picture_count  description_length  tenure_months  
0               11                 232             30  
1                3                  37             29  
2               16                 172             29  
3               19                 472             28  
4               21                 442             28  
..             ...                 ...            ...  
403              8                 235              1  
404             27                1048              1  
405             18                 153              1  
406             13                 370              1  
407             21                 707              1  

[408 rows x 8 columns]

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [233]:
bookings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)

In [234]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)

In [235]:
bookings.columns


Out[235]:
Index([u'prop_id', u'booking_date'], dtype='object')

In [236]:
listings.columns


Out[236]:
Index([u'prop_id', u'prop_type', u'neighborhood', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months'], dtype='object')

In [237]:
# for col in ['neighborhood']:
#    listings[col] = listings[col].map(lambda x: x.replace('Neighborhood ', '')).astype(int)
#for col in ['prop_type']:
#    listings[col] = listings[col].map(lambda x: x.replace('Property type ', '')).astype(int)

#print listings

In [238]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)

In [239]:
listings.price.std()


Out[239]:
353.05085803923384

In [240]:
# listings.groupby('prop_id').price.mean()

In [241]:
listings[['price', 'person_capacity', 'picture_count', 'description_length']].describe()


Out[241]:
price person_capacity picture_count description_length
count 408.000000 408.000000 408.000000 408.000000
mean 187.806373 2.997549 14.389706 309.159314
std 353.050858 1.594676 10.477428 228.021684
min 39.000000 1.000000 1.000000 0.000000
25% 90.000000 2.000000 6.000000 179.000000
50% 125.000000 2.000000 12.000000 250.000000
75% 199.000000 4.000000 20.000000 389.500000
max 5000.000000 10.000000 71.000000 1969.000000

In [242]:
fig, ax = plt.subplots(1,2)
listings.boxplot('price', ax=ax[0])
listings.price.plot(ax=ax[1])

ax[0].set_ylim([0, 500])
listings[listings.price > 1000]


Out[242]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
48 49 Property type 1 Neighborhood 13 2394 8 10 363 16
196 197 Property type 1 Neighborhood 12 3050 4 6 227 7
289 290 Property type 1 Neighborhood 12 2650 4 6 232 5
290 291 Property type 1 Neighborhood 12 2350 1 6 219 5
340 341 Property type 1 Neighborhood 20 5000 4 6 145 3

In [243]:
grouped = listings.groupby('prop_type')
grouped['price']


Out[243]:
<pandas.core.groupby.SeriesGroupBy object at 0x00000000182FA1D0>

In [244]:
# grouped['price'].isnull() == False].describe()
# grouped[grouped['price'].isnull() == False].describe()

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [245]:
grouped = listings.groupby('prop_type')

grouped[['price', 'person_capacity', 'picture_count', 'description_length']].agg([np.mean, np.std, np.median])


Out[245]:
price person_capacity picture_count description_length
mean std median mean std median mean std median mean std median
prop_type
Property type 1 237.085502 425.710534 150 3.516729 1.644955 3 14.695167 10.623651 12.0 313.171004 214.769141 266.0
Property type 2 93.288889 42.261246 89 2.000000 0.846415 2 13.948148 10.255191 11.0 304.851852 255.135332 239.0
Property type 3 63.750000 16.520190 70 1.750000 0.500000 2 8.750000 7.320064 9.5 184.750000 53.093471 192.5

In [246]:
grouped.prop_type.unique()


Out[246]:
prop_type
Property type 1    [Property type 1]
Property type 2    [Property type 2]
Property type 3    [Property type 3]
Name: prop_type, dtype: object

Same, but by property type per neighborhood?


In [247]:
naybor = listings.groupby(['prop_type', 'neighborhood'])['price', 'person_capacity', 'picture_count', 'description_length']

In [248]:
naybor.agg([np.mean, np.std, np.median])


Out[248]:
price person_capacity picture_count description_length
mean std median mean std median mean std median mean std median
prop_type neighborhood
Property type 1 Neighborhood 1 85.000000 NaN 85.0 2.000000 NaN 2 26.000000 NaN 26.0 209.000000 NaN 209.0
Neighborhood 10 142.500000 36.979724 137.5 3.500000 1.224745 4 13.333333 8.571270 12.0 391.000000 146.929915 425.5
Neighborhood 11 159.428571 70.962302 130.0 3.214286 1.311404 3 9.928571 5.928605 8.0 379.000000 396.956111 295.5
Neighborhood 12 365.615385 686.086484 150.0 3.435897 1.874972 3 10.820513 7.118810 8.0 267.205128 137.867820 251.0
Neighborhood 13 241.897959 320.997874 180.0 4.061224 1.586508 4 15.653061 9.828850 13.0 290.408163 124.417965 256.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Property type 2 Neighborhood 9 110.000000 14.142136 110.0 2.000000 1.414214 2 3.500000 0.707107 3.5 114.500000 137.885822 114.5
Property type 3 Neighborhood 11 75.000000 NaN 75.0 2.000000 NaN 2 15.000000 NaN 15.0 196.000000 NaN 196.0
Neighborhood 14 75.000000 NaN 75.0 1.000000 NaN 1 1.000000 NaN 1.0 113.000000 NaN 113.0
Neighborhood 17 65.000000 NaN 65.0 2.000000 NaN 2 15.000000 NaN 15.0 189.000000 NaN 189.0
Neighborhood 4 40.000000 NaN 40.0 2.000000 NaN 2 4.000000 NaN 4.0 241.000000 NaN 241.0

40 rows × 12 columns

Plot daily bookings:


In [249]:
bookings.plot('booking_date', 'prop_id').fig_size = (50, 20)
plt.suptitle('Daily Bookings')
plt.xlabel('Booking Date')
plt.ylabel('# of Bookings')


Out[249]:
<matplotlib.text.Text at 0x1736ebe0>

Plot the daily bookings per neighborhood (provide a legend)


In [250]:
n_listings = listings.merge(bookings, on = 'prop_id', how = 'left')

In [251]:
n_listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6156 entries, 0 to 6155
Data columns (total 9 columns):
prop_id               6156 non-null int64
prop_type             6156 non-null object
neighborhood          6156 non-null object
price                 6156 non-null int64
person_capacity       6156 non-null int64
picture_count         6156 non-null int64
description_length    6156 non-null int64
tenure_months         6156 non-null int64
booking_date          6076 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(6), object(2)

In [252]:
#n_listings.booking_date = pd.to_datetime(n_listings.booking_date)

In [253]:
n_listings.booking_date.map(lambda x: x.dayofweek)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-253-23c2edc35118> in <module>()
----> 1 n_listings.booking_date.map(lambda x: x.dayofweek)

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in map(self, arg, na_action)
   1953                                      index=self.index).__finalize__(self)
   1954         else:
-> 1955             mapped = map_f(values, arg)
   1956             return self._constructor(mapped,
   1957                                      index=self.index).__finalize__(self)

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\lib.pyd in pandas.lib.map_infer (pandas\lib.c:53184)()

<ipython-input-253-23c2edc35118> in <lambda>(x)
----> 1 n_listings.booking_date.map(lambda x: x.dayofweek)

AttributeError: 'NaTType' object has no attribute 'dayofweek'

In [253]:


In [253]:


In [254]:
sns.factorplot("neighborhood", hue ="day_name", data = n_listings, palette = "Purples_d", size = 25, legend = True);


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-254-585a63fc9d8c> in <module>()
----> 1 sns.factorplot("neighborhood", hue ="day_name", data = n_listings, palette = "Purples_d", size = 25, legend = True);

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\seaborn\linearmodels.pyc in factorplot(x, y, hue, data, row, col, col_wrap, estimator, ci, n_boot, units, x_order, hue_order, col_order, row_order, kind, markers, linestyles, dodge, join, hline, size, aspect, palette, legend, legend_out, dropna, sharex, sharey, margin_titles)
    900     cols = [a for a in [x, y, hue, col, row, units] if a is not None]
    901     cols = pd.unique(cols).tolist()
--> 902     data = data[cols]
    903 
    904     facet_hue = hue if hue in [row, col] else None

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1670         if isinstance(key, (Series, np.ndarray, list)):
   1671             # either boolean or fancy integer index
-> 1672             return self._getitem_array(key)
   1673         elif isinstance(key, DataFrame):
   1674             return self._getitem_frame(key)

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
   1714             return self.take(indexer, axis=0, convert=False)
   1715         else:
-> 1716             indexer = self.ix._convert_to_indexer(key, axis=1)
   1717             return self.take(indexer, axis=1, convert=True)
   1718 

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1083                     if isinstance(obj, tuple) and is_setter:
   1084                         return {'key': obj}
-> 1085                     raise KeyError('%s not in index' % objarr[mask])
   1086 
   1087                 return indexer

KeyError: "['day_name'] not in index"

In [255]:
fig, ax = bookings.plot('booking_date', 'neighborhood').fig_size = (50, 20)
plt.suptitle('Daily Bookings')
ax.set_xlabel('Neighborhood')
ax.set_ylabel('# of Bookings')
ax.legend()


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-255-fdef08f7f71c> in <module>()
----> 1 fig, ax = bookings.plot('booking_date', 'neighborhood').fig_size = (50, 20)
      2 plt.suptitle('Daily Bookings')
      3 ax.set_xlabel('Neighborhood')
      4 ax.set_ylabel('# of Bookings')
      5 ax.legend()

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\tools\plotting.pyc in plot_frame(frame, x, y, subplots, sharex, sharey, use_index, figsize, grid, legend, rot, ax, style, title, xlim, ylim, logx, logy, xticks, yticks, kind, sort_columns, fontsize, secondary_y, **kwds)
   2135             label = x if x is not None else frame.index.name
   2136             label = kwds.pop('label', label)
-> 2137             ser = frame[y]
   2138             ser.index.name = label
   2139 

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1676             return self._getitem_multilevel(key)
   1677         else:
-> 1678             return self._getitem_column(key)
   1679 
   1680     def _getitem_column(self, key):

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
   1683         # get column
   1684         if self.columns.is_unique:
-> 1685             return self._get_item_cache(key)
   1686 
   1687         # duplicate columns & possible reduce dimensionaility

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
   1050         res = cache.get(item)
   1051         if res is None:
-> 1052             values = self._data.get(item)
   1053             res = self._box_item_values(item, values)
   1054             cache[item] = res

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
   2563 
   2564             if not isnull(item):
-> 2565                 loc = self.items.get_loc(item)
   2566             else:
   2567                 indexer = np.arange(len(self.items))[isnull(self.items)]

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\index.pyc in get_loc(self, key)
   1179         loc : int if unique index, possibly slice or mask if not
   1180         """
-> 1181         return self._engine.get_loc(_values_from_object(key))
   1182 
   1183     def get_value(self, series, key):

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_loc (pandas\index.c:3656)()

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_loc (pandas\index.c:3534)()

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\hashtable.pyd in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11911)()

C:\Users\selah\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\hashtable.pyd in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11864)()

KeyError: 'neighborhood'

Part 2 - Develop a data set


In [256]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [257]:
total_book = bookings.groupby('prop_id').count().reset_index()

total_book.rename(columns = {'booking_date' : 'number_of_bookings'}, inplace = True)

total_book


# listings['e'] = Series(np.random.randn(sLength), index=df1.index)


Out[257]:
prop_id number_of_bookings
0 1 4
1 3 1
2 4 27
3 6 88
4 7 2
... ... ...
323 404 3
324 405 19
325 406 19
326 407 15
327 408 54

328 rows × 2 columns


In [258]:
listings = listings.merge(total_book, on = 'prop_id', how = 'left')

In [259]:
listings.number_of_bookings.fillna(0, inplace = True)

In [260]:
listings.head()


Out[260]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 0
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 0

In [261]:
listings['booking_rate'] = listings['number_of_bookings']/listings['tenure_months']

listings


Out[261]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 4 0.133333
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 0 0.000000
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 1 0.034483
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 27 0.964286
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 0 0.000000
... ... ... ... ... ... ... ... ... ... ...
403 404 Property type 2 Neighborhood 14 100 1 8 235 1 3 3.000000
404 405 Property type 2 Neighborhood 13 85 2 27 1048 1 19 19.000000
405 406 Property type 1 Neighborhood 9 70 3 18 153 1 19 19.000000
406 407 Property type 1 Neighborhood 13 129 2 13 370 1 15 15.000000
407 408 Property type 1 Neighborhood 14 100 3 21 707 1 54 54.000000

408 rows × 10 columns

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [268]:
listings = listings[listings.tenure_months >= 10]
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 10 columns):
prop_id               144 non-null int64
prop_type             144 non-null object
neighborhood          144 non-null object
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
number_of_bookings    144 non-null float64
booking_rate          144 non-null float64
dtypes: float64(2), int64(6), object(2)

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [273]:
dummy_prop = pd.get_dummies(listings['prop_type'])
dummy_nay = pd.get_dummies(listings['neighborhood'])
listings_prop = listings.join([dummy_prop, dummy_nay])

In [299]:
listings_prop.booking_rate.fillna(0, inplace = True)

In [300]:
listings_prop.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 29 columns):
prop_id               144 non-null int64
prop_type             144 non-null object
neighborhood          144 non-null object
price                 144 non-null int64
person_capacity       144 non-null int64
picture_count         144 non-null int64
description_length    144 non-null int64
tenure_months         144 non-null int64
number_of_bookings    144 non-null float64
booking_rate          144 non-null float64
Property type 1       144 non-null float64
Property type 2       144 non-null float64
Property type 3       144 non-null float64
Neighborhood 11       144 non-null float64
Neighborhood 12       144 non-null float64
Neighborhood 13       144 non-null float64
Neighborhood 14       144 non-null float64
Neighborhood 15       144 non-null float64
Neighborhood 16       144 non-null float64
Neighborhood 17       144 non-null float64
Neighborhood 18       144 non-null float64
Neighborhood 19       144 non-null float64
Neighborhood 20       144 non-null float64
Neighborhood 21       144 non-null float64
Neighborhood 4        144 non-null float64
Neighborhood 5        144 non-null float64
Neighborhood 7        144 non-null float64
Neighborhood 8        144 non-null float64
Neighborhood 9        144 non-null float64
dtypes: float64(21), int64(6), object(2)

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [301]:
from sklearn.cross_validation import train_test_split

In [302]:
merged_final = listings_prop.drop(['prop_id', 'prop_type', 'neighborhood', 'number_of_bookings'], axis = 1)
features = merged_final.drop(['booking_rate'], axis = 1).values
target = merged_final['booking_rate'].values

In [304]:



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-304-2f710f4ceb78> in <module>()
----> 1 target.fillna(0)

AttributeError: 'numpy.ndarray' object has no attribute 'fillna'

In [305]:
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.3)

In [306]:
target = np.log(target)

In [306]:

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [307]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [308]:
lr.fit(feature_train, target_train)


Out[308]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

fit your model with your test sets


In [ ]:


In [ ]:

Interpret the results of the above model:

  • What does the score method do?
  • What does this tell us about our model?

...type here...

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [ ]: