Homework 1 - Data Analysis and Regression

In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.


In [2]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

Part 1 - Data exploration

First, create 2 data frames: listings and bookings from their respective data files


In [3]:
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')

In [4]:
bookings.head()


Out[4]:
prop_id booking_date
0 9 2011-06-17
1 13 2011-08-12
2 21 2011-06-20
3 28 2011-05-05
4 29 2011-11-17

In [5]:
listings.head()


Out[5]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months
0 1 Property type 1 Neighborhood 14 140 3 11 232 30
1 2 Property type 1 Neighborhood 14 95 2 3 37 29
2 3 Property type 2 Neighborhood 16 95 2 16 172 29
3 4 Property type 2 Neighborhood 13 90 2 19 472 28
4 5 Property type 1 Neighborhood 15 125 5 21 442 28

In [6]:
bookings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 2 columns):
prop_id         6076 non-null int64
booking_date    6076 non-null object
dtypes: int64(1), object(1)
memory usage: 142.4+ KB

In [7]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB

What is the mean, median and standard deviation of price, person capacity, picture count, description length and tenure of the properties?


In [8]:
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()


Out[8]:
price person_capacity picture_count description_length tenure_months
count 408.000000 408.000000 408.000000 408.000000 408.000000
mean 187.806373 2.997549 14.389706 309.159314 8.487745
std 353.050858 1.594676 10.477428 228.021684 5.872088
min 39.000000 1.000000 1.000000 0.000000 1.000000
25% 90.000000 2.000000 6.000000 179.000000 4.000000
50% 125.000000 2.000000 12.000000 250.000000 7.000000
75% 199.000000 4.000000 20.000000 389.500000 13.000000
max 5000.000000 10.000000 71.000000 1969.000000 30.000000

What what are the mean price, person capacity, picture count, description length and tenure of the properties grouped by property type?


In [9]:
listings.groupby('prop_type')[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()


Out[9]:
price person_capacity picture_count description_length tenure_months
prop_type
Property type 1 count 269.000000 269.000000 269.000000 269.000000 269.000000
mean 237.085502 3.516729 14.695167 313.171004 8.464684
std 425.710534 1.644955 10.623651 214.769141 5.773367
min 40.000000 1.000000 1.000000 17.000000 1.000000
25% 120.000000 2.000000 6.000000 193.000000 4.000000
50% 150.000000 3.000000 12.000000 266.000000 7.000000
75% 229.000000 4.000000 20.000000 388.000000 14.000000
max 5000.000000 10.000000 71.000000 1719.000000 30.000000
Property type 2 count 135.000000 135.000000 135.000000 135.000000 135.000000
mean 93.288889 2.000000 13.948148 304.851852 8.377778
std 42.261246 0.846415 10.255191 255.135332 5.963654
min 39.000000 1.000000 1.000000 0.000000 1.000000
25% 69.000000 2.000000 6.500000 150.500000 4.000000
50% 89.000000 2.000000 11.000000 239.000000 7.000000
75% 99.000000 2.000000 19.500000 402.500000 10.500000
max 350.000000 6.000000 56.000000 1969.000000 29.000000
Property type 3 count 4.000000 4.000000 4.000000 4.000000 4.000000
mean 63.750000 1.750000 8.750000 184.750000 13.750000
std 16.520190 0.500000 7.320064 53.093471 8.616844
min 40.000000 1.000000 1.000000 113.000000 5.000000
25% 58.750000 1.750000 3.250000 170.000000 7.250000
50% 70.000000 2.000000 9.500000 192.500000 13.500000
75% 75.000000 2.000000 15.000000 207.250000 20.000000
max 75.000000 2.000000 15.000000 241.000000 23.000000

Same, but by property type per neighborhood?


In [10]:
listings.groupby(['prop_type', 'neighborhood'])[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()


Out[10]:
price person_capacity picture_count description_length tenure_months
prop_type neighborhood
Property type 1 Neighborhood 1 count 1.000000 1.000000 1.000000 1.000000 1.000000
mean 85.000000 2.000000 26.000000 209.000000 6.000000
std NaN NaN NaN NaN NaN
min 85.000000 2.000000 26.000000 209.000000 6.000000
25% 85.000000 2.000000 26.000000 209.000000 6.000000
50% 85.000000 2.000000 26.000000 209.000000 6.000000
75% 85.000000 2.000000 26.000000 209.000000 6.000000
max 85.000000 2.000000 26.000000 209.000000 6.000000
Neighborhood 10 count 6.000000 6.000000 6.000000 6.000000 6.000000
mean 142.500000 3.500000 13.333333 391.000000 3.833333
std 36.979724 1.224745 8.571270 146.929915 1.602082
min 90.000000 2.000000 4.000000 160.000000 1.000000
25% 135.000000 2.500000 6.750000 312.250000 3.250000
50% 137.500000 4.000000 12.000000 425.500000 4.500000
75% 147.500000 4.000000 20.250000 499.000000 5.000000
max 205.000000 5.000000 24.000000 537.000000 5.000000
Neighborhood 11 count 14.000000 14.000000 14.000000 14.000000 14.000000
mean 159.428571 3.214286 9.928571 379.000000 9.642857
std 70.962302 1.311404 5.928605 396.956111 5.212517
min 95.000000 2.000000 5.000000 82.000000 1.000000
25% 103.750000 2.000000 6.000000 242.500000 6.000000
50% 130.000000 3.000000 8.000000 295.500000 9.500000
75% 196.500000 3.750000 9.750000 373.000000 15.500000
max 319.000000 6.000000 23.000000 1719.000000 16.000000
Neighborhood 12 count 39.000000 39.000000 39.000000 39.000000 39.000000
mean 365.615385 3.435897 10.820513 267.205128 7.897436
std 686.086484 1.874972 7.118810 137.867820 5.290482
min 60.000000 1.000000 1.000000 45.000000 1.000000
25% 125.000000 2.000000 6.000000 182.000000 3.000000
50% 150.000000 3.000000 8.000000 251.000000 6.000000
... ... ... ... ... ... ... ...
Property type 3 Neighborhood 11 std NaN NaN NaN NaN NaN
min 75.000000 2.000000 15.000000 196.000000 8.000000
25% 75.000000 2.000000 15.000000 196.000000 8.000000
50% 75.000000 2.000000 15.000000 196.000000 8.000000
75% 75.000000 2.000000 15.000000 196.000000 8.000000
max 75.000000 2.000000 15.000000 196.000000 8.000000
Neighborhood 14 count 1.000000 1.000000 1.000000 1.000000 1.000000
mean 75.000000 1.000000 1.000000 113.000000 5.000000
std NaN NaN NaN NaN NaN
min 75.000000 1.000000 1.000000 113.000000 5.000000
25% 75.000000 1.000000 1.000000 113.000000 5.000000
50% 75.000000 1.000000 1.000000 113.000000 5.000000
75% 75.000000 1.000000 1.000000 113.000000 5.000000
max 75.000000 1.000000 1.000000 113.000000 5.000000
Neighborhood 17 count 1.000000 1.000000 1.000000 1.000000 1.000000
mean 65.000000 2.000000 15.000000 189.000000 23.000000
std NaN NaN NaN NaN NaN
min 65.000000 2.000000 15.000000 189.000000 23.000000
25% 65.000000 2.000000 15.000000 189.000000 23.000000
50% 65.000000 2.000000 15.000000 189.000000 23.000000
75% 65.000000 2.000000 15.000000 189.000000 23.000000
max 65.000000 2.000000 15.000000 189.000000 23.000000
Neighborhood 4 count 1.000000 1.000000 1.000000 1.000000 1.000000
mean 40.000000 2.000000 4.000000 241.000000 19.000000
std NaN NaN NaN NaN NaN
min 40.000000 2.000000 4.000000 241.000000 19.000000
25% 40.000000 2.000000 4.000000 241.000000 19.000000
50% 40.000000 2.000000 4.000000 241.000000 19.000000
75% 40.000000 2.000000 4.000000 241.000000 19.000000
max 40.000000 2.000000 4.000000 241.000000 19.000000

320 rows × 5 columns

Plot daily bookings:


In [11]:
listings.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 408 entries, 0 to 407
Data columns (total 8 columns):
prop_id               408 non-null int64
prop_type             408 non-null object
neighborhood          408 non-null object
price                 408 non-null int64
person_capacity       408 non-null int64
picture_count         408 non-null int64
description_length    408 non-null int64
tenure_months         408 non-null int64
dtypes: int64(6), object(2)
memory usage: 28.7+ KB

In [12]:
# merge
mrgd = bookings.merge(listings, on='prop_id')

# convert to datetime
mrgd['booking_date'] = pd.to_datetime(mrgd.booking_date)

In [13]:
mrgd.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6076 entries, 0 to 6075
Data columns (total 9 columns):
prop_id               6076 non-null int64
booking_date          6076 non-null datetime64[ns]
prop_type             6076 non-null object
neighborhood          6076 non-null object
price                 6076 non-null int64
person_capacity       6076 non-null int64
picture_count         6076 non-null int64
description_length    6076 non-null int64
tenure_months         6076 non-null int64
dtypes: datetime64[ns](1), int64(6), object(2)
memory usage: 474.7+ KB

In [14]:
count = mrgd.booking_date.value_counts()

In [15]:
fig, ax = plt.subplots(1,1)
ax.plot_date(count.index, count)


Out[15]:
[<matplotlib.lines.Line2D at 0x103334f50>]

In [16]:
count.plot()


Out[16]:
<matplotlib.axes.AxesSubplot at 0x107e3a2d0>

In [17]:
reset = count.sort_index().reset_index()

In [18]:
print '*'*80
print '''Lets regress ! I'm going to use np.polyfit. It minimizes the minimized squared error'''
print '*'*80
from IPython.display import Image
Image(url='http://docs.scipy.org/doc/numpy/_images/math/3e125ffecb04217922f4cb0a04367165ebf64864.png')


********************************************************************************
Lets regress ! I'm going to use np.polyfit. It minimizes the minimized squared error
********************************************************************************
Out[18]:

In [19]:
for i in range(1,4):
    z = np.polyfit(reset.index, reset[0], i)
    p = np.poly1d(z)
    
    print "Polynomial (" + str(i) + ")", '\n', p, '\n'
    xp = np.linspace(1, max(reset.index), max(reset[0]))
    plt.plot(reset.index, reset[0], '.', xp, p(xp), '-')


Polynomial (1) 
 
-0.008861 x + 18.26 

Polynomial (2) 
           2
-0.000285 x + 0.09488 x + 11.98 

Polynomial (3) 
            3            2
-9.708e-08 x - 0.000232 x + 0.08717 x + 12.22 


In [20]:
help('pandas.stats.moments.rolling_mean')


Help on function roll_mean in pandas.stats.moments:

pandas.stats.moments.rolling_mean = roll_mean(arg, window, min_periods=None, freq=None, center=False, how=None, **kwargs)
    Moving mean.
    
    Parameters
    ----------
    arg : Series, DataFrame
    window : int
        Size of the moving window. This is the number of observations used for
        calculating the statistic.
    min_periods : int, default None
        Minimum number of observations in window required to have a value
        (otherwise result is NA).
    freq : string or DateOffset object, optional (default None)
        Frequency to conform the data to before computing the statistic. Specified
        as a frequency string or DateOffset object.
    center : boolean, default False
        Set the labels at the center of the window.
    how : string, default 'None'
        Method for down- or re-sampling
    
    Returns
    -------
    y : type of input argument
    
    Notes
    -----
    By default, the result is set to the right edge of the window. This can be
    changed to the center of the window by setting ``center=True``.
    
    The `freq` keyword is used to conform time series data to a specified
    frequency by resampling the data. This is done with the default parameters
    of :meth:`~pandas.Series.resample` (i.e. using the `mean`).


In [21]:
print "I think this one seems to fit pretty well. Let's add an exponential weighted moving average, just for fun."
print """
1. Raw Data
2. Polyfit
3. 30 day exponential weighted moving average
4. 14 day exponential weighted moving average
"""
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row')
from pandas.stats.moments import ewma

z = np.polyfit(reset.index, reset[0], 2)
p = np.poly1d(z)

print "Polynomial (" + str(i) + ")", '\n', p, '\n'
xp = np.linspace(1, max(reset.index), max(reset[0]))

# Green is our polyfit
ax1.plot(reset.index, reset[0], '.')
ax2.plot(xp, p(xp), 'g-')

# Here is the exponential moving average, in red, for 1 months span
moving_ema_30 = ewma(reset[0], span=30)
ax3.plot(moving_ema_30)

moving_ema_2w = ewma(reset[0], span=14)
ax4.plot(moving_ema_2w, 'r-')


I think this one seems to fit pretty well. Let's add an exponential weighted moving average, just for fun.

1. Raw Data
2. Polyfit
3. 30 day exponential weighted moving average
4. 14 day exponential weighted moving average

Polynomial (3) 
           2
-0.000285 x + 0.09488 x + 11.98 

Out[21]:
[<matplotlib.lines.Line2D at 0x107f41f90>]

In [22]:
reset.head()


Out[22]:
index 0
0 2011-01-01 11
1 2011-01-02 9
2 2011-01-03 10
3 2011-01-04 8
4 2011-01-05 15

Plot the daily bookings per neighborhood (provide a legend)


In [23]:
# smaller frame
nei = mrgd[['neighborhood', 'booking_date']]
# sort
nei = nei.sort(columns='booking_date')
# group by and count
neidate = nei.groupby('booking_date').neighborhood.value_counts()
# table that sucker
neidate = neidate.unstack()
neidate.plot()


Out[23]:
<matplotlib.axes.AxesSubplot at 0x107f90ad0>

In [24]:
# Bleh, thats hard to read.

# How random is it ?
from pandas.tools.plotting import lag_plot
lag_plot(neidate)


Out[24]:
<matplotlib.axes.AxesSubplot at 0x108183510>

In [25]:
from pandas.tools.plotting import autocorrelation_plot
autocorrelation_plot(neidate)

# I guess pretty random.


Out[25]:
<matplotlib.axes.AxesSubplot at 0x1081f85d0>

In [26]:
chunk = pd.DataFrame(neidate)
maxnei = range(len(chunk.columns))
for i in range(20):
    plt.subplots(1,1)
    plt.title(str(chunk.columns[i]))
    chunk.ix[:,i].plot()



In [27]:
# I still feel like there is a better way of plotting those...

Part 2 - Develop a data set

Add the columns number_of_bookings and booking_rate (number_of_bookings/tenure_months) to your listings data frame


In [28]:
l2 = listings.copy()
l2['number_of_bookings'] = bookings.groupby('prop_id').count()
l2['booking_rate'] = l2.number_of_bookings/l2.tenure_months
l2.head()


Out[28]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 NaN NaN
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 4 0.137931
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 NaN NaN
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 1 0.035714
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 27 0.964286

We only want to analyze well established properties, so let's filter out any properties that have a tenure less than 10 months


In [29]:
l3 = l2[l2.tenure_months>10]
l3.head()


Out[29]:
prop_id prop_type neighborhood price person_capacity picture_count description_length tenure_months number_of_bookings booking_rate
0 1 Property type 1 Neighborhood 14 140 3 11 232 30 NaN NaN
1 2 Property type 1 Neighborhood 14 95 2 3 37 29 4 0.137931
2 3 Property type 2 Neighborhood 16 95 2 16 172 29 NaN NaN
3 4 Property type 2 Neighborhood 13 90 2 19 472 28 1 0.035714
4 5 Property type 1 Neighborhood 15 125 5 21 442 28 27 0.964286

prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.


In [30]:
l4 = pd.get_dummies(l3)
l4.columns
l4.head()
l4 = l4.fillna(0) # THIS IS A HUGE ASSUPTION : Is it ok to fill NaN's with a zero ?

create test and training sets for your regressors and predictors

predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis


In [31]:
from sklearn.cross_validation import train_test_split

In [32]:
training_data = l4.drop(['booking_rate'], axis=1)
training_data = training_data.drop(['number_of_bookings'], axis=1)
test_data = l4.booking_rate
train_a, test_a, train_b, test_b = train_test_split(training_data, test_data, test_size=.2)

In [33]:
test_data.head()
training_data.columns


Out[33]:
Index([u'prop_id', u'price', u'person_capacity', u'picture_count', u'description_length', u'tenure_months', u'prop_type_Property type 1', u'prop_type_Property type 2', u'prop_type_Property type 3', u'neighborhood_Neighborhood 11', u'neighborhood_Neighborhood 12', u'neighborhood_Neighborhood 13', u'neighborhood_Neighborhood 14', u'neighborhood_Neighborhood 15', u'neighborhood_Neighborhood 16', u'neighborhood_Neighborhood 17', u'neighborhood_Neighborhood 18', u'neighborhood_Neighborhood 19', u'neighborhood_Neighborhood 20', u'neighborhood_Neighborhood 21', u'neighborhood_Neighborhood 4', u'neighborhood_Neighborhood 5', u'neighborhood_Neighborhood 9'], dtype='object')

Part 3 - Model booking_rate

Create a linear regression model of your listings


In [34]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_a, train_b)


Out[34]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

fit your model with your test sets


In [35]:
# Mean square error
print np.mean((lr.predict(test_a)-test_b)**2)
# Score
print lr.score(test_a, test_b)


1.1945578579
-0.0413719506008

In [36]:
lr.fit(test_a, test_b)
# Mean square error
print np.mean((lr.predict(test_a)-test_b)**2)
# Score
print lr.score(test_a, test_b)


0.324432180458
0.717171863744

In [37]:
train = lr.score(train_a, train_b)
test = lr.score(test_a, test_b)
print train, test


-1.15374927847 0.717171863744

What does the score method do?

Returns the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0, lower values are worse.

So what actually is that?

R^2, or the coefficient of determination, is a way of "how well data fits a statistical model" (Wikipedia). R^2 can be defined in multiple ways - but here it is defined in a few parts.

First we have: (1 - u/ v)

u = regression sum of squares

((y_true - ypred) ** 2 )).sum()

summation of squared errors: the error between the predicted value and the true value is squared (to alleviate the possibility of positive and negative answers cancelling each other out), then they are added together.

and

v = residual sum of squares

((y_true - y_true.mean()) ** 2).sum()

residual sum of squares: the error between the predicted value and the mean of the true value is squared (to alleviate the possibility of positive and negative answers cancelling each other out), then they are added together.

What does this tell us about our model ?

We'll, the close the score is to 1.0 the better. That means that we did pretty good if our training set is at .95 ! When we add more data, from the test set, we see this explanation of variance go up. Which is good.

Just kidding !! I had left number_of_bookings in there so that is why it was getting such a great score. So now, I'm getting values much lower, but they are real (I think?).

Basically, the score is telling us the goodness of fit. Now that we have a much smaller score, we know that our model only explains a smaller part of the variance.

test size: .30 gives a r^2 of 0.717171863744


In [37]:

Optional - Iterate

Create an alternative predictor (e.g. monthly revenue) and use the same modeling pattern in Part 3 to


In [34]:


In [ ]: