In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [2]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# This enables inline Plots
%matplotlib inline
In [3]:
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')
In [4]:
bookings.head()
Out[4]:
In [5]:
listings.head()
Out[5]:
In [6]:
bookings.info()
In [7]:
listings.info()
In [8]:
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()
Out[8]:
In [9]:
listings.groupby('prop_type')[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()
Out[9]:
In [10]:
listings.groupby(['prop_type', 'neighborhood'])[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe()
Out[10]:
In [11]:
listings.info()
In [12]:
# merge
mrgd = bookings.merge(listings, on='prop_id')
# convert to datetime
mrgd['booking_date'] = pd.to_datetime(mrgd.booking_date)
In [13]:
mrgd.info()
In [14]:
count = mrgd.booking_date.value_counts()
In [15]:
fig, ax = plt.subplots(1,1)
ax.plot_date(count.index, count)
Out[15]:
In [16]:
count.plot()
Out[16]:
In [17]:
reset = count.sort_index().reset_index()
In [18]:
print '*'*80
print '''Lets regress ! I'm going to use np.polyfit. It minimizes the minimized squared error'''
print '*'*80
from IPython.display import Image
Image(url='http://docs.scipy.org/doc/numpy/_images/math/3e125ffecb04217922f4cb0a04367165ebf64864.png')
Out[18]:
In [19]:
for i in range(1,4):
z = np.polyfit(reset.index, reset[0], i)
p = np.poly1d(z)
print "Polynomial (" + str(i) + ")", '\n', p, '\n'
xp = np.linspace(1, max(reset.index), max(reset[0]))
plt.plot(reset.index, reset[0], '.', xp, p(xp), '-')
In [20]:
help('pandas.stats.moments.rolling_mean')
In [21]:
print "I think this one seems to fit pretty well. Let's add an exponential weighted moving average, just for fun."
print """
1. Raw Data
2. Polyfit
3. 30 day exponential weighted moving average
4. 14 day exponential weighted moving average
"""
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row')
from pandas.stats.moments import ewma
z = np.polyfit(reset.index, reset[0], 2)
p = np.poly1d(z)
print "Polynomial (" + str(i) + ")", '\n', p, '\n'
xp = np.linspace(1, max(reset.index), max(reset[0]))
# Green is our polyfit
ax1.plot(reset.index, reset[0], '.')
ax2.plot(xp, p(xp), 'g-')
# Here is the exponential moving average, in red, for 1 months span
moving_ema_30 = ewma(reset[0], span=30)
ax3.plot(moving_ema_30)
moving_ema_2w = ewma(reset[0], span=14)
ax4.plot(moving_ema_2w, 'r-')
Out[21]:
In [22]:
reset.head()
Out[22]:
In [23]:
# smaller frame
nei = mrgd[['neighborhood', 'booking_date']]
# sort
nei = nei.sort(columns='booking_date')
# group by and count
neidate = nei.groupby('booking_date').neighborhood.value_counts()
# table that sucker
neidate = neidate.unstack()
neidate.plot()
Out[23]:
In [24]:
# Bleh, thats hard to read.
# How random is it ?
from pandas.tools.plotting import lag_plot
lag_plot(neidate)
Out[24]:
In [25]:
from pandas.tools.plotting import autocorrelation_plot
autocorrelation_plot(neidate)
# I guess pretty random.
Out[25]:
In [26]:
chunk = pd.DataFrame(neidate)
maxnei = range(len(chunk.columns))
for i in range(20):
plt.subplots(1,1)
plt.title(str(chunk.columns[i]))
chunk.ix[:,i].plot()
In [27]:
# I still feel like there is a better way of plotting those...
In [28]:
l2 = listings.copy()
l2['number_of_bookings'] = bookings.groupby('prop_id').count()
l2['booking_rate'] = l2.number_of_bookings/l2.tenure_months
l2.head()
Out[28]:
In [29]:
l3 = l2[l2.tenure_months>10]
l3.head()
Out[29]:
prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [30]:
l4 = pd.get_dummies(l3)
l4.columns
l4.head()
l4 = l4.fillna(0) # THIS IS A HUGE ASSUPTION : Is it ok to fill NaN's with a zero ?
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [31]:
from sklearn.cross_validation import train_test_split
In [32]:
training_data = l4.drop(['booking_rate'], axis=1)
training_data = training_data.drop(['number_of_bookings'], axis=1)
test_data = l4.booking_rate
train_a, test_a, train_b, test_b = train_test_split(training_data, test_data, test_size=.2)
In [33]:
test_data.head()
training_data.columns
Out[33]:
In [34]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_a, train_b)
Out[34]:
In [35]:
# Mean square error
print np.mean((lr.predict(test_a)-test_b)**2)
# Score
print lr.score(test_a, test_b)
In [36]:
lr.fit(test_a, test_b)
# Mean square error
print np.mean((lr.predict(test_a)-test_b)**2)
# Score
print lr.score(test_a, test_b)
In [37]:
train = lr.score(train_a, train_b)
test = lr.score(test_a, test_b)
print train, test
Returns the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares
((y_true - y_pred) ** 2).sum()and v is the residual sum of squares((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0, lower values are worse.
R^2, or the coefficient of determination, is a way of "how well data fits a statistical model" (Wikipedia). R^2 can be defined in multiple ways - but here it is defined in a few parts.
First we have: (1 - u/ v)
u = regression sum of squares
((y_true - ypred) ** 2 )).sum()
summation of squared errors: the error between the predicted value and the true value is squared (to alleviate the possibility of positive and negative answers cancelling each other out), then they are added together.
and
v = residual sum of squares
((y_true - y_true.mean()) ** 2).sum()
residual sum of squares: the error between the predicted value and the mean of the true value is squared (to alleviate the possibility of positive and negative answers cancelling each other out), then they are added together.
We'll, the close the score is to 1.0 the better. That means that we did pretty good if our training set is at .95 ! When we add more data, from the test set, we see this explanation of variance go up. Which is good.
Just kidding !! I had left number_of_bookings in there so that is why it was getting such a great score. So now, I'm getting values much lower, but they are real (I think?).
Basically, the score is telling us the goodness of fit. Now that we have a much smaller score, we know that our model only explains a smaller part of the variance.
test size: .30 gives a r^2 of 0.717171863744
In [37]:
In [34]:
In [ ]: