In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from __future__ import print_function
# This enables inline Plots
%matplotlib inline
In [3]:
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')
bookings.info()
listings.info()
In [4]:
listings.describe()
Out[4]:
In [5]:
listings.groupby('prop_type').describe()
Out[5]:
In [6]:
listings.groupby(['prop_type','neighborhood']).describe()
Out[6]:
In [7]:
import timeit
import pandas as pd
#Convert to bookings.booking_date from object to date
from datetime import datetime
import matplotlib.dates as mdates
bookings['booking_date']= pd.to_datetime(bookings['booking_date'])
#bookings.info()
daily_totals = bookings.groupby('booking_date').count()
daily_totals.rename(columns={'prop_id':'Daily Totals'},inplace=True)
#plot cumulative bookings by date
plt.style.use('fivethirtyeight')
ax = daily_totals.plot(kind='line')
ax.set_title('Daily Bookings')
ax.set_xlabel('days')
ax.set_ylabel('# of bookings')
ax.legend()
Out[7]:
In [8]:
combined_df = pd.merge(bookings, listings, on='prop_id')
neighborhood_daily_totals = combined_df.groupby(['booking_date','neighborhood']).count()
#neighborhood_daily_totals.info()
neighborhood_daily_totals.head()
plt.style.use('fivethirtyeight')
#ax = plt.plot(neighborhood_daily_totals)
In [9]:
import math
#create new table that counts each prop_id
bookings_by_prop_id = bookings.groupby('prop_id').count()
#bookings_by_prop_id.head()
#Set an index on listings DF
new_listings = listings.set_index(['prop_id'])
#Join tables
new_listings = new_listings.join(bookings_by_prop_id)
new_listings.rename(columns={'booking_date':'number_of_bookings'},inplace=True)
#Replace all NaNs with 0s
new_listings.number_of_bookings = [0 if math.isnan(x) else x for x in new_listings.number_of_bookings]
In [10]:
new_listings['booking_rate'] = (new_listings['number_of_bookings'])/(new_listings['tenure_months'])
In [11]:
model_data = new_listings[new_listings.tenure_months>10]
model_data.info()
prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [12]:
model_data = new_listings[new_listings.tenure_months>10]
prop_type_dummies = pd.get_dummies(model_data.prop_type)
neigborhood_dummies = pd.get_dummies(model_data.neighborhood)
model_data = model_data.join(prop_type_dummies)
model_data = model_data.join(neigborhood_dummies)
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [13]:
from sklearn.cross_validation import train_test_split
In [14]:
#Predictors
model_data_y = model_data.booking_rate
#Regressors
excluded_data = ['booking_rate','prop_type','neighborhood','number_of_bookings','prop_id']
cols = [col for col in model_data.columns if col not in excluded_data]
model_data_x = model_data[cols]
#model_data_x.columns
X_train, X_test, y_train, y_test = train_test_split(model_data_x, model_data_y, test_size=0.8)
In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
#####################
#Ordinary Least Squares
degree = 1
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)
regr = LinearRegression()
regr.fit(X_train, y_train)
# The coefficients
print('Results from Ordinary Least Squares \n')
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))
print("\n \n statsmodels package OLS Results \n")
res = sm.OLS(y_train,X_train).fit() #create a model using statsmodels
print(res.params)
print (res.bse)
print (res.summary())
######################
#Ridge Regression
from sklearn import linear_model
clf = linear_model.Ridge (alpha = .5)
clf.fit(X_train, y_train)
# The coefficients
print('\n \n Results from Ridge Regression \n')
print('Coefficients: \n', clf.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((clf.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % clf.score(X_test, y_test))
######################
#Lasso
clf = linear_model.Lasso(alpha = 0.1)
clf = LinearRegression()
clf.fit(X_train, y_train)
# The coefficients
print('\n \n Results from Lasso \n')
print('Coefficients: \n', clf.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((clf.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % clf.score(X_test, y_test))
In [19]:
degree = 1
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)
pred = est.predict(X_test)
In [21]:
print("Score from OLS: ", est.score(X_test,y_test))
The score compares the fit of our model vs the fit of a horizontal line. Generally, this score should be between 0 and 1. In our case, it is below 0, suggesting a horizontal line fits better than our data. This may probably be due to the # of predictors we have. A suggestion is to use lasso to determine which predictors are unnecessary to simplify our model.
Interestingly, statsmodels and scikitlearn gave slightly different coeficients. statsmodels gives more information on default than scikitlearn. It shows that many of the variables are not significant. Next steps would probably be try to transform the data. Interestingly, the R^2 for statsmodels was 0.67 which is R^2 based on the training set which means the model is not generalizable.
I am curious how to display predictor names on coeficients. It seems that the train/split function created 4 arrays without labels.
In [225]:
#Part of the work. need to keep brainstorming
print("Currently Incomplete")
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')
bookings['booking_date']= pd.to_datetime(bookings['booking_date'])
bookings.set_index(pd.DatetimeIndex(bookings['booking_date']),inplace=True)
bookings.rename(columns={'booking_date':'Monthly_Totals'},inplace=True)
bookings.info()
bookings = bookings.groupby('prop_id').resample("M", how="count").head()
In [237]:
listings.set_index(listings.prop_id, inplace=True)
new_data = bookings.join(listings)
new_data['monthly_revenue'] = new_data.Monthly_Totals*new_data.price
Out[237]:
In [ ]:
#Predictors
new_data_y = new_data.monthly_revenue
#Regressors
excluded_data = ['monthly_rate','prop_type','neighborhood','number_of_bookings','prop_id']
cols = [col for col in model_data.columns if col not in excluded_data]
model_data_x = model_data[cols]
#model_data_x.columns
X_train, X_test, y_train, y_test = train_test_split(model_data_x, model_data_y, test_size=0.8)