In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [2]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
# This enables inline Plots
%matplotlib inline
# Limit rows displayed in notebook
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)
In [3]:
bookings = pd.read_csv('../data/bookings.csv', parse_dates = True)
listings = pd.read_csv('../data/listings.csv')
In [4]:
listings[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].describe().loc[['mean', '50%', 'std']]
Out[4]:
In [5]:
listings.groupby('prop_type')[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].mean()
Out[5]:
In [6]:
listings.groupby(['neighborhood','prop_type'])[['price', 'person_capacity', 'picture_count', 'description_length', 'tenure_months']].mean()
Out[6]:
In [7]:
lb = listings.merge(bookings)
In [8]:
lb.booking_date = pd.to_datetime(lb.booking_date)
lb['day'] = lb.booking_date.map(lambda x: x.dayofweek)
#interpreting 'daily' as Monday, Tuesday, Wednesday...
def day(x):
day = {}
day[0] = 'M'
day[1] = 'Tu'
day[2] = 'W'
day[3] = 'Th'
day[4] = 'F'
day[5] = 'Sa'
day[6] = 'Su'
return day[x]
lb['day_name'] = lb['day'].map(day)
lb.head()
lb.groupby('day').day_name.value_counts().plot(kind = 'bar'); #how do you order it how you want?
Out[8]:
In [9]:
#lb.groupby(['neighborhood','day']).day_name.value_counts().plot(kind = 'bar', figsize = (15,10), legend = True);
#seaborn info
sns.factorplot("neighborhood", hue = "day_name", data = lb, kind = "bar", palette = "Greens_d", size = 30);
In [9]:
In [10]:
count_bookings = bookings.groupby('prop_id').count().reset_index() #need to reset index so prop_id and booking_date are both columns vs. index
lb_new = listings.merge(count_bookings, on = 'prop_id')
lb_new.rename(columns = {'booking_date': 'number_of_bookings'}, inplace= True)
lb_new['booking_rate'] = lb_new.number_of_bookings / lb_new.tenure_months
lb_new
Out[10]:
In [11]:
lb_new2 = lb_new[lb_new['tenure_months'] >= 10]
lb_new2
Out[11]:
prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [14]:
pt_dummy = pd.get_dummies(lb_new2['prop_type'])
n_dummy = pd.get_dummies(lb_new2['neighborhood'])
lb_new3 = lb_new2.join(pt_dummy)
lb_new3 = lb_new3.join(n_dummy)
#pd.set_option('display.max_columns', 100)
#lb_new3
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [15]:
from sklearn.cross_validation import train_test_split
In [46]:
#remove items that are not regressors or predictors
lb_4 = lb_new3.drop(['prop_id', 'prop_type', 'neighborhood', 'number_of_bookings'], axis = 1)
# has only features needed
rel_features = lb_4.drop(['booking_rate'], axis = 1).values
target = lb_4['booking_rate'].values
target = np.log(target)
feat_train, feat_test, tar_train, tar_test = train_test_split(rel_features, target, train_size = .3, random_state = 9)
In [47]:
lb_4
Out[47]:
In [47]:
In [48]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
In [49]:
lr.fit(feat_train, tar_train)
Out[49]:
In [50]:
tar_pred = lr.predict(feat_test)
In [51]:
sum_sq_model = np.sum((tar_test - tar_pred) ** 2)
sum_sq_model
Out[51]:
In [52]:
sum_sq_naive = np.sum((tar_test - tar_test.mean()) ** 2)
sum_sq_naive
Out[52]:
In [53]:
fig, ax = plt.subplots(1, 1)
ax.scatter(tar_pred, tar_test)
# Draw the ideal line
ax.plot(target, target, 'r')
Out[53]:
In [54]:
#test score before using score
1 - (sum_sq_model/sum_sq_naive)
Out[54]:
In [55]:
#lr.score(rel_features, target, sample_weight=None)
lr.score(feat_test, tar_test, sample_weight=None)
Out[55]:
It seems that the score method provides the 'coefficient of determination' (1 - (regression sum of squares/residual sum of squares)). This can vary based on what's in our test set. It takes our test feature set and our test target set in order to compare. The score (coeff of determination) is generally an indicator of how good your model is.
For this model, I seem to get fairly low numbers, which is not great, but they (again) vary, which means the model isn't necessarily the most consistent (though I don't know how much variance is arguably acceptable). The values are negative pretty often when I've re-run the split of the data, which means the model is worse than the mean of the data, which is fairly terrible. This continued even after I took the log of the target data so I'm really not sure where I'm going wrong with that one. After speaking with some people in the class, they split the feature training set up by feature and then set up loops to append the data together, so it is structured differently than my training set and that appeared to work well for them.
In [ ]:
# price*booking rate
In [ ]:
In [ ]: