In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
bookings=pd.read_csv('../data/bookings.csv')
In [3]:
bookings.info()
bookings.head(5)
Out[3]:
In [4]:
listings=pd.read_csv('../data/listings.csv')
In [5]:
listings.info()
listings.head(5)
Out[5]:
In [6]:
listings.price.mean()
Out[6]:
In [7]:
listings.mean(axis=0)
Out[7]:
In [8]:
listings.median(axis=0)
Out[8]:
In [9]:
listings.std(axis=0)
Out[9]:
In [10]:
listings.groupby(['prop_type'])['price','person_capacity','picture_count','description_length','tenure_months'].mean()
Out[10]:
In [11]:
listings.groupby(['neighborhood','prop_type'])['price','person_capacity','picture_count','description_length','tenure_months'].mean()
Out[11]:
In [12]:
bookings.booking_date=pd.to_datetime(bookings.booking_date)
print dir(bookings.booking_date[0])
In [13]:
bookings.booking_date.value_counts().plot()
Out[13]:
In [14]:
bookings.info()
bookings.booking_date.head(5)
Out[14]:
In [15]:
listMerge = listings.merge(bookings, on='prop_id')
listMerge.groupby(['neighborhood','booking_date'])['prop_id'].agg(['count']).unstack(0).plot()
Out[15]:
In [17]:
listings.columns
Out[17]:
In [18]:
bookings.head(5)
Out[18]:
In [19]:
book_by_prop=bookings.groupby('prop_id')[['prop_id']].count()
book_by_prop.head()
Out[19]:
In [20]:
book_by_prop.rename(columns={'prop_id':'number_of_bookings'}, inplace=True)
In [21]:
book_by_prop.reset_index(inplace=True)
In [22]:
book_by_prop.info()
book_by_prop.head(10)
Out[22]:
In [23]:
listings=listings.merge(book_by_prop, on='prop_id', how='left')
In [24]:
listings.fillna(0.0, inplace=True)
In [25]:
listings.head(10)
Out[25]:
In [26]:
listings.info()
In [27]:
listings['booking_rate']=listings.number_of_bookings/listings.tenure_months
In [28]:
listings=listings[listings.tenure_months>10]
prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [29]:
pd.core.reshape.get_dummies(listings.prop_type)
Out[29]:
In [30]:
pd.core.reshape.get_dummies(listings.neighborhood)
Out[30]:
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [31]:
listings.booking_rate.hist()
Out[31]:
In [33]:
np.log(listings.booking_rate)
Out[33]:
In [34]:
from sklearn.cross_validation import train_test_split
feature_cols = ['price', 'tenure_months','person_capacity','description_length','picture_count']
a, b = listings[feature_cols], listings.booking_rate
a_train, a_test, b_train, b_test=train_test_split(a,b, test_size=0.33)
In [35]:
listings.info()
a_train
Out[35]:
In [36]:
b_train.reshape(-1,1)
Out[36]:
In [37]:
b_train.shape
Out[37]:
In [38]:
#need to include price, person capacity, picture count, description length, and tenure months
In [39]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
In [40]:
#Linear Regression
clf = LinearRegression()
clf.fit(a_train, b_train)
Out[40]:
In [41]:
b_pred = clf.predict(a_test)
In [42]:
a_test
print b_pred[0], a_test[0]
In [43]:
# Let's compute sum of Errors between Actual and Predicted
# Again, more on this next week - I just want to show how these tools work together
sum_sq_model = np.sum((b_test - b_pred) ** 2)
sum_sq_model
Out[43]:
In [44]:
# Compare with the base naive model where we say predicted value is just the mean value
sum_sq_naive = np.sum((b_test - b.mean()) ** 2)
sum_sq_naive
Out[44]:
In [45]:
clf.score(a_test,b_test)
Out[45]:
...type here...
In [ ]: