In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_rows', 10)
In [3]:
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')
#bookings[bookings.prop_id==1]
In [4]:
listings[['price','person_capacity','picture_count','description_length','tenure_months']].mean()
Out[4]:
In [5]:
listings[['price','person_capacity','picture_count','description_length','tenure_months']].median()
Out[5]:
In [6]:
listings[['price','person_capacity','picture_count','description_length','tenure_months']].std()
Out[6]:
In [7]:
listings.groupby('prop_type')['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])
Out[7]:
In [8]:
listings.groupby(['prop_type','neighborhood'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])
Out[8]:
In [9]:
bookings.booking_date.value_counts().sort_index().plot()
Out[9]:
In [10]:
booking_all = pd.merge(listings, bookings, how='outer', left_on='prop_id', right_on='prop_id')
booking_all.groupby(['booking_date','neighborhood']).count().sort_index().plot()
Out[10]:
In [11]:
number_of_bookings = bookings.groupby('prop_id').count().reset_index()
number_of_bookings.rename(columns={'booking_date':'number_of_bookings'}, inplace = True)
listings2 = pd.merge(listings, number_of_bookings, how='left', on='prop_id')
listings2.number_of_bookings.fillna(0, inplace=True)
# Alternative way: listings2['number_of_bookings'] = listings2.number_of_bookings.fillna(0)
listings2['booking_rate'] = listings2.number_of_bookings/listings2.tenure_months
listings2
Out[11]:
In [12]:
listings2[listings2.tenure_months > 10]
Out[12]:
prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [13]:
listings2.info()
In [14]:
pd.get_dummies(listings2[['prop_type', 'neighborhood', 'tenure_months']])
Out[14]:
In [15]:
#Best way
features_cols = [col for col in listings2.columns if col not in ['prop_id', 'booking_rate', 'number_of_bookings']]
features = pd.get_dummies(listings2[features_cols])
In [16]:
#Other way
#prop_type_dummies = pd.get_dummies(listings2.prop_type)
#neighborhood_dummies = pd.get_dummies(listings2.neighborhood)
#features1 = listings2.join(prop_type_dummies, on='prop_id')
#features1 = listings2.join(neighborhood_dummies, on='prop_id')
#features1
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [17]:
from sklearn.cross_validation import train_test_split
y = listings2.booking_rate
X = features
X_train, X_test, y_train, y_test = train_test_split(X , y, test_size=0.2, random_state = 42)
In [18]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
# Below we Train our linear Regression
lr.fit(X_train, y_train)
Out[18]:
In [19]:
lr.score(X_test, y_test)
Out[19]:
It returns the Linear Regression score and in this case it tells us that there is a low positive correlation between the booking_rate and the columns in the features table.
In [20]:
month_rev = listings2['booking_rate']*listings2['price']
month_rev
alt_pred = month_rev
regres = features
A_train, A_test, b_train, b_test = train_test_split(regres , alt_pred, test_size=0.2, random_state = 42)
lr2 = LinearRegression()
lr2.fit(A_train, b_train)
lr2.score(A_test, b_test)
Out[20]: