In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [1]:
import pandas as pd
import numpy
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
bookings = pd.read_csv("../data/bookings.csv")
listings = pd.read_csv('../data/listings.csv')
In [3]:
listings.describe()
#or
#listings.median()
#listings.mean()
#listings.std()
Out[3]:
In [4]:
listings.groupby('prop_type').mean()
Out[4]:
In [5]:
listings.groupby(['neighborhood', 'prop_type']).mean()
Out[5]:
In [6]:
with plt.style.context('fivethirtyeight'):
#bookings.sort('booking_date', inplace = True)
ax = bookings['booking_date'].value_counts().sort_index().plot(figsize = (12,8))
ax.set_ylabel("Number of bookings")
ax.set_xlabel('Date')
In [7]:
#i need to copy all of the booking data to a new df
#I need to get the neighborhood id from the listings data and add that column by matching the ID
#now i need to add the count per neighborhood per day
#then I need to plot that
bookingsNeighborhood = bookings.merge(listings[['prop_id', 'neighborhood']], on='prop_id')
In [7]:
In [8]:
listings2 = listings
listings2["number_of_bookings"] = bookings.groupby('prop_id')[['prop_id']].count()
listings2["booking_rate"] = listings2['number_of_bookings']/listings2['tenure_months']
#[['col_name']] returns a data frame, just [] would return a series
In [9]:
listings3 = listings2[listings2['tenure_months'] >=10]
prop_type and neighborhood are categorical variables, use get_dummies()http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [10]:
listings4 = listings3
for column in ['neighborhood','prop_type']:
dummies = pd.get_dummies(listings4[column])
listings4[dummies.columns] = dummies
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [11]:
from sklearn.cross_validation import train_test_split
In [15]:
#remove garbage
listings5 = listings4
#listings5.replace([numpy.inf, -numpy.inf], numpy.nan)
#istings5.dropna(axis=0)
#listings5.dropna(subset=['number_of_bookings'],how='any', inplace = True)
listings5 = listings5.dropna()
#listings5.info()
In [132]:
x_axis = listings5[['price', 'person_capacity',
'picture_count', 'description_length', 'tenure_months', 'Property type 2', 'Property type 3',
'Neighborhood 18', 'Neighborhood 19', 'Neighborhood 20',
'Neighborhood 21', 'Neighborhood 4', 'Neighborhood 5',
'Neighborhood 7', 'Neighborhood 8', 'Neighborhood 9',]]
y_axis = listings5['booking_rate']
x_train, x_test, y_train, y_test = train_test_split(x_axis, y_axis)
In [53]:
listings5['Neighborhood 18'].value_counts()
Out[53]:
In [ ]:
In [95]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
In [133]:
model = lr.fit(x_test, y_test, n_jobs = 10)
In [134]:
model.score(x_test, y_test)
Out[134]:
It returns the R^2 valued which measure of how close the data fits the regression. It is the coefficient of determination.
In [ ]: