In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [172]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns # for pretty layout of plots
import matplotlib.pyplot as plt
# This enables inline Plots
%matplotlib inline
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)
In [22]:
listings = pd.read_csv('../data/listings.csv')
bookings = pd.read_csv('../data/bookings.csv')
listings.info()
listings.columns
#clean up listings. Remove 'Property type' and convert to int. Remove 'Neighborhood' and convert to int
listings['prop_type'] = listings['prop_type'].map(lambda x: x.replace("Property type ","")).astype(int)
listings['neighborhood'] = listings['neighborhood'].map(lambda x: x.replace("Neighborhood ", "")).astype(int)
listings.info()
In [23]:
print "mean :"
print listings.drop('prop_id',1).mean()
print "\nmedian :"
print listings.drop('prop_id',1).median()
print "\nstandard deviation :"
print listings.drop('prop_id',1).std()
In [24]:
groupedListings = listings.groupby(['prop_type'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])
groupedListings
Out[24]:
In [25]:
groupedListings = listings.groupby(['prop_type', 'neighborhood'])['price','person_capacity','picture_count','description_length','tenure_months'].agg(['mean'])
groupedListings
#bookings
Out[25]:
In [206]:
groupedBookings = bookings.groupby(['booking_date']).agg(['count'])
groupedBookings.hist()
#groupedBookings.plot()
#cleaner plot -
Out[206]:
In [27]:
combined = bookings.merge(listings)
combined
groupedCombined = combined.groupby(['neighborhood']).agg(['count'])
groupedCombined.plot()
#TODO - legend
Out[27]:
In [28]:
bookings
Out[28]:
In [29]:
listings
Out[29]:
In [30]:
groupedProperty = bookings.groupby(['prop_id']).count()
groupedProperty.reset_index(inplace=True)
In [31]:
groupedProperty.rename(columns = {'booking_date' : 'bookings'}, inplace=True)
In [32]:
bookingListings = listings.merge(groupedProperty)
bookingListings
Out[32]:
In [33]:
bookingListings['booking_rate'] = bookingListings['bookings'] / bookingListings['tenure_months']
bookingListings
Out[33]:
In [34]:
establishedProperties = bookingListings[bookingListings['tenure_months'] > 9]
establishedProperties
Out[34]:
prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [163]:
establishedProperties.info()
establishedProperties.prop_type.value_counts()
#establishedProperties['prop_type'] = establishedProperties['prop_type']
full_table = pd.get_dummies(establishedProperties, columns=['prop_type', 'neighborhood'])
In [164]:
pd.__version__
full_table
Out[164]:
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [194]:
from sklearn.cross_validation import train_test_split
full_table = pd.get_dummies(establishedProperties, columns=['prop_type', 'neighborhood'])
features = full_table.drop(['prop_id', 'bookings', 'booking_rate'], axis=1)
#features
full_table['booking_rate'].hist()
#
#features = features.values
#targetDF.hist()
full_table['log'] = np.log(full_table['booking_rate'])
full_table.log.hist()
targetDF = full_table['log']
targetDF
target = targetDF.values
target
Out[194]:
In [192]:
a_train, a_test, b_train, b_test = train_test_split(features, target, test_size=0.2, random_state = 3)
print "a_train = {}".format(a_train.shape)
print "a_test = {}".format(a_test.shape)
print "b_train = {}".format(b_train.shape)
print "b_test = {}".format(b_test.shape)
In [195]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(a_train, b_train)
lr.get_params()
Out[195]:
In [189]:
a_predictions = lr.predict(a_test)
print a_predictions
print a_test
In [190]:
lr.score(a_train, b_train)
Out[190]:
In [191]:
lr.score(a_test, b_test)
Out[191]:
In [ ]:
Score gives a value that signifies how accurate the predictions are.
The closer to 1 indicates a perfect model.
Our training data does "mediocre" but the test data is very poor.
...type here...
In [ ]:
In [ ]: