In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [48]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns # for pretty layout of plots
import matplotlib.pyplot as plt
# This enables inline Plots
%matplotlib inline
In [49]:
bookings = pd.read_csv('../data/bookings.csv', delimiter=",")
listings = pd.read_csv('../data/listings.csv', delimiter=",")
bookings.set_index(['prop_id'], inplace=True)
listings.set_index(['prop_id'], inplace=True)
In [50]:
listings.info()
In [51]:
listings_by_prop_type = listings.groupby('prop_type')
listings_by_prop_type.describe()
Out[51]:
In [52]:
listings_by_neighborhood_then_prop_type = listings.groupby(['prop_type', 'neighborhood'])
listings_by_neighborhood_then_prop_type.describe()
Out[52]:
In [59]:
bookings.booking_date = pd.to_datetime(bookings.booking_date)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [ ]:
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [ ]:
from sklearn.cross_validation import train_test_split
In [ ]:
In [ ]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
In [ ]:
In [ ]:
...type here...
In [ ]: