In this assignment your challenge is to do some basic analysis for Airbnb. Provided in hw/data/ there are 2 data files, bookings.csv and listings.csv. The objective is to practice data munging and begin our exploration of regression.
In [2]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns # for pretty layout of plots
import matplotlib.pyplot as plt
# This enables inline Plots
%matplotlib inline
In [3]:
# Load airbnb data from data folder under hw/data/
bookings = pd.read_csv('../data/bookings.csv')
listings = pd.read_csv('../data/listings.csv')
bookings.head(5)
listings.head(5)
Out[3]:
In [4]:
# Describe the dataset - This gives you a summary of numerical columns
listings.describe()
Out[4]:
In [5]:
# There are 3 different property types. Group by prop_type first and then describe.
listings_prop_type = listings.groupby(listings.prop_type)
listings_prop_type.describe()
Out[5]:
In [6]:
# There are 3 different property types and 22 neighborhoods. Group by prop_type and neighborhood first and then describe.
listings_prop_type = listings.groupby([listings.prop_type, listings.neighborhood])
listings_prop_type.describe()
Out[6]:
In [88]:
# datetime field is imported as "object" which is the same as strings
bookings.booking_date = pd.to_datetime(bookings.booking_date)
# create new data frame with date and number of bookings
plot_bookings = bookings['booking_date'].value_counts()
# sort by date
plot_bookings.sort_index()
plot_bookings.plot()
Out[88]:
In [70]:
###Plot the daily bookings per neighborhood (provide a legend)
In [91]:
listings.neighborhood.value_counts()
Out[91]:
In [ ]:
In [36]:
bookings_by_prop = bookings.groupby('prop_id')[['prop_id']].count()
bookings_by_prop.rename(columns={'prop_id': 'number_of_bookings'}, inplace=True)
bookings_by_prop.reset_index(inplace=True)
bookings_by_prop.head()
listings.merge(bookings_by_prop, on='prop_id', how='left')
#def get_booking_rate(numb):
#return numb
#listings['booking_rate'] = listings['number_of_bookings'].map(get_booking_rate)
Out[36]:
In [35]:
listings_filtered = listings[listings.tenure_months >= 10]
listings_filtered.describe()
Out[35]:
prop_type and neighborhood are categorical variables, use get_dummies() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to transform this column of categorical data to many columns of boolean values (after applying this function correctly there should be 1 column for every prop_type and 1 column for every neighborhood category.
In [81]:
import pandas as pd
dummies1 = pd.get_dummies(listings['neighborhood'])
dummies1
dummies2 = pd.get_dummies(listings['prop_type'])
dummies2
Out[81]:
predictor (y) is booking_rate, regressors (X) are everything else, except prop_id,booking_rate,prop_type,neighborhood and number_of_bookings
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
http://pandas.pydata.org/pandas-docs/stable/basics.html#dropping-labels-from-an-axis
In [ ]:
from sklearn.cross_validation import train_test_split
In [ ]:
In [ ]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
In [ ]:
In [ ]:
...type here...
In [ ]: