The housing market is known to be very volatile and is well regarded to be dependant on several factors, such as supply and demand,location, as well as demographics of the local region. New York City has one of the highest median rent in the U.S., but also one of the highest crime rates in the country. The goal of my final project for Data Bootcamp is to understand the relationship between crime rate and rent prices in New York City, and see if rent in different areas of manhattan is actually affected by the crime rate.
Using property data scraped from the Airbnb, as well as public government data on recent crime activity, I hope to be able to quantify and visualize the impact of crime activity on rent prices.
Contents:
The data has been extracted from 4 sources :
Airbnb's data between January 1st and January 3rd, 2015: http://data.beta.nyc//dataset/e91bde48-dbd0-4724-a433-a34bf0140e66/resource/9d64399b-36d6-40a9-b0bb-f26ae0d9c53f/download/listings.csv
New York Police Department data for 2015: http://data.beta.nyc//dataset/c874f72e-23a6-404f-ad83-fdd2aaa104bb/resource/3377cc50-044b-4806-b062-7b0c14351457/download/vol2351.csv
Department of Health, Neighborhoods, Boroughs and Zipcodes: https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm
Department of Education, Zipcodes and Precincts: https://data.cityofnewyork.us/api/views/u553-m549/rows.csv?accessType=DOWNLOAD
At first, I wanted to use Zillow's data for this project, as it seemed very promising. After taking a deeper look into it, I figured that it didn't have good data by areas of the city. Given that, I switched to the Airbnb API, as it had much more detailed data by different zip codes. Also, since Airbnb provides mostly short-term rentals, I believed it was more sensitive to changes in the given area.
The NYPD releases weekly data on the crimes of each Precinct in the city. Fortunetely, each weeks data also has the cumulative numbers for the year. So we will use the data for the last week of 2016 and then only use the YearToDate columns, so we can get the full picture for the year.
The Neighborhood and Borough Data from Airbnb was not accurate. Apartment owners are free to write their Neighborhood and Borough how ever they want. There were 180 Boroughs listed in the data, while in fact, we know there are only 5. To normalize this data, we used data from the health department and matched each zipcode to its correct neighborhood and Borough.
The NYPD aggregates its data by Precinct, which is basically arbitrary areas of the city. So I had to use a dataset from the department of education that had school information, but also the zipcode and precinct that each school is located in. This is not completely accurate, becuase a precinct might have more that one zipcode contained in it, but it was the best way I found to match the housing data to the crime date.
In [2]:
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for pandas
import csv
import re
import math
import urllib.request as ul
from bs4 import BeautifulSoup
%matplotlib inline
#check versions
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
# print('Today: ', dt.date.today())
In [3]:
url = 'http://data.beta.nyc//dataset/e91bde48-dbd0-4724-a433-a34bf0140e66/resource/9d64399b-36d6-40a9-b0bb-f26ae0d9c53f/download/listings.csv'
Rent = pd.read_csv(url, sep=',') # tab (\t) separated values
# print('Variable dtypes:\n', Rent.dtypes, sep=',')
Rent.head(2)
Out[3]:
In [4]:
# list of all variables
Rent.columns.values
Out[4]:
In [5]:
# Select Only relevant data
Rent = Rent[['name', 'neighbourhood_cleansed', 'city', 'zipcode', 'latitude', 'longitude',
'property_type', 'room_type', 'accommodates','bathrooms', 'bedrooms', 'beds', 'square_feet', 'price',
'weekly_price', 'monthly_price', 'review_scores_value']]
In [6]:
Rent.head(2)
Out[6]:
In [7]:
nrow = len(Rent.index)
nrow
Out[7]:
Since the only reliable information, in terms of location, from Airbnb is the zipcode, we need to import a list of zipcodes and neighborhoods to normalize the data
In [9]:
url = "https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm"
url_response=ul.urlopen(url, timeout=5)
soup = BeautifulSoup(url_response,'lxml')
#table we are looking into
table = soup.find_all('table')[0]
#borough: list of boroughs, neighborhood: list of neighborhoods, zip_code: list of zip codes
borough = list()
neighborhood = list()
zip_code = list()
zip_code_unique = list()
length = list()
#patterns for regular expressions - found printing table.find_all('tr')
pattern_borough = '"header1.+>(.+)</td>'
pattern_neighborhood = '"header2">\s(.+)</td>'
pattern_zip = '"header3">\s(.+)</td>'
for line in table.find_all('td'):
regex_borough = re.compile(pattern_borough)
borough.extend(regex_borough.findall(str(line)))
regex_neighborhood = re.compile(pattern_neighborhood)
neighborhood.extend(regex_neighborhood.findall(str(line)))
if len(neighborhood) > len(borough):
borough.append(borough[-1])
regex_zip = re.compile(pattern_zip)
zip_code.extend(regex_zip.findall(str(line)))
#print(borough)
#print(neighborhood)
#print(zip_code)
#code to split the zip codes
for i in range(len(zip_code)):
regex_zip2 = re.compile('(\s*[,]\s*)')
zip_ext = regex_zip.sub(',', zip_code[i]).split(',')
#list of unique zip codes
zip_code_unique.extend(zip_ext)
#number of split realized to get the corresponding borough and neighborhood
length.append(len(zip_ext))
#code to duplicate the borough and neighborhood so that one given index gives both information
#must have a reverse loop to insert at the appropriate location
for i in range(len(borough))[::-1]:
j = 0
#j is a counter to insert the same borough/neighborhood at the given index
while length[i] - j > 1:
borough.insert(i, borough[i])
neighborhood.insert(i, neighborhood[i])
j += 1
In [10]:
#cleaning the zip_code_unique list to avoid " "
for i in range(len(zip_code_unique)):
if len(zip_code_unique[i]) == 6:
zip_code_unique[i] = zip_code_unique[i][1:6]
In [11]:
# list of indices where zipcode is missing
non_missing_zip = [ind for ind in range(nrow) if pd.isnull(Rent.zipcode)[ind] == False]
In [12]:
# Keep only rows where non_missing is True
Rent = Rent.ix[non_missing_zip,]
In [13]:
# reset indices
Rent = Rent.reset_index()
In [14]:
nrow = len(Rent.zipcode)
nrow
Out[14]:
In [15]:
# Create new list and insert Borough and Neighborhood data based on index
borough_table = list()
neighborhood_table = list()
for i in range(nrow):
if Rent.zipcode[i] in zip_code_unique:
borough_table.append(borough[zip_code_unique.index(Rent.zipcode[i])])
neighborhood_table.append(neighborhood[zip_code_unique.index(Rent.zipcode[i])])
else:
borough_table.append("N/A")
neighborhood_table.append("N/A")
In [16]:
Rent['borough'] = borough_table
Rent['Neighborhood'] = neighborhood_table
In [17]:
# Drop rows that have borough or Neighborhood N/A
# Remove row if borough is N/A
Rent = Rent.loc[Rent['borough'] != 'N/A']
In [18]:
# reset indices
Rent = Rent.reset_index()
In [19]:
# reset nrow
nrow = len(Rent.index)
nrow
Out[19]:
In [20]:
day_price = []
for i in range(nrow):
# remove commas ans $ sign, convert into float
day_price.append((float(Rent.price[i].replace(',','')[1:])))
# Drop old price column
Rent = Rent.drop(['price'], axis = 1)
# Add new column with calculated data
Rent['day_price'] = day_price
In [21]:
Rent.head(2)
Out[21]:
In [22]:
# Filter table based on room_type Column
Rent_Private = Rent.loc[Rent['room_type'] == 'Private room']
# Reset index in the new table
Rent_Private.reset_index(inplace=True, drop = True)
# Remove outliers to keep data clean. I set ouliers as any private room above $300
Rent_Private = Rent_Private.loc[Rent_Private['day_price'] < 300]
Rent_Private.head(2)
#Rent_Private.dtypes
Out[22]:
In [23]:
url = 'http://data.beta.nyc//dataset/c874f72e-23a6-404f-ad83-fdd2aaa104bb/resource/3377cc50-044b-4806-b062-7b0c14351457/download/vol2351.csv'
Crime = pd.read_csv(url, sep=',', thousands=',') # tab (\t) separated values
#print('Variable dtypes:\n', Crime.dtypes, sep=',')
Crime.head()
Out[23]:
In [24]:
# list of all variables
Crime.columns.values
Out[24]:
In [25]:
# Select only relevant data
Crime = Crime [['Crime', 'YTD_Current', 'YTD_Prior', 'Precinct']]
In [26]:
# Remove non-numeric characters from Precinct column
Crime['Precinct'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
Crime.head()
Out[26]:
In [27]:
# update new number rows
nrow = len(Crime.index)
In [28]:
# Precinct dtype should be str
Crime.Precinct = Crime.Precinct.astype(np.str)
# Remove commas and periods fron number of crimes
Crime.YTD_Current.append(Crime.YTD_Current.replace(",",""))
Crime.YTD_Prior.append(Crime.YTD_Prior.replace(",",""))
Crime.head()
Out[28]:
In [29]:
# Using school data that has both Zipcodes and Precincts in the same table
url = 'https://data.cityofnewyork.us/api/views/u553-m549/rows.csv?accessType=DOWNLOAD'
Precinct = pd.read_csv(url, sep = ',')
#print('Variable dtypes:\n', Precinct.dtypes, sep=',')
In [30]:
# Select only relevant data
Precinct = Precinct [['Zip', 'precinct']]
In [31]:
# Update Column name
Precinct = Precinct.rename(columns = {'precinct':'Precinct'})
In [32]:
# Remove Duplicates from the list
Precinct = Precinct.drop_duplicates (['Precinct'])
In [33]:
# Remove any trailing spaces
Precinct.Precinct = Precinct.Precinct.replace(" ", "")
Precinct.Zip = Precinct.Zip.astype(str)
In [34]:
# Make sure Precinct dtype is str
Precinct.Precinct = Precinct.Precinct.astype(np.str)
#Precinct.dtypes
In [35]:
Precinct.head()
Out[35]:
In [36]:
# reset indices
Precinct = Precinct.reset_index()
In [37]:
nrow = len(Precinct.Zip)
nrow
Out[37]:
In [38]:
# Create new list and insert Borough and Neighborhood data based on index
borough_table = list()
neighborhood_table = list()
for i in range(nrow):
if Precinct.Zip[i] in zip_code_unique:
borough_table.append(borough[zip_code_unique.index(Precinct.Zip[i])])
neighborhood_table.append(neighborhood[zip_code_unique.index(Precinct.Zip[i])])
else:
borough_table.append("N/A")
neighborhood_table.append("N/A")
In [39]:
Precinct['borough'] = borough_table
Precinct['Neighborhood'] = neighborhood_table
In [40]:
# Drop rows that have borough or Neighborhood N/A
# Remove row if borough is N/A
Precinct = Precinct.loc[Rent['borough'] != 'N/A']
In [41]:
Precinct.head()
Out[41]:
In [42]:
# Merge the two Dataframes to include Zipcode
Combo = pd.merge(Crime, Precinct, # left and right df's
how='inner', # Inner Merge
left_on='Precinct', # link with this variable/column
right_on='Precinct', # link with this variable/column
indicator = True)
Combo.dtypes
Out[42]:
In [43]:
Combo.head()
Out[43]:
In [44]:
# Update Crime df with data from Combo and remove indicator
Crime = Combo [['Crime', 'YTD_Current', 'YTD_Prior', 'Precinct', 'Zip', 'borough','Neighborhood']]
In [45]:
Crime.head()
Out[45]:
In [46]:
# update new number rows
nrow = len(Crime.index)
In [47]:
# list of indices where zipcode is missing
non_missing_zip = [ind for ind in range(nrow) if pd.isnull(Crime.Zip)[ind] == False]
# list of indices where Precinct is missing
non_missing_Pre = [ind for ind in range(nrow) if pd.isnull(Crime.Precinct)[ind] == False]
In [48]:
# Keep only rows where non_missing is True
Crime = Crime.ix[non_missing_zip,]
Crime = Crime.ix[non_missing_Pre,]
In [49]:
# Filter table based on Crime Column
Crime_Total = Crime.loc[Crime['Crime'] == 'TOTAL']
# Reset index in the new table
Crime_Total.reset_index(inplace=True, drop = True)
Crime_Total.head()
#Crime_Total.dtypes
Out[49]:
After cleaning the data we finished with four dataframes: Rent, Rent_Private, Crime and Crime_Total. We will use these three to produce visual aids to better understand how they relate.
In [50]:
# Import seaborn package
import seaborn as sns
In [116]:
sns.set(style="whitegrid",
color_codes=True)
# Manually set color palette
col = ["#9b59b6", "#3498db", "#e74c3c", "#34495e", "#2ecc71"]
# Plot graph
sns.boxplot(x="borough",
y="day_price",
data=Rent_Private,
palette=col)
# Add title and labels
plt.title('Airbnb Rentals for Private Rooms \n 2015 - New York City')
plt.xlabel('Neighborhood')
plt.ylabel('Airbnb Price per Day')
Out[116]:
Not suprisingly, Manhattan has the most expensive rentals in NYC.
In [115]:
# Plot graph
g = sns.stripplot(x="Neighborhood",
y="day_price",
data=Rent_Private[Rent_Private['borough'] == 'Manhattan'],
palette='YlGnBu',
jitter=True);
# rotate labels 90 degrees
for item in g.get_xticklabels():
item.set_rotation(90)
# Add title and labels
plt.title('Airbnb Rentals for Private Rooms \n 2015 - Manhattan')
plt.xlabel('Neighborhood')
plt.ylabel('Airbnb Price per Day')
Out[115]:
Within Manhattan, Chelseas seems to have the most expensive rentals
In [108]:
# Plot bar chart with total rentals available by Borough
sns.countplot(x="borough",
data=Rent_Private,
palette='dark')
# Add title and lables
plt.title('Total Airbnb Rentals Available \n 2015 - New York')
plt.xlabel('Borough')
plt.ylabel('Rooms Available')
Out[108]:
In [54]:
Price_Neighborhood=Rent_Private.groupby(by=['Neighborhood'],as_index=False)['day_price'].mean()
Price_Neighborhood.head(2)
Out[54]:
In [55]:
# New table with total number of rentals available
Total_Neighborhood=Rent_Private.groupby(by=['Neighborhood'],as_index=False)['day_price'].count()
Total_Neighborhood.rename(columns={'day_price': 'Total'}, inplace=True)
Total_Neighborhood.head(2)
Out[55]:
In [56]:
Crime_Neighborhood=Crime_Total.groupby(by=['Neighborhood'],as_index=False)['YTD_Current'].sum()
Crime_Neighborhood.head(2)
Out[56]:
In [102]:
# Create New table with merged data from three dataframes
Combined = pd.merge(pd.merge(Price_Neighborhood,Crime_Neighborhood,how='inner',on='Neighborhood'),
Total_Neighborhood,how='inner',on='Neighborhood')
labels=Combined['Neighborhood'].tolist()
Combined.head(2)
Out[102]:
In [101]:
sns.set(style="whitegrid",
color_codes=True)
# Plot graph
g = sns.lmplot('YTD_Current', 'day_price',
data=Combined,
fit_reg=False,
scatter=True,
palette="Red",
size = 6,
aspect = 1.5,
scatter_kws={"marker": "D", "s": Combined.Total,"color": "steelblue"})
# Annotate some Neighborhoods based on their data coordinates
plt.annotate('Lower East Side', xy=(3123,107), xytext = (-20, 10), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'grey', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('Chelsea', xy=(2335,127), xytext = (40, 20), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('Gramercy Park', xy=(1811,124), xytext = (-15, 12), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'grey', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('Greenpoint', xy=(960,88.5), xytext = (-30, 10), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellowgreen', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('Northwest Brooklyn', xy=(4700,98), xytext = (-20, 10), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'indianred', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('Upper East Side', xy=(1853,116), xytext = (-15, -20), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'grey', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('Upper West Side', xy=(821,109), xytext = (-15, 10), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'grey', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('Central Bronx', xy=(3354,55), xytext = (-15, 10), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'grey', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('Lower Manhattan', xy=(1336,120), xytext = (-15, 8), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'grey', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.annotate('South Shore', xy=(294,64), xytext = (-15, 10), xycoords = 'data',
textcoords = 'offset points', ha = 'right', va = 'bottom', fontsize=8,
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'grey', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
# Add title and labels
plt.title('Daily prices of appartments and number of crimes \n 2015 - New York City')
plt.xlabel('Total Crimes')
plt.ylabel('Airbnb Average Daily Price')
Out[101]:
We can point out a few outliers from this graph:
1) Northwest Brooklyn has a high crime rate and also a relatevily high average daily price
2) Chelsea has the highest average daily price, and also a median cime rate
3) Greenpoint seems like a good neighborhood. Low crime rate and also a low daily price
As said in the introduction, rent prices is a function of several factors. Undoubtly, crime is one of those factors. Even though we were not able to see a specific pattern in the Airbnb prices with relation to crime rates, I think we were able to point out a few outliers. Unfortunetely, the neighborhoods areas are not the same size, so looking at total crimes is not a fair assesment when comparing different areas. We would probably need to look at total crime divided by population to get a more accurate picture. For example, according to this data, Greenpoint neighborhood has a very low crime rate and a reasonably affordable rent. In contrast, the area called Northwest Brooklyn (Which is actually the area of Brooklyn Hights, Park Slope, Bed-Stuy, Williamsberg, and Redhook) has a high crime rate. There areas are actually very close to each other, but Northwest Brookly is a much larger area.