In [52]:
"""
author: mikezawitkowski
created on 7/17/2016
"""
from __future__ import division, print_function
import pandas as pd
%matplotlib inline
import seaborn as sns
At the time that this was created, there is a lot of press going on right now about Mission district fires, and gossip that maybe it's the cause of landlords or some arsonist trying to get more money for older properties. This notebook captures some initial thoughts about this.
This exploration troubles me, because I don't see an upside to producing this, however I see quite a few downsides if I get this wrong.
This seems to be a very politically charged topic at the moment, and there are a lot of people who are claiming things and getting carried away with facts that may or may not be true.
I'm not saying that one side or another is more right or wrong, but I'm confident that in the end, the data will prevail.
In the meantime, just as part of this exploration, I was curious to see if I could verify some of the claims that are being put forth, and figure out whether there are some other explanations, and just wrap my head around the problem and the data being used.
In [2]:
query_url = 'https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset={}&$limit={}'
offset = 0
limit = 1000000
df = pd.read_json(query_url.format(offset, limit))
In [3]:
cols_to_drop = ["automatic_extinguishing_sytem_failure_reason",
"automatic_extinguishing_sytem_type",
"battalion",
"box",
"call_number",
"detector_effectiveness",
"detector_failure_reason",
"ems_personnel",
"ems_units",
"exposure_number",
"first_unit_on_scene",
"ignition_factor_secondary",
"mutual_aid",
"no_flame_spead",
"other_personnel",
"other_units",
"station_area",
"supervisor_district"]
df = df.drop(cols_to_drop, axis=1)
In [4]:
for col in df.columns:
if 'dttm' in col:
df[col] = pd.to_datetime(df[col])
In [5]:
df.alarm_dttm.min()
Out[5]:
In [6]:
df.estimated_property_loss.value_counts(dropna=False)
Out[6]:
In [7]:
df.shape
Out[7]:
In [8]:
# So we have 100,000 rows of data, going all the way back to February 10, 2013
# There is thoughts that there's a correlation with year and cost, especially in the mission
df[df.estimated_property_loss.isnull()].__len__()
Out[8]:
In [9]:
# of the 100,000 rows, 96,335 are null
96335 / float(df.shape[0])
Out[9]:
In [10]:
# wow, so where are these companies getting their data about the costs associated with fires?
# it's not from the sfgov website. we'll need to table that and come back later.
In [26]:
df['year'] = df.alarm_dttm.apply(lambda x: x.year)
In [11]:
temp_df = df[df.estimated_property_loss.notnull()]
In [12]:
temp_df.shape
Out[12]:
In [14]:
temp_df.groupby('year').sum()['estimated_property_loss']
Out[14]:
According to wikipeda, the mission district falls into two zipcodes, 94103, 94110
So let's look at just those zipcodes with the same grouping as above
In [15]:
mask = ((temp_df.zipcode.notnull()) & (temp_df.zipcode.isin([94103, 94110])))
temp_df[mask].groupby('year').sum()['estimated_property_loss']
Out[15]:
In [16]:
# So based on the above data yes, the 2015 fires for those two zipcodes doubled,
# and we can look into why, but could it be a symptom of population growth?
In [17]:
# this article http://sf.curbed.com/2016/7/1/12073544/mission-fires-arson-campos
# said that there were 2,788 blazes... but that's wrong, it's 2,788 units impacted.
# One blaze could impact multiple units
#
# This infographic shows number of units impacted by fire by neighborhood,
# but isn't this seriously misleading? https://infogr.am/sf_fires_by_zip-3
#
# Ok, no seriously, I'm setting aside this mission research, because the upside for getting it right is low
# but the downside for getting it wrong is very impactful. Not the sort of press we want
# TODO: check this out and compare it to the data set
# https://celestelecomptedotcom.files.wordpress.com/2015/04/15-04-05_wfs-greater-alarms-01-01-01-04-05-15.pdf
Just reading through the various articles, it seems that there's quite a bit of misinformation, and misuse of the dataset that is available for estimating fires. sf.curbed.com is saying there were 2,788 blazes in the Mission district over the full time period, but actually it's 2,788 units that were impacted. It could simply be a fact of there being a higher population density in that area, or age of buildings. There's a lot of reasons that fires could be higher in the Mission than in other parts of the city.
However, I see a huge glaring problem in trying to make estimates regarding property damage values, and that is because 90% of the data points and calls for service to the fire department have no damage estimates listed. Yes, it is true that in 2014 to 2015 the estimated property loss had doubled, but let's take a little closer look, shall we?
In [25]:
mask = ((temp_df.zipcode.notnull()) &
(temp_df.zipcode.isin([94103, 94110])) &
(temp_df.year == 2014))
temp_df[mask].groupby('year').sum()['estimated_property_loss']
Out[25]:
https://celestelecompte.com/2015/04/25/open-data-fire-incident-report-san-francisco-2004-2015/
I noticed a quote from that original letter:
IMPORTANT – PLEASE NOTE: Entries contained in the attached report (including all monetary fire loss estimates) are intended for the sole use of the State Fire Marshal. Estimations and evaluations represent “most likely” and “most probable” cause and effect. Any representation as to the validity or accuracy of reported conditions (including all monetary fire loss estimates) outside the State Fire Marshal’s office is neither intended nor implied.
When this data was requested, the response letter was explicit about the fact that the estimates were for internal use, and likely erroneous, and here we are using those estimates to claim that the cost of fires has gotten out of control.
So what do we do? We get all up in arms about a chart that somebody made about how the financial numbers are so much higher for the Mission:
https://infogr.am/YCxOktys5EEYfx8r
In case you missed it, that link to the infogr.am is titled "Financial Losses: Dramatic Increase in the Mission"
Wikipedia gives me two zipcodes, and using that, I'm able to get a rough guess of the same doubling effect of costs.
This other document has a different, more specific definition of the Mission:
The Mission District is defined for purposes of this report as the area bounded roughly by Market Street, Valencia Street, Cesar Chavez Street, U.S. 101, 23rd Street, Hampshire Street, 17th Street, Vermont Street, Division Street, and 11th Street. These boundaries correspond to Census tracts 177, 201, 208, 209, 228.01, 228.03, 228.09, 229.02, and 229.03.
In [27]:
mask = ((df.estimated_property_loss.notnull()))
sns.df[mask].groupby('year').sum()['estimated_property_loss']
Out[27]:
In [28]:
# So based on the above data yes, the 2015 fires for those two zipcodes doubled,
# and we can look into why, but could it be a symptom of population growth?
# according to the document mentioned above and the report, it says that the population size shrunk. OK...
# but the data that is being looked at is a HUGE period. There was a census report in 2000, and then another one
# that's a large bucket of 2009-2013. The change reported was a 9% decrease, not exactly a huge boom.
# My next theory is that the reason that the cost increased is simply that they got better about capturing records
# for certain areas
In [29]:
# Let's try a little experiment
# let's look at which fire areas are better at keeping records, shall we?
df['loss_recorded'] = 0
In [30]:
mask = ((df.estimated_property_loss.notnull()))
df.loc[mask, 'loss_recorded'] = 1
In [41]:
mask = ((df.zipcode.notnull()))
zipgroup = df[mask].groupby(['zipcode'])
In [66]:
zipgroup.mean()['loss_recorded'].plot(kind='barh')
Out[66]:
In [71]:
# the above document shows the likelihood that the estimated_property_loss value
# is recorded based on zipcode.
# Mission District is within 94103, 94110 zipcodes
#
zipgroup.mean()['loss_recorded'][94103]
Out[71]:
In [72]:
zipgroup.mean()['loss_recorded'][94110]
Out[72]:
In [74]:
mask = ((df.estimated_property_loss.notnull()) &
(df.zipcode == 94110))
sns.distplot(df[mask].estimated_property_loss)
Out[74]:
In [75]:
mask = ((df.estimated_property_loss.notnull()) &
(df.zipcode == 94103))
sns.distplot(df[mask].estimated_property_loss)
Out[75]:
In [77]:
Out[77]:
In [79]:
df['estimated_property_loss'] = pd.to_numeric(df['estimated_property_loss'])
In [84]:
df['estimated_property_loss'] = df['estimated_property_loss'].fillna(0)
In [87]:
df.info()
In [89]:
mask = ((df.estimated_property_loss.notnull()) &
(df.zipcode == 94103))
df[mask].estimated_property_loss.value_counts(dropna=False, normalize=True, bins=50)
Out[89]:
In [92]:
df['month'] = df.alarm_dttm.apply(lambda x: x.month)
In [95]:
mask = ((df.month == 6) & (df.year == 2016))
df[mask].describe()
Out[95]:
In [96]:
df.describe()
Out[96]:
In [97]:
df.alarm_dttm.min()
Out[97]:
In [98]:
df.alarm_dttm.max()
Out[98]:
In [ ]:
# what is odd is how the fire civilian fatalities have a max value of 1, which makes it concerning that the dataset
# is inaccurate and needs to be cleaned more carefully before we proceed.