In the last 0.6 notebook, there was not a strong correlation between service calls for actual fires, and the temperature.
In this next analysis, we'll do a similar analysis, but we'll look at ALL service calls, including false alarms, and medical emergencies. Attempt to programmatically access the NOAA dataset.
This notebook is querying the "Fire Incidents" dataset:
Fire Incidents includes a summary of each (non-medical) incident to which the SF Fire Department responded. Each incident record includes the call number, incident number, address, number and type of each unit responding, call type (as determined by dispatch), prime situation (field observation), actions taken, and property loss.
In [237]:
from __future__ import division, print_function
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
First we will fetch the firedata, themost recent 100,000 rows of data. We will also clean it by dropping columns, and converting date-related columns to datetime objects.
IMPORTANT TO NOTE: That resource identifier, and the link used is described as follows:
Fire Incidents includes a summary of each (non-medical) incident to which the SF Fire Department responded. Each incident record includes the call number, incident number, address, number and type of each unit responding, call type (as determined by dispatch), prime situation (field observation), actions taken, and property loss.
That means that the data is only including non-medical calls to the fire department. If we want to look at ALL calls to the Fire Department, that's a different data set.
In [238]:
query_url = 'https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset={}&$limit={}'
offset = 0
limit = 100000
fdf = pd.read_json(query_url.format(offset, limit))
In [239]:
def clean_fire_data(df):
cols_to_drop = ["automatic_extinguishing_sytem_failure_reason",
"automatic_extinguishing_sytem_type",
"battalion",
"box",
"call_number",
"detector_effectiveness",
"detector_failure_reason",
"ems_personnel",
"ems_units",
"exposure_number",
"first_unit_on_scene",
"ignition_factor_secondary",
"mutual_aid",
"no_flame_spead",
"other_personnel",
"other_units",
"station_area",
"supervisor_district"]
df = df.drop(cols_to_drop, axis=1)
for col in df.columns:
if 'dttm' in col:
df[col] = pd.to_datetime(df[col])
return df
In [240]:
fdf = clean_fire_data(fdf)
In [241]:
print(fdf.alarm_dttm.min())
print(fdf.alarm_dttm.max())
In [242]:
# we'll do the same thing with the weather csv, do a little exploration on it too
tdf = pd.read_table('../../data/external/noaa/5991787088242dat.txt', delim_whitespace=True)
In [243]:
# for col in tdf.columns:
# print(col)
In [244]:
def clean_noaa_download(df):
# we need to fix the YR--MODAHRMN column
# The building ferry data sucks. I'll look at the SF Airport dataa
# df = df[df.USAF == 998479] # only getting the ferry building data
df = df[df.USAF == 724940] # SFO data
mask = df['TEMP'] == '****'
df.loc[mask, 'TEMP'] = np.nan
df.TEMP = pd.to_numeric(df.TEMP, errors='coerce')
df.DIR = pd.to_numeric(df.DIR, errors='coerce')
df.loc[df.SLP == '******', 'SLP'] = np.nan
df.SLP = pd.to_numeric(df.SLP, errors='coerce')
# df.PCP24 = pd.to_numeric(df.PCP24, errors='coerce')
# df.DEWP = pd.to_numeric(df.DEWP, errors='coerce')
# YR--MODAHRMN = YEAR-MONTH-DAY-HOUR-MINUTE IN GREENWICH MEAN TIME (GMT)
df['dttm'] = pd.to_datetime(df['YR--MODAHRMN'], format='%Y%m%d%H%M')
cols_to_keep = [#'USAF',
# "WBAN",
# "YR--MODAHRMN",
# "DIR",
# "SPD",
# "GUS",
# "CLG",
# "SKC",
# "L",
# "M",
# "H",
# "VSB",
# "MW",
# "MW.1",
# "MW.2",
# "MW.3",
# "AW",
# "AW.1",
# "AW.2",
# "AW.3",
# "W",
"TEMP",
# "DEWP",
# "SLP",
# "ALT",
# "STP",
# "MAX",
# "MIN",
# "PCP01",
# "PCP06",
# "PCP24",
# "PCPXX",
# "SD",
'dttm']
df = df[cols_to_keep]
return df
In [245]:
tdf = clean_noaa_download(tdf)
In [246]:
tdf.describe()
Out[246]:
In [247]:
tdf.shape
Out[247]:
In [248]:
tdf.head()
Out[248]:
In [249]:
# tdf.DEWP.value_counts(dropna=False)
In [250]:
tdf.info()
In [251]:
# sns.pairplot?
In [252]:
# with open('../../config.json', 'r') as fh:
# weather_api_key = json.load(fh)['ncdc']['token']
In [253]:
sns.distplot(tdf[tdf.TEMP.notnull()].TEMP);
In [254]:
tdf.head()
Out[254]:
In [255]:
tdf = tdf.set_index('dttm')
In [256]:
gdf = fdf.rename(columns={'alarm_dttm':'dttm'}).set_index('dttm').groupby(pd.TimeGrouper('H')).count()['incident_number']
In [257]:
gdf.head()
Out[257]:
In [258]:
gdf.tail()
Out[258]:
In [259]:
gdf.describe()
Out[259]:
In [260]:
tdf.head()
Out[260]:
In [261]:
tdf.shape
Out[261]:
In [262]:
ndf = gdf.reset_index().merge(tdf.reset_index(), on='dttm')
In [263]:
sns.distplot(ndf.incident_number, hist=True, rug=True);
In [264]:
sns.distplot(ndf.TEMP, hist=True, rug=True);
In [265]:
# sns.jointplot(x="TEMP", y="incident_count", data=ndf, kind='reg');
sns.pairplot(ndf, kind='reg')
Out[265]:
In [266]:
sns.jointplot(x='TEMP', y='incident_number', data=ndf, kind='reg')
Out[266]:
In [267]:
ndf.corr()
Out[267]:
There is clearly not a strong correlation between the number of incidents in a given hour, and the temperature of that hour. We cannot as of yet reject the null hypothesis, which is that there is no significant correlation between temperature and fire department calls for service.
We can, however, change the experiment, and see if the change in temperature earlier in the day will impact the number of alarms, and the type of alarms, that occur later on in the day.
One thing I do find interesting is that outlier in the previous chart.
In [268]:
ndf.describe()
Out[268]:
When I get the descriptive analytics, I see that the average number of calls in a single hour are 3.2 and yet the max is 22 (it's in the table above, under 'max' of 'incident_number'). That's a big enough difference to be interesting!
We will come back to the general value of outliers, but first let's do some broader exploration of our dataset, to visualy see what patterns are unusual.
In [269]:
ndf[['incident_number','TEMP','dttm']].set_index('dttm').plot(figsize=(20,10), kind='line')
Out[269]:
In [292]:
ndf[ndf.incident_number == ndf.incident_number.describe()['max']]
Out[292]:
The above chart is pretty straightforward. When we created ndf
it was simply a record of the weather records we had for each hour for San Francisco, and then we overlayed the total count of incidents for that hour, meaning number of calls the fire department. I find it interesting that there are no meaningful patterns in the seasonality of the data, the way we see with the temperature, but that could simply be that we need to zoom out more, and do groupings and averages by week or month.
You can see that in this view, which is hourly, we see a big spike on February 06 2015. The temperature was at the 75% percentile at 64 degrees, but the incident number jumped to 22 within a single hour, compared to the mean of 3.2 and the median of 3.0 incidents per hour. Let's dig into this outlier for a moment, and also check news to find out why.
In [295]:
mintime = pd.tslib.Timestamp("Feb 06 2015")
maxtime = pd.tslib.Timestamp("Feb 07 2015")
mask = ((fdf.alarm_dttm >= mintime) & (fdf.alarm_dttm < maxtime))
fdf[mask].primary_situation.value_counts().head()
Out[295]:
Aha! So the power lines, electrical wiring issues, and severe weather or natural disaster calls suggest this is a weather-related event that caused the outlier.
A quick Google search for "February 06 2015 Power" reveals this news story:
Nearly 15,000 PG&E customers in the Bay Area were without power Friday afternoon because of the high winds and rain passing through the area, PG&E officials said.
Let's keep exploring before we circle back to this.
This time, instead of a count of incidents by the hour, let's look at a grouping of incidents by day (for now we'll ignore the temperature).
In [308]:
fdf.rename(columns={'alarm_dttm':'dttm'}).set_index('dttm').groupby(pd.TimeGrouper('D')).count()['incident_number'].plot(figsize=(20,10))
Out[308]:
In the above chart, there is a HUGE spike and a number of smaller spikes during the month of December 2014. It turns out this spike occurred on December 11 2014. So we'll look into that more...
In [318]:
mintime = pd.tslib.Timestamp("December 11 2014")
maxtime = pd.tslib.Timestamp("December 12 2014")
mask = ((fdf.alarm_dttm >= mintime) & (fdf.alarm_dttm < maxtime))
fdf[mask].primary_situation.value_counts(dropna=False).head()
Out[318]:
Further exploration reveals that the biggest spike in early December 2014 was the cause of another big storm. You can read about it here
I'm not seeing a strong correlation between raw number of incidents, and the ambient temperature. This could simply be because I need to look at low temperature separately from high temp, or do a better job of segmenting the data based on zipcode, etc. There's still work to be done to clean and munge that data to try and reject our null hypothesis.
First though, let's take a look at an aggregation by month.
In [309]:
ndf['month'] = ndf.dttm.apply(lambda x: x.month)
In [311]:
sns.boxplot(x='month', y='incident_number', data=ndf)
Out[311]:
In [310]:
sns.violinplot(x='month', y='incident_number', data=ndf)
Out[310]:
Based on the above box and violin plots, I find the outliers of December and February interesting, but also it's interesting how even the medians are across most months. The medians visually appear unchanged across all months, except for a higher value in December and a lower value in March.
In [312]:
ndf['year'] = ndf.dttm.apply(lambda x: x.year)
In [314]:
sns.boxplot(x='year', y='incident_number', data=ndf[ndf.month == 12])
Out[314]:
In [313]:
sns.violinplot(x='year', y='incident_number', data=ndf[ndf.month == 12])
Out[313]:
Looking at the years, it seems 2014 was elevated in many ways, and the kde plots that shape the violin plot are interesting in how each year takes on a different distribution. Not sure yet what that means, but another item to circle back and explore, especially when we have more years to work with.
So far it seems that the main cause of a large number of calls to the fire department is due to precipitation. It's long been a joke about how Californians don't know how to drive in the rain, particularly when there are long dry spells between storms. The roads accumulate oils and dust and debris, and the sudden downpour creates road surfaces that are both slick and also jagged. If the slick surface from both water and oil can cause a vehicle to spin out, as can a tire puncture caused from newly exposed debris.
At least, theoretically, that's what we hear about driving in California in the winter. What we see so far, however, are massive storms of several inches flooding roads, and blowing down trees that knock out power.
I found it interesting that on December 11 2014, the most common cause of calls were 43 incidents of an alarm system being activated due to malfunction, and that's not including 17 calls that were categorized as "false alarm or false call, other".
The second most common call that day was related to down power lines, and the third most common was surprising to me: there were 19 incidents where the SFFD was contacted to remove victim(s) from stalled elevators.
Originally I began exploring this data looking for ambient temperature as having a correlation with behavior changes, either as people began turning on heating systems in the winter, or becoming more social and careless with open flame in the summer. I do not see that happening (yet) with the fire data, but instead there seems to be a very large burden on the fire department when a large storm hits the SF Bay Area. So that's what we'll probe on next: what variable does the weather service provide that can be used as a predictor for an increased need for resources with the fire department?
In [288]:
# TODO NEXT:
# since the goal of this analysis would be to predict the future number of calls, I'll have to get trickier and
# do a moving average of timeshift
# but first, do a simple graph where tdf is simply PRECIPITATION
We will grab the noaa data again, but this time we will only filter on the precipitation data.
In [329]:
def get_noaa_prec_dataframe():
df = pd.read_table('../../data/external/noaa/5991787088242dat.txt', delim_whitespace=True)
# The building ferry data sucks. I'll look at the SF Airport dataa
df = df[df.USAF == 724940] # SFO data
# YR--MODAHRMN = YEAR-MONTH-DAY-HOUR-MINUTE IN GREENWICH MEAN TIME (GMT)
df['dttm'] = pd.to_datetime(df['YR--MODAHRMN'], format='%Y%m%d%H%M')
cols_to_keep = [#'USAF',
"PCP01",
"PCP06",
"PCP24",
'dttm']
df = df[cols_to_keep]
for col in df.columns:
if 'PCP' in col:
df[col] = pd.to_numeric(df[col], errors='coerce')
return df
In [330]:
tdf = get_noaa_prec_dataframe()
In [331]:
tdf.info()
In [332]:
tdf.describe()
Out[332]:
Yikes! The number of actual values in this table is very sparse. At least PCP01
has a sizeable amount, in the neighborhood of 30K non null data points. However I think it's worth PCP06
In [337]:
sns.distplot(tdf.PCP01.notnull())
Out[337]:
In [339]:
sns.distplot(tdf.PCP06.notnull())
Out[339]:
In [338]:
sns.distplot(tdf.PCP24.notnull())
Out[338]:
Because I'm a big fan of looking at statistical outliers to understand hidden trends in the data to probe on, we will use this definition of an outlier from Wolfram Alpha:
A convenient definition of a outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.
So let's quickly figure out what that number is for incident number, but also for temp.
In [320]:
def outlier_range(s):
"""
Given a pandas Series, return the upper and lower boundaries
for the outlier, which is 1.5 times the interquartile range above the third quartile
or below the first quartile."""
first_quartile = None
third_quartile = None
# TODO: FINISH THIS