Using a new set of cleaned interim data provided by Satya, and looking for weather correlations such as ambient temperature.
The main question we're trying to answer is whether there's a correlation between weather data and FD service calls.
First we'll look at some simple correlations, just to see if there's a relationship in general between temperature and # of calls. The null hypothesis is that there is no relationship.
After we try to explore this, we'll look at other ways we can explore these variables, such as rate of change:
The alt hypothesis is that the quantity of FD service calls significantly increases after the temperature becomes an outlier (hot or cold) for that time of day for that day of week, compared to prior time of days and days of week.
The null is that there is no significant correlation between temperature outliers and service calls.
In [338]:
from __future__ import division, print_function
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
In [21]:
# we'll use the fresh download of the fire call csv, and do a little exploration
cols_to_use = ['Incident.Date',
'Call.Number',
'Zipcode',
'Alarm.DtTm',
'Location',
'Incident.Number',
'Close.DtTm',
'Action.Taken.Primary',
'type',
'Address',
'Neighborhood..District'
]
fdf = pd.read_csv('../../data/interim/Clean_Fire_Incidents.csv', usecols=cols_to_use)
In [26]:
# rename the columns so that it's easier to use built-in dot notation functions
fdf = fdf.rename(columns={'Incident.Number':'incident_number',
'Incident.Date':'incident_date',
'Call.Number':'call_number',
'Zipcode':'zipcode',
'Alarm.DtTm':'alarm_dttm',
'Location':'location',
'Close.DtTm':'close_dttm',
'Action.Taken.Primary':'action_taken_primary',
'type':'call_type',
'Address':'address',
'Neighborhood..District':'neighborhood_district'})
In [29]:
fdf.incident_date = pd.to_datetime(fdf.incident_date)
In [30]:
fdf.head()
Out[30]:
In [31]:
fdf.info()
In [23]:
# based on above we have some typing to do
for col in fdf.columns:
if 'dttm' in col.lower():
fdf[col] = pd.to_datetime(fdf[col])
In [33]:
fdf.loc[0]
Out[33]:
action_taken_primary
so those could be split and parsed and mapped to an index, making the field a simple int. Something to try
In [159]:
# we'll do the same thing with the weather csv, do a little exploration on it too
tdf = pd.read_table('../../data/external/noaa/5991787088242dat.txt', delim_whitespace=True)
In [160]:
# TODO: 4,21 have mixed dtypes, so that must be fixed. It looks like TEMP has **** instead of a NaN value
In [161]:
# we need to fix the YR--MODAHRMN column
# YR--MODAHRMN = YEAR-MONTH-DAY-HOUR-MINUTE IN GREENWICH MEAN TIME (GMT)
tdf['data_time'] = pd.to_datetime(tdf['YR--MODAHRMN'], format='%Y%m%d%H%M')
In [162]:
tdf['data_time'].min() # kind of a bummer that the entire dataset only goes back to 2013. Why did we do that??
# TODO: requery NOAA for records going back to 2000.
Out[162]:
In [163]:
# since column 21 is TEMP and that is really important to us, we'll fix that one first
mask = tdf['TEMP'] == '****'
tdf.loc[mask, 'TEMP'] = np.nan
In [164]:
tdf.TEMP = pd.to_numeric(tdf.TEMP)
In [165]:
tdf.info()
In [166]:
tdf.head()
Out[166]:
In [167]:
# to make this a little easier to work with let's grab the station ids and limit our data points to only 1 station,
# perhaps the one with the fewest null values?
tdf.groupby('USAF').count()['TEMP']
Out[167]:
In [168]:
# 724940 is at SF International Airport, which is a little too far south for my taste. Let's see if we can get closer
# oh, it occurs to me that if the temp is absent, we should drop it from our dataset.
tdf = tdf[tdf.TEMP.notnull()]
tdf = tdf[tdf.USAF != 724940]
In [169]:
# the second most records comes from 998011 which is near Fairfield. We should get rid of those
tdf = tdf[tdf.USAF != 998011]
In [170]:
tdf.groupby('USAF').count()['TEMP'].sort_values(ascending=False)
Out[170]:
In [171]:
# 992640 is from 18 nautical miles off the coast of SF, so not useful, we'll nix that too
tdf = tdf[tdf.USAF != 992640]
In [172]:
# 994016 is in the water off of Fort Mason, so a little better, we'll hold onto that value
# 998479 is pier 1 near the ferry building
# This is disappointing because it means we don't have any reading from inland in the city
# we only have 2 station locations that are close enough to SF to give a good idea of what's going on
tdf.groupby('USAF').count()['TEMP'].sort_values(ascending=False)
Out[172]:
In [173]:
tdf.iloc[-1]
Out[173]:
In [174]:
tdf.iloc[-1].head()
Out[174]:
In [175]:
tdf.iloc[0].head()
Out[175]:
In [176]:
# of all the station locations, let's just start with the ferry building and see if there are correlations there
tdf.groupby('USAF').min()['data_time'].sort_values(ascending=False)
Out[176]:
In [177]:
tdf.groupby('USAF').max()['data_time'].sort_values(ascending=False)
Out[177]:
In [178]:
# since the min and max values are the same for both weather stations, I'll take the ferry building for now.
tdf = tdf[tdf.USAF == 998479]
In [179]:
tdf.shape
Out[179]:
In [180]:
sns.distplot(tdf.TEMP);
In [181]:
tdf.describe()
Out[181]:
In [182]:
tdf.tail()
Out[182]:
In [183]:
fdf.columns
Out[183]:
In [291]:
# wnow let's try to map both dataframes together, and run groupby functions and correlations
# in the interest of time we'll just focus on the TEMP field and go from there
# We should probably make a mapping function to map the temperature for a call to the closest datetime value
# it appears the granularity of the data is by the hour.
# one simply way of dealing with this is to create a new dataframe, df,
# and the index is the datetime by the hour
# then do a groupby for number of incidents in that hour.
# so the item will have datetime, temperature, incident count
df = fdf.set_index('alarm_dttm').groupby(pd.TimeGrouper('H')).count()['incident_number']
In [292]:
df = df.reset_index()
In [293]:
df.head()
Out[293]:
In [294]:
df.min()
Out[294]:
In [295]:
# I realized that if I don't have temperature data for 2003, I need to drop that, so let's limit df to only those dates
# where we have a temp value
tdf.USAF.value_counts()
Out[295]:
In [296]:
tdf.iloc[-1]
Out[296]:
In [297]:
temp_df = tdf[['data_time','TEMP']]
temp_df.head()
Out[297]:
In [298]:
df = df.rename(columns={'alarm_dttm':'dttm'})
In [299]:
temp_df = temp_df.rename(columns={'data_time':'dttm'})
In [300]:
df = df.merge(temp_df)
In [301]:
df.head()
Out[301]:
In [302]:
df.info()
In [303]:
df = df.rename(columns={'incident_number':'incident_count'})
In [304]:
sns.distplot(df.incident_count)
Out[304]:
In [305]:
sns.distplot(df.TEMP)
Out[305]:
In [306]:
sns.jointplot(x="TEMP", y="incident_count", data=df);
In [307]:
df = df.set_index('dttm')
In [308]:
ndf = df.groupby(pd.TimeGrouper('D')).mean()
In [309]:
sns.jointplot(x="TEMP", y="incident_count", data=ndf);
In [310]:
with sns.axes_style("white"):
sns.jointplot(x="TEMP", y="incident_count", data=ndf, kind='hex', color='k');
In [311]:
# TODO: look only at when the incident count is high
# TODO: did it wrong! rebuild with 'D' seperately and merge
df = fdf.set_index('alarm_dttm').groupby(pd.TimeGrouper('D')).count()['incident_number']
df = df.reset_index()
In [312]:
temp_df = tdf[['data_time','TEMP']]
temp_df = temp_df.set_index('data_time').groupby(pd.TimeGrouper('d')).mean()['TEMP']
temp_df.head()
Out[312]:
In [313]:
temp_df = temp_df.reset_index().rename(columns={'data_time':'dttm'})
In [314]:
temp_df.head()
Out[314]:
In [315]:
df.head()
Out[315]:
In [316]:
df = df.rename(columns={'incident_number':'incident_count'})
In [317]:
df = df.rename(columns={'alarm_dttm':'dttm'})
In [318]:
df = df.merge(temp_df)
In [319]:
df.head()
Out[319]:
In [320]:
sns.jointplot(x="TEMP", y="incident_count", data=df);
In [321]:
sns.jointplot(x="TEMP", y="incident_count", data=df, kind='reg');
In [322]:
with sns.axes_style("white"):
sns.jointplot(x="TEMP", y="incident_count", data=ndf, kind='hex', color='k');
In [323]:
# TODO : explore the extreme outliers, what are up with them?
In [324]:
sns.jointplot(x="TEMP", y="incident_count", data=ndf, kind='kde', color='k');
In [325]:
df.head()
Out[325]:
In [326]:
df.describe()
Out[326]:
In [327]:
df[:365].set_index('dttm').TEMP.plot()
Out[327]:
In [328]:
df[:365].set_index('dttm').incident_count.plot()
Out[328]:
In [329]:
df[:365].set_index('dttm').incident_count.plot()
Out[329]:
In [330]:
df.shape
Out[330]:
In [331]:
df.incident_count.sum()
Out[331]:
In [332]:
df.set_index('dttm').plot(kind='scatter', y='incident_count', x='TEMP')
Out[332]:
In [335]:
df.set_index('dttm').plot(kind='hexbin', y='incident_count', x='TEMP')
Out[335]:
In [334]:
# nothing is jumping out here, but check the pearsonr and
# TODO: rolling mean and rolling std
In aggregate when looking at daily averages, and also when looking at hourly data, there is no strnog evidence that would suggest that we can reject the null hypothesis.
Try and look at some other variables from NOAA to see if there are other correlations there. Make a master dataframe with all data by the hour from fdf and merge it with all numerical data from NOAA.
Do a proper cleaning and filling of NA values with averages and zeros, instead of just omitting them from the dataset.
In [ ]: