2014 NFIRS FireCARES data statistics and loading

Overview

The purpose of this collection of scripts is to load the NFIRS yearly data into a dump of the existing Firecares Database. The individual 18 incident types and ancillary tables (fireincident, hazchem, hazmat, etc) are loaded into temporary tables and then appended to the master "fireincident", etc tables. The bulk of this script has to do w/ matching addresses in geocoding results to the new incidents in the "incidentaddress_2014" and appending that to the "incidentaddress" table. The incident_address_2014_aa/ab/ac/ad/ae tables contain the geocoding information from shapefiles and will be used to augment the incidentaddress_2014 table's records with geometries.

Assumptions

  • You have jupyter>=1.0.0 and psycopg2 and dependencies installed
  • See append_nfirs_yearly_data.sh for more information / prereqs and initial data loading
  • Database has been restored as "nfirs_2014"
  • All of the Incident_address_2014_a*.shp data has been loaded into the database as "incident_address_2014_aa/ab/ac/ad/ae" tables
  • Dates have been converted to a Postgres-parseable date (MM/DD/YYYY) in:
    • fireincident.txt
    • incidentaddress.txt
    • arsonjuvsub.txt
    • basicincident.txt
    • civiliancasualty.txt
    • ffcasualty.txt
  • Null (\000) characters have been stripped from incidentaddress.txt and fdheader.txt
  • RISK_FACT1 codes in arsonjuvsub.txt have been replaced with their single-character equivalent (eg. "U", "1", etc)
  • The 18 ancillary tables have been loaded as [FILENAME w/o extension]_2014 and include data from:
    • fireincident.txt => maps to fireincident_2014
    • hazchem.txt
    • hazmat.txt
    • hazmatequipinvolved.txt
    • hazmobprop.txt
    • incidentaddress.txt
    • wildlands.txt
    • arson.txt
    • arsonagencyreferal.txt
    • arsonjuvsub.txt
    • basicaid.txt
    • basicincident.txt
    • civiliancasualty.txt
    • codelookup.txt
    • ems.txt
    • fdheader.txt
    • ffcasualty.txt
    • ffequipfail.txt
  • An id serial primary key has been added to the "incidentaddress_2014" table
  • "incidentaddress" table has source column added

The rest continues from (as of this writing) line 101 of append_nfirs_yearly_data.sh


In [1]:
import pandas as pd
import psycopg2

conn = psycopg2.connect("dbname=nfirs_2014")

In [37]:
# Counts of the geolocated incident addresses

pd.read_sql_query("""
select 'AA' as table, count(1) from incident_address_2014_aa union
select 'AB' as table, count(1) from incident_address_2014_ab union  -- ab's .dbf was corrupted at record 494357
select 'AC' as table, count(1) from incident_address_2014_ac union
select 'AD' as table, count(1) from incident_address_2014_ad union
select 'AE' as table, count(1) from incident_address_2014_ae""", conn)


Out[37]:
table count
0 AD 500000
1 AC 499999
2 AB 494356
3 AA 499999
4 AE 116747

In [202]:
# Determine consistency between aa/ab/ac/ad/ae tables
def same_cols():
    aa = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='incident_address_2014_aa'", conn)
    ab = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='incident_address_2014_ab'", conn)
    ac = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='incident_address_2014_ac'", conn)
    ad = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='incident_address_2014_ad'", conn)
    ae = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='incident_address_2014_ae'", conn)
    dest = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='incidentaddress_2014'", conn)
    final = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='incidentaddress'", conn)

    aa_cols = set([(x.items()[1][1], x.items()[0][1]) for x in aa.to_dict('records')])
    ab_cols = set([(x.items()[1][1], x.items()[0][1]) for x in ab.to_dict('records')])
    ac_cols = set([(x.items()[1][1], x.items()[0][1]) for x in ac.to_dict('records')])
    ad_cols = set([(x.items()[1][1], x.items()[0][1]) for x in ad.to_dict('records')])
    ae_cols = set([(x.items()[1][1], x.items()[0][1]) for x in ae.to_dict('records')])
    dest_cols = set([(x.items()[1][1], x.items()[0][1]) for x in dest.to_dict('records')])
    final_cols = set([(x.items()[1][1], x.items()[0][1]) for x in final.to_dict('records')])

    union = aa_cols | ab_cols | ac_cols | ad_cols | ae_cols
    print 'AA col diffs: '
    print union - aa_cols
    print 'AB col diffs: '
    print union - ab_cols
    print 'AC col diffs: '
    print union - ac_cols
    print 'AD col diffs: '
    print union - ad_cols
    print 'AE col diffs: '
    print union - ae_cols
    dest_union = dest_cols | final_cols
    print dest_union - dest_cols
    print 'Dest (incidentaddress_2014) col diffs: ' # Should only have "id" as a differing column
    print dest_union - final_cols

same_cols()
# aa was numeric before running this


AA col diffs: 
set([('full_stree', 'character varying'), ('num_mile', 'numeric'), ('inc_date', 'date')])
AB col diffs: 
set([('street_add', 'character varying'), ('y', 'numeric'), ('inc_date', 'numeric'), ('x', 'numeric'), ('num_mile', 'numeric')])
AC col diffs: 
set([('street_add', 'character varying'), ('num_mile', 'character varying'), ('y', 'numeric'), ('inc_date', 'numeric'), ('x', 'numeric'), ('zip4', 'character varying')])
AD col diffs: 
set([('street_add', 'character varying'), ('num_mile', 'character varying'), ('y', 'numeric'), ('inc_date', 'numeric'), ('x', 'numeric'), ('zip4', 'character varying')])
AE col diffs: 
set([('street_add', 'character varying'), ('y', 'numeric'), ('inc_date', 'numeric'), ('x', 'numeric'), ('num_mile', 'numeric')])
set([])
Dest (incidentaddress_2014) col diffs: 
set([('id', 'integer')])

In [204]:
# Change num_mile in aa/ac/ad to be a string vs number
for table in ['aa', 'ac', 'ad']:
    with conn.cursor() as cursor:
        cursor.execute("alter table incident_address_2014_%s alter column num_mile type character varying(8);" % table)

In [186]:
# Per 2013 data, the zip5 was unreliable
pd.read_sql_query("""
select 'AA' as table, count(1) from incident_address_2014_aa where zip5 = 0 union
select 'AB' as table, count(1) from incident_address_2014_ab where zip5 = 0 union
select 'AC' as table, count(1) from incident_address_2014_ac where zip5 = 0 union
select 'AD' as table, count(1) from incident_address_2014_ad where zip5 = 0 union
select 'AE' as table, count(1) from incident_address_2014_ae where zip5 = 0""", conn)
# and it looks to be the same in 2014's data, lots of zeros


Out[186]:
table count
0 AB 228
1 AA 22426
2 AE 92
3 AC 367
4 AD 722

In [187]:
# "Postal" might be a better option
pd.read_sql_query("""
select 'AA' as table, count(1) from incident_address_2014_aa where postal is null or length(postal) != 5 union
select 'AB' as table, count(1) from incident_address_2014_ab where postal is null or length(postal) != 5 union
select 'AC' as table, count(1) from incident_address_2014_ac where postal is null or length(postal) != 5 union
select 'AD' as table, count(1) from incident_address_2014_ad where postal is null or length(postal) != 5 union
select 'AE' as table, count(1) from incident_address_2014_ae where postal is null or length(postal) != 5""", conn)
# About the same...overall


Out[187]:
table count
0 AA 8942
1 AE 261
2 AB 844
3 AC 1742
4 AD 1088

In [198]:
# Ok, so use "postal" for zip5 IF zip5 is zero
with conn.cursor() as cursor:
    cursor.execute("""update incident_address_2014_aa set zip5 = to_number(postal, '99999') where zip5 = 0;""")
    print cursor.rowcount
    cursor.execute("""update incident_address_2014_ab set zip5 = to_number(postal, '99999') where zip5 = 0;""")
    print cursor.rowcount
    cursor.execute("""update incident_address_2014_ac set zip5 = to_number(postal, '99999') where zip5 = 0;""")
    print cursor.rowcount
    cursor.execute("""update incident_address_2014_ad set zip5 = to_number(postal, '99999') where zip5 = 0;""")
    print cursor.rowcount
    cursor.execute("""update incident_address_2014_ae set zip5 = to_number(postal, '99999') where zip5 = 0;""")
    print cursor.rowcount
    
# Already ran aa/ab, counts for ab was 228, missed aa


0
0
367
722
92

In [205]:
# Even-out some of the nullable fields
for table in ['aa', 'ab', 'ac', 'ad', 'ae']:
    with conn.cursor() as cursor:
        print 'Updating %s' % table
        cursor.execute("update incident_address_2014_%s set street_pre = '' where street_pre is null;" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set streettype = '' where streettype is null;" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set streetsuf = '' where streetsuf is null;" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set streetname = '' where streetname is null;" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set num_mile = '' where num_mile = '0';" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set apt_no = '' where apt_no = '0';" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set city_1 = '' where city_1 is null;" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set state_id = '' where state_id is null;" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set streetname = upper(streetname);" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set streettype = upper(streettype);" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set streetsuf = upper(streetsuf);" % table)
        print cursor.rowcount
        cursor.execute("update incident_address_2014_%s set street_pre = upper(street_pre);" % table)
        print cursor.rowcount
conn.commit()


Updating aa
0
0
462634
5867
120798
106
143
75
499999
499999
499999
499999
Updating ab
378307
123206
470898
6770
1377
122
36
32
494356
494356
494356
494356
Updating ac
451902
216651
486191
4089
98079
771
72
69
499999
499999
499999
499999
Updating ad
391965
133391
479133
7818
110751
28
125
113
500000
500000
500000
500000
Updating ae
89317
32882
103825
1181
2543
24
27
27
116747
116747
116747
116747

In [267]:
# Looks like we're good on the loc_type being the same set across all of the tables
import numpy as np
aa = pd.read_sql_query("select distinct loc_type from incident_address_2014_aa;", conn)
ab = pd.read_sql_query("select distinct loc_type from incident_address_2014_ab;", conn)
ac = pd.read_sql_query("select distinct loc_type from incident_address_2014_ac;", conn)
ad = pd.read_sql_query("select distinct loc_type from incident_address_2014_ad;", conn)
ae = pd.read_sql_query("select distinct loc_type from incident_address_2014_ae;", conn)

(np.sort(aa.transpose()) == np.sort(ab.transpose())).all() and \
(np.sort(aa.transpose()) == np.sort(ac.transpose())).all() and \
(np.sort(aa.transpose()) == np.sort(ad.transpose())).all() and \
(np.sort(aa.transpose()) == np.sort(ae.transpose())).all()


Out[267]:
True

In [268]:
# Ugh, num_mile's type is inconsistent across the these tables

tables = ['aa', 'ab', 'ac', 'ad', 'ae']

for table in tables:
    with conn.cursor() as cursor:
        print "Updating %s" % table
        cursor.execute("alter table incident_address_2014_%s alter column num_mile type character varying(254);" % table)
conn.commit()


Updating aa
Updating ab
Updating ac
Updating ad
Updating ae

In [270]:
# So we know the exact source of the geocoded address after merging these tables together
for table in tables:
    with conn.cursor() as cursor:
        print "Updating %s" % table
        cursor.execute("alter table incident_address_2014_%s add column source character varying (8) default '%s';" % (table, table))
conn.commit()


Updating aa
Updating ab
Updating ac
Updating ad
Updating ae

In [271]:
same_cols()


AA col diffs: 
set([('full_stree', 'character varying'), ('inc_date', 'date')])
AB col diffs: 
set([('street_add', 'character varying'), ('y', 'numeric'), ('inc_date', 'numeric'), ('x', 'numeric')])
AC col diffs: 
set([('street_add', 'character varying'), ('zip4', 'character varying'), ('inc_date', 'numeric'), ('x', 'numeric'), ('y', 'numeric')])
AD col diffs: 
set([('street_add', 'character varying'), ('zip4', 'character varying'), ('inc_date', 'numeric'), ('x', 'numeric'), ('y', 'numeric')])
AE col diffs: 
set([('street_add', 'character varying'), ('y', 'numeric'), ('inc_date', 'numeric'), ('x', 'numeric')])
set([])
Dest (incidentaddress_2014) col diffs: 
set([('id', 'integer')])

In [276]:
# Make aa/ab/ac/ad/ae consistent

with conn.cursor() as cursor:
    cursor.execute("alter table incident_address_2014_aa drop column x, drop column y, drop column street_add, drop column inc_date, drop column zip4")
    cursor.execute("alter table incident_address_2014_ab drop column full_stree, drop column inc_date, drop column zip4")
    cursor.execute("alter table incident_address_2014_ac drop column full_stree, drop column inc_date")
    cursor.execute("alter table incident_address_2014_ad drop column full_stree, drop column inc_date")
    cursor.execute("alter table incident_address_2014_ae drop column full_stree, drop column inc_date, drop column zip4")
conn.commit()

In [277]:
same_cols()


AA col diffs: 
set([])
AB col diffs: 
set([])
AC col diffs: 
set([])
AD col diffs: 
set([])
AE col diffs: 
set([])
set([])
Dest (incidentaddress_2014) col diffs: 
set([('id', 'integer')])

In [285]:
# Create and stage consolidated table w/ indexes
with conn.cursor() as cursor:
    cursor.execute("create table address_to_geo_2014 as table incident_address_2014_aa with no data;")
    cursor.execute("create index on address_to_geo_2014 (num_mile, upper(apt_no), upper(city_1), zip5, upper(street_pre), loc_type, upper(streetname), upper(streettype), upper(streetsuf), upper(state_id));")
    cursor.execute("create index on address_to_geo_2014 (state_id);")
    cursor.execute("create index on address_to_geo_2014 (source);")
    cursor.execute("create index on address_to_geo_2014 (ogc_fid)")

conn.commit()

In [290]:
# Dump and load, specifying ALL of the cols in the source and dest is tedious, this is a workaround
for table in tables:
    with conn.cursor() as cursor:
        print "Dumping %s" % table
        cursor.copy_to(open("/tmp/%s.dump" % table, "w"), "incident_address_2014_%s" % table)

for table in tables:
    with conn.cursor() as cursor:
        print "Loading %s" % table
        cursor.copy_from(open("/tmp/%s.dump" % table), "address_to_geo_2014")
conn.commit()


Dumping aa
Dumping ab
Dumping ac
Dumping ad
Dumping ae
Loading aa
Loading ab
Loading ac
Loading ad
Loading ae

In [304]:
# Redundant, but a habit
tot = 0
for table in tables:
    with conn.cursor() as cursor:
        cursor.execute("select count(1) from incident_address_2014_%s" % table)
        tot = tot + cursor.fetchone()[0]

with conn.cursor() as cursor:
    cursor.execute("select count(1) from address_to_geo_2014")
    assert tot == cursor.fetchone()[0]

In [306]:
# Need to change the zip5 datatype to match incidentaddress_2014
with conn.cursor() as cursor:
    cursor.execute("alter table address_to_geo_2014 alter column zip5 type character varying(5);")
    cursor.execute("update address_to_geo_2014 set zip5 = lpad(zip5, 5, '0')")
    print cursor.rowcount
conn.commit()


2111101

In [308]:
# Need to change the loc_type datatype to match incidentaddress_2014
with conn.cursor() as cursor:
    cursor.execute("alter table address_to_geo_2014 alter column loc_type type character varying(1);")
conn.commit()


-1

In [311]:
# Ugh, forgot to update the apt_no
with conn.cursor() as cursor:
    cursor.execute("update address_to_geo_2014 set apt_no = '' where apt_no is null")
    print cursor.rowcount
conn.commit()


1959751

In [ ]:
# Just so we can track the source shapefile for the geocoding in case of the potential issues
with conn.cursor() as cursor:
    cursor.execute("alter table incidentaddress_2014 add column source character varying (8);")
conn.commit()

In [322]:
# Lets see how many incidents there are in each state for 2014
from pretty import pprint
with conn.cursor() as cursor:
    cursor.execute("select state_id, count(state_id) from incidentaddress_2014 group by state_id order by state_id;")
    pprint(cursor.fetchall())
conn.commit()


[('', 316L),
 ('AK', 5479L),
 ('AL', 22411L),
 ('AR', 32438L),
 ('AS', 1L),
 ('AZ', 23583L),
 ('CA', 121490L),
 ('CO', 28841L),
 ('CT', 31352L),
 ('CZ', 14L),
 ('DC', 2852L),
 ('DE', 7722L),
 ('FL', 111954L),
 ('FM', 4L),
 ('GA', 78479L),
 ('GU', 3L),
 ('HI', 4643L),
 ('IA', 19825L),
 ('ID', 12369L),
 ('IL', 110442L),
 ('IN', 9042L),
 ('KS', 33279L),
 ('KY', 34278L),
 ('LA', 37369L),
 ('MA', 71458L),
 ('MD', 42167L),
 ('ME', 13142L),
 ('MH', 6L),
 ('MI', 68991L),
 ('MN', 29437L),
 ('MO', 50266L),
 ('MP', 2L),
 ('MS', 26934L),
 ('MT', 6569L),
 ('NA', 67L),
 ('NC', 93999L),
 ('ND', 3335L),
 ('NE', 8427L),
 ('NH', 17134L),
 ('NJ', 80926L),
 ('NM', 15651L),
 ('NV', 12967L),
 ('NY', 169930L),
 ('OH', 106017L),
 ('OK', 33544L),
 ('OO', 10945L),
 ('OR', 23105L),
 ('PA', 66202L),
 ('PR', 1L),
 ('RI', 5736L),
 ('SC', 44799L),
 ('SD', 4089L),
 ('TN', 46226L),
 ('TR', 3L),
 ('TX', 166568L),
 ('UM', 3L),
 ('UT', 12183L),
 ('VA', 55501L),
 ('VI', 27L),
 ('VT', 7107L),
 ('WA', 37057L),
 ('WI', 32502L),
 ('WV', 22887L),
 ('WY', 2650L)]

In [ ]:


In [3]:
# Create some indexes so this query doesn't take years to complete
with conn.cursor() as cursor:
    cursor.execute("""
create index on incidentaddress_2014 (num_mile, upper(apt_no), upper(city), zip5, upper(street_pre),
    loc_type, upper(streetname), upper(streettype), upper(streetsuf), upper(state_id));""")
    
    cursor.execute("""
create index on address_to_geo_2014 (num_mile, upper(apt_no), upper(city_1), zip5, upper(street_pre), loc_type, upper(streetname), upper(streettype), upper(streetsuf), upper(state_id));
create index on address_to_geo_2014 (state_id);
create index on address_to_geo_2014 (source);""")

conn.commit()

In [4]:
# Execute the matching of the geocoded addresses to incident address for 2014, updating the geometry associated w/ each incident

with conn.cursor() as cursor:
    cursor.execute("""
update incidentaddress_2014 as ia set geom = res.wkb_geometry, source = res.source
from (
    select ia.id, aa.wkb_geometry, aa.source from address_to_geo_2014 aa inner join incidentaddress_2014 ia on (
    aa.num_mile = ia.num_mile and
    upper(aa.apt_no) = upper(ia.apt_no) and
    upper(aa.city_1) = upper(ia.city) and
    aa.zip5 = ia.zip5 and
    upper(aa.street_pre) = upper(ia.street_pre) and
    aa.loc_type = ia.loc_type and
    upper(aa.streetname) = upper(ia.streetname) and
    upper(aa.streettype) = upper(ia.streettype) and
    upper(aa.streetsuf) = upper(ia.streetsuf) and
    upper(aa.state_id) = upper(ia.state_id)
) where aa.num_mile != '' and score != 0) as res
where ia.id = res.id
""")
    print cursor.rowcount
conn.commit()


1579373

In [18]:
# Now lets see how many incidents there are in each state for 2014 that are geocoded vs not geocoded
from decimal import *
from pretty import pprint

with conn.cursor() as cursor:
    cursor.execute("""
select (100.0 * sum(case when geom is null then 0 else 1 end) / count(1)) as percent_with_geom,
    sum(case when geom is null then 0 else 1 end) as with_geom, state_id,
    count(state_id) as total_incidents
from incidentaddress_2014 group by state_id order by percent_with_geom desc;
""")
    print '% | Has geom | State | Total rows'
    for row in cursor.fetchall():
        print '{} | {} | {} | {}'.format(row[0], row[1], row[2], row[3])
    
    cursor.execute("""
select sum(case when geom is null then 0 else 1 end) as matches,
    (100.0 * sum(case when geom is null then 0 else 1 end) / count(1)) as percent,
    count(1) as total
from incidentaddress_2014 order by percent desc;""")
    row = cursor.fetchone()
    print '# geocode matches: {}, Percent geocoded: {}, Total {}'.format(row[0], row[1], row[2])
conn.commit()


% | Has geom | State | Total rows
100.0000000000000000 | 2 | MP | 2
100.0000000000000000 | 1 | PR | 1
87.0118549511854951 | 4991 | RI | 5736
85.6591865357643759 | 2443 | DC | 2852
84.9580727619155418 | 90070 | OH | 106017
83.6550344259400930 | 142155 | NY | 169930
83.4364421617835882 | 14296 | NH | 17134
82.8301752555315768 | 34927 | MD | 42167
82.7683308886112167 | 91411 | IL | 110442
82.7382418123614873 | 10080 | UT | 12183
82.1075517206237049 | 24170 | MN | 29437
82.0735623498934075 | 30414 | WA | 37057
82.0090122869377816 | 58602 | MA | 71458
81.7176793056190041 | 8944 | OO | 10945
81.1706811706811707 | 6268 | DE | 7722
80.7551144161108097 | 75909 | NC | 93999
78.9861286254728878 | 15659 | IA | 19825
78.9305851487875230 | 54455 | MI | 68991
78.5714285714285714 | 11 | CZ | 14
78.1289760932163664 | 35001 | SC | 44799
77.9928388486091821 | 61208 | GA | 78479
77.2948462729484627 | 6989 | IN | 9042
77.2868957316084395 | 42895 | VA | 55501
77.2565495464154781 | 28870 | LA | 37369
76.8803131453820662 | 128058 | TX | 166568
76.6869746279015964 | 9944 | NV | 12967
76.4040361318389173 | 50581 | PA | 66202
75.2969048058581010 | 24473 | WI | 32502
75.0097381519151699 | 17331 | OR | 23105
74.0222188403591539 | 9728 | ME | 13142
73.9848750710525665 | 59873 | NJ | 80926
73.7145955600918602 | 23111 | CT | 31352
73.7124176000857495 | 82524 | FL | 111954
73.0790290198941413 | 4004 | AK | 5479
72.5715050644772656 | 16264 | AL | 22411
71.9342528321163496 | 23939 | KS | 33279
71.5231788079470199 | 20628 | CO | 28841
70.1996927803379416 | 8683 | ID | 12369
69.8957296759399472 | 32310 | TN | 46226
67.6510884457299437 | 4444 | MT | 6569
67.4348481585181994 | 3131 | HI | 4643
67.0254247403811722 | 33691 | MO | 50266
66.6666666666666667 | 18 | VI | 27
66.2264150943396226 | 1755 | WY | 2650
65.8597365610537558 | 5550 | NE | 8427
65.3440270156184044 | 4644 | VT | 7107
64.2767708084903410 | 21561 | OK | 33544
63.5103188294677656 | 9940 | NM | 15651
63.0155989888402491 | 20441 | AR | 32438
62.8785607196401799 | 2097 | ND | 3335
60.5108784436028811 | 16298 | MS | 26934
59.8167921115584340 | 20504 | KY | 34278
56.9350803544926430 | 13427 | AZ | 23583
50.0122279285888970 | 2045 | SD | 4089
47.9693801959008972 | 58278 | CA | 121490
47.7611940298507463 | 32 | NA | 67
44.9643902652160615 | 10291 | WV | 22887
33.3333333333333333 | 1 | GU | 3
33.3333333333333333 | 1 | UM | 3
25.0000000000000000 | 1 | FM | 4
16.6666666666666667 | 1 | MH | 6
0E-20 | 0 | AS | 1
0E-20 | 0 | TR | 3
0E-20 | 0 |  | 316
# geocode matches: 1579373, Percent geocoded: 74.6132507159574177, Total 2116746

In [31]:
# Just for sanity sake, make sure that the column types are the same between the 2014 and base data

addl_tables = ['arson', 'arsonagencyreferal', 'arsonjuvsub', 'basicaid', 'basicincident', 'civiliancasualty', 'codelookup', 'ems', 'fdheader', 'ffcasualty', 'ffequipfail', 'fireincident', 'hazchem', 'hazmat', 'hazmatequipinvolved', 'hazmobprop', 'incidentaddress', 'wildlands']

for table in addl_tables:
    t = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='%s'" % table, conn)
    t_2014 = pd.read_sql_query("select column_name, data_type from information_schema.columns where table_name='%s_2014'" % table, conn)
    t_cols = set([(x.items()[1][1], x.items()[0][1]) for x in arson.to_dict('records')])
    t2014_cols = set([(x.items()[1][1], x.items()[0][1]) for x in arson2014.to_dict('records')])
    print 'Table: {}, col diffs: {}'.format(table, t2014_cols - t_cols)


Table: arson, col diffs: set([])
Table: arsonagencyreferal, col diffs: set([])
Table: arsonjuvsub, col diffs: set([])
Table: basicaid, col diffs: set([])
Table: basicincident, col diffs: set([])
Table: civiliancasualty, col diffs: set([])
Table: codelookup, col diffs: set([])
Table: ems, col diffs: set([])
Table: fdheader, col diffs: set([])
Table: ffcasualty, col diffs: set([])
Table: ffequipfail, col diffs: set([])
Table: fireincident, col diffs: set([])
Table: hazchem, col diffs: set([])
Table: hazmat, col diffs: set([])
Table: hazmatequipinvolved, col diffs: set([])
Table: hazmobprop, col diffs: set([])
Table: incidentaddress, col diffs: set([])
Table: wildlands, col diffs: set([])

In [32]:
# get some pre counts

with conn.cursor() as cursor:
    for t in addl_tables:
        cursor.execute("select '{table}', count(1) from {table};".format(table=t))
        print cursor.fetchone()
conn.commit()


('arson', 566083L)
('arsonagencyreferal', 310659L)
('arsonjuvsub', 500459L)
('basicaid', 3280569L)
('basicincident', 38518740L)
('civiliancasualty', 140780L)
('codelookup', 6622L)
('ems', 1291244L)
('fdheader', 39385L)
('ffcasualty', 109327L)
('ffequipfail', 35214L)
('fireincident', 7155944L)
('hazchem', 89683L)
('hazmat', 121318L)
('hazmatequipinvolved', 26399L)
('hazmobprop', 46351L)
('incidentaddress', 38518754L)
('wildlands', 625323L)

In [43]:
# Update the fdheader table first

with conn.cursor() as cursor:
    cursor.execute("select count(1) from fdheader_2014;")
    print 'fdheader_2014 count: %s' % cursor.fetchone()[0]
    cursor.execute("select count(1) from fdheader;")
    pre = cursor.fetchone()[0]
    cursor.execute("""INSERT INTO fdheader(
            state, fdid, fd_name, fd_str_no, fd_str_pre, fd_street, fd_str_typ,
            fd_str_suf, fd_city, fd_zip, fd_phone, fd_fax, fd_email, fd_fip_cty,
            no_station, no_pd_ff, no_vol_ff, no_vol_pdc)
    (SELECT distinct on (state, fdid) state, fdid, fd_name, fd_str_no, fd_str_pre, fd_street, fd_str_typ,
       fd_str_suf, fd_city, fd_zip, fd_phone, fd_fax, fd_email, fd_fip_cty,
       no_station, no_pd_ff, no_vol_ff, no_vol_pdc
  FROM fdheader_2014 where (state, fdid) not in (select state, fdid from fdheader));""")
    inserted = cursor.rowcount
    cursor.execute("select count(1) from fdheader;")
    post = cursor.fetchone()[0]
    print 'FDHeader Pre: {}, Post: {}, Insert count: {}, Post - pre: {}'.format(pre, post, inserted, post - pre)
conn.commit()


fdheader_2014 count: 38096
FDHeader Pre: 39385, Post: 39494, Insert count: 109, Post - pre: 109

In [55]:
# Update the "codeloop" (appears to be a static table) w/ the 2014 data (truncate and load)
with conn.cursor() as cursor:
    cursor.execute("delete from codelookup;")
    print 'Deleted: %s' % cursor.rowcount
    cursor.execute("insert into codelookup select * from codelookup_2014;")
    print 'Inserted: %s' % cursor.rowcount
conn.commit()


Deleted: 6622
Inserted: 6622

In [ ]:
# Insert the 2014 data into the foundation data
def load_2014(table, commit=True):
    with conn.cursor() as cursor:
        cursor.execute("select count(1) from %s;" % table)
        pre = cursor.fetchone()[0]
        cursor.execute("insert into %s select * from %s_2014;" % (table, table))
        inserted = cursor.rowcount
        cursor.execute("select count(1) from %s;" % table)
        post = cursor.fetchone()[0]

        print 'Table: {}, Pre: {}, Post: {}, Insert count: {}, Post - pre: {}'.format(table, pre, post, inserted, post - pre)
    if commit:
        conn.commit()
        print 'COMMITTED'
    else:
        conn.rollback()
        print 'ROLLBACK'

In [8]:
# Load EVERYTHING except for basicincident, fireincident, incidentaddress
for table in ['arson', 'arsonagencyreferal', 'arsonjuvsub', 'basicaid', 'civiliancasualty', 'ems', 'ffcasualty',
              'ffequipfail', 'hazchem', 'hazmat', 'hazmatequipinvolved', 'hazmobprop', 'wildlands']:
    load_2014(table)


Table: arson, Pre: 566083, Post: 605099, Insert count: 39016, Post - pre: 39016
Table: arsonagencyreferal, Pre: 310659, Post: 318630, Insert count: 7971, Post - pre: 7971
Table: arsonjuvsub, Pre: 500459, Post: 505154, Insert count: 4695, Post - pre: 4695
Table: basicaid, Pre: 3280569, Post: 3532902, Insert count: 252333, Post - pre: 252333
Table: civiliancasualty, Pre: 140780, Post: 152166, Insert count: 11386, Post - pre: 11386
Table: ems, Pre: 1291244, Post: 1294841, Insert count: 3597, Post - pre: 3597
Table: ffcasualty, Pre: 109327, Post: 118027, Insert count: 8700, Post - pre: 8700
Table: ffequipfail, Pre: 35214, Post: 35588, Insert count: 374, Post - pre: 374
Table: hazchem, Pre: 89683, Post: 98271, Insert count: 8588, Post - pre: 8588
Table: hazmat, Pre: 121318, Post: 133177, Insert count: 11859, Post - pre: 11859
Table: hazmatequipinvolved, Pre: 26399, Post: 38258, Insert count: 11859, Post - pre: 11859
Table: hazmobprop, Pre: 46351, Post: 58210, Insert count: 11859, Post - pre: 11859
Table: wildlands, Pre: 625323, Post: 676124, Insert count: 50801, Post - pre: 50801

In [17]:
# There must have been an error with the MO 06101 entry in "fdheader_2014", since there are 101 references to it in "basicincident_2014"

with conn.cursor() as cursor:
    cursor.execute("""select count(state), state, fdid from basicincident_2014 bi
where (state, fdid) not in (select state, fdid from fdheader)
and (state, fdid) not in (select state, fdid from fdheader_2014)
group by state, fdid""")
    print cursor.fetchall()
    
    cursor.execute("select count(1) from basicincident where state = 'MO' and fdid = '6101'")
    print cursor.fetchall()
    
    cursor.execute("select count(1) from basicincident_2014 where state = 'MO' and fdid = '6101'")
    print cursor.fetchall()


[(101L, 'MO', '06101')]
[(867L,)]
[(0L,)]

In [19]:
# Going to do a one-off update of (MO,6101) => (MO,06101) since that's the only offender
with conn.cursor() as cursor:
    cursor.execute("update basicincident_2014 set fdid = '6101' where state = 'MO' and fdid = '06101'")
    print cursor.rowcount
    conn.commit()


101

In [21]:
with conn.cursor() as cursor:
    cursor.execute("""select 'fireincident_2014', count(1) from fireincident_2014
union select 'basicincident_2014', count(1) from basicincident_2014
union select 'incidentaddress_2014', count(1) from incidentaddress_2014""")
    print cursor.fetchall()


[('fireincident_2014', 596521L), ('basicincident_2014', 2116746L), ('incidentaddress_2014', 2116746L)]

In [22]:
load_2014('basicincident', commit=True)


Table: basicincident, Pre: 38518740, Post: 40635486, Insert count: 2116746, Post - pre: 2116746

In [25]:
# Looks like fireincident_2014 has the same issue w/ that MO 6101 fire department

with conn.cursor() as cursor:
    cursor.execute("""select count(state), state, fdid from fireincident_2014 bi
where (state, fdid) not in (select state, fdid from fdheader)
and (state, fdid) not in (select state, fdid from fdheader_2014)
group by state, fdid""")
    print cursor.fetchall()
    
    cursor.execute("select count(1) from fireincident where state = 'MO' and fdid = '6101'")
    print cursor.fetchall()
    
    cursor.execute("select count(1) from fireincident_2014 where state = 'MO' and fdid = '6101'")
    print cursor.fetchall()


[(21L, 'MO', '06101')]
[(268L,)]
[(0L,)]

In [27]:
# Going to do a one-off update of (MO,6101) => (MO,06101) since that's the only offender
with conn.cursor() as cursor:
    cursor.execute("update fireincident_2014 set fdid = '6101' where state = 'MO' and fdid = '06101'")
    print cursor.rowcount
    conn.commit()


21

In [29]:
load_2014('fireincident', commit=True)


Table: fireincident, Pre: 7155944, Post: 7752465, Insert count: 596521, Post - pre: 596521

In [31]:
# Loading all of the 2014 incident addresses into the master "incidentaddress" table
with conn.cursor() as cursor:
    cursor.execute("select count(1) from incidentaddress_2014;")
    print 'incidentaddress_2014 count: %s' % cursor.fetchone()[0]
    cursor.execute("select count(1) from incidentaddress;")
    pre = cursor.fetchone()[0]
    cursor.execute("""INSERT INTO incidentaddress(
            state, fdid, inc_date, inc_no, exp_no, loc_type, num_mile, street_pre,
            streetname, streettype, streetsuf, apt_no, city, state_id, zip5,
            zip4, x_street, addid, addid_try, geom, bkgpidfp00, bkgpidfp10)
    (SELECT state, fdid, inc_date, inc_no, exp_no, loc_type, num_mile, street_pre,
       streetname, streettype, streetsuf, apt_no, city, state_id, zip5,
       zip4, x_street, addid, addid_try, geom, bkgpidfp00, bkgpidfp10
  FROM incidentaddress_2014);""")
    inserted = cursor.rowcount
    cursor.execute("select count(1) from incidentaddress;")
    post = cursor.fetchone()[0]
    print 'Incident address Pre: {}, Post: {}, Insert count: {}, Post - pre: {}'.format(pre, post, inserted, post - pre)
conn.commit()


incidentaddress_2014 count: 2116746
Incident address Pre: 38518754, Post: 40635500, Insert count: 2116746, Post - pre: 2116746

In [69]:
# Start moving things over to a separate schema
with conn.cursor() as cursor:
    cursor.execute("create schema if not exists geocoding_2014;")
conn.commit()

In [35]:
with conn.cursor() as cursor:
    cursor.execute("alter table address_to_geo_2014 set schema geocoding_2014;")
conn.commit()

In [36]:
tables = ["incident_address_2014_aa", "incident_address_2014_ab", "incident_address_2014_ac", "incident_address_2014_ad",
          "incident_address_2014_ae"]
with conn.cursor() as cursor:
    for table in tables:
        cursor.execute("alter table %s set schema geocoding_2014;" % table)
conn.commit()

In [37]:
# We should be able to drop all of the 2014 tables now

tables_2014 = ['arson_2014', 'arsonagencyreferal_2014', 'arsonjuvsub_2014', 'basicaid_2014', 'basicincident_2014',
               'civiliancasualty_2014', 'codelookup_2014', 'ems_2014', 'fdheader_2014', 'ffcasualty_2014',
               'ffequipfail_2014', 'fireincident_2014', 'hazchem_2014', 'hazmat_2014', 'hazmatequipinvolved_2014',
               'hazmobprop_2014', 'incidentaddress_2014', 'wildlands_2014']
with conn.cursor() as cursor:
    for table in addl_tables_2014:
        cursor.execute("drop table %s;" % table)
conn.commit()

[x] - arson_2014 -> arson [x] - arsonagencyreferal_2014 -> arsonagencyreferal [x] - arsonjuvsub_2014 -> arsonjuvsub [x] - basicaid_2014 -> basicaid [x] - basicincident_2014 -> basicincident [x] - civiliancasualty_2014 -> civiliancasualty [x] - codelookup_2014 -> codelookup [x] - ems_2014 -> ems [x] - fdheader_2014 -> fdheader [x] - ffcasualty_2014 -> ffcasualty [x] - ffequipfail_2014 -> ffequipfail [x] - fireincident_2014 -> fireincident [x] - hazchem_2014 -> hazchem [x] - hazmat_2014 - > hazmat [x] - hazmatequipinvolved_2014 -> hazmatequipinvolved [x] - hazmobprop_2014 -> hazmobprop [x] - incidentaddress_2014 -> incidentaddress [x] - wildlands_2014 -> wildlands

Dumping the db / compressing

pg_dump -O -x nfirs_2014 > nfirs_2014.sql
mv nfirs_2014.sql nfirs_2014_addresses_matched.sql
gzip nfirs_2014_addresses_matched.sql

In [ ]: