Exploring extracting data from the EPA web service

https://www.epa.gov/enviro/web-services

I am using the Python Requests http://docs.python-requests.org/en/master/) library to scrape quantitative data from the EPA web service. The REST service is self is reasonably well documented at https://www.epa.gov/enviro/web-services


In [70]:
import requests
import io
import pandas
from itertools import chain

Example REST query:


In [17]:
url='https://iaspub.epa.gov/enviro/efservice/tri_facility/state_abbr/VA/count/json'

requests.get(url).text


Out[17]:
'[{"TOTALQUERYRESULTS":1146}]'

I can use this to extract the table data:


In [18]:
def makeurl(tablename,start,end):
    return "https://iaspub.epa.gov/enviro/efservice/{tablename}/JSON/rows/{start}:{end}".format_map(locals())

url=makeurl( 't_design_for_environment',1,19)
out=requests.get(url)

In [19]:
pandas.DataFrame(out.json())


Out[19]:
CATEGORY CITY COMPANY_NAME PARTNERSINCE PRODUCT_NAME SECTOR STATE UPCS
0 Consumer Product El Segundo Fresh %26 Easy Neighborhood Market Inc. 2010 Green Things Glass Cleaner Window/Glass Cleaners California None
1 Consumer Product Fredonia Guy %26 O%27Neill, Inc. 2009 Green %26 Clean Green Glass Cleaner Window/Glass Cleaners Wisconsin None
2 Consumer Product Newport Beach ICB-PTY, Ltd. 2010 Earth Rescue Glass %26 Window Cleaner Window/Glass Cleaners California None
3 Consumer Product Cleveland Kinzua Environmental, Inc. 2008 WOW Glass Cleaner RTU Window/Glass Cleaners Ohio None
4 Consumer Product Franklin pHurity, Inc. 2008 Glass %26 High Gloss Surface Cleaner Window/Glass Cleaners Michigan None
5 Consumer Product Franklin pHurity, Inc. 2008 Glass %26 High Gloss Surface Cleaner Free %26 ... Window/Glass Cleaners Michigan None
6 Consumer Product Franklin pHurity, Inc. 2008 Glass Cleaner Window/Glass Cleaners Michigan None
7 Consumer Product Roseville PRIDE Industries 2009 PRIDEClean Glass %26 Mirror Cleaner Window/Glass Cleaners California 782900230236, 782900013013
8 Consumer Product Hempstead Professional Sanitary Industries 2016 ProSan Glass Master Glass %26 Window Cleaner Window/Glass Cleaners New York None
9 Consumer Product Pleasanton Safeway Inc. 2012 Bright Green Glass %26 Surface Cleaner Window/Glass Cleaners California 843560000697, 079893200071
10 Consumer Product Oakland Serenade Solutions LLC 2015 Serenade Key 64 Green Glass %26 Surface Concen... Window/Glass Cleaners California 1818SYM
11 Consumer Product Oakland Serenade Solutions LLC 2015 Serenade Key Green Glass %26 Surface Cleaner Window/Glass Cleaners California 1819SYM
12 Consumer Product Framingham Staples Contract %26 Commercial, Inc. 2004 Sustainable Earth Glass Cleaner (RTU) Window/Glass Cleaners Massachusetts None
13 Consumer Product Rochester Wegmans 2014 Glass Cleaner Window/Glass Cleaners New York 77890347843
14 Consumer Product Wood Dale Earth Friendly Products 2008 Furniture Polish Wood Cleaners Illinois 749174097316, 749174973139, 749174497314
15 Consumer Product Wood Dale Earth Friendly Products 2008 WaveJet Rinse Aid Rinse Aids Illinois 74917497477
16 Consumer Product Istanbul Friendly Organic LLC-BST Ltd. 2013 Rinse Aid Rinse Aids None 8.68009E+12
17 Consumer Product Santa Monica Honest Company 2012 Honest Rinse Aid Rinse Aids California None
18 Consumer Product Newport Beach ICB-PTY, Ltd. 2010 Earth Rescue Auto Dishwasher Rinse Aid Rinse Aids California None

Before we go further, we'll want to know the size of each table:


In [20]:
def table_count(tablename):
    url= "https://iaspub.epa.gov/enviro/efservice/{tablename}/COUNT/JSON".format_map(locals())
    print(url)
    return requests.get(url).json()[0]['TOTALQUERYRESULTS']

In [21]:
table_count('erm_project')


https://iaspub.epa.gov/enviro/efservice/erm_project/COUNT/JSON
Out[21]:
1052

In [22]:
table_count('t_design_for_environment')


https://iaspub.epa.gov/enviro/efservice/t_design_for_environment/COUNT/JSON
Out[22]:
2502

Someone needs to provide me with a full list of tables, but for now I'm focusing on the ones listed at https://www.epa.gov/enviro/radnet-model

We're building a complete list at https://docs.google.com/spreadsheets/d/1LDDH-qxJunBqqkS1EfG2mhwgwFi7PylXtz3GYsGjDzA/edit#gid=52614242


In [31]:
tablenames=[
    'ERM_RESULT',
    'ERM_ANALYSIS',
    'ERM_COUNT',
    'ERM_SAMPLE',
    'ERM_MATRIX',
    'ERM_LOCATION',
    'ERM_PROJECT',
    'ERM_STUDY',
    'ERM_ANALYTE',
    'ERM_ANA_PROC',
    'ERM_DET_TYPE'
]

# there are many more to be added to this list.

In [38]:
# 
def table_data(tablename,start=0,end=100):
    url=makeurl(tablename,start,end)
    print(url)
    data=requests.get(url).json()
    return pandas.DataFrame(data)

In [39]:
erm_result=table_data('ERM_RESULT')


https://iaspub.epa.gov/enviro/efservice/ERM_RESULT/JSON/rows/0:100

In [105]:
erm_result.head()


Out[105]:
ANALYTE_ID ANA_NUM CSU CSU_IN_SI DETECTABLE MDC MDC_IN_SI RESULT_AMOUNT RESULT_DATE RESULT_ID RESULT_IN_SI RESULT_UNIT SI_UNIT
0 BA140 642 4.00 0.15 N None None -6.00 13-OCT-78 1 -0.222 PCI/L BQ/L
1 CS137 642 3.50 0.13 N None None 3.00 13-OCT-78 2 0.110 PCI/L BQ/L
2 I131 642 3.50 0.13 N None None 1.00 13-OCT-78 3 0.040 PCI/L BQ/L
3 K 642 0.06 0.06 Y None None 1.46 13-OCT-78 4 1.460 G/L G/L
4 H3 643 100.00 3.70 Y None None 400.00 04-OCT-78 1 14.800 PCI/L BQ/L

In [41]:
erm_analyte=table_data('ERM_ANALYTE')


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/JSON/rows/0:100

In [106]:
erm_analyte.head()


Out[106]:
ANALYTE_ID ANALYTE_NAME ANA_TYPE CRS_ID HALF_LIFE HALF_LIFE_TIME_UNIT
0 AC228 Actinium-228 R None 6.130 H
1 ALPHA Gross Alpha R None NaN None
2 AM241 Americium-241 R None 432.200 Y
3 BA133 Barium-133 R None 10.500 Y
4 BA140 Barium-140 R None 12.789 D

Performance Profiling.

Are there limits to how much data we can pull per request? How fast we can retrieve it?

Let's try a reasonably big table: ERM_ANALYSIS


In [37]:
table_count('ERM_ANALYSIS')


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/COUNT/JSON
Out[37]:
320561

In [48]:
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,10)
print(len(erm_analysis),"rows retrieved")


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:10
11 rows retrieved
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 453 ms

In [49]:
erm_analysis.head()


Out[49]:
ANA_COMMENT ANA_NUM ANA_PROC_NUM ANA_SIZE ANA_SIZE_2 ANA_UNIT ANA_UNIT_2 SAMP_NUM
0 None 642 9 3.5 None L None 230
1 None 643 10 NaN None None None 231
2 None 644 10 NaN None None None 232
3 None 645 10 NaN None None None 233
4 None 646 10 NaN None None None 234

In [52]:
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,100)
print(len(erm_analysis),"rows retrieved")


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:100
101 rows retrieved
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 1.43 s

In [53]:
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,200)
print(len(erm_analysis),"rows retrieved")


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:200
201 rows retrieved
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 1.97 s

In [54]:
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,400)
print(len(erm_analysis),"rows retrieved")


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:400
401 rows retrieved
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 4.17 s

In [55]:
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,800)
print(len(erm_analysis),"rows retrieved")


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:800
801 rows retrieved
CPU times: user 28 ms, sys: 0 ns, total: 28 ms
Wall time: 33.8 s

In [50]:
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,1000)
print(len(erm_analysis),"rows retrieved")


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:1000
1001 rows retrieved
CPU times: user 24 ms, sys: 4 ms, total: 28 ms
Wall time: 57.5 s

In [57]:
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,500)
print(len(erm_analysis),"rows retrieved")


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:500
501 rows retrieved
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 6.28 s

In [58]:
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,600)
print(len(erm_analysis),"rows retrieved")


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:600
601 rows retrieved
CPU times: user 20 ms, sys: 4 ms, total: 24 ms
Wall time: 12.8 s

So retrieval time seems to go up significantly once we go past 500 rows.

So let's write an iterative retrieval function


In [117]:
def collect(tablename, rowcount=100, limit = 1000):
    '''
        The API understands start/end to be INCLUSIVE
    '''
    count   =table_count(tablename)
    def inner():
        start   =0
        end     = rowcount-1    
        while start <=count:
            end=min(end,limit-1)
            url = makeurl(tablename,start,end)
            print(url)
            yield requests.get(url).json()
            
            start+= rowcount
            end  += rowcount
            if start>limit-1:return
    return pandas.DataFrame(list(chain.from_iterable(inner())))

In [118]:
erm_analyte=collect('ERM_ANALYTE',rowcount=20,limit=35)


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/COUNT/JSON
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/JSON/rows/0:19
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/JSON/rows/20:34

In [119]:
len(erm_analyte)


Out[119]:
35

Let's compare this with our previous method


In [83]:
erm_analyte_=table_data('ERM_ANALYTE')


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/JSON/rows/0:100

In [90]:
erm_analyte.head()


Out[90]:
ANALYTE_ID ANALYTE_NAME ANA_TYPE CRS_ID HALF_LIFE HALF_LIFE_TIME_UNIT
0 AC228 Actinium-228 R None 6.130 H
1 ALPHA Gross Alpha R None NaN None
2 AM241 Americium-241 R None 432.200 Y
3 BA133 Barium-133 R None 10.500 Y
4 BA140 Barium-140 R None 12.789 D

In [93]:
erm_analyte_.head()


Out[93]:
ANALYTE_ID ANALYTE_NAME ANA_TYPE CRS_ID HALF_LIFE HALF_LIFE_TIME_UNIT
0 AC228 Actinium-228 R None 6.130 H
1 ALPHA Gross Alpha R None NaN None
2 AM241 Americium-241 R None 432.200 Y
3 BA133 Barium-133 R None 10.500 Y
4 BA140 Barium-140 R None 12.789 D

So we can use this technique to extract a table piecemeal. Let's try it on a bigger table with a larger stride

The ERM_ANALYSIS is significantly bigger than ERM_ANALYTE


In [107]:
table_count('ERM_ANALYSIS')


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/COUNT/JSON
Out[107]:
320561

In [120]:
%%time
erm_analysis=collect('ERM_ANALYSIS',rowcount=100,limit=1000)


https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/COUNT/JSON
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:99
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/100:199
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/200:299
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/300:399
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/400:499
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/500:599
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/600:699
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/700:799
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/800:899
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/900:999
CPU times: user 144 ms, sys: 40 ms, total: 184 ms
Wall time: 14.5 s

Which is significantly faster than than the 57 seconds it took to get the data in one go.


In [ ]: