Exploring extracting data from the EPA web service

https://www.epa.gov/enviro/web-services

I am using the Python Requests http://docs.python-requests.org/en/master/) library to scrape quantitative data from the EPA web service. The REST service is self is reasonably well documented at https://www.epa.gov/enviro/web-services



In [70]:

    
import requests
import io
import pandas
from itertools import chain

Example REST query:



In [17]:

    
url='https://iaspub.epa.gov/enviro/efservice/tri_facility/state_abbr/VA/count/json'

requests.get(url).text









    Out[17]:





'[{"TOTALQUERYRESULTS":1146}]'

I can use this to extract the table data:



In [18]:

    
def makeurl(tablename,start,end):
    return "https://iaspub.epa.gov/enviro/efservice/{tablename}/JSON/rows/{start}:{end}".format_map(locals())

url=makeurl( 't_design_for_environment',1,19)
out=requests.get(url)



In [19]:

    
pandas.DataFrame(out.json())









    Out[19]:






  
    
      
      CATEGORY
      CITY
      COMPANY_NAME
      PARTNERSINCE
      PRODUCT_NAME
      SECTOR
      STATE
      UPCS
    
  
  
    
      0
      Consumer Product
      El Segundo
      Fresh %26 Easy Neighborhood Market Inc.
      2010
      Green Things Glass Cleaner
      Window/Glass Cleaners
      California
      None
    
    
      1
      Consumer Product
      Fredonia
      Guy %26 O%27Neill, Inc.
      2009
      Green %26 Clean Green Glass Cleaner
      Window/Glass Cleaners
      Wisconsin
      None
    
    
      2
      Consumer Product
      Newport Beach
      ICB-PTY, Ltd.
      2010
      Earth Rescue Glass %26 Window Cleaner
      Window/Glass Cleaners
      California
      None
    
    
      3
      Consumer Product
      Cleveland
      Kinzua Environmental, Inc.
      2008
      WOW Glass Cleaner RTU
      Window/Glass Cleaners
      Ohio
      None
    
    
      4
      Consumer Product
      Franklin
      pHurity, Inc.
      2008
      Glass %26 High Gloss Surface Cleaner
      Window/Glass Cleaners
      Michigan
      None
    
    
      5
      Consumer Product
      Franklin
      pHurity, Inc.
      2008
      Glass %26 High Gloss Surface Cleaner Free %26 ...
      Window/Glass Cleaners
      Michigan
      None
    
    
      6
      Consumer Product
      Franklin
      pHurity, Inc.
      2008
      Glass Cleaner
      Window/Glass Cleaners
      Michigan
      None
    
    
      7
      Consumer Product
      Roseville
      PRIDE Industries
      2009
      PRIDEClean Glass %26 Mirror Cleaner
      Window/Glass Cleaners
      California
      782900230236, 782900013013
    
    
      8
      Consumer Product
      Hempstead
      Professional Sanitary Industries
      2016
      ProSan Glass Master Glass %26 Window Cleaner
      Window/Glass Cleaners
      New York
      None
    
    
      9
      Consumer Product
      Pleasanton
      Safeway Inc.
      2012
      Bright Green Glass %26 Surface Cleaner
      Window/Glass Cleaners
      California
      843560000697, 079893200071
    
    
      10
      Consumer Product
      Oakland
      Serenade Solutions LLC
      2015
      Serenade Key 64 Green Glass %26 Surface Concen...
      Window/Glass Cleaners
      California
      1818SYM
    
    
      11
      Consumer Product
      Oakland
      Serenade Solutions LLC
      2015
      Serenade Key Green Glass %26 Surface Cleaner
      Window/Glass Cleaners
      California
      1819SYM
    
    
      12
      Consumer Product
      Framingham
      Staples Contract %26 Commercial, Inc.
      2004
      Sustainable Earth Glass Cleaner (RTU)
      Window/Glass Cleaners
      Massachusetts
      None
    
    
      13
      Consumer Product
      Rochester
      Wegmans
      2014
      Glass Cleaner
      Window/Glass Cleaners
      New York
      77890347843
    
    
      14
      Consumer Product
      Wood Dale
      Earth Friendly Products
      2008
      Furniture Polish
      Wood Cleaners
      Illinois
      749174097316, 749174973139, 749174497314
    
    
      15
      Consumer Product
      Wood Dale
      Earth Friendly Products
      2008
      WaveJet Rinse Aid
      Rinse Aids
      Illinois
      74917497477
    
    
      16
      Consumer Product
      Istanbul
      Friendly Organic LLC-BST Ltd.
      2013
      Rinse Aid
      Rinse Aids
      None
      8.68009E+12
    
    
      17
      Consumer Product
      Santa Monica
      Honest Company
      2012
      Honest Rinse Aid
      Rinse Aids
      California
      None
    
    
      18
      Consumer Product
      Newport Beach
      ICB-PTY, Ltd.
      2010
      Earth Rescue Auto Dishwasher Rinse Aid
      Rinse Aids
      California
      None

Before we go further, we'll want to know the size of each table:



In [20]:

    
def table_count(tablename):
    url= "https://iaspub.epa.gov/enviro/efservice/{tablename}/COUNT/JSON".format_map(locals())
    print(url)
    return requests.get(url).json()[0]['TOTALQUERYRESULTS']



In [21]:

    
table_count('erm_project')









    



https://iaspub.epa.gov/enviro/efservice/erm_project/COUNT/JSON






    Out[21]:





1052



In [22]:

    
table_count('t_design_for_environment')









    



https://iaspub.epa.gov/enviro/efservice/t_design_for_environment/COUNT/JSON






    Out[22]:





2502

Someone needs to provide me with a full list of tables, but for now I'm focusing on the ones listed at https://www.epa.gov/enviro/radnet-model

We're building a complete list at https://docs.google.com/spreadsheets/d/1LDDH-qxJunBqqkS1EfG2mhwgwFi7PylXtz3GYsGjDzA/edit#gid=52614242



In [31]:

    
tablenames=[
    'ERM_RESULT',
    'ERM_ANALYSIS',
    'ERM_COUNT',
    'ERM_SAMPLE',
    'ERM_MATRIX',
    'ERM_LOCATION',
    'ERM_PROJECT',
    'ERM_STUDY',
    'ERM_ANALYTE',
    'ERM_ANA_PROC',
    'ERM_DET_TYPE'
]

# there are many more to be added to this list.



In [38]:

    
# 
def table_data(tablename,start=0,end=100):
    url=makeurl(tablename,start,end)
    print(url)
    data=requests.get(url).json()
    return pandas.DataFrame(data)



In [39]:

    
erm_result=table_data('ERM_RESULT')









    



https://iaspub.epa.gov/enviro/efservice/ERM_RESULT/JSON/rows/0:100



In [105]:

    
erm_result.head()









    Out[105]:






  
    
      
      ANALYTE_ID
      ANA_NUM
      CSU
      CSU_IN_SI
      DETECTABLE
      MDC
      MDC_IN_SI
      RESULT_AMOUNT
      RESULT_DATE
      RESULT_ID
      RESULT_IN_SI
      RESULT_UNIT
      SI_UNIT
    
  
  
    
      0
      BA140
      642
      4.00
      0.15
      N
      None
      None
      -6.00
      13-OCT-78
      1
      -0.222
      PCI/L
      BQ/L
    
    
      1
      CS137
      642
      3.50
      0.13
      N
      None
      None
      3.00
      13-OCT-78
      2
      0.110
      PCI/L
      BQ/L
    
    
      2
      I131
      642
      3.50
      0.13
      N
      None
      None
      1.00
      13-OCT-78
      3
      0.040
      PCI/L
      BQ/L
    
    
      3
      K
      642
      0.06
      0.06
      Y
      None
      None
      1.46
      13-OCT-78
      4
      1.460
      G/L
      G/L
    
    
      4
      H3
      643
      100.00
      3.70
      Y
      None
      None
      400.00
      04-OCT-78
      1
      14.800
      PCI/L
      BQ/L



In [41]:

    
erm_analyte=table_data('ERM_ANALYTE')









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/JSON/rows/0:100



In [106]:

    
erm_analyte.head()









    Out[106]:






  
    
      
      ANALYTE_ID
      ANALYTE_NAME
      ANA_TYPE
      CRS_ID
      HALF_LIFE
      HALF_LIFE_TIME_UNIT
    
  
  
    
      0
      AC228
      Actinium-228
      R
      None
      6.130
      H
    
    
      1
      ALPHA
      Gross Alpha
      R
      None
      NaN
      None
    
    
      2
      AM241
      Americium-241
      R
      None
      432.200
      Y
    
    
      3
      BA133
      Barium-133
      R
      None
      10.500
      Y
    
    
      4
      BA140
      Barium-140
      R
      None
      12.789
      D

Performance Profiling.

Are there limits to how much data we can pull per request? How fast we can retrieve it?

Let's try a reasonably big table: ERM_ANALYSIS



In [37]:

    
table_count('ERM_ANALYSIS')









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/COUNT/JSON






    Out[37]:





320561



In [48]:

    
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,10)
print(len(erm_analysis),"rows retrieved")









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:10
11 rows retrieved
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 453 ms



In [49]:

    
erm_analysis.head()









    Out[49]:






  
    
      
      ANA_COMMENT
      ANA_NUM
      ANA_PROC_NUM
      ANA_SIZE
      ANA_SIZE_2
      ANA_UNIT
      ANA_UNIT_2
      SAMP_NUM
    
  
  
    
      0
      None
      642
      9
      3.5
      None
      L
      None
      230
    
    
      1
      None
      643
      10
      NaN
      None
      None
      None
      231
    
    
      2
      None
      644
      10
      NaN
      None
      None
      None
      232
    
    
      3
      None
      645
      10
      NaN
      None
      None
      None
      233
    
    
      4
      None
      646
      10
      NaN
      None
      None
      None
      234



In [52]:

    
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,100)
print(len(erm_analysis),"rows retrieved")









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:100
101 rows retrieved
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 1.43 s



In [53]:

    
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,200)
print(len(erm_analysis),"rows retrieved")









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:200
201 rows retrieved
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 1.97 s



In [54]:

    
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,400)
print(len(erm_analysis),"rows retrieved")









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:400
401 rows retrieved
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 4.17 s



In [55]:

    
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,800)
print(len(erm_analysis),"rows retrieved")









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:800
801 rows retrieved
CPU times: user 28 ms, sys: 0 ns, total: 28 ms
Wall time: 33.8 s



In [50]:

    
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,1000)
print(len(erm_analysis),"rows retrieved")









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:1000
1001 rows retrieved
CPU times: user 24 ms, sys: 4 ms, total: 28 ms
Wall time: 57.5 s



In [57]:

    
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,500)
print(len(erm_analysis),"rows retrieved")









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:500
501 rows retrieved
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 6.28 s



In [58]:

    
%%time

erm_analysis = table_data('ERM_ANALYSIS',0,600)
print(len(erm_analysis),"rows retrieved")









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:600
601 rows retrieved
CPU times: user 20 ms, sys: 4 ms, total: 24 ms
Wall time: 12.8 s

So retrieval time seems to go up significantly once we go past 500 rows.

So let's write an iterative retrieval function



In [117]:

    
def collect(tablename, rowcount=100, limit = 1000):
    '''
        The API understands start/end to be INCLUSIVE
    '''
    count   =table_count(tablename)
    def inner():
        start   =0
        end     = rowcount-1    
        while start <=count:
            end=min(end,limit-1)
            url = makeurl(tablename,start,end)
            print(url)
            yield requests.get(url).json()
            
            start+= rowcount
            end  += rowcount
            if start>limit-1:return
    return pandas.DataFrame(list(chain.from_iterable(inner())))



In [118]:

    
erm_analyte=collect('ERM_ANALYTE',rowcount=20,limit=35)









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/COUNT/JSON
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/JSON/rows/0:19
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/JSON/rows/20:34



In [119]:

    
len(erm_analyte)









    Out[119]:





35

Let's compare this with our previous method



In [83]:

    
erm_analyte_=table_data('ERM_ANALYTE')









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYTE/JSON/rows/0:100



In [90]:

    
erm_analyte.head()









    Out[90]:






  
    
      
      ANALYTE_ID
      ANALYTE_NAME
      ANA_TYPE
      CRS_ID
      HALF_LIFE
      HALF_LIFE_TIME_UNIT
    
  
  
    
      0
      AC228
      Actinium-228
      R
      None
      6.130
      H
    
    
      1
      ALPHA
      Gross Alpha
      R
      None
      NaN
      None
    
    
      2
      AM241
      Americium-241
      R
      None
      432.200
      Y
    
    
      3
      BA133
      Barium-133
      R
      None
      10.500
      Y
    
    
      4
      BA140
      Barium-140
      R
      None
      12.789
      D



In [93]:

    
erm_analyte_.head()









    Out[93]:






  
    
      
      ANALYTE_ID
      ANALYTE_NAME
      ANA_TYPE
      CRS_ID
      HALF_LIFE
      HALF_LIFE_TIME_UNIT
    
  
  
    
      0
      AC228
      Actinium-228
      R
      None
      6.130
      H
    
    
      1
      ALPHA
      Gross Alpha
      R
      None
      NaN
      None
    
    
      2
      AM241
      Americium-241
      R
      None
      432.200
      Y
    
    
      3
      BA133
      Barium-133
      R
      None
      10.500
      Y
    
    
      4
      BA140
      Barium-140
      R
      None
      12.789
      D

So we can use this technique to extract a table piecemeal. Let's try it on a bigger table with a larger stride

The ERM_ANALYSIS is significantly bigger than ERM_ANALYTE



In [107]:

    
table_count('ERM_ANALYSIS')









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/COUNT/JSON






    Out[107]:





320561



In [120]:

    
%%time
erm_analysis=collect('ERM_ANALYSIS',rowcount=100,limit=1000)









    



https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/COUNT/JSON
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/0:99
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/100:199
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/200:299
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/300:399
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/400:499
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/500:599
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/600:699
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/700:799
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/800:899
https://iaspub.epa.gov/enviro/efservice/ERM_ANALYSIS/JSON/rows/900:999
CPU times: user 144 ms, sys: 40 ms, total: 184 ms
Wall time: 14.5 s

Which is significantly faster than than the 57 seconds it took to get the data in one go.



In [ ]:

	CATEGORY	CITY	COMPANY_NAME	PARTNERSINCE	PRODUCT_NAME	SECTOR	STATE	UPCS
0	Consumer Product	El Segundo	Fresh %26 Easy Neighborhood Market Inc.	2010	Green Things Glass Cleaner	Window/Glass Cleaners	California	None
1	Consumer Product	Fredonia	Guy %26 O%27Neill, Inc.	2009	Green %26 Clean Green Glass Cleaner	Window/Glass Cleaners	Wisconsin	None
2	Consumer Product	Newport Beach	ICB-PTY, Ltd.	2010	Earth Rescue Glass %26 Window Cleaner	Window/Glass Cleaners	California	None
3	Consumer Product	Cleveland	Kinzua Environmental, Inc.	2008	WOW Glass Cleaner RTU	Window/Glass Cleaners	Ohio	None
4	Consumer Product	Franklin	pHurity, Inc.	2008	Glass %26 High Gloss Surface Cleaner	Window/Glass Cleaners	Michigan	None
5	Consumer Product	Franklin	pHurity, Inc.	2008	Glass %26 High Gloss Surface Cleaner Free %26 ...	Window/Glass Cleaners	Michigan	None
6	Consumer Product	Franklin	pHurity, Inc.	2008	Glass Cleaner	Window/Glass Cleaners	Michigan	None
7	Consumer Product	Roseville	PRIDE Industries	2009	PRIDEClean Glass %26 Mirror Cleaner	Window/Glass Cleaners	California	782900230236, 782900013013
8	Consumer Product	Hempstead	Professional Sanitary Industries	2016	ProSan Glass Master Glass %26 Window Cleaner	Window/Glass Cleaners	New York	None
9	Consumer Product	Pleasanton	Safeway Inc.	2012	Bright Green Glass %26 Surface Cleaner	Window/Glass Cleaners	California	843560000697, 079893200071
10	Consumer Product	Oakland	Serenade Solutions LLC	2015	Serenade Key 64 Green Glass %26 Surface Concen...	Window/Glass Cleaners	California	1818SYM
11	Consumer Product	Oakland	Serenade Solutions LLC	2015	Serenade Key Green Glass %26 Surface Cleaner	Window/Glass Cleaners	California	1819SYM
12	Consumer Product	Framingham	Staples Contract %26 Commercial, Inc.	2004	Sustainable Earth Glass Cleaner (RTU)	Window/Glass Cleaners	Massachusetts	None
13	Consumer Product	Rochester	Wegmans	2014	Glass Cleaner	Window/Glass Cleaners	New York	77890347843
14	Consumer Product	Wood Dale	Earth Friendly Products	2008	Furniture Polish	Wood Cleaners	Illinois	749174097316, 749174973139, 749174497314
15	Consumer Product	Wood Dale	Earth Friendly Products	2008	WaveJet Rinse Aid	Rinse Aids	Illinois	74917497477
16	Consumer Product	Istanbul	Friendly Organic LLC-BST Ltd.	2013	Rinse Aid	Rinse Aids	None	8.68009E+12
17	Consumer Product	Santa Monica	Honest Company	2012	Honest Rinse Aid	Rinse Aids	California	None
18	Consumer Product	Newport Beach	ICB-PTY, Ltd.	2010	Earth Rescue Auto Dishwasher Rinse Aid	Rinse Aids	California	None

	ANALYTE_ID	ANA_NUM	CSU	CSU_IN_SI	DETECTABLE	MDC	MDC_IN_SI	RESULT_AMOUNT	RESULT_DATE	RESULT_ID	RESULT_IN_SI	RESULT_UNIT	SI_UNIT
0	BA140	642	4.00	0.15	N	None	None	-6.00	13-OCT-78	1	-0.222	PCI/L	BQ/L
1	CS137	642	3.50	0.13	N	None	None	3.00	13-OCT-78	2	0.110	PCI/L	BQ/L
2	I131	642	3.50	0.13	N	None	None	1.00	13-OCT-78	3	0.040	PCI/L	BQ/L
3	K	642	0.06	0.06	Y	None	None	1.46	13-OCT-78	4	1.460	G/L	G/L
4	H3	643	100.00	3.70	Y	None	None	400.00	04-OCT-78	1	14.800	PCI/L	BQ/L

	ANALYTE_ID	ANALYTE_NAME	ANA_TYPE	CRS_ID	HALF_LIFE	HALF_LIFE_TIME_UNIT
0	AC228	Actinium-228	R	None	6.130	H
1	ALPHA	Gross Alpha	R	None	NaN	None
2	AM241	Americium-241	R	None	432.200	Y
3	BA133	Barium-133	R	None	10.500	Y
4	BA140	Barium-140	R	None	12.789	D