I am using the Python Requests http://docs.python-requests.org/en/master/) library to scrape quantitative data from the EPA web service. The REST service is self is reasonably well documented at https://www.epa.gov/enviro/web-services
In [70]:
import requests
import io
import pandas
from itertools import chain
Example REST query:
In [17]:
url='https://iaspub.epa.gov/enviro/efservice/tri_facility/state_abbr/VA/count/json'
requests.get(url).text
Out[17]:
I can use this to extract the table data:
In [18]:
def makeurl(tablename,start,end):
return "https://iaspub.epa.gov/enviro/efservice/{tablename}/JSON/rows/{start}:{end}".format_map(locals())
url=makeurl( 't_design_for_environment',1,19)
out=requests.get(url)
In [19]:
pandas.DataFrame(out.json())
Out[19]:
Before we go further, we'll want to know the size of each table:
In [20]:
def table_count(tablename):
url= "https://iaspub.epa.gov/enviro/efservice/{tablename}/COUNT/JSON".format_map(locals())
print(url)
return requests.get(url).json()[0]['TOTALQUERYRESULTS']
In [21]:
table_count('erm_project')
Out[21]:
In [22]:
table_count('t_design_for_environment')
Out[22]:
Someone needs to provide me with a full list of tables, but for now I'm focusing on the ones listed at https://www.epa.gov/enviro/radnet-model
We're building a complete list at https://docs.google.com/spreadsheets/d/1LDDH-qxJunBqqkS1EfG2mhwgwFi7PylXtz3GYsGjDzA/edit#gid=52614242
In [31]:
tablenames=[
'ERM_RESULT',
'ERM_ANALYSIS',
'ERM_COUNT',
'ERM_SAMPLE',
'ERM_MATRIX',
'ERM_LOCATION',
'ERM_PROJECT',
'ERM_STUDY',
'ERM_ANALYTE',
'ERM_ANA_PROC',
'ERM_DET_TYPE'
]
# there are many more to be added to this list.
In [38]:
#
def table_data(tablename,start=0,end=100):
url=makeurl(tablename,start,end)
print(url)
data=requests.get(url).json()
return pandas.DataFrame(data)
In [39]:
erm_result=table_data('ERM_RESULT')
In [105]:
erm_result.head()
Out[105]:
In [41]:
erm_analyte=table_data('ERM_ANALYTE')
In [106]:
erm_analyte.head()
Out[106]:
In [37]:
table_count('ERM_ANALYSIS')
Out[37]:
In [48]:
%%time
erm_analysis = table_data('ERM_ANALYSIS',0,10)
print(len(erm_analysis),"rows retrieved")
In [49]:
erm_analysis.head()
Out[49]:
In [52]:
%%time
erm_analysis = table_data('ERM_ANALYSIS',0,100)
print(len(erm_analysis),"rows retrieved")
In [53]:
%%time
erm_analysis = table_data('ERM_ANALYSIS',0,200)
print(len(erm_analysis),"rows retrieved")
In [54]:
%%time
erm_analysis = table_data('ERM_ANALYSIS',0,400)
print(len(erm_analysis),"rows retrieved")
In [55]:
%%time
erm_analysis = table_data('ERM_ANALYSIS',0,800)
print(len(erm_analysis),"rows retrieved")
In [50]:
%%time
erm_analysis = table_data('ERM_ANALYSIS',0,1000)
print(len(erm_analysis),"rows retrieved")
In [57]:
%%time
erm_analysis = table_data('ERM_ANALYSIS',0,500)
print(len(erm_analysis),"rows retrieved")
In [58]:
%%time
erm_analysis = table_data('ERM_ANALYSIS',0,600)
print(len(erm_analysis),"rows retrieved")
So let's write an iterative retrieval function
In [117]:
def collect(tablename, rowcount=100, limit = 1000):
'''
The API understands start/end to be INCLUSIVE
'''
count =table_count(tablename)
def inner():
start =0
end = rowcount-1
while start <=count:
end=min(end,limit-1)
url = makeurl(tablename,start,end)
print(url)
yield requests.get(url).json()
start+= rowcount
end += rowcount
if start>limit-1:return
return pandas.DataFrame(list(chain.from_iterable(inner())))
In [118]:
erm_analyte=collect('ERM_ANALYTE',rowcount=20,limit=35)
In [119]:
len(erm_analyte)
Out[119]:
Let's compare this with our previous method
In [83]:
erm_analyte_=table_data('ERM_ANALYTE')
In [90]:
erm_analyte.head()
Out[90]:
In [93]:
erm_analyte_.head()
Out[93]:
So we can use this technique to extract a table piecemeal. Let's try it on a bigger table with a larger stride
The ERM_ANALYSIS is significantly bigger than ERM_ANALYTE
In [107]:
table_count('ERM_ANALYSIS')
Out[107]:
In [120]:
%%time
erm_analysis=collect('ERM_ANALYSIS',rowcount=100,limit=1000)
Which is significantly faster than than the 57 seconds it took to get the data in one go.
In [ ]: