Estimate how many TB of data are served by IOOS

Estimate dataset size from the OPeNDAP DDS. Here we use regular expressions to parse the DDS and just the variable size (32 or 64 bit Int or Float) by their shapes. This represents the size in memory, not on disk, since the data could be compressed. But the data in memory is in some sense a more true representation of the quantity of data available by the service.



In [47]:

    
from owslib.csw import CatalogueServiceWeb
from owslib import fes
import pandas as pd
import datetime as dt
import requests
import re
import time
from __future__ import print_function



In [48]:

    
def service_urls(records,service_string='urn:x-esri:specification:ServiceType:odp:url'):
    """
    Get all URLs matching a specific ServiceType 
 
    Unfortunately these seem to differ between different CSW-ISO services.
    For example, OpenDAP is specified:
    NODC geoportal: 'urn:x-esri:specification:ServiceType:OPeNDAP'
    NGDC geoportal: 'urn:x-esri:specification:ServiceType:odp:url'
    """

    urls=[]
    for key,rec in records.iteritems():
        #create a generator object, and iterate through it until the match is found
        #if not found, gets the default value (here "none")
        url = next((d['url'] for d in rec.references if d['scheme'] == service_string), None)
        if url is not None:
            urls.append(url)
    return urls

Find OpenDAP endpoints from NGDC CSW



In [49]:

    
endpoint = 'http://www.ngdc.noaa.gov/geoportal/csw' #  NGDC/IOOS Geoportal
dap_timeout=4  # timeout for DAP response
csw_timeout=60 # timeout for CSW response
csw = CatalogueServiceWeb(endpoint,timeout=csw_timeout)
csw.version









    Out[49]:





'2.0.2'



In [50]:

    
[op.name for op in csw.operations]









    Out[50]:





['GetCapabilities',
 'DescribeRecord',
 'GetRecords',
 'GetRecordById',
 'Transaction']



In [51]:

    
csw.get_operation_by_name('GetRecords').constraints









    Out[51]:





[Constraint: SupportedCommonQueryables - ['Subject', 'Title', 'Abstract', 'AnyText', 'Format', 'Identifier', 'Modified', 'Type', 'BoundingBox'],
 Constraint: SupportedISOQueryables - ['apiso:Subject', 'apiso:Title', 'apiso:Abstract', 'apiso:AnyText', 'apiso:Format', 'apiso:Identifier', 'apiso:Modified', 'apiso:Type', 'apiso:BoundingBox', 'apiso:CRS.Authority', 'apiso:CRS.ID', 'apiso:CRS.Version', 'apiso:RevisionDate', 'apiso:AlternateTitle', 'apiso:CreationDate', 'apiso:PublicationDate', 'apiso:OrganizationName', 'apiso:HasSecurityConstraints', 'apiso:Language', 'apiso:ResourceIdentifier', 'apiso:ParentIdentifier', 'apiso:KeywordType', 'apiso:TopicCategory', 'apiso:ResourceLanguage', 'apiso:GeographicDescriptionCode', 'apiso:Denominator', 'apiso:DistanceValue', 'apiso:DistanceUOM', 'apiso:TempExtent_begin', 'apiso:TempExtent_end', 'apiso:ServiceType', 'apiso:ServiceTypeVersion', 'apiso:Operation', 'apiso:OperatesOn', 'apiso:OperatesOnIdentifier', 'apiso:OperatesOnName', 'apiso:CouplingType'],
 Constraint: AdditionalQueryables - ['apiso:Degree', 'apiso:AccessConstraints', 'apiso:OtherConstraints', 'apiso:Classification', 'apiso:ConditionApplyingToAccessAndUse', 'apiso:Lineage', 'apiso:ResponsiblePartyRole', 'apiso:ResponsiblePartyName', 'apiso:SpecificationTitle', 'apiso:SpecificationDate', 'apiso:SpecificationDateType']]



In [52]:

    
for oper in csw.operations:
    print(oper.name)









    



GetCapabilities
DescribeRecord
GetRecords
GetRecordById
Transaction



In [53]:

    
csw.get_operation_by_name('GetRecords').constraints









    Out[53]:





[Constraint: SupportedCommonQueryables - ['Subject', 'Title', 'Abstract', 'AnyText', 'Format', 'Identifier', 'Modified', 'Type', 'BoundingBox'],
 Constraint: SupportedISOQueryables - ['apiso:Subject', 'apiso:Title', 'apiso:Abstract', 'apiso:AnyText', 'apiso:Format', 'apiso:Identifier', 'apiso:Modified', 'apiso:Type', 'apiso:BoundingBox', 'apiso:CRS.Authority', 'apiso:CRS.ID', 'apiso:CRS.Version', 'apiso:RevisionDate', 'apiso:AlternateTitle', 'apiso:CreationDate', 'apiso:PublicationDate', 'apiso:OrganizationName', 'apiso:HasSecurityConstraints', 'apiso:Language', 'apiso:ResourceIdentifier', 'apiso:ParentIdentifier', 'apiso:KeywordType', 'apiso:TopicCategory', 'apiso:ResourceLanguage', 'apiso:GeographicDescriptionCode', 'apiso:Denominator', 'apiso:DistanceValue', 'apiso:DistanceUOM', 'apiso:TempExtent_begin', 'apiso:TempExtent_end', 'apiso:ServiceType', 'apiso:ServiceTypeVersion', 'apiso:Operation', 'apiso:OperatesOn', 'apiso:OperatesOnIdentifier', 'apiso:OperatesOnName', 'apiso:CouplingType'],
 Constraint: AdditionalQueryables - ['apiso:Degree', 'apiso:AccessConstraints', 'apiso:OtherConstraints', 'apiso:Classification', 'apiso:ConditionApplyingToAccessAndUse', 'apiso:Lineage', 'apiso:ResponsiblePartyRole', 'apiso:ResponsiblePartyName', 'apiso:SpecificationTitle', 'apiso:SpecificationDate', 'apiso:SpecificationDateType']]

Since the supported ISO queryables contain apiso:ServiceType, we can use CSW to find all datasets with services that contain the string "dap"



In [54]:

    
try:
    csw.get_operation_by_name('GetDomain')
    csw.getdomain('apiso:ServiceType', 'property')
    print(csw.results['values'])
except:
    print('GetDomain not supported')









    



GetDomain not supported

Since this CSW service doesn't provide us a list of potential values for apiso:ServiceType, we guess opendap, which seems to work:



In [55]:

    
val = 'opendap'
service_type = fes.PropertyIsLike(propertyname='apiso:ServiceType',literal=('*%s*' % val),
                        escapeChar='\\',wildCard='*',singleChar='?')
filter_list = [ service_type]



In [56]:

    
csw.getrecords2(constraints=filter_list,maxrecords=10000,esn='full')
len(csw.records.keys())









    Out[56]:





1995

By printing out the references from a random record, we see that for this CSW the DAP URL is identified by urn:x-esri:specification:ServiceType:odp:url



In [57]:

    
choice=random.choice(list(csw.records.keys()))
print(choice)
csw.records[choice].references









    



id_enp.cwaf1.met






    Out[57]:





[{'scheme': 'urn:x-esri:specification:ServiceType:distribution:url',
  'url': 'http://tds.secoora.org/thredds/dodsC/enp.cwaf1.met.nc.html'},
 {'scheme': 'urn:x-esri:specification:ServiceType:distribution:url',
  'url': 'http://www.ncdc.noaa.gov/oa/wct/wct-jnlp-beta.php?singlefile=http://tds.secoora.org/thredds/dodsC/enp.cwaf1.met.nc'},
 {'scheme': 'urn:x-esri:specification:ServiceType:sos:url',
  'url': 'http://tds.secoora.org/thredds/sos/enp.cwaf1.met.nc?service=SOS&version=1.0.0&request=GetCapabilities'},
 {'scheme': 'urn:x-esri:specification:ServiceType:odp:url',
  'url': 'http://tds.secoora.org/thredds/dodsC/enp.cwaf1.met.nc'},
 {'scheme': 'urn:x-esri:specification:ServiceType:download:url',
  'url': 'http://tds.secoora.org/thredds/dodsC/enp.cwaf1.met.nc.html'}]

Get all the OPeNDAP endpoints



In [58]:

    
dap_urls = service_urls(csw.records,service_string='urn:x-esri:specification:ServiceType:odp:url')
len(dap_urls)









    Out[58]:





1885



In [59]:

    
def calc_dsize(txt):
    ''' 
    Calculate dataset size from the OPeNDAP DDS. 
    Approx method: Multiply 32|64 bit Int|Float variables by their shape.
    '''
    # split the OpenDAP DDS on ';' characters
    all = re.split(';',txt)
    '''
    Use regex to find numbers following Float or Int (e.g. Float32, Int64)
    and also numbers immediately preceding a "]".  The idea is that in line like:
    
    Float32 Total_precipitation_surface_6_Hour_Accumulation[time2 = 74][y = 303][x = 491];
           
    we want to find only the numbers that are not part of a variable or dimension name
    (want to return [32, 74, 303, 491], *not* [32, 6, 2, 74, 303, 491])
    '''
    m = re.compile('\d+(?=])|(?<=Float)\d+|(?<=Int)\d+')
    dsize=0
    for var in all:
        c = map(int,m.findall(var))
        if len(c)>=2:
            vsize = reduce(lambda x,y: x*y,c)
            dsize += vsize
    
    return dsize/1.0e6/8.   # return megabytes



In [60]:

    
def tot_dsize(url,dap_timeout=2):
    das = url + '.dds'
    tot = 0
    try:
        response = requests.get(das,verify=True, timeout=dap_timeout)
    except:
        return tot, -1
    if response.status_code==200:
        # calculate the total size for all variables:
        tot = calc_dsize(response.text)
        # calculate the size for MAPS variables and subtract from the total:
        maps = re.compile('MAPS:(.*?)}',re.MULTILINE | re.DOTALL)
        map_text = ''.join(maps.findall(response.text))
        if map_text:
            map_tot = calc_dsize(map_text)
            tot -= map_tot
    
    return tot,response.status_code



In [61]:

    
time0 = time.time()
good_data=[]
bad_data=[]
count=0
for url in dap_urls:
    count += 1
    dtot, status_code = tot_dsize(url,dap_timeout=dap_timeout)
    if status_code==200:
        good_data.append([url,dtot])
        print('[{}]Good:{},{}'.format(count,url,dtot), end='\r')
    else:
        bad_data.append([url,status_code])
        print('[{}]Fail:{},{}'.format(count,url,status_code), end='\r')
    
print('Elapsed time={} minutes'.format((time.time()-time0)/60.))









    



Elapsed time=20.0452855984 minutes



In [62]:

    
print('Elapsed time={} minutes'.format((time.time()-time0)/60.))









    



Elapsed time=20.0453634818 minutes



In [63]:

    
len(bad_data)









    Out[63]:





248



In [64]:

    
bad_data[0][0]









    Out[64]:





'http://www.neracoos.org/thredds/dodsC/UMO/DSG/SOS/A01/Doppler/HistoricRealtime/Agg.ncml'

So how much data are we serving?



In [65]:

    
sum=0
for ds in good_data:
    sum +=ds[1]
    
print('{} terabytes'.format(sum/1.e6))









    



53.2175702391 terabytes



In [66]:

    
url=[]
size=[]
for item in good_data:
    url.append(item[0])
    size.append(item[1])



In [67]:

    
d={}
d['url']=url
d['size']=size



In [68]:

    
good = pd.DataFrame(d)



In [69]:

    
good.head()









    Out[69]:






  
    
      
      size
      url
    
  
  
    
      0
       161.589067
       http://oos.soest.hawaii.edu/thredds/dodsC/paci...
    
    
      1
        12.899552
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      2
        14.328960
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      3
       274.258484
       http://thredds.coastal.ufl.edu:8080/thredds/do...
    
    
      4
       721.913524
       http://thredds.coastal.ufl.edu:8080/thredds/do...



In [70]:

    
good=good.sort(['size'],ascending=0)



In [71]:

    
good.head()









    Out[71]:






  
    
      
      size
      url
    
  
  
    
      1037
       8075141.971392
       http://ecowatch.ncddc.noaa.gov/thredds/dodsC/n...
    
    
      967 
       7494709.539768
       http://geoport.whoi.edu/thredds/dodsC/coawst_4...
    
    
      1143
       4567598.362696
       http://oos.soest.hawaii.edu/thredds/dodsC/paci...
    
    
      1038
       3432967.640784
       http://ecowatch.ncddc.noaa.gov/thredds/dodsC/n...
    
    
      1049
       3077189.375788
       http://ecowatch.ncddc.noaa.gov/thredds/dodsC/n...



In [72]:

    
url=[]
code=[]
for item in bad_data:
    url.append(item[0])
    code.append(item[1])



In [73]:

    
d={}
d['url']=url
d['code']=code
bad = pd.DataFrame(d)



In [74]:

    
bad.head()









    Out[74]:






  
    
      
      code
      url
    
  
  
    
      0
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      1
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      2
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      3
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      4
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...



In [75]:

    
cd /usgs/data2/notebook/system-test/Theme_1_Baseline









    



/usgs/data2/notebook/system-test/Theme_1_Baseline



In [76]:

    
td = dt.datetime.today().strftime('%Y-%m-%d')



In [77]:

    
bad.to_csv('bad'+td+'.csv')



In [78]:

    
good.to_csv('good'+td+'.csv')



In [79]:

    
bad=bad.sort(['url','code'],ascending=[0,0])



In [80]:

    
bad = pd.read_csv('bad'+td+'.csv',index_col=0)
good = pd.read_csv('good'+td+'.csv',index_col=0)



In [81]:

    
bad.head()









    Out[81]:






  
    
      
      code
      url
    
  
  
    
      0
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      1
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      2
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      3
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
    
    
      4
       404
       http://www.neracoos.org/thredds/dodsC/UMO/DSG/...



In [82]:

    
recs = bad[bad['url'].str.contains('neracoos')]
print(len(recs))



In [83]:

    
recs = bad[bad['url'].str.contains('ucar')]
print(len(recs))



In [84]:

    
recs = bad[bad['url'].str.contains('tamu')]
print(len(recs))



In [85]:

    
recs = bad[bad['url'].str.contains('axiom')]
print(len(recs))



In [86]:

    
recs = bad[bad['url'].str.contains('caricoos')]
print(len(recs))



In [87]:

    
recs = bad[bad['url'].str.contains('secoora')]
print(len(recs))



In [88]:

    
recs = bad[bad['url'].str.contains('nanoos')]
print(len(recs))



In [89]:

    
recs.to_csv('axiom.csv')



In [90]:

    
!git add *.csv



In [91]:

    
!git commit -m 'new csv'









    



[master 53a2493] new csv
 2 files changed, 1887 insertions(+)
 create mode 100644 Theme_1_Baseline/bad2014-12-02.csv
 create mode 100644 Theme_1_Baseline/good2014-12-02.csv



In [92]:

    
!git push









    



ssh: /home/usgs/anaconda/lib/libcrypto.so.1.0.0: no version information available (required by ssh)
Counting objects: 7, done.
Delta compression using up to 16 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 28.61 KiB, done.
Total 5 (delta 2), reused 0 (delta 0)
To git@github.com:rsignell-usgs/system-test.git
   6238558..53a2493  master -> master



In [92]:

	size	url
0	161.589067	http://oos.soest.hawaii.edu/thredds/dodsC/paci...
1	12.899552	http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
2	14.328960	http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
3	274.258484	http://thredds.coastal.ufl.edu:8080/thredds/do...
4	721.913524	http://thredds.coastal.ufl.edu:8080/thredds/do...

	size	url
1037	8075141.971392	http://ecowatch.ncddc.noaa.gov/thredds/dodsC/n...
967	7494709.539768	http://geoport.whoi.edu/thredds/dodsC/coawst_4...
1143	4567598.362696	http://oos.soest.hawaii.edu/thredds/dodsC/paci...
1038	3432967.640784	http://ecowatch.ncddc.noaa.gov/thredds/dodsC/n...
1049	3077189.375788	http://ecowatch.ncddc.noaa.gov/thredds/dodsC/n...