Estimate how many TB of data are served by IOOS

Estimate dataset size from the OPeNDAP DDS. Here we use regular expressions to parse the DDS and just the variable size (32 or 64 bit Int or Float) by their shapes. This represents the size in memory, not on disk, since the data could be compressed. But the data in memory is in some sense a more true representation of the quantity of data available by the service.


In [4]:
from owslib.csw import CatalogueServiceWeb
from owslib import fes
import pandas as pd
import datetime as dt
import requests
import re
import time

In [5]:
def service_urls(records,service_string='urn:x-esri:specification:ServiceType:odp:url'):
    """
    Get all URLs matching a specific ServiceType 
 
    Unfortunately these seem to differ between different CSW-ISO services.
    For example, OpenDAP is specified:
    NODC geoportal: 'urn:x-esri:specification:ServiceType:OPeNDAP'
    NGDC geoportal: 'urn:x-esri:specification:ServiceType:odp:url'
    """

    urls=[]
    for key,rec in records.iteritems():
        #create a generator object, and iterate through it until the match is found
        #if not found, gets the default value (here "none")
        url = next((d['url'] for d in rec.references if d['scheme'] == service_string), None)
        if url is not None:
            urls.append(url)
    return urls

Find OpenDAP endpoints from NGDC CSW


In [6]:
endpoint = 'http://www.ngdc.noaa.gov/geoportal/csw' #  NGDC/IOOS Geoportal
csw = CatalogueServiceWeb(endpoint,timeout=60)
csw.version


Out[6]:
'2.0.2'

In [7]:
[op.name for op in csw.operations]


Out[7]:
['GetCapabilities',
 'DescribeRecord',
 'GetRecords',
 'GetRecordById',
 'Transaction']

In [12]:
for oper in csw.operations:
    if oper.name == 'GetRecords':
        print oper.constraints


[Constraint: SupportedCommonQueryables - ['Subject', 'Title', 'Abstract', 'AnyText', 'Format', 'Identifier', 'Modified', 'Type', 'BoundingBox'], Constraint: SupportedISOQueryables - ['apiso:Subject', 'apiso:Title', 'apiso:Abstract', 'apiso:AnyText', 'apiso:Format', 'apiso:Identifier', 'apiso:Modified', 'apiso:Type', 'apiso:BoundingBox', 'apiso:CRS.Authority', 'apiso:CRS.ID', 'apiso:CRS.Version', 'apiso:RevisionDate', 'apiso:AlternateTitle', 'apiso:CreationDate', 'apiso:PublicationDate', 'apiso:OrganizationName', 'apiso:HasSecurityConstraints', 'apiso:Language', 'apiso:ResourceIdentifier', 'apiso:ParentIdentifier', 'apiso:KeywordType', 'apiso:TopicCategory', 'apiso:ResourceLanguage', 'apiso:GeographicDescriptionCode', 'apiso:Denominator', 'apiso:DistanceValue', 'apiso:DistanceUOM', 'apiso:TempExtent_begin', 'apiso:TempExtent_end', 'apiso:ServiceType', 'apiso:ServiceTypeVersion', 'apiso:Operation', 'apiso:OperatesOn', 'apiso:OperatesOnIdentifier', 'apiso:OperatesOnName', 'apiso:CouplingType'], Constraint: AdditionalQueryables - ['apiso:Degree', 'apiso:AccessConstraints', 'apiso:OtherConstraints', 'apiso:Classification', 'apiso:ConditionApplyingToAccessAndUse', 'apiso:Lineage', 'apiso:ResponsiblePartyRole', 'apiso:ResponsiblePartyName', 'apiso:SpecificationTitle', 'apiso:SpecificationDate', 'apiso:SpecificationDateType']]

Since the supported ISO queryables contain apiso:ServiceType, we can use CSW to find all datasets with services that contain the string "dap"


In [4]:
val = 'dap'
service_type = fes.PropertyIsLike(propertyname='apiso:ServiceType',literal=('*%s*' % val),
                        escapeChar='\\',wildCard='*',singleChar='?')
filter_list = [ service_type]

In [13]:
csw.getrecords2(constraints=filter_list,maxrecords=10000,esn='full')
len(csw.records.keys())


Out[13]:
2785

By printing out the references from a random record, we see that for this CSW the DAP URL is identified by urn:x-esri:specification:ServiceType:odp:url


In [14]:
choice=random.choice(list(csw.records.keys()))
print choice
csw.records[choice].references


CDIP_Archive/196p1/196p1_d01.nc
Out[14]:
[{'scheme': 'urn:x-esri:specification:ServiceType:distribution:url',
  'url': 'http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/196p1/196p1_d01.nc.html'},
 {'scheme': 'urn:x-esri:specification:ServiceType:distribution:url',
  'url': 'http://www.ncdc.noaa.gov/oa/wct/wct-jnlp-beta.php?singlefile=http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/196p1/196p1_d01.nc'},
 {'scheme': 'urn:x-esri:specification:ServiceType:sos:url',
  'url': 'http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/196p1/196p1_d01.nc?service=SOS&version=1.0.0&request=GetCapabilities'},
 {'scheme': 'urn:x-esri:specification:ServiceType:odp:url',
  'url': 'http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/196p1/196p1_d01.nc'},
 {'scheme': 'urn:x-esri:specification:ServiceType:download:url',
  'url': 'http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/196p1/196p1_d01.nc.html'}]

Get all the DAP endpoints


In [15]:
dap_urls = service_urls(csw.records,service_string='urn:x-esri:specification:ServiceType:odp:url')
len(dap_urls)


Out[15]:
2686

In [17]:
def calc_dsize(txt):
    ''' 
    Calculate dataset size from the OPeNDAP DDS. 
    Approx method: Multiply 32|64 bit Int|Float variables by their shape.
    '''
    # split the OpenDAP DDS on ';' characters
    all = re.split(';',txt)
    '''
    Use regex to find numbers following Float or Int (e.g. Float32, Int64)
    and also numbers immediately preceding a "]".  The idea is that in line like:
    
    Float32 Total_precipitation_surface_6_Hour_Accumulation[time2 = 74][y = 303][x = 491];
           
    we want to find only the numbers that are not part of a variable or dimension name
    (want to return [32, 74, 303, 491], *not* [32, 6, 2, 74, 303, 491])
    '''
    m = re.compile('\d+(?=])|(?<=Float)\d+|(?<=Int)\d+')
    dsize=0
    for var in all:
        c = map(int,m.findall(var))
        if len(c)>=2:
            vsize = reduce(lambda x,y: x*y,c)
            dsize += vsize
    
    return dsize/1.0e6/8.   # return megabytes

In [25]:
def tot_dsize(url,timeout=10):
    das = url + '.dds'
    tot = 0
    try:
        response = requests.get(das,verify=True, timeout=timeout)
    except:
        return tot, -1
    if response.status_code==200:
        # calculate the total size for all variables:
        tot = calc_dsize(response.text)
        # calculate the size for MAPS variables and subtract from the total:
        maps = re.compile('MAPS:(.*?)}',re.MULTILINE | re.DOTALL)
        map_text = ''.join(maps.findall(response.text))
        if map_text:
            map_tot = calc_dsize(map_text)
            tot -= map_tot
    
    return tot,response.status_code

In [ ]:
from __future__ import print_function
time0 = time.time()
good_data=[]
bad_data=[]
count=0
for url in dap_urls:
    count += 1
    dtot, status_code = tot_dsize(url,timeout=2)
    if status_code==200:
        good_data.append([url,dtot])
        print('[{}]Good:{},{}'.format(count,url,dtot), end='\r')
    else:
        bad_data.append([url,status_code])
        print('[{}]Fail:{},{}'.format(count,url,status_code), end='\r')
    
print('Elapsed time={} minutes'.format((time.time()-time0)/60.))

In [34]:
print('Elapsed time={} minutes'.format((time.time()-time0)/60.))


Elapsed time=33.3200285832 minutes

In [22]:
len(good_data)


Out[22]:
1547

In [23]:
len(bad_data)


Out[23]:
1139

In [35]:
bad_data[0][0]


Out[35]:
'http://www.neracoos.org/thredds/dodsC/UMO/DSG/SOS/A01/Doppler/HistoricRealtime/Agg.ncml'

Loop through the datasets that failed in the 2 second timeout to see if any of them work with a 10 second timeout


In [ ]:
time0 = time.time()
good_data2=[]
bad_data2=[]
count=0
for item in bad_data:
    url = item[0]
    count += 1
    dtot, status_code = tot_dsize(url,timeout=10)
    if status_code==200:
        good_data2.append([url,dtot])
        print('[{}]Good:{},{}'.format(count,url,dtot), end='\r')
    else:
        bad_data2.append([url,status_code])
        print('[{}]Fail:{},{}'.format(count,url,status_code), end='\r')
    
print('Elapsed time={} minutes'.format((time.time()-time0)/60.))


Elapsed time=96.6383475343 minutes

yipes, that took forever with 10 second timeout. How many more datasets did we get?


In [87]:
len(bad_data)-len(bad_data2)


Out[87]:
46

So how much data are we serving?


In [52]:
sum=0
for ds in good_data:
    sum +=ds[1]
    
print('{} terabytes'.format(sum/1.e6))


39.2907689704 terabytes

How much more data do we get if we allow 10 second timeout instead of 2?


In [53]:
sum=0
for ds in good_data2:
    sum +=ds[1]
    
print('{} terabytes'.format(sum/1.e6))


0.37659873907 terabytes

In [59]:
url=[]
size=[]
for item in good_data:
    url.append(item[0])
    size.append(item[1])

In [55]:
d={}
d['url']=url
d['size']=size

In [88]:
good = pd.DataFrame(d)

In [89]:
good.head()


Out[89]:
size url
0 160.935373 http://oos.soest.hawaii.edu/thredds/dodsC/paci...
1 12.585632 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
2 14.015040 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
3 2.227600 http://tds.secoora.org/thredds/dodsC/cormp.ilm...
4 0.042984 http://tds.secoora.org/thredds/dodsC/enp.bdvf1...

In [71]:
df2=df.sort(['size'],ascending=0)

In [76]:
df2.head()


Out[76]:
size url
1007 8077171.411440 http://ecowatch.ncddc.noaa.gov/thredds/dodsC/n...
190 3911780.956856 http://oos.soest.hawaii.edu/thredds/dodsC/paci...
1019 3085310.257180 http://ecowatch.ncddc.noaa.gov/thredds/dodsC/n...
128 1498433.147160 http://ecowatch.ncddc.noaa.gov/thredds/dodsC/h...
129 1479018.284432 http://ecowatch.ncddc.noaa.gov/thredds/dodsC/h...

In [91]:
url=[]
code=[]
for item in bad_data:
    url.append(item[0])
    code.append(item[1])

In [92]:
d={}
d['url']=url
d['code']=code
bad = pd.DataFrame(d)

In [93]:
bad.head()


Out[93]:
code url
0 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
1 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
2 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
3 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
4 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...

In [102]:
bad.to_csv('bad.csv')

In [103]:
good.to_csv('good.csv')

In [12]:
cd /usgs/data2/notebook/system-test/Theme_1_Baseline


/usgs/data2/notebook/system-test/Theme_1_Baseline

In [15]:
bad = pd.read_csv('bad.csv',index_col=0)
good = pd.read_csv('good.csv',index_col=0)

In [35]:
bad_sorted=bad.sort(['url','code'],ascending=[0,0])

In [36]:
bad_sorted.to_csv('bad.csv')

In [33]:
bad_sorted


Out[33]:
code url
1058 -1 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
8 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
1057 -1 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
7 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
6 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
5 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
4 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
1056 -1 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
3 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
2 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
1055 -1 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
1 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
1054 -1 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
0 404 http://www.neracoos.org/thredds/dodsC/UMO/DSG/...
334 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
392 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
208 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
333 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
332 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
331 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
406 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
391 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
207 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
206 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
205 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
370 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
369 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
204 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
203 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
936 404 http://thredds.ucar.edu/thredds/dodsC/grib/Uni...
... ... ...
476 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
475 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
474 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
473 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
472 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
471 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
470 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
469 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
468 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
467 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
466 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
465 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
464 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
463 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
462 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
461 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
460 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
459 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
458 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
457 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
456 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
455 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
454 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
453 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
452 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
451 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
450 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
449 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
448 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...
447 -1 http://barataria.tamu.edu/thredds/dodsC/nam_go...

1139 rows × 2 columns


In [48]:
recs = bad[bad['url'].str.contains('neracoos')]
print len(recs)


14

In [52]:
recs = bad[bad['url'].str.contains('ucar')]
print len(recs)


402

In [54]:
recs = bad[bad['url'].str.contains('tamu')]
print len(recs)


480

In [64]:
recs['url'][927]


Out[64]:
'http://barataria.tamu.edu/thredds/dodsC/nam_gom_monthly/vgrd/nam_vgrd_gom_201312.nc'

In [66]:
recs['url'][447]


Out[66]:
'http://barataria.tamu.edu/thredds/dodsC/nam_gom_monthly/dswrf/nam_dswrf_gom_200901.nc'

In [67]:
recs = bad[bad['url'].str.contains('axiom')]
print len(recs)


85

In [74]:
recs.to_csv('axiom.csv')

In [ ]: