In [ ]:
!jupyter nbconvert --to='slides' asdf\ tutorial.ipynb --post serve


[NbConvertApp] Converting notebook asdf tutorial.ipynb to slides
[NbConvertApp] Writing 315966 bytes to asdf tutorial.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.1.0
Serving your slides at http://127.0.0.1:8000/asdf tutorial.slides.html
Use Control-C to stop this server
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 1.75ms
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 0.69ms

Introduction to ASDF

Paper - ASDF: A new data format for astronomy http://dx.doi.org/10.1016/j.ascom.2015.06.004

Why another data format?

Shortcomings of FITS are becoming burdensome:

  • restrictions on keyword name length led to restricting the order of polynomials used (not enough characters for more than 1 digit)
  • Namespace collisions led to lack of transformation concatenation
  • lack of simple grouping structure
  • Conventions for workarounds have been implimented, but these are not standardized

What about alternatives?

VOTable

Strengths: XML, can leverage correspoonding reader and validation libraies.

Weaknesses

  • No intrinsic support for efficiently handling binary data
  • can't do memory mapping for files too large to fit in memory
  • can refer to FITS, but doesn't solve FITS's problems
  • Does not allow for additional structured metadata beyond description of tabular data (free text wouldn't be sufficient to describe World Coordinate System transformations)

HDF5

Strengths

  • Flexible, used by a much broader community, perhaps strongest alternative to FITS

Weaknesses (relative to FITS)

  • Entirely binary - FITS headers were are atleast human readable.
  • All inspection must be done through HDF5 software
  • FITS is largely self documenting, while HDF5 spec is lengthy and complex (125 pages)
  • Complexity results in ~1 implimenation.. more likely to deviate from spec
  • HDF format kept changing, so it was less appropriate for archival format
  • Does not support simpler, text based data files
  • HDF5 Abstract Data Model was not flexible enough to represent generalized WCS
  • Arbitrary restriction on nesting of data structures
    • Data types do not include a variable-length mapping datatype (analogous to dict or js object)
    • Groups can not be nested in variable-length arrays, only within each other
    • Compound data dtypes cannoot contain other "compound" types or variable-length arrays

Problems with others

  • CDF
    • purely binary
    • no supportfor grouping, hierarchical structures
    • no references
  • netCDF
    • Also purely binary
    • Does not support references or compression
    • V.3 does not support hiearchical grouping
    • V.4 is layered on hdf5, so inherits most of HDF5's issues, although fewer features and simpler api
  • Starlink - does not support 64-bit dimension sizes, table structures organized by row, nor compression
  • XDF - like VOTable, suffers from inability to store raw binary, dev halted in 2006
  • FITSXML - embeds FITS in xml, but shares similar restrictions wrt grouping and metadata

ASDF Design goals

Main focus is on data interchange and archive sutability while maintinaing level of efficiency of FITS. They are not prioritizing throughput - expect tools to convert from ASDF to HDF5 where necessary.

  1. Structure

    1. Intrinsic hierarchical structure - syntax makes structure apparent
    2. Human-readable
    3. Based on an existing standard
    4. Support for references - can refer to same object from multiple locations in the file
    5. No arbitrary limits where possible
    6. Efficient updating
  2. Qualities for numerical Data

    1. Support both text and binary, where binary may be stored raw. Text should be human-editable
    2. Machine independent (endianess)
    3. Structured - binary fits insised of structured hierachy
    4. Multiple binary sections possible
    5. n-dimensional arrays and tables
    6. Intrinsic support for reading/writing streams
  3. Interoperability
    1. Explicit versioning for both the format and the individual stritures within it
    2. Explicit extensibiility - domain specific extensions without interference
    3. Validation - support a schema language that enables semantic validation of type and structure of objects. leverage existing validation toolsets. Enable schema validation for domain-specific extensions

ASDF Structure

One-line header indicating asdf and version

YAML "tree": Yet Another Markup Language (YAML)

  • natively supports strings, numbers, booleans, null
  • arbitrary nesting of mappings, sequences
  • Similar to json, but structure more concise and human readable - can use indentation and/or delimiters
  • has explicit type designations (YAML tags)
    • example: !wcs/steps -> validation performed against schema definition for that tag
  • Supported by major programming languages
  • Supports references
    • "anchors": object->label
    • "alias": label->object

Binary Sections: Defined by supportin yaml schemas

Motivation

Kameleon is known as a data standard for the space weather community. Operationally, that means we

  • write conversion software that puts the scientist's model results into cdf or hdf5 format
  • provide tools for cracking open and interpolating from the resulting files
  • we host model results for the public (read scientists') consumption.

Problems with our current standard

At the time the CCMC started, we chose to provide a common interface into the model results and to add our own metadata to the hosted results. There are a few problems with this approach

  1. this approach is very labor intensive - unless the Kameleon developers add the model, it doesn't get supported.
  2. the demand for bringing in new models is growing and we are constantly being asked to support more models.
  3. The cdf/hdf5 format requires the user to have specialized software just to view metadata.
  4. While cdf/hdf5 standard is highly optimized for heavy models, it is overkill for many others.

Meanwhile.. IDL ROR Vis

For years, Lutz has been developing IDL readers and interpolators for all models currently in the Runs-on-Request system. Lutz's scripts convert the model results into a general N-D data structure, which his interpolators understand...

Idea: use Lutz's format, but store results in ASDF .

Defining a format?

Our goal is to make it easy to bring space weather models into kameleon. We do this by either:

  • defining a reader/interpolator in code
  • by explicitly defining a file format (asdf)

Since Lutz has already created a format for reading the models into IDL, we are targeting a file format for him to write to.

Why ASDF

https://asdf.readthedocs.io/en/latest/

  • A Flexible file format for scientific data
  • Uses YAML to store metadata and provenance info.
  • Arrays stored in the same file or separately
  • On Load, multiple all files treated as one

Hello World


In [2]:
mkdir -p hello_world

In [3]:
from asdf import AsdfFile

# Make the tree structure, and create a AsdfFile from it.
tree = {'hello': 'world'}
ff = AsdfFile(tree)
ff.write_to("hello_world/test.asdf")

# You can also make the AsdfFile first, and modify its tree directly:
ff = AsdfFile()
ff.tree['hello'] = 'world'
ff.write_to("hello_world/test_hello_world.asdf")

In [4]:
cat hello_world/test.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
hello: world
...

In [17]:
cat hello_world/test_hello_world.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
hello: world
...

Storing Arrays


In [ ]:
mkdir -p array_storage

In [ ]:
from asdf import AsdfFile
import numpy as np

tree = {'my_array': np.random.rand(8, 8)}
ff = AsdfFile(tree)
ff.write_to("array_storage/test_arrays.asdf")

In [18]:
cat array_storage/test_arrays.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
my_array: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [8, 8]
...
�BLK0%Z?8Fݤ����`�v�A��?�H3�UR�?�C]��N�?@\�bd�?0b@�#ӯ?��l��?bS'���?�
�?��;k��?#ASDF BLOCK INDEX
%YAML 1.1
--- [353]
...

Schema validation

ASDF prevents files from being instantiated or saved with data not supported by a given schema.


In [ ]:
from asdf import ValidationError

In [ ]:
from asdf import AsdfFile
tree = {'data': 'Not an array'}

try:
    AsdfFile(tree)
except:
    raise ValidationError('data needs an array!')

Data Sharing

Overlap between data sets can be stored without duplication.


In [ ]:
mkdir -p data_sharing

In [ ]:
from asdf import AsdfFile
import numpy as np

my_array = np.random.rand(8, 8)
subset = my_array[2:4,3:6]
tree = {
    'my_array': my_array,
    'subset':   subset
}
ff = AsdfFile(tree)
ff.write_to("data_sharing/test_overlap.asdf")

In [20]:
cat data_sharing/test_overlap.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
my_array: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [8, 8]
subset: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [2, 3]
  offset: 152
  strides: [64, 8]
...
?�p�\�Y�?G�t���?r�ܜ�?�Q�*��?�Ʉq��?��9y�?|z���%�?!�oW���?�e�q���?Y���Z�?a�u����?�+@
���?��?X�X��?���U��?zҊ�|�?�0��$��?�0�^�{�?���- �?Z"Ԧzf�?߮z���?#ASDF BLOCK INDEX
%YAML 1.1
--- [482]
...

Streaming Data


In [ ]:
mkdir -p streaming_data

In [31]:
from asdf import AsdfFile, Stream
import numpy as np

tree = {
    # Each "row" of data will have 128 entries.
    'my_stream': Stream([128], np.float64)
}

ff = AsdfFile(tree)
with open('streaming_data/stream_test.asdf', 'wb') as fd:
    ff.write_to(fd)
    # Write 100 rows of data, one row at a time.  ``write``
    # expects the raw binary bytes, not an array, so we use
    # ``tostring()``.
    for i in range(10):
        fd.write(np.array([i] * 128, np.float64).tostring())

In [32]:
cat streaming_data/stream_test.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
my_stream: !core/ndarray-1.0.0
  source: -1
  datatype: float64
  byteorder: little
  shape: ['*', 128]
...
�BLK0�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@"@

Explosive Storage

Data can be optionally stored in the header (if it's reasonably small) or exploded into separate files (if large binary).


In [ ]:
mkdir -p exploded_data

In [ ]:
from asdf import AsdfFile
import numpy as np

my_array = np.random.rand(3, 4)
tree = {'my_array': my_array}

my_big_array = np.random.rand(8, 8)
tree['my_big_array'] = my_big_array

ff = AsdfFile(tree)
ff.set_array_storage(my_array, 'inline')
ff.set_array_storage(my_big_array, 'external')
ff.write_to("exploded_data/test_exploded.asdf")

# Or for every block:
# ff.write_to("test.asdf", all_array_storage='external')

In [34]:
ls exploded_data/


test_exploded.asdf      test_exploded0000.asdf

In [35]:
cat exploded_data/test_exploded.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
my_array: !core/ndarray-1.0.0
  data:
  - [0.5692266677107335, 0.15130631802177097, 0.47130375299595606, 0.41078261676298844]
  - [0.6172845590606234, 0.38521795030966643, 0.2380731102351069, 0.5642219882253369]
  - [0.4135472637118681, 0.6735411057601441, 0.46484849720818966, 0.23949691397551498]
  datatype: float64
  shape: [3, 4]
my_big_array: !core/ndarray-1.0.0
  source: test_exploded0000.asdf
  datatype: float64
  byteorder: little
  shape: [8, 8]
...

In [36]:
cat exploded_data/test_exploded0000.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
...
�BLK0����z�q6���(A�^���b=�?~���&�?`M�/���?�-���?dp��b0�?� �����?�x4���?�!N��?P_�꒪?��E�?HCBS6�?8�&:�?���ʸ�?��R�Z�?�����[�?-5�,f��?��
N�?���?��?@�!K
�?FN�Nn>�?��D��?�����_�?�P�4�?�|U��?���>��?���+k�?����L^�?���d��?솂˧J�?���+��?�!�]I�?@#`Fbg�?p� #��?�Z���?�@=����? (f��4�?b8�#%}�?����!u�?�d^j��?*<=j��?>��ڛ�?��J�r��?8.y\��?pK0`��?J�:ɮ�?#ASDF BLOCK INDEX
%YAML 1.1
--- [255]
...

Data Provenance


In [ ]:
mkdir -p provenance

In [37]:
from asdf import AsdfFile
import numpy as np

tree = {
    'some_random_data': np.random.rand(5, 5)
}

ff = AsdfFile(tree)
ff.add_history_entry(
    u"Initial random numbers",
    {u'name': u'asdf examples',
     u'author': u'John Q. Public',
     u'homepage': u'http://github.com/spacetelescope/asdf',
     u'version': u'0.1',
    u'spase_dict': {u'resource_id': 5}})
ff.write_to('provenance/provenance.asdf')

In [38]:
cat provenance/provenance.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
history:
- !core/history_entry-1.0.0
  description: Initial random numbers
  software: !core/software-1.0.0
    author: John Q. Public
    homepage: http://github.com/spacetelescope/asdf
    name: asdf examples
    spase_dict: {resource_id: 5}
    version: '0.1'
  time: 2017-04-03 13:32:22.302293
some_random_data: !core/ndarray-1.0.0
  source: 0
  datatype: float64
  byteorder: little
  shape: [5, 5]
...
�BLK0����)��$�naG�Lu���x��?�	�W��?&�=!6�?b\���'�?�P����?��wV0&�?���<sL�?l�	1��?\������?��	�*�?h���Dw�?Oyp9���?�xǁ���?��$`۰?���D{	�?�Kj
��?����?f�L����?*�χS�?[��?�e��g�?�EG]]�?�)[�e��?(�!�9��?{��̋��?#ASDF BLOCK INDEX
%YAML 1.1
--- [659]
...

Compression


In [ ]:
mkdir -p compression

In [ ]:
from asdf import AsdfFile
import numpy as np
x = np.linspace(-20, 20, 30)
y = np.linspace(-30, 30, 50)
xx,yy = np.meshgrid(x,y)
tree = dict(variables = dict(x = xx,
                             y = yy
                            )
           )
ff = AsdfFile(tree)
ff.write_to("compression/uncompressed_data.asdf", all_array_compression=None)
ff.write_to("compression/compressed_data.asdf", all_array_compression='bzp2')

In [ ]:
import os
print 'uncompressed:', os.path.getsize("compression/uncompressed_data.asdf"), 'bytes'
print 'compressed (bz2):', os.path.getsize("compression/compressed_data.asdf"), 'bytes'

Custom types

Example: Astropy Time


In [ ]:
mkdir -p time

In [48]:
from asdf import AsdfFile

from astropy.time import Time 

astrot = Time('2016-10-3')

from asdf.tags.time import TimeType

tree = {'my_time': astrot}
ff = AsdfFile(tree)

ff.write_to("time/test_time.asdf")

ff.close()

In [49]:
cat time/test_time.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
my_time: !time/time-1.0.0 '2016-10-03 00:00:00.000'
...

verify that time matches astropy type


In [53]:
sample_time =  AsdfFile.open('time/test_time.asdf')

my_time = sample_time.tree['my_time']

type(my_time) == type(astrot)


Out[53]:
True

Units

ASDF is capable of storing scientific units via astropy schemas.


In [ ]:
mkdir -p units

In [65]:
from astropy import units as u

rho_unit = u.kg*u.cm**-3
density = np.linspace(0, 11, 5)*rho_unit
density.unit


Out[65]:
$\mathrm{\frac{kg}{cm^{3}}}$

In [71]:
from asdf import AsdfFile

tree = dict(variables=dict(density = dict(data=density.value, unit = density.unit)))
           
ff = AsdfFile(tree)
ff.set_array_storage(density, 'inline')
ff.write_to("units/units_test.asdf", all_array_compression=None)
ff.close()

In [72]:
cat units/units_test.asdf


#ASDF 1.0.0
#ASDF_STANDARD 1.0.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
asdf_library: !core/software-1.0.0 {author: Space Telescope Science Institute, homepage: 'http://github.com/spacetelescope/asdf',
  name: asdf, version: 1.1.0}
variables:
  density:
    data: !core/ndarray-1.0.0
      data: [0.0, 2.75, 5.5, 8.25, 11.0]
      datatype: float64
      shape: [5]
    unit: !unit/unit-1.0.0 'cm-3 kg'
...

verify variables load - comes as dictionary


In [75]:
units_file =  AsdfFile.open('units/units_test.asdf')
rho = units_file.tree['variables']['density']
rho


Out[75]:
{u'data': array([  0.  ,   2.75,   5.5 ,   8.25,  11.  ]),
 u'unit': Unit("kg / cm3")}

It would be nice to define a schema for quantities, so that they would load as astropy arrays with units!

Extending ASDF schema

ASDF is designed to be extensible so outside teams can add their own types and structures while retaining compatibility with tools that don’t understand those conventions.

https://github.com/STScI-JWST/jwst/tree/master/jwst/datamodels/schemas