Developing Scientific Workflows in Pycroscopy - Part 0: Data Format

Suhas Somnath

8/8/2017

This set of notebooks will serve as examples for developing and end-to-end workflows for and using pycroscopy.

This preliminary document goes over the pycroscopy data format

Why should you care?

The quest for understanding more about samples has necessitated the development of a multitude of microscopes, each capable of numerous measurement modalities.

Typically, each commercial microscope generates data files formatted in proprietary data formats by the instrument manufacturer. The proprietary natures of these data formats impede scientific progress in the following ways:

  1. By making it challenging for researchers to extract data from these files
  2. Impeding the correlation of data acquired from different instruments.
  3. Inability to store results back into the same file
  4. Accomodating files from few kilobytes to several gigabytes of data
  5. Requiring different versions of analysis routines for each format

Future concerns:

  1. Several fields are moving towards the open science paradigm which will require journals and researchers to support journal papers with data and analysis software
  2. US Federal agencies that support scientific research require curation of datasets in a clear and organized manner

To solve these and many more problems, we have developed an instrument agnostic data format that can be used to represent data from any instrument, size, dimensionality, or complexity.

Pycroscopy data format

Regardless of origin, modality or complexity, imaging data have one thing in common:

  • The same measurement is performed at multiple spatial locations

The data format in pycroscopy is based on this one simple ground truth. The data always has some spatial dimensions (X, Y, Z) and some spectroscopic dimensions (time, frequency, intensity, wavelength, temperature, cycle, voltage, etc.). Pycroscopy, the spatial dimensions are collapsed onto a single dimension and the spectroscopic dimensions are flattened to the other dimensions. Thus, all data are stored as two dimensional grids. Here are some examples of how some familar data can be represented using this paradigm:

  • Grayscale photographs: A single value (intensity) in is recorded at each pixel in a two dimensional grid. Thus, there are are two spatial dimensions - X, Y and one spectroscopic dimension - "Intensity". The data can be represented as a N x 1 matrix where N is the product of the number of rows and columns of pixels. The second axis has size of 1 since we only record one value (intensity) at each location. The positions will be arranged as row0-col0, row0-col1.... row0-colN, row1-col0....
    • In the case of a color image, the data would be of shape N x 3. Where the red, green, blue intensity values would be stored separately.
  • A single Raman spectra: In this case, the measurement is recorded at a single location. At this position, data is recorded as a function of a single (spectroscopic) variable such as wavelength. Thus this data is represented as a 1 x P matrix, where P is the number of points in the spectra
  • Scanning Tunelling Spectroscopy or IV spectroscopy: The current (A 1D array of size P) is recorded as a function of voltage at each position in a two dimensional grid of points (two spatial dimensions). Thus the data would be represente as a N x P matrix, where N is the product of the number of rows and columns in the grid and P is the number of spectroscopic points recorded.
    • If the same voltage sweep were performed twice at each location, the data would be represented as N x 2 P. The data is still saved as a long (2*P) 1D array at each location. The number of spectroscopic dimensions would change from just ['Voltage'] to ['Voltage', 'Cycle'] where the second spectroscopic dimension would account for repetitions of this bias sweep.
      • The spectroscopic data would be stored as it would be recorded as volt_0-cycle_0, volt_1-cycle_0..... volt_P-1-cycle_0, volt_0-cycle_1.....volt_P-1-cycle-1. Just like the positions
    • Now, if the bias was swept thrice from -1 to +1V and then thrice again from -2 to 2V, the data bacomes N x 2 3 P. The data now has two position dimensions (X, Y) and three spectrosocpic dimensions ['Voltage', 'Cycle', 'Step']. The data is still saved as a (P 2 * 3) 1D array at each location.

Making sense of such flattened datasets:

Each main dataset is always accompanied by four ancillary datasets:

  • the position value and index of each spatial location (row)
  • the spectroscopic value and index of any column in the dataset In addition to serving as a legend or the key, these ancillary datasets are necessary for explaining:
  • the original dimensionality of the dataset
  • how to reshape the data back to its N dimensional form

From the IV Spectorscopy example with [X, Y] x [Voltage, Cycle, Step]:

  • The position datasets would be of shape N x 2 - N total position, two spatial dimensions.
    • The position indices datasets may start like:

| 0 | 0 | | 0 | 1 | | a | t |

    * 0, 0
    * 0, 1
    * ....
    * 0, N/2
    * 1, 0 ....
    would be structured exactly

Channels

The pycroscopy data format also allows multiple channels of information to be recorded as separate datasets in the same file. For example, one channel could be a spectra (1D array) collected at each location on a 2D grid while another could be the temperature (single value) recorded by another sensor at the same spatial positions

My hope is that this notebook will serve as a comprehensive example for:

  1. Data Access

    1. Loading, reading, writing, and manipulating HDF5 / H5 files.
  2. Visualization

    1. Visualizing results of analyses and processing using pycroscopy functions
    2. Developing simple interactive visualizers

Among the numerous benefits of HDF5 files are that these files:

  • are readily compatible with high-performance computing facilities
  • scale very efficiently from few kilobytes to several terabytes
  • can be read and modified using any language including Python, Matlab, C/C++, Java, Fortran, Igor Pro, etc.

In [ ]: