Vaex uses hdf5 (Hierarchical Data Format) for storing data. You can think of hdf5 files as being a file system, where the 'files' contain N-dimensional arrays, or think of it as the binary equivalent of XML files. Being almost like a filesystem, you can store data anyway, for instance under '/mydata/somearray'.
For vaex we based our layout on VOTable, any recommendation, comments or requests to standardize are welcome.
In vaex, every column is stored under /data, which can be found out using the h5ls tool
$ h5ls data/helmi-dezeeuw-2000-10p.hdf5
data Group
All columns are stored under this group, and can be listed:
$ h5ls data/helmi-dezeeuw-2000-10p.hdf5/data
E Dataset {330000}
FeH Dataset {330000}
L Dataset {330000}
Lz Dataset {330000}
random_index Dataset {330000}
vx Dataset {330000}
vy Dataset {330000}
vz Dataset {330000}
x Dataset {330000}
y Dataset {330000}
z Dataset {330000}
If you for some reason don't want to use vaex, but access the data using Python, you could do something like this:
In [9]:
import h5py
import numpy as np
h5file = h5py.File("/Users/users/breddels/src/vaex/data/helmi-dezeeuw-2000-10p.hdf5", "r")
FeH = h5file["/data/FeH"]
# FeH is your regular numpy array (with some extras)
print("mean FeH", np.mean(FeH), "length", len(FeH))
More information about a column can be found using:
h5ls -v data/helmi-dezeeuw-2000-10p.hdf5/data/FeH
Opened "data/helmi-dezeeuw-2000-10p.hdf5" with sec2 driver.
FeH Dataset {330000/330000}
Attribute: ucd scalar
Type: variable-length null-terminated ASCII string
Data: "phys.abund.fe"
Attribute: unit scalar
Type: variable-length null-terminated ASCII string
Data: "dex"
Location: 1:2644064
Links: 1
Storage: 2640000 logical bytes, 2640000 allocated bytes, 100.00% utilization
Type: native double
Here we see that the (similar to VOTable), we have a ucd attribute which describes what the column represents, and its units.
These can be accessed using h5py as well
In [10]:
print(FeH.attrs["ucd"], FeH.attrs["unit"])
Further restrictions are the first character of the column name should be an underscope (_) an ascii letter (a-z or A-Z), and following characters can also include a digit.
For completeness, the layout is as follows
In [ ]: