This command creates the sample data files used in the rest of this example. These files contain no real data, but they have the same structure as European XFEL's HDF5 data files.
In [1]:
!python3 -m karabo_data.tests.make_examples
In [2]:
!h5ls fxe_control_example.h5
In [3]:
from karabo_data import H5File
f = H5File('fxe_control_example.h5')
In [4]:
f.control_sources
Out[4]:
In [5]:
f.instrument_sources
Out[5]:
In [6]:
for tid, data in f.trains():
print("Processing train", tid)
print("beam iyPos:", data['SA1_XTD2_XGM/DOOCS/MAIN']['beamPosition.iyPos.value'])
break
In [7]:
tid, data = f.train_from_id(10005)
data['FXE_XAD_GEC/CAM/CAMERA:daqOutput']['data.image.dims']
Out[7]:
These are just a few of the ways to access data. The attributes and methods described below for run directories also work with individual files. We expect that it will normally make sense to access a run directory as a single object, rather than working with the files separately.
In [8]:
!ls fxe_example_run/
In [9]:
from karabo_data import RunDirectory
run = RunDirectory('fxe_example_run/')
In [10]:
run.files[:3] # The objects for the individual files (see above)
Out[10]:
What devices were recording in this run?
Control devices are slow data, recording once per train. Instrument devices includes detector data, but also some other data sources such as cameras. They can have more than one reading per train.
In [11]:
run.control_sources
Out[11]:
In [12]:
run.instrument_sources
Out[12]:
Which trains are in this run?
In [13]:
print(run.train_ids[:10])
See the available keys for a given source:
In [14]:
run.keys_for_source('SPB_XTD9_XGM/DOOCS/MAIN:output')
Out[14]:
This collects data from across files, including detector data:
In [15]:
for tid, data in run.trains():
print("Processing train", tid)
print("Detctor data module 0 shape:", data['FXE_DET_LPD1M-1/DET/0CH0:xtdf']['image.data'].shape)
break # Stop after the first train to keep the demo short
Train IDs are meant to be globally unique (although there were some glitches with this in the past). A train index is only within this run.
In [16]:
tid, data = run.train_from_id(10005)
tid, data = run.train_from_index(5)
Data which holds a single number per train (or per pulse) can be extracted to as series (individual columns) and dataframes (tables) for pandas, a widely-used tool for data manipulation.
karabo_data
chains sequence files, which contain successive data from the same source. In this example, trains 10000–10399 are in one sequence file (...DA01-S00000.h5
), and 10400–10479 are in another (...DA01-S00001.h5
). They are concatenated into one series:
In [17]:
ixPos = run.get_series('SA1_XTD2_XGM/DOOCS/MAIN', 'beamPosition.ixPos.value')
ixPos.tail(10)
Out[17]:
To extract a dataframe, you can select interesting data fields with glob syntax, as often used for selecting files on Unix platforms.
[abc]
: one character, a/b/c?
: any one character*
: any sequence of characters
In [18]:
run.get_dataframe(fields=[("*_XGM/*", "*.i[xy]Pos")])
Out[18]:
Data with extra dimensions can be handled as xarray labelled arrays. These are a wrapper around Numpy arrays with indexes which can be used to align them and select data.
In [19]:
xtd2_intensity = run.get_array('SA1_XTD2_XGM/DOOCS/MAIN:output', 'data.intensityTD', extra_dims=['pulseID'])
xtd2_intensity
Out[19]:
Here's a brief example of using xarray to align the data and select by train ID. See the examples in the xarray docs for more on what it can do.
In this example data, all the data sources have the same range of train IDs, so aligning them doesn't change anything. In real data, devices may miss some trains that other devices did record.
In [20]:
import xarray as xr
xtd9_intensity = run.get_array('SPB_XTD9_XGM/DOOCS/MAIN:output', 'data.intensityTD', extra_dims=['pulseID'])
# Align two arrays, keep only trains which they both have data for:
xtd2_intensity, xtd9_intensity = xr.align(xtd2_intensity, xtd9_intensity, join='inner')
# Select data for a single train by train ID:
xtd2_intensity.sel(trainId=10004)
# Select data from a range of train IDs.
# This includes the end value, unlike normal Python indexing
xtd2_intensity.loc[10004:10006]
Out[20]:
You can also specify a region of interest from an array to load only part of the data:
In [21]:
from karabo_data import by_index
# Select the first 5 trains in this run:
sel = run.select_trains(by_index[:5])
# Get the whole of this array:
arr = sel.get_array('FXE_XAD_GEC/CAM/CAMERA:daqOutput', 'data.image.pixels')
print("Whole array shape:", arr.shape)
# Get a region of interest
arr2 = sel.get_array('FXE_XAD_GEC/CAM/CAMERA:daqOutput', 'data.image.pixels', roi=by_index[100:200, :512])
print("ROI array shape:", arr2.shape)
karabo_data
provides a few ways to get general information about what's in data files. First, from Python code:
In [22]:
run.info()
In [23]:
run.detector_info('FXE_DET_LPD1M-1/DET/0CH0:xtdf')
Out[23]:
The lsxfel
command provides similar information at the command line:
In [24]:
!lsxfel fxe_example_run/RAW-R0450-LPD00-S00000.h5
In [25]:
!lsxfel fxe_example_run/RAW-R0450-DA01-S00000.h5
In [26]:
!lsxfel fxe_example_run