In [1]:
%autosave 10


Autosaving every 10 seconds

Intro

  • Particle accelerator, close to speed to light. Synchotron.
  • Data analysis during and after run.
  • Many users, time constraints.
  • Java-based code, General Data Acquisition (GDA)
    • But want scientists to script/extend it.
    • Jython.
  • Every increasing throughput of data.
    • 2007: 10MB/s
    • 2009: 60MB/s
    • 2011: 150MB/s
    • 2013: 600MB/s
    • 2015: 6000MB/s
    • doubling every 7.5 months!
    • peaks at 1TB/day right now

 Data storage

  • 1PB near-line, 0.5PB on-line.
  • 200M+ files.
  • High performance parallel file systems hate lots of small files.
  • So they moved bigger files as HDF5, but it hates ASCII files.

 Big data

  • Volume, variety, veracity, velocity

 Tooling

  • Excel, MATLAB, etc., all assume data fits onto laptop
  • These tools do not scale to big data, at least at reasonable price.

Python

  • Free, easily distributable
  • Already used some via Jython
  • But how to spread it?
    • Extend their existing acquisition tools to spin off new analysis tools
  • Use PyDev and Eclipse heavily
    • Spun off as "Dawn" product
  • Use scisoftpy to tie PyDev with HDF5 storage.
    • Like matplotlib but in Eclipse, do e.g. line fitting
  • MATLAB/IDL requires expensive support, but with Python easier to support.
    • Easier to create sustainable software that survives over time.

 Optimization

  • Need a magnetic array to wiggle light beam as it travels
  • But magnets have imperfections.
  • Use Python to optimize an objective function.
  • Originally, Python was 1k slower than Fortran, because direct port.
  • Then NumPy, 10 times slower, but much cleaner than Fortran.
  • It became obvious how to improve caching, then speed same.
  • Eventually Python is 100 times faster.
  • Then instead of simulated annealing, used Artificial Immune systems
    • Global.
    • Slower.
    • Parallelization, very simple, embarassingly parallel.
        - numpy with threads.
        - 25 machines, 200 CPUs.

Data reduction / processing

  • A lot of pre-existing FORTRAN code, but only single core
  • Want to create a data pipeline that parallelises.
  • Python to glue together data pipeline.
    • Not so much Python to do core processing, but DIALS is a project to do this.

Tomography (Current Implementation)

  • Need large TIFFs to do processing, single GPU, uses CUDA
  • Store in HDF5, use Python to write out TIFFs, multi GPUs.
  • Going forward
    • zeromq as transport, blosc as compression
    • can still use CUDA via Python.

 multiprocessing + MPI "profiling"

  • Drew a multi-thread multi-thread image of profile, time series.
  • Based on log messages, not fine grained like cProfile
  • But discovered h5py holds the GIL
  • google.visualization.DataTable() JavaScript, use Python to parse logs, jinja2 to produce HTML
  • Much better than grepping!

Summary

  • Python is the best tool to let scientists to become developers.
  • Rate of change of techniques and equipment is increasing.
    • Python supports this.

http://www.dawnsci.org http://www.diamond.ac.uk