| notebook.community

notebook.community



In [1]:

    
%autosave 10









    














    



Autosaving every 10 seconds

Intro

Particle accelerator, close to speed to light. Synchotron.
Data analysis during and after run.
Many users, time constraints.
Java-based code, General Data Acquisition (GDA)
- But want scientists to script/extend it.
- Jython.
Every increasing throughput of data.
- 2007: 10MB/s
- 2009: 60MB/s
- 2011: 150MB/s
- 2013: 600MB/s
- 2015: 6000MB/s
- doubling every 7.5 months!
- peaks at 1TB/day right now

Data storage

1PB near-line, 0.5PB on-line.
200M+ files.
High performance parallel file systems hate lots of small files.
So they moved bigger files as HDF5, but it hates ASCII files.

Big data

Volume, variety, veracity, velocity

Tooling

Excel, MATLAB, etc., all assume data fits onto laptop
These tools do not scale to big data, at least at reasonable price.

Python

Free, easily distributable
Already used some via Jython
But how to spread it?
- Extend their existing acquisition tools to spin off new analysis tools
Use PyDev and Eclipse heavily
- Spun off as "Dawn" product
Use scisoftpy to tie PyDev with HDF5 storage.
- Like matplotlib but in Eclipse, do e.g. line fitting
MATLAB/IDL requires expensive support, but with Python easier to support.
- Easier to create sustainable software that survives over time.

Optimization

Need a magnetic array to wiggle light beam as it travels
But magnets have imperfections.
Use Python to optimize an objective function.
Originally, Python was 1k slower than Fortran, because direct port.
Then NumPy, 10 times slower, but much cleaner than Fortran.
It became obvious how to improve caching, then speed same.
Eventually Python is 100 times faster.
Then instead of simulated annealing, used Artificial Immune systems
- Global.
- Slower.
- Parallelization, very simple, embarassingly parallel.
```
  - numpy with threads.
  - 25 machines, 200 CPUs.
```

Data reduction / processing

A lot of pre-existing FORTRAN code, but only single core
Want to create a data pipeline that parallelises.
Python to glue together data pipeline.
- Not so much Python to do core processing, but DIALS is a project to do this.

Tomography (Current Implementation)

Need large TIFFs to do processing, single GPU, uses CUDA
Store in HDF5, use Python to write out TIFFs, multi GPUs.
Going forward
- zeromq as transport, blosc as compression
- can still use CUDA via Python.

multiprocessing + MPI "profiling"

Drew a multi-thread multi-thread image of profile, time series.
Based on log messages, not fine grained like cProfile
But discovered h5py holds the GIL
google.visualization.DataTable() JavaScript, use Python to parse logs, jinja2 to produce HTML
Much better than grepping!

Summary

Python is the best tool to let scientists to become developers.
Rate of change of techniques and equipment is increasing.
- Python supports this.

http://www.dawnsci.org http://www.diamond.ac.uk