Reproducibility and Containers

J. S. Oishi

What is Reproducibility?

nomenclature

  • can you in the future redo a calculation for a paper? can another person in your group?
    • "repeatbility"
  • can someone download everything you did, run it, and get the same result?
    • "replicability"
  • can someone read your paper, do a clean room implementation (or use another code), access the datasource, and reproduce your results independantly?
    • "reproducibility"

Open-source is not enough

Our code, Dedalus Project relies on (arrows indicate dependencies of the dependencies)

Stack

  • numpy --> BLAS --> Vendor provided (e.g. Accelerate, Intel MKL)? OpenBLAS?
  • scipy --> UMFPACK (for sparse Linear Algebra) --> BLAS
  • Python
  • hdf5
  • mpi4py --> MPI --> Vendor provided? OpenMPI? MPICH?
  • more

See install script

Also...shouldn't your results NOT rely on YOUR code?

Examples from this paper

Enter containers

Images vs Containers

An image is a blob that contains everything you need to run software:

  • an OS (Linux; your choice of distribution)
  • libraries, compilers, etc
  • the software you want to run

A container is an image running on a computer. It functions as though it is its own computer (though it isn't; it is Not a virtual machine!), and isolates your program.

Dockerfiles

An image is created by a Dockerfile. Here's a simple one.

FROM debian:latest

RUN apt-get -y update && apt-get install -y wget sudo

RUN useradd -ms /bin/bash dedalus && echo "dedalus:dedalus" | chpasswd && adduser dedalus sudo

USER dedalus
CMD /bin/bash

Dockerhub

Like github, dockerhub stores images.

Unlike github, dockerhub is automatically connected to the docker runtime:

please don't run this right now

docker run -it -p 8888:8888 -v $PWD/data:/data -w /data tensorflow/tensorflow:1.13.1-py3-jupyter bash

will just grab this image without you having to do anything!


In [ ]: