Docker and Jupyter Notebooks for Reproducible Research

Goal: To understand what Docker is and how it can be used with Jupyter notebooks for reproducible research.

Docker is technological tool that creates high performance, shareable, reproducible computational environments. Jupyter notebooks are tools for interactive analysis that interweave prose, code, and results. Together, Docker and Jupyter notebooks are best-of-breed methods to create research that is reproducible.


In [ ]:
#Imports for running this presentation live

from ipywidgets import interact, interactive
from IPython.display import clear_output, display, HTML, YouTubeVideo

import numpy as np

from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import cnames
from matplotlib import animation

%matplotlib inline

!docker info
!docker load -i busybox.dockerarchive.tar

The Problem

Even though computers are often considered deterministic, computational software is a rapidly evolving and changing landscape. Libraries are constantly adding new features and fixing issues.

Image source: http://www.michaelogawa.com/research/storylines/

Even libraries with the strictest backwards-compatibility policies can change in significant ways.

Image source: http://www.bonkersworld.net/backwards-compatibility/

A reproducible computational environment has a sufficiently consistent state for the computational task at hand.

For example, this can consist of

  • a similar CPU instruction set
  • libraries and executables available with a specific version and configuration options
  • a specific version of a given compiler
  • a specific version of a libc implementation
  • a specific version of the C++ standard library

Close But Not Good Enough

Source code

Does not include:

  • Compiler
  • Hardware it was built on
  • How it is configured
  • Package dependencies
  • Run-time environment
  • How to run it

Package managers and distributions

  • There is not a consensus on the package manager
  • Packages become unsupported over time
  • What to do if a required library is not packaged?

Virtual machines (VMs)

  • Inefficient utilization of computational resources

Image source: http://time-az.com/images/2014/02/20140203carjam.jpg

Enter Linux Containers

Linux container systems , like Docker, are new type of tool to easily build, ship, and run reproducible, binary applications.

It is "good enough" for a reproducible computational environment.

In this talk, we will introduce Docker from the perspective a scientific research software engineer. We will

  • Generate an understanding of what Docker is by comparing it to existing technologies.
  • Give an introduction to basic Docker concepts.
  • Describe how Docker fits into the scientific analysis workflow with Jupyter notebooks.

Understanding Docker

Not just this cute whale thing

Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere.


In [ ]:
!docker run --rm busybox sh -c 'echo "Hello Docker World!"'

Docker is a combination of a:

  1. Sandboxed chroot
  2. Copy on write filesystem
  3. Distributed VCS for binaries

Sandboxed chroot

Docker works with images that consume minimal disk space, versioned, archiveable, and shareable. Executing applications in these images does not require dedicated resources and is high performance.

It works with containers as opposed to virtual machines (VM's).


In [ ]:
%time !docker run --rm busybox sh -c 'echo "Hello Docker World!"'

A Docker container is similar to a running an application in a chroot, but it sandboxes processes and the network stack with Linux kernel:

  • Namespaces: isolated processes, networking, messaging, file systems, hostname's
  • CGroups: groups together cpu, memory, and IO resources

Copy on Write Filesystem

Union file systems, or UnionFS, are file systems that operate by creating layers, making them very lightweight and fast while saving disk space.

Docker can make use of several union file system variants including:

  • AUFS
  • btrfs
  • vfs
  • DeviceMapper

Distributed VCS for binaries

Docker is like Git for binaries


In [ ]:
!docker search itk
  • Docker images are identified with hex string or tags
  • Interface is docker <subcommand>
  • docker push, docker pull, docker tag
  • docker export will create a archiveable tarball of an image's filesystem.
  • DockerHub is like GitHub

Installing

Here's what you need:

  • Linux kernel with control groups and namespaces
  • Support for a layered filesystem (like AUFS)
  • Docker Daemon / Server (written in Go)

|Linux

Windows and Mac

Docker Machine

  • easy install of

    • Git Bash
    • VirtualBox
    • Lightweight Linux distribution
    • Docker
  • Mac OSX users can use the Docker client from the Mac bash shell

  • Comes with busybox shell -> Write your Docker build.sh and run.sh in Bourne shell

Docker Concepts

Image

A read-only file system layer


In [ ]:
!docker images

Container

An modifiable image with processes running in memory, or an exited container with a modified filesystem


In [ ]:
!docker ps

In [ ]:
!docker run -d busybox sh -c 'sleep 3'

In [ ]:
!docker ps

In [ ]:
!docker ps -a

Volume

A directory within one or more containers that bypasses the Union File System

  • Data volumes are initialized when a container is created
  • Volumes can be shared and reused between containers
  • Changes to a data volume are made directly
  • Changes to a data volume will not be included when you update an image
  • Volume persist until no containers use them
  • Host directories can also be mounted as data volumes

Why use a data volume?

  • Store and share data
  • Expose data or code from the host to the Docker computational environment

Dockerfile

A sequence of instructions to generate a Docker image


In [ ]:
!mkdir -p docker-ls-data
!cp $PWD/Data/*.png docker-ls-data/

In [ ]:
%%writefile docker-ls-data/Dockerfile

FROM busybox
MAINTAINER Matt McCormick <matt.mccormick@kitware.com>
RUN mkdir -p /Data
ADD *.png /Data/
VOLUME /Data
CMD ["/bin/sh", "-c", "ls /Data"]

In [ ]:
!docker build -t ls-data ./docker-ls-data

In [ ]:
!docker run --rm ls-data

Scientific Research with Docker Notebook

Graphical Applications and Docker

A portable Docker image will only assume standard CPU/memory/disk/network resources are available. If local USB devices and video card devices are used the images will not be runnable anywhere.

Choosing a base image

Recap and Next Steps

Docker is

  • Sandboxed chroot +
  • Incremental, copy on write filesystem +
  • Distributed VCS for binaries +

Concepts

  • Image: A read-only file system layer
  • Container: A writable image with processes running in memory, or an exited container with a modified filesystem
  • Volume: A mounted directory that is not tracked as a filesystem layer
  • Dockerfile: A sequence of instructions to generate a Docker image

Scientific Python and Docker

  • Not for graphical applications, especially OpenGL
  • Reproducible computational environment for IPython notebook
  • Use with Linux-based packaging system of your choice

Learn more!

Docker vs. LXC

  • LXC is a set of tools and API to interact with Linux kernel namespaces, cgroups, etc.
  • LXC used to be the default execution enviroment for Docker
  • Docker provides LXC function, plus:
    • Portable deployment across machines
    • Application-centric
    • Automatic builds
    • Versioning
    • Component re-use
    • Sharing
    • Tool echosystem

Docker vs Rocket

  • Rocket is a container system like Docker developed by CoreOS
  • Rocket is not as mature
  • Rocket does not use a daemon/client system