The (Hitchhicker's) Guide to Making Research (and Support) more Reproducible with ...

Jupyter

Open source, interactive data science and scientific computing across over 40 programming languages. The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. Read more about Jupyter.

We are concerned, among other things (it is to be hoped), with increasing the capability of our NBIS staff, collaborators, and wider public to reproduce/resell/recycle/hack our work. Like many others, I used Jupyter and similar platforms for several reasons:

  • As a way to hand out research results to my collaborators and public.
  • Keeping track of my work (internal notebooks).
  • Organize course material.
  • Cloud development on the go with Sage, Picloud, etc.
  • Provide a fast OS independent web GUI toolkit.
  • Use of a Star Trek bridge from which to run/edit scripts in multiple languages. If you are like me and worked with some fifty programming languages over a 27 years timeframe, having to constantly change editors and displays kind of sucks.

Other uses of Jupyter, that are common enough, but I was less concerned with before include:

  • Documenting a program, pipeline or library.
  • Colaborative work.
  • Scallable computing (big data interface).
  • Batch running, scheduling

We shall take those points one by one with some examples.

JupyterLab

Read here about the future of the Jupyter Notebook: https://blog.jupyter.org/2016/07/14/jupyter-lab-alpha/. Some of my key remarks:

  • Python and Julia remain central.
  • Separate file browser, console and code editor added to the notebook server.
  • The project moves out of Berkeley (partnership with Continuum Analytics, Tech at Bloomberg).

History

I don't know the early beginnings, but I first saw notebooks that can compute in an Astronomy lab from the nineties, where they were using Mathematica notebooks for symbolic mathematics computations. Among mathematicians there is a tribe that became hateful of too much symbolic writing, and they were the first enthusiasts. Two major developments happened since: the ability to write code cells and the web.

Jupyter name is an agglutination of Julia, Python and R and is pronounced in reference to a pie rather than a planet. It originates at Berkeley (also home of Julia and Apache Spark) from a project that now is independent, called IPython. IPython is the best interactive shell for Python for scientific computing and now provides one of many kernels for running Python code in Jupyter.

Projects similar to Jupyter:

  • Mathematica has its own notebook. Somehow I could never afford it. Matlab, Octave, Sage and the like all integrate into Jupyter but often develop their own notebooks.
  • R:
    • R Markdown often used with report generation with knitr.
    • Shiny is often viewed as something similar. Purely web application frameworks for Python similar even moe similar to Shiny would include Flask and Django.
    • Other R possibilities range from to the "R only" Jupyter clone RCloud to web publishing with RPubs. Taken together, they make up a decent Jupyter replacement. For R.
  • The latest efforts come unsurprisingly from big data, such as spark-notebook and Beaker.

Can you cook eggs with Jupyter? Yes, you need a frying pan and an oven...

Don't expect Jupyter to always work. It is very dependent on your ability to have their latest update on the computer, wait for bugs to be solved, and circumvent problems. If you stay on the beaten track the user experience is fabulous, although limited.

Some known limitations:

  • Support for Python is the best, anything else is less than optimal.
  • Collaborative tools are new and not super fun.
  • Jupyter provides a non-native aproach to parallel computing. Many parallel computing machines have their own platforms.
  • You cannot open notebooks with a double click. This CAN be done, with average OS admin skill, it is just not how things work (yet?) by default. You have to open a web server first. This is not an acessible Office replacer.
  • It won't replace Wiki. The wiki is much better at displaying collaborative content for the web. So our Confluence wiki pages are better left where they sit.
  • For great looks, Javascript and web development is essential.

In [12]:
# Something Lena asked
from IPython.display import Image
Image('https://blog.jupyter.org/content/images/2016/07/jlab-screenshot-nb-con-term-2.png')


Out[12]:

And now, the Hello World! Oh noo

To start using Jupyter please follow this notebook that I use during my Python classes. Or help yourself with Dr Google. THE point about Jupyter is that you can do this thing (called a code cell):


In [1]:
m = "Hello World!"
def f(message):
    print(message)

Each code cell will communicate with the others because underneath a notebook is the same shell/kernel/language (IPython/Python in this particular case, but you can just as well use IRkernel/R). If you created a notebook with a Kernel other than IPython you will have* to type the code in the appropriate languages. Some languages (kernels) however are very xenophoric, most notably Python and Julia, so you can call many other languages if you use them.

Note: *A different kernel can be set for a certain code cell.


In [2]:
m


Out[2]:
'Hello World!'

In [3]:
f(m)


Hello World!

The text cells are commonly typed in Markdown. They can also include html and latex.

Static
HTMLTABLE

Static latex: $$c = \sqrt{a^2 + b^2}$$

Dynamic latex and html:


In [13]:
%%latex
\begin{align}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0
\end{align}


\begin{align} \nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\ \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\ \nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\ \nabla \cdot \vec{\mathbf{B}} & = 0 \end{align}

In [4]:
from IPython.core.display import HTML
HTML('<iframe src=http://nbis.se/?useformat=mobile width=700 height=350>')


Out[4]:

In [5]:
%%HTML
<div style="background-color:cyan; border:solid black; width:300px; padding:20px;">
Value for 'foo': <input type="text" id="foo" value="bar"><br>
<button onclick="set_value()">Set Value</button>
</div>
<script type="text/Javascript">
    function set_value(){
        var var_value = document.getElementById('foo').value;
        var command = "foo = '" + var_value + "'";
        console.log("Executing Command: " + command);
        var kernel = IPython.notebook.kernel;
        kernel.execute(command);
    }
</script>
Here is how Javascript communicates with Python:


Value for 'foo':
Here is how Javascript communicates with Python:

In [6]:
foo


Out[6]:
'bar'

The examples above use % which is a way to invoke kernel commands known as "magics". Next is a plot example that will use a magic to load matplotlib.pylab module and specify that we want an inline figure rather than a standalone GUI for our plot.


In [8]:
%pylab inline
x = linspace(0, 3*pi, 500)
plot(x, sin(x**2))
title('adjustment to day-night cycle in northern sweden');


Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['f']
`%matplotlib` prevents importing * from pylab and numpy

In [9]:
%load_ext rpy2.ipython


/home/sergiun/programs/anaconda3/lib/python3.5/site-packages/rpy2/robjects/robject.py:6: UserWarning: Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/lib/R/library/stats/libs/stats.so':
  /home/sergiun/programs/anaconda3/bin/../lib/libgfortran.so.3: version `GFORTRAN_1.4' not found (required by /usr/lib/liblapack.so.3)

  rpy2.rinterface.initr()
/home/sergiun/programs/anaconda3/lib/python3.5/site-packages/rpy2/robjects/robject.py:6: UserWarning: During startup - 
  rpy2.rinterface.initr()
/home/sergiun/programs/anaconda3/lib/python3.5/site-packages/rpy2/robjects/robject.py:6: UserWarning: Warning message:

  rpy2.rinterface.initr()
/home/sergiun/programs/anaconda3/lib/python3.5/site-packages/rpy2/robjects/robject.py:6: UserWarning: package ‘stats’ in options("defaultPackages") was not found 

  rpy2.rinterface.initr()

In [10]:
%%R
plot_r <- function(x) {
    p <- plot(x);
    print(p);
}

In [11]:
print("Yellow R this is Python can you please plot this array for me?")
import numpy as np
x = np.random.rand(10)
print(x)
%Rpush x
%R plot_r(x)


Yellow R this is Python can you please plot this array for me?
[ 0.30469154  0.92585863  0.40339478  0.32991618  0.97750167  0.97810745
  0.06415171  0.28461408  0.00603493  0.66413359]
NULL

A way to hand out research results to collaborators and public.

A notebook can be displayed on the web provided you use a notebook aware server. The default notebook service is called Nbviewer and can be configured to display public notebooks on a web server. It is widely used on Github, and a large part of Jupyter popularity probably came from having the Github exposed notebooks available publicly.

FAQ: What is this Notebook Viewer? IPython Notebook Viewer is a free webservice that allows you to share static html versions of hosted notebook files. If a notebook is publicly available, by giving its url to the Viewer, you should be able to view it. You can also directly browse collections of notebooks in public GitHub repositories, for example the IPython examples.

Here is a lenghty blog post about deploying the notebook on other clouds:

https://blog.ouseful.info/2014/12/12/seven-ways-of-running-ipython-notebooks/

Most often, a collaborator that is not computer savy will need pdf or html conversion, which can be done in batch mode or interactive using the notebook file menu.

For text processing, batch conversion into markdown is also possible. Apart from this, notebooks can also be saved into the native kernel language, such as Python, Julia, R (untested). The native format of a notebook file, .ipynb is a json format.

TODO: will link the future Andersson handouts here.

Keeping track of my work (internal notebooks)

This is how I use Jupyter most of the time. Taking log notes is not fun, how about taking log notes while you document your ideas while you program? It is possible to add a time log to it, although I personally don't like it and prefer to add a date when I feel like.

It is easier to keep them private because you want the ability to make annotations that would otherwise be difficult to explain. (For example making a note about what a direct repeat is. You might have heard about it before, but now you just wanted to put it there and clarify it. A collaborator or a client however can think you are an unqualified.)

While I do have a few projects where an integrated code editor is essential, I do a lot of the work in these notebook logs, assembling the code only when needed. They are especially good for exploratory studies. At the end I assemble a few presentation notebooks from the logs. Is this a good ideea for reproducible research? In my opinion publicly documenting every little detail is detrimental, I would not try to reproduce a bit of research that is detailed in too many files. But, having the record, even if private, allows someone the get to the details when needed, and that too is reproducible research. I am guessing there is a trade off somewhere between actual work and administration. Are all of Einstein's mind mapping scribbled bits of paper being kept, maybe they are, maybe they aren't public. He had a weird take on what desk order means. What do you think?

Advices:

  • Do not keep long notebooks! Since each notebook is a shell it will get filled with data to the point where it gets hard to manage properly.
  • Keep important scripts and programs running independently. You can run anything from a notebook, you can import your code easily.
  • Consider each notebook as a task record rather than a time record. Example: installing and running program x, this is a task, it can be done in different days and you may refine it from time to time.
  • Code cells can be run sequentially so it is best to have a top-down approach. Although it is possible to insert cells on the top that use the notebook data, this is only useful for fast testing and those cells should be deleted or moved.

Examples:

  • TODO: add link to Andersson project logs
  • TODO: add time log as an example, and consider other logging features

Organize course/training material

Well if you heard about IPython or Jupyter, it probably is because you have seen courses. My most recent is here: https://github.com/grokkaine/biopycourse/blob/master/Syllabus.ipynb

Here are some links:

My experience:

It is best if you use Docker and a cloud to provide a controlled experience. It is also good to use a single Python distribution. It is good as in any interactive course if the class is not too big. I found that I can't take more than ten without a super drop in the quality of the teaching.

Cloud development on the go with Sage, AWS, Picloud, etc

Most clouds today can be configured or natively run Jupyter. Some of Jupyter kernels are especially great in clouds, having parralel computing capabilities. For some, this has some collaboration benefits too.

My use of this feature was mostly to access clouds from my mobile. Having a small kid it is hard to open a laptop do do any work in the evenings or during weekends, so my Nexus 4 became a work tool.

Provide a fast OS independent web GUI

TODO: a widget example to plot spacer graphs.

Multiple language interface

I have shown how multiple languages can be used from the same notebook. In most case though you would want to have each notebook code in its kernel's native language.

Documenting a program, pipeline or library

Because of Github integration, this became a popular feature. Notebooks can be converted to ReStructuredText markup (.reST), which feeds into a popular documentation generator called Sphinx, which created a popular fork named Read The Docs, able to regen after every commit.

I liked for example how cobrapy structured documentation. Send me mail if you have a special mention. Here is another, a metagenomic framework http://pythonhosted.org/mgkit/index.html.

Collaborative work

coLaboratory is hosted on Jupyter site and currently offers collaborative support through Google Drive. There are other minor developments, if you had some experience with one or another please let me know.

You can always co-edit a git repo hosting notebooks, but hardcore collaborative edits (as in concurential editing of the same code cell) will break the internal json structure of your document. Unless you have a git genius on your sleeve, better find alternatives. Content management is not in the plans neither.

Worth mentioning:

  • JupyterHub. A Jupyter server spawning user authenticated instances.
  • Dashboard can display Jupyter notebooks as dynamic dashboards outside of the Jupyter Notebook server. (NodeJS based)

Scallable computing

Because the emergent big data architectures have their own notebook implementations, sometimes natively (examples were given) this will be a hard test for Jupyter. The most recent succesful story for Jupyter was Scala and Spark integration among its kernels. Ultimately, Jupyter is powered by Berkeley, "love" and open source.

The main website has a small example of Jupyter-Spark integration, and other examples that you can test-run at will. https://try.jupyter.org/

Batch scheduling

Let's say you made a detailed notebook making a number of plots and you need to rerun it because you got some new data. Notebooks can be re-run at any moment from a terminal or from any job scheduler. An example is here: https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/


In [ ]: