DevOps

Reproducible research

A large majority of published results today are not reproducible. Two of my colleagues are preparing a longer course on the subject.

https://bitbucket.org/scilifelab-lts/reproducible_research_example

  • Mention types of reproducibility
  • Get the course git link!

What are devops?

From a more restricted data science point of view, DevOps are less concerned with automating the production of software. For a researcher, the main interest is research reproducibility, but also data management and backup.

  • Do you need this?
  • Data scientist, data engineer, data analyst.
  • "Orchestration" and "data lake".

How to source your code

  • What is the difference between source code and a script?
  • Discussion on how to source python code.
  • Importance of design patterns.
  • Python style guide.
  • Software versioning milestones.

In [ ]:

Using source editors. What matters?

Editors for Python range from any simple raw text editor to the most complex IDE (integrated development environment).

In the first cathegory I reccommend Notepad and Notepad++ for Windows, Emacs for MacOS and Linux, and nano, vim, geany for Linux.

Among IDEs, Spyder is a simpler editor with an interface similar to Matlab and native integration of the IPython interpreter, and we will use that for the purpose of this class. A much more complex favorite of mine is PyCharm from JetBrains, that has a community edition. The one I use more frequently is Atom, built by the git community.

What matters:

  • Using the editor most appropriate to the complexity of the task.
  • Full feature editors make it easier to write good code!
  • Syntax and style linting.
  • Code refactoring.
  • Git/svn integration.
  • Remote development.

Task:

Create a 'src' folder inside your working directory. Use a raw test editor to make a hello world program inside and run it on the command line. Now open the same file inside your favorite editor and run it inside the interpreter embedded into it.

Now write a function called hello_world() and load it here using a module call.


In [1]:
import sys
sys.path


Out[1]:
['',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python36.zip',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/lib-dynload',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages/IPython/extensions',
 '/home/sergiu/.ipython']

In [2]:
sys.path.append("/custom/path")

In [3]:
sys.path


Out[3]:
['',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python36.zip',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/lib-dynload',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages/IPython/extensions',
 '/home/sergiu/.ipython',
 '/custom/path']

In [4]:
import myfancymodule
myfancymodule.hello_world()


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-4-a46c63b11ef7> in <module>()
----> 1 import myfancymodule
      2 myfancymodule.hello_world()

ModuleNotFoundError: No module named 'myfancymodule'

Discussion:

  • Do you see a benefit to this approach?
  • How much of the code should be sourced and how much should be kept on a notebook?

Distributed version control using git

  • Git - a svn for the distributed computing age.
  • Collaborative editing.
  • History: subversion, mercurial, git and bitbucket.

Exercise

Let us now add the sourced code to our own git repositories!

git init
git status

stage: Now make a change to your source code and run git status again. To tell Git to start tracking changes made to your file, we first need to add it to the staging area by using git add.

git add your_file
# git add .
# git log
# git reset your_file
git status

commit, checkout: Notice how Git says changes to be committed? The files listed here are in the Staging Area, and they are not in the repository yet. We could add or remove files from the stage before we store them in the repository. To store our staged changes we run the commit command with a message describing what we've changed. Files can be changed back to how they were at the last commit by using the command:

git commit -m "I modified the hello function"
# git checkout -- your_file

push, origin, master: To push a local repo to the GitHub server we'll need to add a remote repository. This command takes a remote name and a repository URL. The push command tells Git where to put our commits when we're ready, and now we're ready. So let's push our local changes to our origin repo (on GitHub).

The name of our remote is "origin" and the default local branch name is "master". The -u tells Git to remember the parameters, so that next time we can simply run git push and Git will know what to do. Go ahead and push it!

git remote add origin https://github.com/urreponame/urreponame.git
git push -u origin master
# git push is sufficient afterwards

pull Let's pretend some time has passed. We've invited other people to our GitHub project who have pulled your changes, made their own commits, and pushed them. We can check for changes on our GitHub repository and pull down any new changes by running pull. Let's take a look at what is different from our last commit by using the git diff command. In this case we want the diff of our most recent commit, which we can refer to using the HEAD pointer. diff can also be used for files newly staged.

git pull origin master
git diff HEAD
# git diff --staged

branching and merging When developers are working on a feature or bug they'll often create a copy (aka. branch) of their code they can make separate commits to. Then when they're done they can merge this branch back into their main master branch. Why is this better? Because it avoids situations of incidental modifications to other files. Branching often allows you to track changes.

Now if you type git branch you'll see two local branches: a main branch named master and your new branch. You can switch branches using the git checkout command. After making your changes on the branch you just need to switch back to the master branch so you can copy (or merge) your changes from the branch back into the master branch. Go ahead and checkout the master branch:

git branch nameit
git checkout nameit
# git rm some_file
# etc..
# git commit -m "removed some_file"
git checkout master
git merge nameit
# delete the branch
git branch -d nameit

Development vs production

  • Why are they separated?
  • How does it impact on the reproducibility if the development and production is not separated?
  • Bring forward the issue of having different projects using the same directory. What is to do?

When do we need containers? Using Docker.

  • Separating the environment from the operating system
  • Virtualization, VirtualBox, VMWare
  • Containers, Docker, Singularity, Vagrant
  • DockerHub
  • Backup the code, the data and the environment
$ sudo docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
78445dd45222: Pull complete 
Digest: sha256:c5515758d4c5e1e838e9cd307f6c6a0d620b5e07e6f927b07d05f6d12a1ac8d7
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://cloud.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/engine/userguide/

The Dockerfile

# Use an official Python runtime as a base image
FROM python:2.7-slim

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Define environment variable
ENV NAME World

# Run app.py when the container launches
CMD ["python", "app.py"]

Build the image:

$ ls
Dockerfile      app.py          requirements.txt
$ docker build -t friendlyhello .
$ docker images

REPOSITORY            TAG                 IMAGE ID
friendlyhello         latest              326387cea398

Run the image:

docker run -d -p 4000:80 friendlyhello
$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED
1fa4ab2cf395        friendlyhello       "python app.py"     28 seconds ago
$ docker stop 1fa4ab2cf395

Publish on the docker hub:

# signup at cloud.docker.com
docker tag friendlyhello username/repository:tag
docker push username/repository:tag

Now go to any machine in Milky Way having Docker installed and access to Internet and run:

docker run -p 4000:80 username/repository:tag

Speed: Profiling, IPython, JIT

The Python standard library contains the cProfile module for determining the time that takes every Python function when running the code. The pstats module allows to read the profiling results. Third party profiling libraries include in particular line_profiler for profiling code line after line, and memory_profiler for profiling memory usage. All these tools are very powerful and extremely useful when optimizing some code, but they might not be very easy to use at first.


In [7]:
%%writefile script.py
import numpy as np
import numpy.random as rdn

# uncomment for line_profiler
# @profile
def test():
    a = rdn.randn(100000)
    b = np.repeat(a, 100)

test()


Writing script.py

In [8]:
!python -m cProfile -o prof script.py

In [ ]:
$ pip install ipython
$ ipython --version
0.13.1
$ pip install line-profiler
$ pip install psutil
$ pip install memory_profiler

In [9]:
%timeit?

In [ ]:
%run -t slow_functions.py

In [11]:
%time {1 for i in range(10*1000000)}
%timeit -n 1000 10*1000000


CPU times: user 560 ms, sys: 0 ns, total: 560 ms
Wall time: 563 ms
18.6 ns ± 0.281 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [ ]:
def foo(n):
    phrase = 'repeat me'
    pmul = phrase * n
    pjoi = ''.join([phrase for x in xrange(n)])
    pinc = ''
    for x in xrange(n):
        pinc += phrase
    del pmul, pjoi, pinc

In [14]:
#%load_ext line_profiler
%lprun -f foo foo(100000)


ERROR:root:Line magic function `%lprun` not found.
  • %time & %timeit: See how long a script takes to run (one time, or averaged over a bunch of runs).
  • %prun: See how long it took each function in a script to run.
  • %lprun: See how long it took each line in a function to run.
  • %mprun & %memit: See how much memory a script uses (line-by-line, or averaged over a bunch of runs).

In [ ]: