A large majority of published results today are not reproducible. Two of my colleagues are preparing a longer course on the subject.
From a more restricted data science point of view, DevOps are less concerned with automating the production of software. For a researcher, the main interest is research reproducibility, but also data management and backup.
Editors for Python range from any simple raw text editor to the most complex IDE (integrated development environment).
In the first cathegory I reccommend Notepad and Notepad++ for Windows, Emacs for MacOS and Linux, and nano, vim, geany for Linux.
Among IDEs, Spyder is a simpler editor with an interface similar to Matlab and native integration of the IPython interpreter, and we will use that for the purpose of this class. A much more complex favorite of mine is PyCharm from JetBrains, that has a community edition. The one I use more frequently is Atom, built by the git community.
What matters:
Create a 'src' folder inside your working directory. Use a raw test editor to make a hello world program inside and run it on the command line. Now open the same file inside your favorite editor and run it inside the interpreter embedded into it.
Now write a function called hello_world() and load it here using a module call.
import sys
import myfancymodule
Let us now add the sourced code to our own git repositories!
git init
git status
stage: Now make a change to your source code and run git status again. To tell Git to start tracking changes made to your file, we first need to add it to the staging area by using git add.
git add your_file
# git add .
# git log
# git reset your_file
git status
commit, checkout: Notice how Git says changes to be committed? The files listed here are in the Staging Area, and they are not in the repository yet. We could add or remove files from the stage before we store them in the repository. To store our staged changes we run the commit command with a message describing what we've changed. Files can be changed back to how they were at the last commit by using the command:
git commit -m "I modified the hello function"
# git checkout -- your_file
push, origin, master: To push a local repo to the GitHub server we'll need to add a remote repository. This command takes a remote name and a repository URL. The push command tells Git where to put our commits when we're ready, and now we're ready. So let's push our local changes to our origin repo (on GitHub).
The name of our remote is "origin" and the default local branch name is "master". The -u tells Git to remember the parameters, so that next time we can simply run git push and Git will know what to do. Go ahead and push it!
git remote add origin https://github.com/urreponame/urreponame.git
git push -u origin master
# git push is sufficient afterwards
pull Let's pretend some time has passed. We've invited other people to our GitHub project who have pulled your changes, made their own commits, and pushed them. We can check for changes on our GitHub repository and pull down any new changes by running pull. Let's take a look at what is different from our last commit by using the git diff command. In this case we want the diff of our most recent commit, which we can refer to using the HEAD pointer. diff can also be used for files newly staged.
git pull origin master
git diff HEAD
# git diff --staged
branching and merging When developers are working on a feature or bug they'll often create a copy (aka. branch) of their code they can make separate commits to. Then when they're done they can merge this branch back into their main master branch. Why is this better? Because it avoids situations of incidental modifications to other files. Branching often allows you to track changes.
Now if you type git branch you'll see two local branches: a main branch named master and your new branch. You can switch branches using the git checkout
git branch nameit
git checkout nameit
# git rm some_file
# etc..
# git commit -m "removed some_file"
git checkout master
git merge nameit
# delete the branch
git branch -d nameit
$ sudo docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
78445dd45222: Pull complete
Digest: sha256:c5515758d4c5e1e838e9cd307f6c6a0d620b5e07e6f927b07d05f6d12a1ac8d7
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
For more examples and ideas, visit:
# Use an official Python runtime as a base image
FROM python:2.7-slim
# Set the working directory to /app
# Copy the current directory contents into the container at /app
ADD . /app
# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt
# Make port 80 available to the world outside this container
# Define environment variable
# Run app.py when the container launches
CMD ["python", "app.py"]
Build the image:
$ ls
Dockerfile app.py requirements.txt
$ docker build -t friendlyhello .
$ docker images
friendlyhello latest 326387cea398
Run the image:
docker run -d -p 4000:80 friendlyhello
$ docker ps
1fa4ab2cf395 friendlyhello "python app.py" 28 seconds ago
$ docker stop 1fa4ab2cf395
Publish on the docker hub:
# signup at cloud.docker.com
docker tag friendlyhello username/repository:tag
docker push username/repository:tag
Now go to any machine in Milky Way having Docker installed and access to Internet and run:
docker run -p 4000:80 username/repository:tag
The Python standard library contains the cProfile module for determining the time that takes every Python function when running the code. The pstats module allows to read the profiling results. Third party profiling libraries include in particular line_profiler for profiling code line after line, and memory_profiler for profiling memory usage. All these tools are very powerful and extremely useful when optimizing some code, but they might not be very easy to use at first.
%%writefile script.py
import numpy as np
import numpy.random as rdn
# uncomment for line_profiler
# @profile
def test():
a = rdn.randn(100000)
b = np.repeat(a, 100)
!python -m cProfile -o prof script.py
$ pip install ipython
$ ipython --version
$ pip install line-profiler
$ pip install psutil
$ pip install memory_profiler
%run -t slow_functions.py
%time {1 for i in range(10*1000000)}
%timeit -n 1000 10*1000000
def foo(n):
phrase = 'repeat me'
pmul = phrase * n
pjoi = ''.join([phrase for x in xrange(n)])
pinc = ''
for x in xrange(n):
pinc += phrase
del pmul, pjoi, pinc
#%load_ext line_profiler
%lprun -f foo foo(100000)
