DevOps

Reproducible research

A large majority of published results today are not reproducible. Two of my colleagues are preparing a longer course on the subject.

https://bitbucket.org/scilifelab-lts/reproducible_research_example

Mention types of reproducibility
Get the course git link!

What are devops?

From a more restricted data science point of view, DevOps are less concerned with automating the production of software. For a researcher, the main interest is research reproducibility, but also data management and backup.

Do you need this?
Data scientist, data engineer, data analyst.
"Orchestration" and "data lake".

How to source your code

What is the difference between source code and a script?
Discussion on how to source python code.
Importance of design patterns.
Python style guide.
Software versioning milestones.

Using source editors. What matters?

Editors for Python range from any simple raw text editor to the most complex IDE (integrated development environment).

In the first cathegory I reccommend Notepad and Notepad++ for Windows, Emacs for MacOS and Linux, and nano, vim, geany for Linux.

Among IDEs, Spyder is a simpler editor with an interface similar to Matlab and native integration of the IPython interpreter, and we will use that for the purpose of this class. A much more complex favorite of mine is PyCharm from JetBrains, that has a community edition. The one I use more frequently is Atom, built by the git community.

What matters:

Using the editor most appropriate to the complexity of the task.
Full feature editors make it easier to write good code!
Syntax and style linting.
Code refactoring.
Git/svn integration.
Remote development.

Task:

Create a 'src' folder inside your working directory. Use a raw test editor to make a hello world program inside and run it on the command line. Now open the same file inside your favorite editor and run it inside the interpreter embedded into it.

Now write a function called hello_world() and load it here using a module call.



In [1]:

    
import sys
sys.path









    Out[1]:





['',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python36.zip',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/lib-dynload',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages/IPython/extensions',
 '/home/sergiu/.ipython']



In [2]:

    
sys.path.append("/custom/path")



In [3]:

    
sys.path









    Out[3]:





['',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python36.zip',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/lib-dynload',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg',
 '/home/sergiu/programs/miniconda3/envs/lts/lib/python3.6/site-packages/IPython/extensions',
 '/home/sergiu/.ipython',
 '/custom/path']



In [4]:

    
import myfancymodule
myfancymodule.hello_world()









    



---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-4-a46c63b11ef7> in <module>()
----> 1 import myfancymodule
      2 myfancymodule.hello_world()

ModuleNotFoundError: No module named 'myfancymodule'

Discussion:

Do you see a benefit to this approach?
How much of the code should be sourced and how much should be kept on a notebook?

Distributed version control using git

Git - a svn for the distributed computing age.
Collaborative editing.
History: subversion, mercurial, git and bitbucket.

Exercise

Let us now add the sourced code to our own git repositories!

git init
git status

stage: Now make a change to your source code and run git status again. To tell Git to start tracking changes made to your file, we first need to add it to the staging area by using git add.

git add your_file
# git add .
# git log
# git reset your_file
git status

commit, checkout: Notice how Git says changes to be committed? The files listed here are in the Staging Area, and they are not in the repository yet. We could add or remove files from the stage before we store them in the repository. To store our staged changes we run the commit command with a message describing what we've changed. Files can be changed back to how they were at the last commit by using the command:

git commit -m "I modified the hello function"
# git checkout -- your_file

push, origin, master: To push a local repo to the GitHub server we'll need to add a remote repository. This command takes a remote name and a repository URL. The push command tells Git where to put our commits when we're ready, and now we're ready. So let's push our local changes to our origin repo (on GitHub).

The name of our remote is "origin" and the default local branch name is "master". The -u tells Git to remember the parameters, so that next time we can simply run git push and Git will know what to do. Go ahead and push it!

git remote add origin https://github.com/urreponame/urreponame.git
git push -u origin master

pull Let's pretend some time has passed. We've invited other people to our GitHub project who have pulled your changes, made their own commits, and pushed them. We can check for changes on our GitHub repository and pull down any new changes by running pull. Let's take a look at what is different from our last commit by using the git diff command. In this case we want the diff of our most recent commit, which we can refer to using the HEAD pointer. diff can also be used for files newly staged.

git pull origin master
git diff HEAD
# git diff --staged



In [ ]: