4 Git and Pickle integration in ReproPhylo

This notebook demonstrates the interaction of ReproPhylo and of pickled ReproPhylo Project files with Git. In section 3 we disabled Git and saved the pickle file manually at the end of each sub section. However, ReproPhylo is designed to update the Project's pickle file automatically after time consuming steps and also to create a version control repository and record versions in real time. All of this will happen if we start a Project using the default setting git=True.

Once we start a Project this way, it can be the only version controlled Project in the current working directory. Any additional Project will have to be started with a different pickle name, and with git=False. Should it not be the case, helpful error messages will guide you through.

Also, once we started a Project, it can only be resumed with the command unpickle_pj. If we try to reconstruct the Project using the command pj = Project(...), another helpful error message will be raised.

4.1 The long version

4.1.1 Start a Project, read data, do alignment, show Git log

Start a Project
As we did in section 3, we start a Project, and provide a pickle file name. We do not, however, use git=False and therefore git is invoked, as the default behaviour.


In [2]:
from reprophylo import *
pj = Project('git_demo_files/loci_edited.csv', pickle='git_demo_files/git_demo')


/home/amir/Dropbox/python_modules/rpgit.py:93: UserWarning: Thanks to Stack-Overflow users Shane Geiger and Billy Jin for the git wrappers code
  warnings.warn('Thanks to Stack-Overflow users Shane Geiger and Billy Jin for the git wrappers code')
/home/amir/Dropbox/python_modules/rpgit.py:109: UserWarning: A git repository was created in /home/amir/Dropbox/ReproPhylo/Tutorial_files/Git.
  warnings.warn('A git repository was created in %s.'%repoDir)
/home/amir/Dropbox/python_modules/reprophylo.py:255: UserWarning: The new repository is called git_demo_files/git_demo.
  warnings.warn('The new repository is called %s.'%open(cwd + '/.git/description', 'r').read().rstrip())
DEBUG:Cloud:Log file (/home/amir/.picloud/cloud.log) opened

We get three warnings, which are only information messages.

  • The first massage includes credit for some code I got online.
  • The second gives us the location in which the repository will be mentained
  • The third gives us the name of the repository

Read data
We can move on to reading data and aligning some loci:


In [4]:
genbank = './git_demo_files/Tetillidae.gb'
pj.read_embl_genbank([genbank])

Do alignment


In [5]:
pj.extract_by_locus()
mafft = AlnConf(pj)
pj.align([mafft])


mafft 217511440955273.78_CDS_proteins_MT-CO1.fasta

So our data was split to bins according the the Locus objects in the Project, and all the loci were aligned with the default settings of Mafft.

Show last Git action (which was to commit the pickle with the alignment)
At this point, let's check what pickle and git did at the background, by asking for git info:


In [6]:
pj.last_git_log()


Sun Aug 30 18:21:15 2015
STDOUT:
[master 09df506] AlnConf named mafftDefault with ID 217511440955273.78 Loci: MT-CO1 Created on: Sun Aug 30 18:21:13 2015 Commands: MT-CO1: mafft 217511440955273.78_CDS_proteins_MT-CO1.fasta
 1 file changed, 0 insertions(+), 0 deletions(-)

STDERR:None
>>>>

The last git action was to commit the pickle file, after the sequence alignment was complete. The git message is the report we get when we print the used method (from pj.used_methods if you recall).

We can show the full log like this:


In [7]:
pj.show_commits()


commit 09df506f5a5a003f1665d5abf52d11fb66755a90
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:21:15 2015 +0100

    AlnConf named mafftDefault with ID 217511440955273.78
    Loci: MT-CO1
    Created on: Sun Aug 30 18:21:13 2015
    Commands:
    MT-CO1: mafft 217511440955273.78_CDS_proteins_MT-CO1.fasta
    
    Environment:
    Platform: Linux-3.13.0-40-generic-x86_64-with-Ubuntu-14.04-trusty
     Processor: x86_64
     Python build: defaultJun 22 2015 17:58:13
     Python compiler: GCC 4.8.2
     Python implementation: CPython
     Python version: 2.7.6
     ete2 version: 2.2rev1056
     biopython version: 1.64
     dendropy version: 3.12.0
     cloud version: 2.8.5
     reprophylo version 1.0
     User: amir-TECRA-W50-A
     Program and version: MAFFT v7.123b\nPal2Nal v14
     Program reference:Katoh
     Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.\nMikita Suyama
     David Torrents
     and Peer Bork (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.Nucleic Acids Res. 34
     W609-W612.
    execution time:
    1.39940595627
    
    ==============================
    Core Methods section sentence:
    ==============================
    The dataset(s) MT-CO1 were first aligned at the protein level using the program MAFFT v7.123b [1].
    The resulting alignments served as guides to codon-align the DNA sequences using Pal2Nal v14 [2].
    
    Reference:
    [1]Katoh, Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
    [2]Mikita Suyama, David Torrents, and Peer Bork (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.Nucleic Acids Res. 34, W609-W612.

commit 5d9e94d44f88128374f0470d4006f4e6cb1ed10c
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:20:25 2015 +0100

    1 genbank/embl data file(s) from Sun Aug 30 18:20:25 2015

commit 0423808e2ef5bec77da2fdb482b7466916546da7
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:13:11 2015 +0100

    Project object with the loci MT-CO1, from Sun Aug 30 18:13:11 2015

commit abeaa0b6195f41049a04f0dbabfe87c9bdece320
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:13:11 2015 +0100

    2 script file(s) from Sun Aug 30 18:13:11 2015

This output is the complete list of git actions since we first started the Project, with the oldest at the bottom. Each action has a commit hash, the author of the commit, the time it was made, and an indented commit message. If we look at the messages from bottom to top we can see that so far we have done the following:

  1. Saved relevant files that preexisted in the working directory when we started the git repository (2 script files, which are this notebook and its checkpoint)
  2. Saved a pickle file of a Project with a single gene (MT-CO1)
  3. Read a genbank file into the Project and updated the pickle file
  4. Ran sequence alignment for the MT-CO1 gene using Mafft

4.1.2 Revert to older Project version

In addition to logging our actions, git allows us to 'undo' and 'redo' them by reverting to previous versions of the pickle file.
For example, let's say we want to cancel our latest sequence alignment. Our current Project has one alignment in it:


In [8]:
pj.alignments.keys()


Out[8]:
['MT-CO1@mafftDefault']

To move back to when we had no alignments in the Project, we need the 'commit hash' from our commits log, of the action the preceded the sequence alignment. The hash is the long alphanumeric string at the top of each commit, just a few characters from it's start shoud do it.
When I was writing this notebook, the git hash of the action which preceded the sequence alignment (one before last) was 5d9e94d44f88128374f0470d4006f4e6cb1ed10c, but it will be something else for you. To move back to it I do:


In [9]:
pj = revert_pickle(pj, '5d9e94d4')


Git STDOUT: 
Git STDERR: 
/home/amir/Dropbox/python_modules/reprophylo.py:240: UserWarning: Git repository exists for this Project
  warnings.warn('Git repository exists for this Project')

We get no output or errors from git, which is what we expect. When we revert, ReproPhylo restarts the Project and it lets us know that a git repository already exists, and it will keep using it.
Lets see how many alignments the Project has now:


In [10]:
pj.alignments.keys()


Out[10]:
[]

Right. No alignments now. But wait, was this reversion a mistake? No problem. We can get our alignment back. The git hash for the alignment step is 09df506f5a5a003f1665d5abf52d11fb66755a90 (will be something else for you). Lets get it back:


In [11]:
pj = revert_pickle(pj, '09df506f5')
pj.alignments.keys()


Git STDOUT: 
Git STDERR: 
Out[11]:
['MT-CO1@mafftDefault']

OK! No git error messages, and we have our alignment back in pj.alignments.

4.1.3 Recovering from unintentional changes

Now lets do something stupid: We will make a new AlnConf object, with different run parameters, but without changing the name of the AlnConf object, thus overwriting the resulting alignment of the previous one. For this alignment step, this is not the end of the world, since it is very quick. However, this will work the same for long analyses, such as tree reconstruction or when there is a lot of data.


In [12]:
new_mafft = AlnConf(pj, cline_args=dict(localpair=True, maxiterate=1000))
pj.align([new_mafft])


mafft --localpair --maxiterate 1000 611281440957509.19_CDS_proteins_MT-CO1.fasta

Now, checking the used_methods dictionary, we realize the gravity of our mistake, as the new AlnConf is stored under the same key as the old one, which is now gone from both the used_methods and the alignment dictionaries:


In [13]:
print 'Alignments:'
print pj.alignments
print
print 'Used Methods:'
print pj.used_methods


Alignments:
{'MT-CO1@mafftDefault': <<class 'Bio.Align.MultipleSeqAlignment'> instance (92 records of length 1566, IUPACAmbiguousDNA()) at 7f239f2b2950>}

Used Methods:
{'mafftDefault': <reprophylo.AlnConf instance at 0x7f239f52f488>}

Checking the string representation of the AlnConf object, which has the same name as the old one, will confirm it shows the new command line, rather than the old one:


In [14]:
print pj.used_methods['mafftDefault']


AlnConf named mafftDefault with ID 611281440957509.19
Loci: MT-CO1 
Created on: Sun Aug 30 18:58:29 2015
Commands:
MT-CO1: mafft --localpair --maxiterate 1000 611281440957509.19_CDS_proteins_MT-CO1.fasta

Environment:
Platform: Linux-3.13.0-40-generic-x86_64-with-Ubuntu-14.04-trusty
 Processor: x86_64
 Python build: defaultJun 22 2015 17:58:13
 Python compiler: GCC 4.8.2
 Python implementation: CPython
 Python version: 2.7.6
 ete2 version: 2.2rev1056
 biopython version: 1.64
 dendropy version: 3.12.0
 cloud version: 2.8.5
 reprophylo version 1.0
 User: amir-TECRA-W50-A
 Program and version: MAFFT v7.123b\nPal2Nal v14
 Program reference:Katoh
 Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.\nMikita Suyama
 David Torrents
 and Peer Bork (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.Nucleic Acids Res. 34
 W609-W612.
execution time:
3.97148609161


==============================
Core Methods section sentence:
==============================
The dataset(s) MT-CO1 were first aligned at the protein level using the program MAFFT v7.123b [1].
The resulting alignments served as guides to codon-align the DNA sequences using Pal2Nal v14 [2].

Reference:
[1]Katoh, Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
[2]Mikita Suyama, David Torrents, and Peer Bork (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.Nucleic Acids Res. 34, W609-W612.

Thanks to the Git repository, it is possible to recover from this blunder. We can spot an old version that contains the original alignment step and revert to it.


In [15]:
pj.show_commits()


commit 1e11023bab07af3882b10bae65053301c0c16997
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:58:33 2015 +0100

    AlnConf named mafftDefault with ID 611281440957509.19
    Loci: MT-CO1
    Created on: Sun Aug 30 18:58:29 2015
    Commands:
    MT-CO1: mafft --localpair --maxiterate 1000 611281440957509.19_CDS_proteins_MT-CO1.fasta
    
    Environment:
    Platform: Linux-3.13.0-40-generic-x86_64-with-Ubuntu-14.04-trusty
     Processor: x86_64
     Python build: defaultJun 22 2015 17:58:13
     Python compiler: GCC 4.8.2
     Python implementation: CPython
     Python version: 2.7.6
     ete2 version: 2.2rev1056
     biopython version: 1.64
     dendropy version: 3.12.0
     cloud version: 2.8.5
     reprophylo version 1.0
     User: amir-TECRA-W50-A
     Program and version: MAFFT v7.123b\nPal2Nal v14
     Program reference:Katoh
     Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.\nMikita Suyama
     David Torrents
     and Peer Bork (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.Nucleic Acids Res. 34
     W609-W612.
    execution time:
    3.97148609161
    
    ==============================
    Core Methods section sentence:
    ==============================
    The dataset(s) MT-CO1 were first aligned at the protein level using the program MAFFT v7.123b [1].
    The resulting alignments served as guides to codon-align the DNA sequences using Pal2Nal v14 [2].
    
    Reference:
    [1]Katoh, Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
    [2]Mikita Suyama, David Torrents, and Peer Bork (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.Nucleic Acids Res. 34, W609-W612.

commit 809b314e27f5a3303f64a2ecf3a1556b4cd327bd
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:54:19 2015 +0100

    2 script file(s) from Sun Aug 30 18:54:19 2015

commit 39434c1c76a39b9a5f8cda246c714e30817a3138
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:49:46 2015 +0100

    2 script file(s) from Sun Aug 30 18:49:46 2015

commit 09df506f5a5a003f1665d5abf52d11fb66755a90
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:21:15 2015 +0100

    AlnConf named mafftDefault with ID 217511440955273.78
    Loci: MT-CO1
    Created on: Sun Aug 30 18:21:13 2015
    Commands:
    MT-CO1: mafft 217511440955273.78_CDS_proteins_MT-CO1.fasta
    
    Environment:
    Platform: Linux-3.13.0-40-generic-x86_64-with-Ubuntu-14.04-trusty
     Processor: x86_64
     Python build: defaultJun 22 2015 17:58:13
     Python compiler: GCC 4.8.2
     Python implementation: CPython
     Python version: 2.7.6
     ete2 version: 2.2rev1056
     biopython version: 1.64
     dendropy version: 3.12.0
     cloud version: 2.8.5
     reprophylo version 1.0
     User: amir-TECRA-W50-A
     Program and version: MAFFT v7.123b\nPal2Nal v14
     Program reference:Katoh
     Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.\nMikita Suyama
     David Torrents
     and Peer Bork (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.Nucleic Acids Res. 34
     W609-W612.
    execution time:
    1.39940595627
    
    ==============================
    Core Methods section sentence:
    ==============================
    The dataset(s) MT-CO1 were first aligned at the protein level using the program MAFFT v7.123b [1].
    The resulting alignments served as guides to codon-align the DNA sequences using Pal2Nal v14 [2].
    
    Reference:
    [1]Katoh, Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
    [2]Mikita Suyama, David Torrents, and Peer Bork (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.Nucleic Acids Res. 34, W609-W612.

commit 5d9e94d44f88128374f0470d4006f4e6cb1ed10c
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:20:25 2015 +0100

    1 genbank/embl data file(s) from Sun Aug 30 18:20:25 2015

commit 0423808e2ef5bec77da2fdb482b7466916546da7
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:13:11 2015 +0100

    Project object with the loci MT-CO1, from Sun Aug 30 18:13:11 2015

commit abeaa0b6195f41049a04f0dbabfe87c9bdece320
Author: Amir Szitenberg <szitenberg@gmail.com>
Date:   Sun Aug 30 18:13:11 2015 +0100

    2 script file(s) from Sun Aug 30 18:13:11 2015

The git log lists a sequence alignment at the top, the very last alignment we ran. But we want to revert to an earlier sequence alignment. If we scroll down the log we can find this earlier alignment and get its git hash. For me it is 09df506f5a5a003f1665d5abf52d11fb66755a90 but it will be something else for you.

Wait! before we revert, we need to grab hold of the new alignment and its used method, so that we can add them to the Project under a different method name, after we revert:


In [17]:
latest_alignment_object = pj.alignments['MT-CO1@mafftDefault']
latest_used_method = pj.used_methods['mafftDefault']

now we can revert:


In [18]:
pj = revert_pickle(pj, '09df506f5a')


Git STDOUT: 
Git STDERR: 

Good. Last step, we add the latest alignment and used method, but with a different name:


In [19]:
new_name = 'mafft_linsi'

# add the alignment to the Project
pj.alignments['MT-CO1@' + new_name] = latest_alignment_object

# Fix the used method name
latest_used_method.method_name = new_name

# Add the latest used method to the used_methods dict:
pj.used_methods[new_name] = latest_used_method

How many alignments and used methods are there now?


In [22]:
pj.alignments.keys()


Out[22]:
['MT-CO1@mafft_linsi', 'MT-CO1@mafftDefault']

In [23]:
pj.used_methods.keys()


Out[23]:
['mafftDefault', 'mafft_linsi']

Good. Now we have the Project, with the two alternative sequence alignments of the MT-CO1 gene. Nothing is lost, nothing had to be rerun, thanks to git.
We're not done!
The Project is automatically pickled when we

  • Read data
  • Read metadata
  • Run alignment, trimming or tree reconstruction

We have done nothing of those as our last step, so the pickle is not up to date. Let's save it:


In [24]:
pickle_pj(pj, 'git_demo_files/git_demo')


Out[24]:
'git_demo_files/git_demo'

OK, now we're done. We can turn the machine off. Next time we'll start as follows and carry on from where we stoped (git=True by default):


In [25]:
pj = unpickle_pj('git_demo_files/git_demo')

4.2 Possible error messages

If you are not using the Docker ReproPhylo distribution, and you are new to Git, you might get the following error when you start a new Project with pj=Project('loci_file',pickle='pikle_filename'):

RuntimeError: Git: set your email with '!git config --global user.email "your_email@example.com"' or disable git (the ! is needed in Jupyter Notebook. In a terminal, ommit it)

This is because git expects your email to be configured. To configure it, run the following in a terminal:

git config --global user.email "your_email@example.com"

Another possible error when you start a new Project with pj=Project('loci_file',pickle='pikle_filename'), as opposed to loading one with unpickle_pj or with revert_pickle, can arise because Project expects pickle to be a file name that does not yet exist. Otherwise, the following error will be raised,

IOError: Pickle git_demo_files/git_demo exists. If you want to keep using it do pj=unpickle_pj('git_demo_files/git_demo') instead.

to protect you from unintentionally deleting existing projects.

ReproPhylo also tries to make sure that an unpickled, reverted or new Project can identify its unique Git repository. This connection can be broken if a Git reporsitory already existed in the working directory, which does not belong to the current Project or if the pickle file was moved independently from the directory in which it is found. The Git repository is found in a directory called .git, which is a hidden directory. To view hidden files and folders in your file browser, click ctrt+H. If you want to move the Project to another location, the folder containing both the .git directory and the pickle file must be moved as one unit. Should the connection between a Project and its Git repository be broken, the following error wil be show:

RuntimeError: The Git repository in the CWD does not belong to this project. Either the pickle moved, or this is a preexsisting repo. Try one of the following: Delete the local .Git dir if you don't need it, move the pickle and the notebook to a new work dir, or if possible, move them back to their original location. You may also disable Git by with stop_git().

Note that even if the link between a repository and a project was broken, the pickle file still contains the full Project and is totally usable, by passing git=False, like this:

pj=unpickle_pj('my_pickle_file', git=False)

4.2 The short version


In [ ]:
# Show the last git action
pj.last_git_log()

# Show all the commits in the git repository
pj.show_commits()

# Revert to a previous commit
# Using a hash from the commits list
pj = revert_pickle(pj, '5d9e94d4')