This notebook demonstrates the interaction of ReproPhylo and of pickled ReproPhylo Project
files with Git. In section 3 we disabled Git and saved the pickle file manually at the end of each sub section. However, ReproPhylo is designed to update the Project
's pickle file automatically after time consuming steps and also to create a version control repository and record versions in real time. All of this will happen if we start a Project
using the default setting git=True
.
Once we start a Project
this way, it can be the only version controlled Project
in the current working directory. Any additional Project
will have to be started with a different pickle name, and with git=False
. Should it not be the case, helpful error messages will guide you through.
Also, once we started a Project
, it can only be resumed with the command unpickle_pj
. If we try to reconstruct the Project
using the command pj = Project(...)
, another helpful error message will be raised.
Start a Project
As we did in section 3, we start a Project
, and provide a pickle file name. We do not, however, use git=False
and therefore git is invoked, as the default behaviour.
In [2]:
from reprophylo import *
pj = Project('git_demo_files/loci_edited.csv', pickle='git_demo_files/git_demo')
We get three warnings, which are only information messages.
Read data
We can move on to reading data and aligning some loci:
In [4]:
genbank = './git_demo_files/Tetillidae.gb'
pj.read_embl_genbank([genbank])
Do alignment
In [5]:
pj.extract_by_locus()
mafft = AlnConf(pj)
pj.align([mafft])
So our data was split to bins according the the Locus
objects in the Project
, and all the loci were aligned with the default settings of Mafft.
Show last Git action (which was to commit the pickle with the alignment)
At this point, let's check what pickle and git did at the background, by asking for git info:
In [6]:
pj.last_git_log()
The last git action was to commit the pickle file, after the sequence alignment was complete. The git message is the report we get when we print the used method (from pj.used_methods
if you recall).
We can show the full log like this:
In [7]:
pj.show_commits()
This output is the complete list of git actions since we first started the Project
, with the oldest at the bottom. Each action has a commit hash, the author of the commit, the time it was made, and an indented commit message. If we look at the messages from bottom to top we can see that so far we have done the following:
Project
with a single gene (MT-CO1)Project
and updated the pickle fileProject
versionIn addition to logging our actions, git allows us to 'undo' and 'redo' them by reverting to previous versions of the pickle file.
For example, let's say we want to cancel our latest sequence alignment. Our current Project
has one alignment in it:
In [8]:
pj.alignments.keys()
Out[8]:
To move back to when we had no alignments in the Project
, we need the 'commit hash' from our commits log, of the action the preceded the sequence alignment. The hash is the long alphanumeric string at the top of each commit, just a few characters from it's start shoud do it.
When I was writing this notebook, the git hash of the action which preceded the sequence alignment (one before last) was 5d9e94d44f88128374f0470d4006f4e6cb1ed10c
, but it will be something else for you. To move back to it I do:
In [9]:
pj = revert_pickle(pj, '5d9e94d4')
We get no output or errors from git, which is what we expect. When we revert, ReproPhylo restarts the Project
and it lets us know that a git repository already exists, and it will keep using it.
Lets see how many alignments the Project
has now:
In [10]:
pj.alignments.keys()
Out[10]:
Right. No alignments now. But wait, was this reversion a mistake? No problem. We can get our alignment back. The git hash for the alignment step is 09df506f5a5a003f1665d5abf52d11fb66755a90
(will be something else for you). Lets get it back:
In [11]:
pj = revert_pickle(pj, '09df506f5')
pj.alignments.keys()
Out[11]:
OK! No git error messages, and we have our alignment back in pj.alignments
.
Now lets do something stupid: We will make a new AlnConf
object, with different run parameters, but without changing the name of the AlnConf
object, thus overwriting the resulting alignment of the previous one. For this alignment step, this is not the end of the world, since it is very quick. However, this will work the same for long analyses, such as tree reconstruction or when there is a lot of data.
In [12]:
new_mafft = AlnConf(pj, cline_args=dict(localpair=True, maxiterate=1000))
pj.align([new_mafft])
Now, checking the used_methods
dictionary, we realize the gravity of our mistake, as the new AlnConf
is stored under the same key as the old one, which is now gone from both the used_methods
and the alignment
dictionaries:
In [13]:
print 'Alignments:'
print pj.alignments
print
print 'Used Methods:'
print pj.used_methods
Checking the string representation of the AlnConf
object, which has the same name as the old one, will confirm it shows the new command line, rather than the old one:
In [14]:
print pj.used_methods['mafftDefault']
Thanks to the Git repository, it is possible to recover from this blunder. We can spot an old version that contains the original alignment step and revert to it.
In [15]:
pj.show_commits()
The git log lists a sequence alignment at the top, the very last alignment we ran. But we want to revert to an earlier sequence alignment. If we scroll down the log we can find this earlier alignment and get its git hash. For me it is 09df506f5a5a003f1665d5abf52d11fb66755a90
but it will be something else for you.
Wait! before we revert, we need to grab hold of the new alignment and its used method, so that we can add them to the Project
under a different method name, after we revert:
In [17]:
latest_alignment_object = pj.alignments['MT-CO1@mafftDefault']
latest_used_method = pj.used_methods['mafftDefault']
now we can revert:
In [18]:
pj = revert_pickle(pj, '09df506f5a')
Good. Last step, we add the latest alignment and used method, but with a different name:
In [19]:
new_name = 'mafft_linsi'
# add the alignment to the Project
pj.alignments['MT-CO1@' + new_name] = latest_alignment_object
# Fix the used method name
latest_used_method.method_name = new_name
# Add the latest used method to the used_methods dict:
pj.used_methods[new_name] = latest_used_method
How many alignments and used methods are there now?
In [22]:
pj.alignments.keys()
Out[22]:
In [23]:
pj.used_methods.keys()
Out[23]:
Good. Now we have the Project
, with the two alternative sequence alignments of the MT-CO1
gene. Nothing is lost, nothing had to be rerun, thanks to git.
We're not done!
The Project
is automatically pickled when we
We have done nothing of those as our last step, so the pickle is not up to date. Let's save it:
In [24]:
pickle_pj(pj, 'git_demo_files/git_demo')
Out[24]:
OK, now we're done. We can turn the machine off. Next time we'll start as follows and carry on from where we stoped (git=True
by default):
In [25]:
pj = unpickle_pj('git_demo_files/git_demo')
If you are not using the Docker ReproPhylo distribution, and you are new to Git, you might get the following error when you start a new Project
with pj=Project('loci_file',pickle='pikle_filename')
:
RuntimeError: Git: set your email with '!git config --global user.email "your_email@example.com"' or disable git (the ! is needed in Jupyter Notebook. In a terminal, ommit it)
This is because git expects your email to be configured. To configure it, run the following in a terminal:
git config --global user.email "your_email@example.com"
Another possible error when you start a new Project
with pj=Project('loci_file',pickle='pikle_filename')
, as opposed to loading one with unpickle_pj
or with revert_pickle
, can arise because Project
expects pickle
to be a file name that does not yet exist. Otherwise, the following error will be raised,
IOError: Pickle git_demo_files/git_demo exists. If you want to keep using it do pj=unpickle_pj('git_demo_files/git_demo') instead.
to protect you from unintentionally deleting existing projects.
ReproPhylo also tries to make sure that an unpickled, reverted or new Project
can identify its unique Git repository. This connection can be broken if a Git reporsitory already existed in the working directory, which does not belong to the current Project
or if the pickle file was moved independently from the directory in which it is found. The Git repository is found in a directory called .git
, which is a hidden directory. To view hidden files and folders in your file browser, click ctrt+H
. If you want to move the Project
to another location, the folder containing both the .git
directory and the pickle file must be moved as one unit. Should the connection between a Project
and its Git repository be broken, the following error wil be show:
RuntimeError: The Git repository in the CWD does not belong to this project. Either the pickle moved, or this is a preexsisting repo. Try one of the following: Delete the local .Git dir if you don't need it, move the pickle and the notebook to a new work dir, or if possible, move them back to their original location. You may also disable Git by with stop_git().
Note that even if the link between a repository and a project
was broken, the pickle file still contains the full Project
and is totally usable, by passing git=False
, like this:
pj=unpickle_pj('my_pickle_file', git=False)
In [ ]:
# Show the last git action
pj.last_git_log()
# Show all the commits in the git repository
pj.show_commits()
# Revert to a previous commit
# Using a hash from the commits list
pj = revert_pickle(pj, '5d9e94d4')