See quick reference at the bottom
See full module reference section for full details
This part will show methods by which to read data into the ReproPhylo Project
GenBank or EMBL files should be the prefered way to read data from online databases because ReproPhylo can store all the associated metadata and make it available for steps such as tree annotation or even Bayestraits analysis. When we pass a GenBank file, only loci and feature types that match the loci we have passed upon creating the Project
will be retained, and the rest will be ignored. This is handy for multi-featured GenBank entries that contain any number of genes on top of the ones we are interested in. In this example, only cox1 CDSs will be read from entries of complete mitochondrial genomes. First we read the pickle file from the section 3.2:
In [1]:
from reprophylo import *
pj = unpickle_pj('outputs/my_project.pkpj', git=False)
Now we can add data to the project, by reading a list of one or more GenBank files:
In [2]:
input_filnames = ['data/Tetillidae1.gb', 'data/Tetillidae2.gb']
pj.read_embl_genbank(input_filnames)
When GenBank or EMBL files are read, the accession numbers are used as sequence ID in ReproPhylo. But when other file formats are used, it is difficult to predict whether a unique sequence ID is available in the sequence header. Therefore, ReproPhylo regards data read from other file formats as 'denovo' and creates denovo sequence IDs. For the same reason, there is no mechanism to prevent you from reading the same file twice, at the moment. All the information found in the original sequence header is retained and made available as metadata. ReproPhylo can handle any format that is compatible with the SeqIO module of Biopython. Reading prealigned sequences is done by a different dedicated method which will be discussed below.
In this example we read a fasta file with an unpublished sequence. We will specify the data type ('dna') and the file format. This means that DNA and protein files need to be read in two separate actions.
In [3]:
# This list can include one or more file names
denovo_sequence_filenames = ['data/Tetillidae_denovo_sequence.fasta']
pj.read_denovo(denovo_sequence_filenames, 'dna', format='fasta')
Out[3]:
This is how the 'denovo' record looks like if we ask to print it in GenBank format:
In [4]:
for r in pj.records:
if r.id == 'denovo0': print r.format('genbank')
The record was assigned the ID 'denovo0', and a 'source' feature was created, including the fasta header as the 'original_id' and 'original_desc' qualifiers. However, it has no feature to indicate what locus it is and it will be ignored down the line. It is now up to us to add such a feature. Note that for large scale data, such as Exonerate results, other methods apply and will be discussed later.
Here we only have one new sequence and we know its ID - 'denovo0' so it is easy enough to add a feature:
In [5]:
pj.add_feature_to_record('denovo0', 'CDS', qualifiers={'gene': 'cox1'})
Out[5]:
Feature 'denovo0_f0' was created.
Often we would want to assign gene names to a whole lot of sequences based on one name we recognize in the fasta header. We can create a dictionary that will specify the gene and feature type of each sequence:
feature_lookup = {'NIWA2850': ['CDS','cox1'], # If we had more sequences we would add them here: # 'Seq2': ['18S', 'rRNA'], # ... }
Now we can use this dictionary to create the feature:
for r in pj.records: source = r.features[0] quals = source.qualifiers if 'original_id' in quals and quals['original_id'][0] in feature_lookup: original_id = quals['original_id'][0] feature_type = feature_lookup[original_id][0] gene = feature_lookup[original_id][1] pj.add_feature_to_record(r.id, feature_type, qualifiers={'gene': gene})
The add_feature_to_record
method allows to limit the feature to just a part of the sequence and to add any number of qualifiers. Look it up in the module reference.
This is how the record looks now, with the new feature added:
In [6]:
for r in pj.records:
if r.id == 'denovo0': print r.format('genbank')
Through the qualifiers dictionary, we can also attempt to add a translation of the sequence. We can also define a location for the feature, as a subset of the whole sequence :
qualifiers={'gene': 'cox1', 'transl_table': 4, 'codon_start': 1, 'organism': 'Craniella microsigma'}
for record in pj.records: if 'denovo' in record.id: # New sequences are assigned with IDs starting
# with 'denovo'
pj.add_feature_to_record(record.id, 'CDS',
# The location is specified as a list
# of lists. Every sub-list is an exon
# and has the start, the end and the strand.
# The numbers are "real" positions and not
# machine. ie, counting starts from 1.
location=[[1,786,1],[1742,2092,1]],
qualifiers=qualifiers)
</pre>
transl_table
is the genetic code to use in order to translate the coding sequence into a protein. The number, 4 in this case, specify the table to use, out of the GenBank genetic code tables.
ReproPhylo allows to read prealigned sequences in any of the Biopython AlignIO compatible formats, as follows:
pj.read_alignment('Another_locus.nex', 'dna', 'CDS', 'ND5', format='nexus')This will place the alignment in the
Project.alignments
attribute (pj.alignments
in this case) and the unaligned sequences as records in Project.records
. There must be a Locus
object in pj.loci
, that is compatible with the character (dna) feature type (CDS) and the locus name (ND5) specified in the read_alignmnet
command. The records will be assigned ‘denovo’ IDs, and the nexus sequence names will be stored in the ‘original_id’ qualifiers. ‘original_desc’ qualifier remain empty in this case, because nexus files don’t have them.
Many published datasets are available in nexus format with charset
commands that describe the data partitions. ReproPhylo can read such a matrix, split the partitions into individual alignments and place them in Project.alignments
, and then put each sequence from each partition in Project.records
. This facilitates experimentation with the data composition. It is even possible to turn such a nexus file directly into a new Project
instance with all the information set up. To do that use the following command:
nexus_filename = 'data/some_supermatrix.nex' pj = pj_from_nexus_w_charset(nexus_filename, 'data', # path to write intermediate fasta file 'dna', # Character type ('dna' or 'prot') 'CDS', # Feature type (Any) project = True, # Will return a Project instance instead of a list # of fasta files per partition # if project will save it to this file: pickle = 'new_pickle_name', git = True) # Will start and manage repository
In [8]:
# Update the pickle file
pickle_pj(pj, 'outputs/my_project.pkpj')
Out[8]:
In [ ]:
# Read GenBank or embl files
input_filnames = ['file1', 'file2']
pj.read_embl_genbank(input_filnames)
# Read other formats
denovo_sequence_filenames = ['file1.fasta', 'file2.fasta']
pj.read_denovo(denovo_sequence_filenames, 'dna', format='fasta')
#Or
pj.read_denovo(denovo_sequence_filenames, 'prot', format='fasta')
# Add asequence feature to a record
pj.add_feature_to_record('someRecordID', 'CDS', qualifiers={'gene': 'cox1'})
# Or
qualifiers={'gene': 'cox1',
'transl_table': 4,
'codon_start': 1,
'organism': 'Craniella microsigma'}
pj.add_feature_to_record(someRecordID, 'CDS',
location=[[1,786,1],[1742,2092,1]],
qualifiers=qualifiers)
# Read a sequence alignment
pj.read_alignment('Another_locus.nex', 'dna', 'CDS', 'ND5', format='nexus')
# Read a Nexus alignment with a super-matrix and charset commands
nexus_filename = 'data/some_supermatrix.nex'
pj = pj_from_nexus_w_charset(nexus_filename,
'data',
'dna',
'CDS',
project = True,
pickle = 'new_pickle_name',
git = True)