This section starts with a Project
that already contains alignments:
In [1]:
from reprophylo import *
pj = unpickle_pj('./outputs/my_project.pkpj',
git=False)
If we call the keys of the pj.alignments
dictionary, we can see the names of the alignments it contains:
In [2]:
pj.alignments.keys()
Out[2]:
Like the sequence alignment phase, alignment trimming has its own configuration class, the TrimalConf
class. An object of this class will generate a command-line and the required input files for the program TrimAl, but will not execute the process (this is shown below). Once the process has been successfully executed, this TrimalConf
object is also stored in pj.used_methods
and it can be invoked as a report.
With TrimalConf
, instead of specifying loci names, we provide alignment names, as they appear in the keys of pj.alignments
In [3]:
gappyout = TrimalConf(pj, # The Project
method_name='gappyout', # Any unique string ('gappyout' is default)
program_name='trimal', # No alternatives in this ReproPhylo version
cmd='default', # the default is trimal. Change it here
# or in pj.defaults['trimal']
alns=['MT-CO1@mafftLinsi'], # 'all' by default
trimal_commands={'gappyout': True} # By default, the gappyout algorithm is used.
)
In this example, it is easy enough to copy and paste alignment names into a list and pass it to TrimalConf
. But this is more difficult if we want to fish out a subset of alignments from a very large list of alignments. In such cases, Python's list comprehension is very useful. Below I show two uses of list comprehension, but the more you feel comfortable with this approach, the better.
Getting locus names of rRNA loci
If you read the code line that follows very carefully, you will see it quite literally says "take the name of each Locus
found in pj.loci
if its feature type is rRNA, and put it in a list":
In [4]:
rRNA_locus_names = [locus.name for locus in pj.loci if locus.feature_type == 'rRNA']
print rRNA_locus_names
what we get is a list of names of our rRNA loci.
Getting alignment names that have locus names of rRNA loci
The following line says: "take the key of each alignment from the pj.alignments
dictionary if the first word before the '@' symbol is in the list of rRNA locus names, and put this key in a list":
In [5]:
rRNA_alignment_names = [key for key in pj.alignments.keys() if key.split('@')[0] in rRNA_locus_names]
print rRNA_alignment_names
We get a list of keys, of the rRNA loci alignments we produced on the previous section, and which are stored in the pj.alignments
dictionary. We can now pass this list to a new TrimalConf
instance that will only process rRNA locus alignments:
In [6]:
gt50 = TrimalConf(pj,
method_name='gt50',
alns = rRNA_alignment_names,
trimal_commands={'gt': 0.5} # This will keep positions with up to
# 50% gaps.
)
In [7]:
pj.trim([gappyout, gt50])
Once used, these objects are also placed in the pj.used_methods
dictionary, and they can be printed out for observation:
In [8]:
print pj.used_methods['gappyout']
pj.trimmed_alignments
dictionaryThe trimmed alignments themselves are stored in the pj.trimmed_alignments
dictionary, using keys that follow this pattern: locus_name@alignment_method_name@trimming_method_name
where alignment_method_name
is the name you have provided to your AlnConf
object and trimming_method_name
is the one you provided to your TrimalConf
object.
In [9]:
pj.trimmed_alignments
Out[9]:
MultipleSeqAlignment
objectA trimmed alignment can be easily accessed and manipulated with any of Biopython's AlignIO
tricks using the fta
Project
method:
In [10]:
print pj.fta('18s@muscleDefault@gt50')[:4,410:420].format('phylip-relaxed')
Trimmed alignment text files can be dumped in any AlignIO
format for usage in an external command line or GUI program. When writing to files, you can control the header of the sequence by, for example, adding the organism name of the gene name, or by replacing the feature ID with the record ID:
In [11]:
# record_id and source_organism are feature qualifiers in the SeqRecord object
# See section 3.4
files = pj.write_trimmed_alns(id=['record_id','source_organism'],
format='fasta')
files
Out[11]:
The files will always be written to the current working directory (where this notebook file is), and can immediately be moved programmatically to avoid clutter:
In [12]:
# make a new directory for your trimmed alignment files:
if not os.path.exists('trimmed_alignment_files'):
os.mkdir('trimmed_alignment_files')
# move the files there
for f in files:
os.rename(f, "./trimmed_alignment_files/%s"%f)
In [13]:
pj.show_aln('MT-CO1@mafftLinsi@gappyout',id=['source_organism'])
In [15]:
pickle_pj(pj, 'outputs/my_project.pkpj')
Out[15]:
In [ ]:
# Make a TrimalConf object
trimconf = TrimalConf(pj, **kwargs)
# Execute alignment process
pj.trim([trimconf])
# Show AlnConf description
print pj.used_methods['method_name']
# Fetch a MultipleSeqAlignment object
trim_aln_obj = pj.fta('locus_name@aln_method_name@trim_method_name')
# Write alignment text files
pj.write_trimmed_alns(id=['some_feature_qualifier'], format='fasta')
# the default feature qualifier is 'feature_id'
# 'fasta' is the default format
# View alignment in browser
pj.show_aln('locus_name@aln_method_name@trim_method_name',id=['some_feature_qualifier'])
In [ ]: