This section starts with a Project that already contains alignments:
In [1]:
from reprophylo import *
pj = unpickle_pj('./outputs/my_project.pkpj',
git=False)
If we call the keys of the pj.alignments dictionary, we can see the names of the alignments it contains:
In [2]:
pj.alignments.keys()
Out[2]:
Like the sequence alignment phase, alignment trimming has its own configuration class, the TrimalConf class. An object of this class will generate a command-line and the required input files for the program TrimAl, but will not execute the process (this is shown below). Once the process has been successfully executed, this TrimalConf object is also stored in pj.used_methods and it can be invoked as a report.
With TrimalConf, instead of specifying loci names, we provide alignment names, as they appear in the keys of pj.alignments
In [3]:
gappyout = TrimalConf(pj, # The Project
method_name='gappyout', # Any unique string ('gappyout' is default)
program_name='trimal', # No alternatives in this ReproPhylo version
cmd='default', # the default is trimal. Change it here
# or in pj.defaults['trimal']
alns=['MT-CO1@mafftLinsi'], # 'all' by default
trimal_commands={'gappyout': True} # By default, the gappyout algorithm is used.
)
In this example, it is easy enough to copy and paste alignment names into a list and pass it to TrimalConf. But this is more difficult if we want to fish out a subset of alignments from a very large list of alignments. In such cases, Python's list comprehension is very useful. Below I show two uses of list comprehension, but the more you feel comfortable with this approach, the better.
Getting locus names of rRNA loci
If you read the code line that follows very carefully, you will see it quite literally says "take the name of each Locus found in pj.loci if its feature type is rRNA, and put it in a list":
In [4]:
rRNA_locus_names = [locus.name for locus in pj.loci if locus.feature_type == 'rRNA']
print rRNA_locus_names
what we get is a list of names of our rRNA loci.
Getting alignment names that have locus names of rRNA loci
The following line says: "take the key of each alignment from the pj.alignments dictionary if the first word before the '@' symbol is in the list of rRNA locus names, and put this key in a list":
In [5]:
rRNA_alignment_names = [key for key in pj.alignments.keys() if key.split('@')[0] in rRNA_locus_names]
print rRNA_alignment_names
We get a list of keys, of the rRNA loci alignments we produced on the previous section, and which are stored in the pj.alignments dictionary. We can now pass this list to a new TrimalConf instance that will only process rRNA locus alignments:
In [6]:
gt50 = TrimalConf(pj,
method_name='gt50',
alns = rRNA_alignment_names,
trimal_commands={'gt': 0.5} # This will keep positions with up to
# 50% gaps.
)
In [7]:
pj.trim([gappyout, gt50])
Once used, these objects are also placed in the pj.used_methods dictionary, and they can be printed out for observation:
In [8]:
print pj.used_methods['gappyout']
pj.trimmed_alignments dictionaryThe trimmed alignments themselves are stored in the pj.trimmed_alignments dictionary, using keys that follow this pattern: locus_name@alignment_method_name@trimming_method_name where alignment_method_name is the name you have provided to your AlnConf object and trimming_method_name is the one you provided to your TrimalConf object.
In [9]:
pj.trimmed_alignments
Out[9]:
MultipleSeqAlignment objectA trimmed alignment can be easily accessed and manipulated with any of Biopython's AlignIO tricks using the fta Project method:
In [10]:
print pj.fta('18s@muscleDefault@gt50')[:4,410:420].format('phylip-relaxed')
Trimmed alignment text files can be dumped in any AlignIO format for usage in an external command line or GUI program. When writing to files, you can control the header of the sequence by, for example, adding the organism name of the gene name, or by replacing the feature ID with the record ID:
In [11]:
# record_id and source_organism are feature qualifiers in the SeqRecord object
# See section 3.4
files = pj.write_trimmed_alns(id=['record_id','source_organism'],
format='fasta')
files
Out[11]:
The files will always be written to the current working directory (where this notebook file is), and can immediately be moved programmatically to avoid clutter:
In [12]:
# make a new directory for your trimmed alignment files:
if not os.path.exists('trimmed_alignment_files'):
os.mkdir('trimmed_alignment_files')
# move the files there
for f in files:
os.rename(f, "./trimmed_alignment_files/%s"%f)
In [13]:
pj.show_aln('MT-CO1@mafftLinsi@gappyout',id=['source_organism'])
In [15]:
pickle_pj(pj, 'outputs/my_project.pkpj')
Out[15]:
In [ ]:
# Make a TrimalConf object
trimconf = TrimalConf(pj, **kwargs)
# Execute alignment process
pj.trim([trimconf])
# Show AlnConf description
print pj.used_methods['method_name']
# Fetch a MultipleSeqAlignment object
trim_aln_obj = pj.fta('locus_name@aln_method_name@trim_method_name')
# Write alignment text files
pj.write_trimmed_alns(id=['some_feature_qualifier'], format='fasta')
# the default feature qualifier is 'feature_id'
# 'fasta' is the default format
# View alignment in browser
pj.show_aln('locus_name@aln_method_name@trim_method_name',id=['some_feature_qualifier'])
In [ ]: