More complex sequences (like plasmids) have many annotated pieces and benefit from other methods. sequence.DNA has many methods for accessing and modifying complex sequences.
The following sequence is a plasmid that integrates at the S. cerevisiae HO locus via ends-out integration, inserting the GEV transactivator from McIsaac et al. 2011:
In [1]:
import coral as cor
In [2]:
pKL278 = cor.seqio.read_dna('./files_for_tutorial/maps/pMODKan-HO-pACT1GEV.ape')
Sequences have .name and .id attributes that are empty string by default. By convention, you should fill them with appropriate strings for your use case - the name is a human-readable name while id should be a unique number or string.
In [3]:
pKL278.name # Raw genbank name field - truncated due to genbank specifications
Out[3]:
Large sequences have summary representations, useful for getting a general idea of which sequence you're manipulating
In [4]:
pKL278 # The sequence representation - shows ~40 bases on each side.
Out[4]:
Complex sequences usually have annotations to categorize functional or important elements. This plasmid has a lot of features - it's a yeast shuttle vector, so it has sequences for propagating in E. coli, sequences for integrating into the S. cerevisiae genome, sequences for selection after transformation, and an expression cassette (promoter, gene, terminator). In addition, it has common primer sites and annotated subsequences.
In [5]:
pKL278.features # Man that's way too many features
Out[5]:
With all of these features, manual slicing is inconvenient. The .extract() method makes it easy to isolate features from a complex sequence:
In [6]:
# The beta-lactamase coding sequence, essential for propagation in *E. coli* on Amp/Carb media.
# Note that it is transcribed in the direction of the bottom strand (right to left on this sequence)
bla = [f for f in pKL278.features if f.name == 'bla'][0]
pKL278.extract(bla)
Out[6]:
The .features attribute is just a list of sequence.Feature objects - you can add or remove them at will using standard python list methods (like .pop and .append). The use of sequence.Feature will be covered in a different tutorial.
In addition, you can efficiently match patterns in your sequence using .locate(), which searches for a string on both the top and bottom strands, returning a tuple containing the indexes of the matches (top and bottom strands). In the following case, there are 8 matches for the top strand and 5 for the bottom strand. In the case of a palindromic query, only the top strand is reported.
In [7]:
pKL278.locate('atgcc') # All occurrences of the pattern atgcc on the top and bottom strands (both 5'->3')
Out[7]:
There are additional methods that can't be (easily) demonstrated in this tutorial.
The .ape() method will launch ApE with your sequence if it is found in your PATH environment variable. This enables some convenient analyses that are faster with a GUI like simulating a digest or viewing the general layout of annotations.