Tutorial Part 21: Introduction to Bioinformatics

So far in this tutorial, we've primarily worked on the problems of cheminformatics. We've been interested in seeing how we can use the techniques of machine learning to make predictions about the properties of molecules. In this tutorial, we're going to shift a bit and see how we can use classical computer science techniques and machine learning to tackle problems in bioinformatics.

For this, we're going to use the venerable biopython library to do some basic bioinformatics. A lot of the material in this notebook is adapted from the extensive official [Biopython tutorial]http://biopython.org/DIST/docs/tutorial/Tutorial.html). We strongly recommend checking out the official tutorial after you work through this notebook!

Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.


In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')


TensorFlow 1.x selected.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3477  100  3477    0     0  36600      0 --:--:-- --:--:-- --:--:-- 36600
add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing deepchem
done
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

deepchem-2.3.0 installation finished!
CPU times: user 2.91 s, sys: 622 ms, total: 3.54 s
Wall time: 2min 16s

We'll use pip to install biopython


In [2]:
!pip install biopython


Collecting biopython
  Downloading https://files.pythonhosted.org/packages/a8/66/134dbd5f885fc71493c61b6cf04c9ea08082da28da5ed07709b02857cbd0/biopython-1.77-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)
     |████████████████████████████████| 2.3MB 2.7MB/s 
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from biopython) (1.18.5)
Installing collected packages: biopython
Successfully installed biopython-1.77

In [3]:
import Bio
Bio.__version__


Out[3]:
'1.77'

In [4]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACATTG")
my_seq


Out[4]:
Seq('AGTACACATTG')

In [5]:
my_seq.complement()


Out[5]:
Seq('TCATGTGTAAC')

In [6]:
my_seq.reverse_complement()


Out[6]:
Seq('CAATGTGTACT')

Parsing Sequence Records

We're going to download a sample fasta file from the Biopython tutorial to use in some exercises. This file is a set of hits for a sequence (of lady slipper orcid genes).


In [7]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta


--2020-06-12 02:47:50--  https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76480 (75K) [text/plain]
Saving to: ‘ls_orchid.fasta’

ls_orchid.fasta     100%[===================>]  74.69K  --.-KB/s    in 0.03s   

2020-06-12 02:47:51 (2.36 MB/s) - ‘ls_orchid.fasta’ saved [76480/76480]

Let's take a look at what the contents of this file look like:


In [8]:
from Bio import SeqIO

for seq_record in SeqIO.parse('ls_orchid.fasta', 'fasta'):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))


gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT', SingleLetterAlphabet())
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA', SingleLetterAlphabet())
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC', SingleLetterAlphabet())
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT', SingleLetterAlphabet())
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA', SingleLetterAlphabet())
704
gi|2765649|emb|Z78524.1|CFZ78524
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC', SingleLetterAlphabet())
740
gi|2765648|emb|Z78523.1|CHZ78523
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTGCGGCAGGATCATTGTTGAGACAGCAG...AAG', SingleLetterAlphabet())
709
gi|2765647|emb|Z78522.1|CMZ78522
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...GAG', SingleLetterAlphabet())
700
gi|2765646|emb|Z78521.1|CCZ78521
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATATGATCGAGT...ACC', SingleLetterAlphabet())
726
gi|2765645|emb|Z78520.1|CSZ78520
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TTT', SingleLetterAlphabet())
753
gi|2765644|emb|Z78519.1|CPZ78519
Seq('ATATGATCGAGTGAATCTGGTGGACTTGTGGTTACTCAGCTCGCCATAGGCTTT...TTA', SingleLetterAlphabet())
699
gi|2765643|emb|Z78518.1|CRZ78518
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATAGTAG...TCC', SingleLetterAlphabet())
658
gi|2765642|emb|Z78517.1|CFZ78517
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...AGC', SingleLetterAlphabet())
752
gi|2765641|emb|Z78516.1|CPZ78516
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAT...TAA', SingleLetterAlphabet())
726
gi|2765640|emb|Z78515.1|MXZ78515
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGCTGAGACCGTAG...AGC', SingleLetterAlphabet())
765
gi|2765639|emb|Z78514.1|PSZ78514
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...CTA', SingleLetterAlphabet())
755
gi|2765638|emb|Z78513.1|PBZ78513
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GAG', SingleLetterAlphabet())
742
gi|2765637|emb|Z78512.1|PWZ78512
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...AGC', SingleLetterAlphabet())
762
gi|2765636|emb|Z78511.1|PEZ78511
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCCCCA...GGA', SingleLetterAlphabet())
745
gi|2765635|emb|Z78510.1|PCZ78510
Seq('CTAACCAGGGTTCCGAGGTGACCTTCGGGAGGATTCCTTTTTAAGCCCCCGAAA...TTA', SingleLetterAlphabet())
750
gi|2765634|emb|Z78509.1|PPZ78509
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GGA', SingleLetterAlphabet())
731
gi|2765633|emb|Z78508.1|PLZ78508
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TGA', SingleLetterAlphabet())
741
gi|2765632|emb|Z78507.1|PLZ78507
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCCCCA...TGA', SingleLetterAlphabet())
740
gi|2765631|emb|Z78506.1|PLZ78506
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...TGA', SingleLetterAlphabet())
727
gi|2765630|emb|Z78505.1|PSZ78505
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TTT', SingleLetterAlphabet())
711
gi|2765629|emb|Z78504.1|PKZ78504
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCGCAA...TAA', SingleLetterAlphabet())
743
gi|2765628|emb|Z78503.1|PCZ78503
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCCTTGTTGAGACCGCCA...TAA', SingleLetterAlphabet())
727
gi|2765627|emb|Z78502.1|PBZ78502
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGACCGCCA...CGC', SingleLetterAlphabet())
757
gi|2765626|emb|Z78501.1|PCZ78501
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...AGA', SingleLetterAlphabet())
770
gi|2765625|emb|Z78500.1|PWZ78500
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGCTCATTGTTGAGACCGCAA...AAG', SingleLetterAlphabet())
767
gi|2765624|emb|Z78499.1|PMZ78499
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAGGGATCATTGTTGAGATCGCAT...ACC', SingleLetterAlphabet())
759
gi|2765623|emb|Z78498.1|PMZ78498
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAAGGTCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
750
gi|2765622|emb|Z78497.1|PDZ78497
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
788
gi|2765621|emb|Z78496.1|PAZ78496
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC', SingleLetterAlphabet())
774
gi|2765620|emb|Z78495.1|PEZ78495
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GTG', SingleLetterAlphabet())
789
gi|2765619|emb|Z78494.1|PNZ78494
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGGTCGCAT...AAG', SingleLetterAlphabet())
688
gi|2765618|emb|Z78493.1|PGZ78493
Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...CCC', SingleLetterAlphabet())
719
gi|2765617|emb|Z78492.1|PBZ78492
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...ATA', SingleLetterAlphabet())
743
gi|2765616|emb|Z78491.1|PCZ78491
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC', SingleLetterAlphabet())
737
gi|2765615|emb|Z78490.1|PFZ78490
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
728
gi|2765614|emb|Z78489.1|PDZ78489
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGC', SingleLetterAlphabet())
740
gi|2765613|emb|Z78488.1|PTZ78488
Seq('CTGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGCAATAATTGATCGA...GCT', SingleLetterAlphabet())
696
gi|2765612|emb|Z78487.1|PHZ78487
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TAA', SingleLetterAlphabet())
732
gi|2765611|emb|Z78486.1|PBZ78486
Seq('CGTCACGAGGTTTCCGTAGGTGAATCTGCGGGAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
731
gi|2765610|emb|Z78485.1|PHZ78485
Seq('CTGAACCTGGTGTCCGAAGGTGAATCTGCGGATGGATCATTGTTGAGATATCAT...GTA', SingleLetterAlphabet())
735
gi|2765609|emb|Z78484.1|PCZ78484
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGGGGAAGGATCATTGTTGAGATCACAT...TTT', SingleLetterAlphabet())
720
gi|2765608|emb|Z78483.1|PVZ78483
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())
740
gi|2765607|emb|Z78482.1|PEZ78482
Seq('TCTACTGCAGTGACCGAGATTTGCCATCGAGCCTCCTGGGAGCTTTCTTGCTGG...GCA', SingleLetterAlphabet())
629
gi|2765606|emb|Z78481.1|PIZ78481
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
572
gi|2765605|emb|Z78480.1|PGZ78480
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
587
gi|2765604|emb|Z78479.1|PPZ78479
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGT', SingleLetterAlphabet())
700
gi|2765603|emb|Z78478.1|PVZ78478
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCAGTGTTGAGATCACAT...GGC', SingleLetterAlphabet())
636
gi|2765602|emb|Z78477.1|PVZ78477
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())
716
gi|2765601|emb|Z78476.1|PGZ78476
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())
592
gi|2765600|emb|Z78475.1|PSZ78475
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', SingleLetterAlphabet())
716
gi|2765599|emb|Z78474.1|PKZ78474
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGT...CTT', SingleLetterAlphabet())
733
gi|2765598|emb|Z78473.1|PSZ78473
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())
626
gi|2765597|emb|Z78472.1|PLZ78472
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
737
gi|2765596|emb|Z78471.1|PDZ78471
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
740
gi|2765595|emb|Z78470.1|PPZ78470
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT', SingleLetterAlphabet())
574
gi|2765594|emb|Z78469.1|PHZ78469
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT', SingleLetterAlphabet())
594
gi|2765593|emb|Z78468.1|PAZ78468
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...GTT', SingleLetterAlphabet())
610
gi|2765592|emb|Z78467.1|PSZ78467
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
730
gi|2765591|emb|Z78466.1|PPZ78466
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())
641
gi|2765590|emb|Z78465.1|PRZ78465
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())
702
gi|2765589|emb|Z78464.1|PGZ78464
Seq('CGTAACAAGGTTTCCGTAGGTGAGCGGAAGGGTCATTGTTGAGATCACATAATA...AGC', SingleLetterAlphabet())
733
gi|2765588|emb|Z78463.1|PGZ78463
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGTTCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
738
gi|2765587|emb|Z78462.1|PSZ78462
Seq('CGTCACGAGGTCTCCGGATGTGACCCTGCGGAAGGATCATTGTTGAGATCACAT...CAT', SingleLetterAlphabet())
736
gi|2765586|emb|Z78461.1|PWZ78461
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TAA', SingleLetterAlphabet())
732
gi|2765585|emb|Z78460.1|PCZ78460
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TTA', SingleLetterAlphabet())
745
gi|2765584|emb|Z78459.1|PDZ78459
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTT', SingleLetterAlphabet())
744
gi|2765583|emb|Z78458.1|PHZ78458
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTG', SingleLetterAlphabet())
738
gi|2765582|emb|Z78457.1|PCZ78457
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GAG', SingleLetterAlphabet())
739
gi|2765581|emb|Z78456.1|PTZ78456
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
740
gi|2765580|emb|Z78455.1|PJZ78455
Seq('CGTAACCAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAGATCACAT...GCA', SingleLetterAlphabet())
745
gi|2765579|emb|Z78454.1|PFZ78454
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AAC', SingleLetterAlphabet())
695
gi|2765578|emb|Z78453.1|PSZ78453
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())
745
gi|2765577|emb|Z78452.1|PBZ78452
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())
743
gi|2765576|emb|Z78451.1|PHZ78451
Seq('CGTAACAAGGTTTCCGTAGGTGTACCTCCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
730
gi|2765575|emb|Z78450.1|PPZ78450
Seq('GGAAGGATCATTGCTGATATCACATAATAATTGATCGAGTTAAGCTGGAGGATC...GAG', SingleLetterAlphabet())
706
gi|2765574|emb|Z78449.1|PMZ78449
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())
744
gi|2765573|emb|Z78448.1|PAZ78448
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())
742
gi|2765572|emb|Z78447.1|PVZ78447
Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
694
gi|2765571|emb|Z78446.1|PAZ78446
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())
712
gi|2765570|emb|Z78445.1|PUZ78445
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGT', SingleLetterAlphabet())
715
gi|2765569|emb|Z78444.1|PAZ78444
Seq('CGTAACAAGGTTTCCGTAGGGTGAACTGCGGAAGGATCATTGTTGAGATCACAT...ATT', SingleLetterAlphabet())
688
gi|2765568|emb|Z78443.1|PLZ78443
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())
784
gi|2765567|emb|Z78442.1|PBZ78442
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAATTGATCGAGT...AGT', SingleLetterAlphabet())
721
gi|2765566|emb|Z78441.1|PSZ78441
Seq('GGAAGGTCATTGCCGATATCACATAATAATTGATCGAGTTAATCTGGAGGATCT...GAG', SingleLetterAlphabet())
703
gi|2765565|emb|Z78440.1|PPZ78440
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTCCGGGAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())
744
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592

Sequence Objects

A large part of the biopython infrastructure deals with tools for handlings sequences. These could be DNA sequences, RNA sequences, amino acid sequences or even more exotic constructs. To tell biopython what type of sequence it's dealing with, you can specify the alphabet explicitly.


In [9]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("ACAGTAGAC", IUPAC.unambiguous_dna)
my_seq


Out[9]:
Seq('ACAGTAGAC', IUPACUnambiguousDNA())

In [10]:
my_seq.alphabet


Out[10]:
IUPACUnambiguousDNA()

If we want to code a protein sequence, we can do that just as easily.


In [11]:
my_prot = Seq("AAAAA", IUPAC.protein) # Alanine pentapeptide
my_prot


Out[11]:
Seq('AAAAA', IUPACProtein())

In [12]:
my_prot.alphabet


Out[12]:
IUPACProtein()

We can take the length of sequences and index into them like strings.


In [13]:
print(len(my_prot))


5

In [14]:
my_prot[0]


Out[14]:
'A'

You can also use slice notation on sequences to get subsequences.


In [15]:
my_prot[0:3]


Out[15]:
Seq('AAA', IUPACProtein())

You can concatenate sequences if they have the same type so this works.


In [16]:
my_prot + my_prot


Out[16]:
Seq('AAAAAAAAAA', IUPACProtein())

But this fails


In [17]:
my_prot + my_seq


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-0b91ff3c1125> in <module>()
----> 1 my_prot + my_seq

/usr/local/lib/python3.6/dist-packages/Bio/Seq.py in __add__(self, other)
    335             if not Alphabet._check_type_compatible([self.alphabet, other.alphabet]):
    336                 raise TypeError(
--> 337                     f"Incompatible alphabets {self.alphabet!r} and {other.alphabet!r}"
    338                 )
    339             # They should be the same sequence type (or one of them is generic)

TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()

Transcription

Transcription is the process by which a DNA sequence is converted into messenger RNA. Remember that this is part of the "central dogma" of biology in which DNA engenders messenger RNA which engenders proteins. Here's a nice representation of this cycle borrowed from a Khan academy lesson.

Note from the image above that DNA has two strands. The top strand is typically called the coding strand, and the bottom the template strand. The template strand is used for the actual transcription process of conversion into messenger RNA, but in bioinformatics, it's more common to work with the coding strand. Let's now see how we can execute a transcription computationally using Biopython.


In [18]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

coding_dna = Seq("ATGATCTCGTAA", IUPAC.unambiguous_dna)
coding_dna


Out[18]:
Seq('ATGATCTCGTAA', IUPACUnambiguousDNA())

In [19]:
template_dna = coding_dna.reverse_complement()
template_dna


Out[19]:
Seq('TTACGAGATCAT', IUPACUnambiguousDNA())

Note that these sequences match those in the image below. You might be confused about why the template_dna sequence is shown reversed. The reason is that by convention, the template strand is read in the reverse direction.

Let's now see how we can transcribe our coding_dna strand into messenger RNA. This will only swap 'T' for 'U' and change the alphabet for our object.


In [20]:
messenger_rna = coding_dna.transcribe()
messenger_rna


Out[20]:
Seq('AUGAUCUCGUAA', IUPACUnambiguousRNA())

We can also perform a "back-transcription" to recover the original coding strand from the messenger RNA.


In [21]:
messenger_rna.back_transcribe()


Out[21]:
Seq('ATGATCTCGTAA', IUPACUnambiguousDNA())

Translation

Translation is the next step in the process, whereby a messenger RNA is transformed into a protein sequence. Here's a beautiful diagram from Wikipedia#/media/File:Ribosome_mRNA_translation_en.svg) that lays out the basics of this process.

Note how 3 nucleotides at a time correspond to one new amino acid added to the growing protein chain. A set of 3 nucleotides which codes for a given amino acid is called a "codon." We can use the translate() method on the messenger rna to perform this transformation in code.

messenger_rna.translate()

The translation can also be performed directly from the coding sequence DNA


In [22]:
coding_dna.translate()


Out[22]:
Seq('MIS*', HasStopCodon(IUPACProtein(), '*'))

Let's now consider a longer genetic sequence that has some more interesting structure for us to look at.


In [23]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
coding_dna.translate()


Out[23]:
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In both of the sequences above, '*' represents the stop codon. A stop codon is a sequence of 3 nucleotides that turns off the protein machinery. In DNA, the stop codons are 'TGA', 'TAA', 'TAG'. Note that this latest sequence has multiple stop codons. It's possible to run the machinery up to the first stop codon and pause too.


In [24]:
coding_dna.translate(to_stop=True)


Out[24]:
Seq('MAIVMGR', IUPACProtein())

We're going to introduce a bit of terminology here. A complete coding sequence CDS is a nucleotide sequence of messenger RNA which is made of a whole number of codons (that is, the length of the sequence is a multiple of 3), starts with a "start codon" and ends with a "stop codon". A start codon is basically the opposite of a stop codon and is mostly commonly the sequence "AUG", but can be different (especially if you're dealing with something like bacterial DNA).

Let's see how we can translate a complete CDS of bacterial messenger RNA.


In [25]:
from Bio.Alphabet import generic_dna

gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
           "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
           "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
           "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
           "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
           generic_dna)
# We specify a "table" to use a different translation table for bacterial proteins
gene.translate(table="Bacterial")


Out[25]:
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [26]:
gene.translate(table="Bacterial", to_stop=True)


Out[26]:
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())

Handling Annotated Sequences

Sometimes it will be useful for us to be able to handle annotated sequences where there's richer annotations, as in GenBank or EMBL files. For these purposes, we'll want to use the SeqRecord class.


In [27]:
from Bio.SeqRecord import SeqRecord
help(SeqRecord)


Help on class SeqRecord in module Bio.SeqRecord:

class SeqRecord(builtins.object)
 |  A SeqRecord object holds a sequence and information about it.
 |  
 |  Main attributes:
 |   - id          - Identifier such as a locus tag (string)
 |   - seq         - The sequence itself (Seq object or similar)
 |  
 |  Additional attributes:
 |   - name        - Sequence name, e.g. gene name (string)
 |   - description - Additional text (string)
 |   - dbxrefs     - List of database cross references (list of strings)
 |   - features    - Any (sub)features defined (list of SeqFeature objects)
 |   - annotations - Further information about the whole sequence (dictionary).
 |     Most entries are strings, or lists of strings.
 |   - letter_annotations - Per letter/symbol annotation (restricted
 |     dictionary). This holds Python sequences (lists, strings
 |     or tuples) whose length matches that of the sequence.
 |     A typical use would be to hold a list of integers
 |     representing sequencing quality scores, or a string
 |     representing the secondary structure.
 |  
 |  You will typically use Bio.SeqIO to read in sequences from files as
 |  SeqRecord objects.  However, you may want to create your own SeqRecord
 |  objects directly (see the __init__ method for further details):
 |  
 |  >>> from Bio.Seq import Seq
 |  >>> from Bio.SeqRecord import SeqRecord
 |  >>> from Bio.Alphabet import IUPAC
 |  >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
 |  ...                         IUPAC.protein),
 |  ...                    id="YP_025292.1", name="HokC",
 |  ...                    description="toxic membrane protein")
 |  >>> print(record)
 |  ID: YP_025292.1
 |  Name: HokC
 |  Description: toxic membrane protein
 |  Number of features: 0
 |  Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
 |  
 |  If you want to save SeqRecord objects to a sequence file, use Bio.SeqIO
 |  for this.  For the special case where you want the SeqRecord turned into
 |  a string in a particular file format there is a format method which uses
 |  Bio.SeqIO internally:
 |  
 |  >>> print(record.format("fasta"))
 |  >YP_025292.1 toxic membrane protein
 |  MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
 |  <BLANKLINE>
 |  
 |  You can also do things like slicing a SeqRecord, checking its length, etc
 |  
 |  >>> len(record)
 |  44
 |  >>> edited = record[:10] + record[11:]
 |  >>> print(edited.seq)
 |  MKQHKAMIVAIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
 |  >>> print(record.seq)
 |  MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
 |  
 |  Methods defined here:
 |  
 |  __add__(self, other)
 |      Add another sequence or string to this sequence.
 |      
 |      The other sequence can be a SeqRecord object, a Seq object (or
 |      similar, e.g. a MutableSeq) or a plain Python string. If you add
 |      a plain string or a Seq (like) object, the new SeqRecord will simply
 |      have this appended to the existing data. However, any per letter
 |      annotation will be lost:
 |      
 |      >>> from Bio import SeqIO
 |      >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
 |      >>> print("%s %s" % (record.id, record.seq))
 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
 |      >>> print(list(record.letter_annotations))
 |      ['solexa_quality']
 |      
 |      >>> new = record + "ACT"
 |      >>> print("%s %s" % (new.id, new.seq))
 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNNACT
 |      >>> print(list(new.letter_annotations))
 |      []
 |      
 |      The new record will attempt to combine the annotation, but for any
 |      ambiguities (e.g. different names) it defaults to omitting that
 |      annotation.
 |      
 |      >>> from Bio import SeqIO
 |      >>> with open("GenBank/pBAD30.gb") as handle:
 |      ...     plasmid = SeqIO.read(handle, "gb")
 |      >>> print("%s %i" % (plasmid.id, len(plasmid)))
 |      pBAD30 4923
 |      
 |      Now let's cut the plasmid into two pieces, and join them back up the
 |      other way round (i.e. shift the starting point on this plasmid, have
 |      a look at the annotated features in the original file to see why this
 |      particular split point might make sense):
 |      
 |      >>> left = plasmid[:3765]
 |      >>> right = plasmid[3765:]
 |      >>> new = right + left
 |      >>> print("%s %i" % (new.id, len(new)))
 |      pBAD30 4923
 |      >>> str(new.seq) == str(right.seq + left.seq)
 |      True
 |      >>> len(new.features) == len(left.features) + len(right.features)
 |      True
 |      
 |      When we add the left and right SeqRecord objects, their annotation
 |      is all consistent, so it is all conserved in the new SeqRecord:
 |      
 |      >>> new.id == left.id == right.id == plasmid.id
 |      True
 |      >>> new.name == left.name == right.name == plasmid.name
 |      True
 |      >>> new.description == plasmid.description
 |      True
 |      >>> new.annotations == left.annotations == right.annotations
 |      True
 |      >>> new.letter_annotations == plasmid.letter_annotations
 |      True
 |      >>> new.dbxrefs == left.dbxrefs == right.dbxrefs
 |      True
 |      
 |      However, we should point out that when we sliced the SeqRecord,
 |      any annotations dictionary or dbxrefs list entries were lost.
 |      You can explicitly copy them like this:
 |      
 |      >>> new.annotations = plasmid.annotations.copy()
 |      >>> new.dbxrefs = plasmid.dbxrefs[:]
 |  
 |  __bool__(self)
 |      Boolean value of an instance of this class (True).
 |      
 |      This behaviour is for backwards compatibility, since until the
 |      __len__ method was added, a SeqRecord always evaluated as True.
 |      
 |      Note that in comparison, a Seq object will evaluate to False if it
 |      has a zero length sequence.
 |      
 |      WARNING: The SeqRecord may in future evaluate to False when its
 |      sequence is of zero length (in order to better match the Seq
 |      object behaviour)!
 |  
 |  __contains__(self, char)
 |      Implement the 'in' keyword, searches the sequence.
 |      
 |      e.g.
 |      
 |      >>> from Bio import SeqIO
 |      >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
 |      >>> "GAATTC" in record
 |      False
 |      >>> "AAA" in record
 |      True
 |      
 |      This essentially acts as a proxy for using "in" on the sequence:
 |      
 |      >>> "GAATTC" in record.seq
 |      False
 |      >>> "AAA" in record.seq
 |      True
 |      
 |      Note that you can also use Seq objects as the query,
 |      
 |      >>> from Bio.Seq import Seq
 |      >>> from Bio.Alphabet import generic_dna
 |      >>> Seq("AAA") in record
 |      True
 |      >>> Seq("AAA", generic_dna) in record
 |      True
 |      
 |      See also the Seq object's __contains__ method.
 |  
 |  __eq__(self, other)
 |      Define the equal-to operand (not implemented).
 |  
 |  __format__(self, format_spec)
 |      Return the record as a string in the specified file format.
 |      
 |      This method supports the Python format() function and f-strings.
 |      The format_spec should be a lower case string supported by
 |      Bio.SeqIO as a text output file format. Requesting a binary file
 |      format raises a ValueError. e.g.
 |      
 |      >>> from Bio.Seq import Seq
 |      >>> from Bio.SeqRecord import SeqRecord
 |      >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
 |      ...                    id="YP_025292.1", name="HokC",
 |      ...                    description="toxic membrane protein")
 |      ...
 |      >>> format(record, "fasta")
 |      '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'
 |      >>> print(f"Here is {record.id} in FASTA format:\n{record:fasta}")
 |      Here is YP_025292.1 in FASTA format:
 |      >YP_025292.1 toxic membrane protein
 |      MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
 |      <BLANKLINE>
 |      
 |      See also the SeqRecord's format() method.
 |  
 |  __ge__(self, other)
 |      Define the greater-than-or-equal-to operand (not implemented).
 |  
 |  __getitem__(self, index)
 |      Return a sub-sequence or an individual letter.
 |      
 |      Slicing, e.g. my_record[5:10], returns a new SeqRecord for
 |      that sub-sequence with some annotation preserved as follows:
 |      
 |      * The name, id and description are kept as-is.
 |      * Any per-letter-annotations are sliced to match the requested
 |        sub-sequence.
 |      * Unless a stride is used, all those features which fall fully
 |        within the subsequence are included (with their locations
 |        adjusted accordingly). If you want to preserve any truncated
 |        features (e.g. GenBank/EMBL source features), you must
 |        explicitly add them to the new SeqRecord yourself.
 |      * The annotations dictionary and the dbxrefs list are not used
 |        for the new SeqRecord, as in general they may not apply to the
 |        subsequence. If you want to preserve them, you must explicitly
 |        copy them to the new SeqRecord yourself.
 |      
 |      Using an integer index, e.g. my_record[5] is shorthand for
 |      extracting that letter from the sequence, my_record.seq[5].
 |      
 |      For example, consider this short protein and its secondary
 |      structure as encoded by the PDB (e.g. H for alpha helices),
 |      plus a simple feature for its histidine self phosphorylation
 |      site:
 |      
 |      >>> from Bio.Seq import Seq
 |      >>> from Bio.SeqRecord import SeqRecord
 |      >>> from Bio.SeqFeature import SeqFeature, FeatureLocation
 |      >>> from Bio.Alphabet import IUPAC
 |      >>> rec = SeqRecord(Seq("MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLAT"
 |      ...                     "EMMSEQDGYLAESINKDIEECNAIIEQFIDYLR",
 |      ...                     IUPAC.protein),
 |      ...                 id="1JOY", name="EnvZ",
 |      ...                 description="Homodimeric domain of EnvZ from E. coli")
 |      >>> rec.letter_annotations["secondary_structure"] = "  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  "
 |      >>> rec.features.append(SeqFeature(FeatureLocation(20, 21),
 |      ...                     type = "Site"))
 |      
 |      Now let's have a quick look at the full record,
 |      
 |      >>> print(rec)
 |      ID: 1JOY
 |      Name: EnvZ
 |      Description: Homodimeric domain of EnvZ from E. coli
 |      Number of features: 1
 |      Per letter annotation for: secondary_structure
 |      Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())
 |      >>> rec.letter_annotations["secondary_structure"]
 |      '  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  '
 |      >>> print(rec.features[0].location)
 |      [20:21]
 |      
 |      Now let's take a sub sequence, here chosen as the first (fractured)
 |      alpha helix which includes the histidine phosphorylation site:
 |      
 |      >>> sub = rec[11:41]
 |      >>> print(sub)
 |      ID: 1JOY
 |      Name: EnvZ
 |      Description: Homodimeric domain of EnvZ from E. coli
 |      Number of features: 1
 |      Per letter annotation for: secondary_structure
 |      Seq('RTLLMAGVSHDLRTPLTRIRLATEMMSEQD', IUPACProtein())
 |      >>> sub.letter_annotations["secondary_structure"]
 |      'HHHHHTTTHHHHHHHHHHHHHHHHHHHHHH'
 |      >>> print(sub.features[0].location)
 |      [9:10]
 |      
 |      You can also of course omit the start or end values, for
 |      example to get the first ten letters only:
 |      
 |      >>> print(rec[:10])
 |      ID: 1JOY
 |      Name: EnvZ
 |      Description: Homodimeric domain of EnvZ from E. coli
 |      Number of features: 0
 |      Per letter annotation for: secondary_structure
 |      Seq('MAAGVKQLAD', IUPACProtein())
 |      
 |      Or for the last ten letters:
 |      
 |      >>> print(rec[-10:])
 |      ID: 1JOY
 |      Name: EnvZ
 |      Description: Homodimeric domain of EnvZ from E. coli
 |      Number of features: 0
 |      Per letter annotation for: secondary_structure
 |      Seq('IIEQFIDYLR', IUPACProtein())
 |      
 |      If you omit both, then you get a copy of the original record (although
 |      lacking the annotations and dbxrefs):
 |      
 |      >>> print(rec[:])
 |      ID: 1JOY
 |      Name: EnvZ
 |      Description: Homodimeric domain of EnvZ from E. coli
 |      Number of features: 1
 |      Per letter annotation for: secondary_structure
 |      Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())
 |      
 |      Finally, indexing with a simple integer is shorthand for pulling out
 |      that letter from the sequence directly:
 |      
 |      >>> rec[5]
 |      'K'
 |      >>> rec.seq[5]
 |      'K'
 |  
 |  __gt__(self, other)
 |      Define the greater-than operand (not implemented).
 |  
 |  __init__(self, seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None)
 |      Create a SeqRecord.
 |      
 |      Arguments:
 |       - seq         - Sequence, required (Seq, MutableSeq or UnknownSeq)
 |       - id          - Sequence identifier, recommended (string)
 |       - name        - Sequence name, optional (string)
 |       - description - Sequence description, optional (string)
 |       - dbxrefs     - Database cross references, optional (list of strings)
 |       - features    - Any (sub)features, optional (list of SeqFeature objects)
 |       - annotations - Dictionary of annotations for the whole sequence
 |       - letter_annotations - Dictionary of per-letter-annotations, values
 |         should be strings, list or tuples of the same length as the full
 |         sequence.
 |      
 |      You will typically use Bio.SeqIO to read in sequences from files as
 |      SeqRecord objects.  However, you may want to create your own SeqRecord
 |      objects directly.
 |      
 |      Note that while an id is optional, we strongly recommend you supply a
 |      unique id string for each record.  This is especially important
 |      if you wish to write your sequences to a file.
 |      
 |      If you don't have the actual sequence, but you do know its length,
 |      then using the UnknownSeq object from Bio.Seq is appropriate.
 |      
 |      You can create a 'blank' SeqRecord object, and then populate the
 |      attributes later.
 |  
 |  __iter__(self)
 |      Iterate over the letters in the sequence.
 |      
 |      For example, using Bio.SeqIO to read in a protein FASTA file:
 |      
 |      >>> from Bio import SeqIO
 |      >>> record = SeqIO.read("Fasta/loveliesbleeding.pro", "fasta")
 |      >>> for amino in record:
 |      ...     print(amino)
 |      ...     if amino == "L": break
 |      X
 |      A
 |      G
 |      L
 |      >>> print(record.seq[3])
 |      L
 |      
 |      This is just a shortcut for iterating over the sequence directly:
 |      
 |      >>> for amino in record.seq:
 |      ...     print(amino)
 |      ...     if amino == "L": break
 |      X
 |      A
 |      G
 |      L
 |      >>> print(record.seq[3])
 |      L
 |      
 |      Note that this does not facilitate iteration together with any
 |      per-letter-annotation.  However, you can achieve that using the
 |      python zip function on the record (or its sequence) and the relevant
 |      per-letter-annotation:
 |      
 |      >>> from Bio import SeqIO
 |      >>> rec = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
 |      >>> print("%s %s" % (rec.id, rec.seq))
 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
 |      >>> print(list(rec.letter_annotations))
 |      ['solexa_quality']
 |      >>> for nuc, qual in zip(rec, rec.letter_annotations["solexa_quality"]):
 |      ...     if qual > 35:
 |      ...         print("%s %i" % (nuc, qual))
 |      A 40
 |      C 39
 |      G 38
 |      T 37
 |      A 36
 |      
 |      You may agree that using zip(rec.seq, ...) is more explicit than using
 |      zip(rec, ...) as shown above.
 |  
 |  __le___(self, other)
 |      Define the less-than-or-equal-to operand (not implemented).
 |  
 |  __len__(self)
 |      Return the length of the sequence.
 |      
 |      For example, using Bio.SeqIO to read in a FASTA nucleotide file:
 |      
 |      >>> from Bio import SeqIO
 |      >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
 |      >>> len(record)
 |      309
 |      >>> len(record.seq)
 |      309
 |  
 |  __lt__(self, other)
 |      Define the less-than operand (not implemented).
 |  
 |  __ne__(self, other)
 |      Define the not-equal-to operand (not implemented).
 |  
 |  __radd__(self, other)
 |      Add another sequence or string to this sequence (from the left).
 |      
 |      This method handles adding a Seq object (or similar, e.g. MutableSeq)
 |      or a plain Python string (on the left) to a SeqRecord (on the right).
 |      See the __add__ method for more details, but for example:
 |      
 |      >>> from Bio import SeqIO
 |      >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
 |      >>> print("%s %s" % (record.id, record.seq))
 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
 |      >>> print(list(record.letter_annotations))
 |      ['solexa_quality']
 |      
 |      >>> new = "ACT" + record
 |      >>> print("%s %s" % (new.id, new.seq))
 |      slxa_0001_1_0001_01 ACTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
 |      >>> print(list(new.letter_annotations))
 |      []
 |  
 |  __repr__(self)
 |      Return a concise summary of the record for debugging (string).
 |      
 |      The python built in function repr works by calling the object's ___repr__
 |      method.  e.g.
 |      
 |      >>> from Bio.Seq import Seq
 |      >>> from Bio.SeqRecord import SeqRecord
 |      >>> from Bio.Alphabet import generic_protein
 |      >>> rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT"
 |      ...                    +"GEMKEQTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQ"
 |      ...                    +"SGQDRYTTEVVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQ"
 |      ...                    +"PQQPQGGNQFSGGAQSRPQQSAPAAPSNEPPMDFDDDIPF",
 |      ...                    generic_protein),
 |      ...                 id="NP_418483.1", name="b4059",
 |      ...                 description="ssDNA-binding protein",
 |      ...                 dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"])
 |      >>> print(repr(rec))
 |      SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])
 |      
 |      At the python prompt you can also use this shorthand:
 |      
 |      >>> rec
 |      SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])
 |      
 |      Note that long sequences are shown truncated. Also note that any
 |      annotations, letter_annotations and features are not shown (as they
 |      would lead to a very long string).
 |  
 |  __str__(self)
 |      Return a human readable summary of the record and its annotation (string).
 |      
 |      The python built in function str works by calling the object's ___str__
 |      method.  e.g.
 |      
 |      >>> from Bio.Seq import Seq
 |      >>> from Bio.SeqRecord import SeqRecord
 |      >>> from Bio.Alphabet import IUPAC
 |      >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
 |      ...                         IUPAC.protein),
 |      ...                    id="YP_025292.1", name="HokC",
 |      ...                    description="toxic membrane protein, small")
 |      >>> print(str(record))
 |      ID: YP_025292.1
 |      Name: HokC
 |      Description: toxic membrane protein, small
 |      Number of features: 0
 |      Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
 |      
 |      In this example you don't actually need to call str explicity, as the
 |      print command does this automatically:
 |      
 |      >>> print(record)
 |      ID: YP_025292.1
 |      Name: HokC
 |      Description: toxic membrane protein, small
 |      Number of features: 0
 |      Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
 |      
 |      Note that long sequences are shown truncated.
 |  
 |  format(self, format)
 |      Return the record as a string in the specified file format.
 |      
 |      The format should be a lower case string supported as an output
 |      format by Bio.SeqIO, which is used to turn the SeqRecord into a
 |      string.  e.g.
 |      
 |      >>> from Bio.Seq import Seq
 |      >>> from Bio.SeqRecord import SeqRecord
 |      >>> from Bio.Alphabet import IUPAC
 |      >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
 |      ...                         IUPAC.protein),
 |      ...                    id="YP_025292.1", name="HokC",
 |      ...                    description="toxic membrane protein")
 |      >>> record.format("fasta")
 |      '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'
 |      >>> print(record.format("fasta"))
 |      >YP_025292.1 toxic membrane protein
 |      MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
 |      <BLANKLINE>
 |      
 |      The Python print function automatically appends a new line, meaning
 |      in this example a blank line is shown.  If you look at the string
 |      representation you can see there is a trailing new line (shown as
 |      slash n) which is important when writing to a file or if
 |      concatenating multiple sequence strings together.
 |      
 |      Note that this method will NOT work on every possible file format
 |      supported by Bio.SeqIO (e.g. some are for multiple sequences only,
 |      and binary formats are not supported).
 |  
 |  lower(self)
 |      Return a copy of the record with a lower case sequence.
 |      
 |      All the annotation is preserved unchanged. e.g.
 |      
 |      >>> from Bio import SeqIO
 |      >>> record = SeqIO.read("Fasta/aster.pro", "fasta")
 |      >>> print(record.format("fasta"))
 |      >gi|3298468|dbj|BAA31520.1| SAMIPF
 |      GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVG
 |      VTNALVFEIVMTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI
 |      <BLANKLINE>
 |      >>> print(record.lower().format("fasta"))
 |      >gi|3298468|dbj|BAA31520.1| SAMIPF
 |      gghvnpavtfgafvggnitllrgivyiiaqllgstvaclllkfvtndmavgvfslsagvg
 |      vtnalvfeivmtfglvytvyataidpkkgslgtiapiaigfivgani
 |      <BLANKLINE>
 |      
 |      To take a more annotation rich example,
 |      
 |      >>> from Bio import SeqIO
 |      >>> old = SeqIO.read("EMBL/TRBG361.embl", "embl")
 |      >>> len(old.features)
 |      3
 |      >>> new = old.lower()
 |      >>> len(old.features) == len(new.features)
 |      True
 |      >>> old.annotations["organism"] == new.annotations["organism"]
 |      True
 |      >>> old.dbxrefs == new.dbxrefs
 |      True
 |  
 |  reverse_complement(self, id=False, name=False, description=False, features=True, annotations=False, letter_annotations=True, dbxrefs=False)
 |      Return new SeqRecord with reverse complement sequence.
 |      
 |      By default the new record does NOT preserve the sequence identifier,
 |      name, description, general annotation or database cross-references -
 |      these are unlikely to apply to the reversed sequence.
 |      
 |      You can specify the returned record's id, name and description as
 |      strings, or True to keep that of the parent, or False for a default.
 |      
 |      You can specify the returned record's features with a list of
 |      SeqFeature objects, or True to keep that of the parent, or False to
 |      omit them. The default is to keep the original features (with the
 |      strand and locations adjusted).
 |      
 |      You can also specify both the returned record's annotations and
 |      letter_annotations as dictionaries, True to keep that of the parent,
 |      or False to omit them. The default is to keep the original
 |      annotations (with the letter annotations reversed).
 |      
 |      To show what happens to the pre-letter annotations, consider an
 |      example Solexa variant FASTQ file with a single entry, which we'll
 |      read in as a SeqRecord:
 |      
 |      >>> from Bio import SeqIO
 |      >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
 |      >>> print("%s %s" % (record.id, record.seq))
 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
 |      >>> print(list(record.letter_annotations))
 |      ['solexa_quality']
 |      >>> print(record.letter_annotations["solexa_quality"])
 |      [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
 |      
 |      Now take the reverse complement, here we explicitly give a new
 |      identifier (the old identifier with a suffix):
 |      
 |      >>> rc_record = record.reverse_complement(id=record.id + "_rc")
 |      >>> print("%s %s" % (rc_record.id, rc_record.seq))
 |      slxa_0001_1_0001_01_rc NNNNNNACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
 |      
 |      Notice that the per-letter-annotations have also been reversed,
 |      although this may not be appropriate for all cases.
 |      
 |      >>> print(rc_record.letter_annotations["solexa_quality"])
 |      [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
 |      
 |      Now for the features, we need a different example. Parsing a GenBank
 |      file is probably the easiest way to get an nice example with features
 |      in it...
 |      
 |      >>> from Bio import SeqIO
 |      >>> with open("GenBank/pBAD30.gb") as handle:
 |      ...     plasmid = SeqIO.read(handle, "gb")
 |      >>> print("%s %i" % (plasmid.id, len(plasmid)))
 |      pBAD30 4923
 |      >>> plasmid.seq
 |      Seq('GCTAGCGGAGTGTATACTGGCTTACTATGTTGGCACTGATGAGGGTGTCAGTGA...ATG', IUPACAmbiguousDNA())
 |      >>> len(plasmid.features)
 |      13
 |      
 |      Now, let's take the reverse complement of this whole plasmid:
 |      
 |      >>> rc_plasmid = plasmid.reverse_complement(id=plasmid.id+"_rc")
 |      >>> print("%s %i" % (rc_plasmid.id, len(rc_plasmid)))
 |      pBAD30_rc 4923
 |      >>> rc_plasmid.seq
 |      Seq('CATGGGCAAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCA...AGC', IUPACAmbiguousDNA())
 |      >>> len(rc_plasmid.features)
 |      13
 |      
 |      Let's compare the first CDS feature - it has gone from being the
 |      second feature (index 1) to the second last feature (index -2), its
 |      strand has changed, and the location switched round.
 |      
 |      >>> print(plasmid.features[1])
 |      type: CDS
 |      location: [1081:1960](-)
 |      qualifiers:
 |          Key: label, Value: ['araC']
 |          Key: note, Value: ['araC regulator of the arabinose BAD promoter']
 |          Key: vntifkey, Value: ['4']
 |      <BLANKLINE>
 |      >>> print(rc_plasmid.features[-2])
 |      type: CDS
 |      location: [2963:3842](+)
 |      qualifiers:
 |          Key: label, Value: ['araC']
 |          Key: note, Value: ['araC regulator of the arabinose BAD promoter']
 |          Key: vntifkey, Value: ['4']
 |      <BLANKLINE>
 |      
 |      You can check this new location, based on the length of the plasmid:
 |      
 |      >>> len(plasmid) - 1081
 |      3842
 |      >>> len(plasmid) - 1960
 |      2963
 |      
 |      Note that if the SeqFeature annotation includes any strand specific
 |      information (e.g. base changes for a SNP), this information is not
 |      amended, and would need correction after the reverse complement.
 |      
 |      Note trying to reverse complement a protein SeqRecord raises an
 |      exception:
 |      
 |      >>> from Bio.SeqRecord import SeqRecord
 |      >>> from Bio.Seq import Seq
 |      >>> from Bio.Alphabet import IUPAC
 |      >>> protein_rec = SeqRecord(Seq("MAIVMGR", IUPAC.protein), id="Test")
 |      >>> protein_rec.reverse_complement()
 |      Traceback (most recent call last):
 |         ...
 |      ValueError: Proteins do not have complements!
 |      
 |      Also note you can reverse complement a SeqRecord using a MutableSeq:
 |      
 |      >>> from Bio.SeqRecord import SeqRecord
 |      >>> from Bio.Seq import MutableSeq
 |      >>> from Bio.Alphabet import generic_dna
 |      >>> rec = SeqRecord(MutableSeq("ACGT", generic_dna), id="Test")
 |      >>> rec.seq[0] = "T"
 |      >>> print("%s %s" % (rec.id, rec.seq))
 |      Test TCGT
 |      >>> rc = rec.reverse_complement(id=True)
 |      >>> print("%s %s" % (rc.id, rc.seq))
 |      Test ACGA
 |  
 |  translate(self, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None, id=False, name=False, description=False, features=False, annotations=False, letter_annotations=False, dbxrefs=False)
 |      Return new SeqRecord with translated sequence.
 |      
 |      This calls the record's .seq.translate() method (which describes
 |      the translation related arguments, like table for the genetic code),
 |      
 |      By default the new record does NOT preserve the sequence identifier,
 |      name, description, general annotation or database cross-references -
 |      these are unlikely to apply to the translated sequence.
 |      
 |      You can specify the returned record's id, name and description as
 |      strings, or True to keep that of the parent, or False for a default.
 |      
 |      You can specify the returned record's features with a list of
 |      SeqFeature objects, or False (default) to omit them.
 |      
 |      You can also specify both the returned record's annotations and
 |      letter_annotations as dictionaries, True to keep that of the parent
 |      (annotations only), or False (default) to omit them.
 |      
 |      e.g. Loading a FASTA gene and translating it,
 |      
 |      >>> from Bio import SeqIO
 |      >>> gene_record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
 |      >>> print(gene_record.format("fasta"))
 |      >gi|3176602|gb|U78617.1|LOU78617 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds
 |      CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCA
 |      AAACATGTGAAGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCG
 |      ACCTTAAGAGCTCCACATAGTTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCT
 |      TCATTGGTTATGGCAGTGGTCGTCAATGACAGCGATGAAGATGGAGATAGCCGTGACGCA
 |      GTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAGTTTGTCATAACACTACTCCG
 |      AGGTTTGTT
 |      <BLANKLINE>
 |      
 |      And now translating the record, specifying the new ID and description:
 |      
 |      >>> protein_record = gene_record.translate(table=11,
 |      ...                                        id="phya",
 |      ...                                        description="translation")
 |      >>> print(protein_record.format("fasta"))
 |      >phya translation
 |      QAARFLFMKNKVRMIVDCHAKHVKVLQDEKLPFDLTLCGSTLRAPHSCHLQYMANMDSIA
 |      SLVMAVVVNDSDEDGDSRDAVLPQKKKRLWGLVVCHNTTPRFV
 |      <BLANKLINE>
 |  
 |  upper(self)
 |      Return a copy of the record with an upper case sequence.
 |      
 |      All the annotation is preserved unchanged. e.g.
 |      
 |      >>> from Bio.Alphabet import generic_dna
 |      >>> from Bio.Seq import Seq
 |      >>> from Bio.SeqRecord import SeqRecord
 |      >>> record = SeqRecord(Seq("acgtACGT", generic_dna), id="Test",
 |      ...                    description = "Made up for this example")
 |      >>> record.letter_annotations["phred_quality"] = [1, 2, 3, 4, 5, 6, 7, 8]
 |      >>> print(record.upper().format("fastq"))
 |      @Test Made up for this example
 |      ACGTACGT
 |      +
 |      "#$%&'()
 |      <BLANKLINE>
 |      
 |      Naturally, there is a matching lower method:
 |      
 |      >>> print(record.lower().format("fastq"))
 |      @Test Made up for this example
 |      acgtacgt
 |      +
 |      "#$%&'()
 |      <BLANKLINE>
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  letter_annotations
 |      Dictionary of per-letter-annotation for the sequence.
 |      
 |      For example, this can hold quality scores used in FASTQ or QUAL files.
 |      Consider this example using Bio.SeqIO to read in an example Solexa
 |      variant FASTQ file as a SeqRecord:
 |      
 |      >>> from Bio import SeqIO
 |      >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
 |      >>> print("%s %s" % (record.id, record.seq))
 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
 |      >>> print(list(record.letter_annotations))
 |      ['solexa_quality']
 |      >>> print(record.letter_annotations["solexa_quality"])
 |      [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
 |      
 |      The letter_annotations get sliced automatically if you slice the
 |      parent SeqRecord, for example taking the last ten bases:
 |      
 |      >>> sub_record = record[-10:]
 |      >>> print("%s %s" % (sub_record.id, sub_record.seq))
 |      slxa_0001_1_0001_01 ACGTNNNNNN
 |      >>> print(sub_record.letter_annotations["solexa_quality"])
 |      [4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
 |      
 |      Any python sequence (i.e. list, tuple or string) can be recorded in
 |      the SeqRecord's letter_annotations dictionary as long as the length
 |      matches that of the SeqRecord's sequence.  e.g.
 |      
 |      >>> len(sub_record.letter_annotations)
 |      1
 |      >>> sub_record.letter_annotations["dummy"] = "abcdefghij"
 |      >>> len(sub_record.letter_annotations)
 |      2
 |      
 |      You can delete entries from the letter_annotations dictionary as usual:
 |      
 |      >>> del sub_record.letter_annotations["solexa_quality"]
 |      >>> sub_record.letter_annotations
 |      {'dummy': 'abcdefghij'}
 |      
 |      You can completely clear the dictionary easily as follows:
 |      
 |      >>> sub_record.letter_annotations = {}
 |      >>> sub_record.letter_annotations
 |      {}
 |      
 |      Note that if replacing the record's sequence with a sequence of a
 |      different length you must first clear the letter_annotations dict.
 |  
 |  seq
 |      The sequence itself, as a Seq or MutableSeq object.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None

Let's write a bit of code involving SeqRecord and see how it comes out looking.


In [0]:
from Bio.SeqRecord import SeqRecord

simple_seq = Seq("GATC")
simple_seq_r = SeqRecord(simple_seq)

In [29]:
simple_seq_r.id = "AC12345"
simple_seq_r.description = "Made up sequence"
print(simple_seq_r.id)
print(simple_seq_r.description)


AC12345
Made up sequence

Let's now see how we can use SeqRecord to parse a large fasta file. We'll pull down a file hosted on the biopython site.


In [30]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna


--2020-06-12 02:48:39--  https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9853 (9.6K) [text/plain]
Saving to: ‘NC_005816.fna’

NC_005816.fna       100%[===================>]   9.62K  --.-KB/s    in 0s      

2020-06-12 02:48:39 (63.4 MB/s) - ‘NC_005816.fna’ saved [9853/9853]


In [31]:
from Bio import SeqIO

record = SeqIO.read("NC_005816.fna", "fasta")
record


Out[31]:
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])

Note how there's a number of annotations attached to the SeqRecord object!

Let's take a closer look.


In [32]:
record.id


Out[32]:
'gi|45478711|ref|NC_005816.1|'

In [33]:
record.name


Out[33]:
'gi|45478711|ref|NC_005816.1|'

In [34]:
record.description


Out[34]:
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

Let's now look at the same sequence, but downloaded from GenBank. We'll download the hosted file from the biopython tutorial website as before.


In [35]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb


--2020-06-12 02:48:55--  https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31838 (31K) [text/plain]
Saving to: ‘NC_005816.gb’

NC_005816.gb        100%[===================>]  31.09K  --.-KB/s    in 0.01s   

2020-06-12 02:48:56 (2.17 MB/s) - ‘NC_005816.gb’ saved [31838/31838]


In [36]:
from Bio import SeqIO

record = SeqIO.read("NC_005816.gb", "genbank")
record


Out[36]:
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])

SeqIO Objects

TODO(rbharath): Continue filling this up in future PRs.


In [0]: