Processing DNA Sequences in Batch

Convert multiple sequences in Genbank format into FASTA format.


In [1]:
!pip install biopython


Requirement already satisfied: biopython in /usr/local/lib/python3.6/dist-packages (1.73)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from biopython) (1.14.6)

In [2]:
import requests

url = 'https://raw.githubusercontent.com/Serulab/Py4Bio/master/samples/casein.gb'
input_file_name = url.split('/')[-1]
response = requests.get(url)
open(input_file_name, 'wb').write(response.content)


Out[2]:
31550

In [0]:
from Bio import SeqIO

with open(input_file_name, 'r') as input_handle:
    sequences = SeqIO.parse(input_handle, 'genbank')
    with open('casein.fasta', 'w') as output_handle:
        SeqIO.write(sequences, output_handle, 'fasta')

In [0]:
with open("casein.fasta") as output_handle:
    print(output_handle.read())

Split a FASTA file with multiple sequences into several files, one file per sequence.


In [0]:
# Using casein.fasta file generated in previous code
with open("casein.fasta") as input_handle:
    sequences = SeqIO.parse(input_handle, 'fasta')
    for sequence in sequences:
        SeqIO.write(sequence, sequence.name, "fasta")

In [16]:
!ls


casein.fasta  EU221557.1  EU221567.1  EU221570.1  EU221573.1
casein.gb     EU221562.1  EU221568.1  EU221571.1  FJ429671.1
EU221552.1    EU221565.1  EU221569.1  EU221572.1  sample_data

In [17]:
!cat FJ429671.1


>FJ429671.1 Bison bison alpha s1 casein (CSN1S1) gene, exon 12 and partial cds
TTTCTCATTATTTACTCCTGGGAAAGGGATACTATGATAGACGGTATCCACGAAATTGAC
AATATTTTTTCTTTTGAATAGGAACAGCTTCTCAGACTGAAAAAATACAAAGTACCCCAG
CTGGTAATATTTTATTATAATAATACAAAATTAAGTCTACAGAATTAAAATAATTAAATG
AAATTTACTTTGACTAAATTCTACATCAAATCATGTTAGAGCCTCCCAAATGATTCGTTA
TACTCTGGAA

Convert multiple sequences into a Python dictionary.


In [19]:
record_dict = SeqIO.to_dict(SeqIO.parse("casein.fasta", "fasta"))
print(record_dict["FJ429671.1"])  # use any record ID


ID: FJ429671.1
Name: FJ429671.1
Description: FJ429671.1 Bison bison alpha s1 casein (CSN1S1) gene, exon 12 and partial cds
Number of features: 0
Seq('TTTCTCATTATTTACTCCTGGGAAAGGGATACTATGATAGACGGTATCCACGAA...GAA', SingleLetterAlphabet())

Extracting sequences applying a filtering criteria like having a minimum length.


In [26]:
# Using casein.fasta file generated in the first code
MIN_SEQ_LENGTH = 400
sequences = SeqIO.parse("casein.fasta", 'fasta')
filtered = []
for total, sequence in enumerate(sequences):
    if len(sequence.seq) > MIN_SEQ_LENGTH:
      filtered.append(sequence)
      
print("total sequences: {}".format(total))
print("after filter sequences: {}".format(len(filtered)))


total sequences: 11
after filter sequences: 4