Processing DNA Sequences in Batch

Convert multiple sequences in Genbank format into FASTA format.



In [1]:

    
!pip install biopython









    



Requirement already satisfied: biopython in /usr/local/lib/python3.6/dist-packages (1.73)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from biopython) (1.14.6)



In [2]:

    
import requests

url = 'https://raw.githubusercontent.com/Serulab/Py4Bio/master/samples/casein.gb'
input_file_name = url.split('/')[-1]
response = requests.get(url)
open(input_file_name, 'wb').write(response.content)









    Out[2]:





31550



In [0]:

    
from Bio import SeqIO

with open(input_file_name, 'r') as input_handle:
    sequences = SeqIO.parse(input_handle, 'genbank')
    with open('casein.fasta', 'w') as output_handle:
        SeqIO.write(sequences, output_handle, 'fasta')



In [0]:

    
with open("casein.fasta") as output_handle:
    print(output_handle.read())

Split a FASTA file with multiple sequences into several files, one file per sequence.



In [0]:

    
# Using casein.fasta file generated in previous code
with open("casein.fasta") as input_handle:
    sequences = SeqIO.parse(input_handle, 'fasta')
    for sequence in sequences:
        SeqIO.write(sequence, sequence.name, "fasta")



In [16]:

    
!ls









    



casein.fasta  EU221557.1  EU221567.1  EU221570.1  EU221573.1
casein.gb     EU221562.1  EU221568.1  EU221571.1  FJ429671.1
EU221552.1    EU221565.1  EU221569.1  EU221572.1  sample_data



In [17]:

    
!cat FJ429671.1









    



>FJ429671.1 Bison bison alpha s1 casein (CSN1S1) gene, exon 12 and partial cds
TTTCTCATTATTTACTCCTGGGAAAGGGATACTATGATAGACGGTATCCACGAAATTGAC
AATATTTTTTCTTTTGAATAGGAACAGCTTCTCAGACTGAAAAAATACAAAGTACCCCAG
CTGGTAATATTTTATTATAATAATACAAAATTAAGTCTACAGAATTAAAATAATTAAATG
AAATTTACTTTGACTAAATTCTACATCAAATCATGTTAGAGCCTCCCAAATGATTCGTTA
TACTCTGGAA

Convert multiple sequences into a Python dictionary.



In [19]:

    
record_dict = SeqIO.to_dict(SeqIO.parse("casein.fasta", "fasta"))
print(record_dict["FJ429671.1"])  # use any record ID









    



ID: FJ429671.1
Name: FJ429671.1
Description: FJ429671.1 Bison bison alpha s1 casein (CSN1S1) gene, exon 12 and partial cds
Number of features: 0
Seq('TTTCTCATTATTTACTCCTGGGAAAGGGATACTATGATAGACGGTATCCACGAA...GAA', SingleLetterAlphabet())

Extracting sequences applying a filtering criteria like having a minimum length.



In [26]:

    
# Using casein.fasta file generated in the first code
MIN_SEQ_LENGTH = 400
sequences = SeqIO.parse("casein.fasta", 'fasta')
filtered = []
for total, sequence in enumerate(sequences):
    if len(sequence.seq) > MIN_SEQ_LENGTH:
      filtered.append(sequence)
      
print("total sequences: {}".format(total))
print("after filter sequences: {}".format(len(filtered)))









    



total sequences: 11
after filter sequences: 4