Python for Bioinformatics

This Jupyter notebook is intented to be used alongside the book Python for Bioinformatics

Chapter 20: Inferring Splicing Sites

Note: These scripts requires external files to be accesible. In order to do so, the following commands will download these files from Github and from Amazon.


In [18]:
!curl https://s3.amazonaws.com/py4bio/TAIR.tar.bz2 -o TAIR.tar.bz2
!mkdir samples
!tar xvfj TAIR.tar.bz2 -C samples
!curl https://s3.amazonaws.com/py4bio/ncbi-blast-2.6.0.tar.bz2 -o ncbi-blast-2.6.0.tar.bz2
!tar xvfj ncbi-blast-2.6.0.tar.bz2
!curl https://s3.amazonaws.com/py4bio/clustalw2 -o clustalw2
!chmod a+x clustalw2
!ls


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.8M  100 39.8M    0     0  24.5M      0  0:00:01  0:00:01 --:--:-- 24.5M
mkdir: cannot create directory 'samples': File exists
TAIR10_cds_20101214_updated
TAIR10_seq_20101214_updated
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 82.1M  100 82.1M    0     0  36.0M      0  0:00:02  0:00:02 --:--:-- 36.1M
ncbi-blast-2.6.0+/db/UniVec_Core.nhr
ncbi-blast-2.6.0+/ncbi_package_info
ncbi-blast-2.6.0+/db/UniVec_Core.nin
ncbi-blast-2.6.0+/db/
ncbi-blast-2.6.0+/bin/
ncbi-blast-2.6.0+/ChangeLog
ncbi-blast-2.6.0+/bin/makembindex
ncbi-blast-2.6.0+/db/UniVec_Core.nsq
ncbi-blast-2.6.0+/bin/blastn
ncbi-blast-2.6.0+/README
ncbi-blast-2.6.0+/LICENSE
ncbi-blast-2.6.0+/bin/blastdbcmd
ncbi-blast-2.6.0+/bin/blast_formatter
ncbi-blast-2.6.0+/db/TAIR10.nhr
ncbi-blast-2.6.0+/db/TAIR10.nin
ncbi-blast-2.6.0+/bin/makeblastdb
ncbi-blast-2.6.0+/bin/makeprofiledb
ncbi-blast-2.6.0+/bin/windowmasker
ncbi-blast-2.6.0+/db/TAIR10.nsq
ncbi-blast-2.6.0+/
ncbi-blast-2.6.0+/bin/dustmasker
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7327k  100 7327k    0     0  9037k      0 --:--:-- --:--:-- --:--:-- 9034k
AT.db			     anaconda3_410  input_sequence.fasta
TAIR.tar.bz2		     anaconda3_431  ncbi-blast-2.6.0+
TAIR10_cds_20101214_updated  clustalw2	    ncbi-blast-2.6.0.tar.bz2
TAIR10_seq_20101214_updated  foralig.txt    outfile.xml
anaconda2_410		     forblast	    samples

In [6]:
!conda install biopython -y


Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /home/nbcommon/anaconda3_431:

The following NEW packages will be INSTALLED:

    biopython: 1.68-np111py36_0

biopython-1.68 100% |################################| Time: 0:00:00   9.01 MB/s

Listing 20.1: makedb.py: Convert data for entering into an SQLite database

Note The following program had to be adapted to work on Jupyter Notebook.


In [7]:
import sqlite3
from Bio import SeqIO

SEQ_FILE = open('samples/TAIR10_seq_20101214_updated')
CDS_FILE = open('samples/TAIR10_cds_20101214_updated')
AT_DB_FILE = 'AT.db'

at_d = {}
# Get all sequences from TAIR sequences file.
for record in SeqIO.parse(SEQ_FILE, 'fasta'):
    sid = record.id
    seq = str(record.seq)
    at_d[sid] = [seq]
# Get all sequences from TAIR CDS file.
for record in SeqIO.parse(CDS_FILE, 'fasta'):
    sid = record.id
    seq = str(record.seq)
    at_d[sid].append(seq)
# Write to a CSV file only the entries of the dictionary that
# has data from both sources
conn = sqlite3.connect(AT_DB_FILE)
c = conn.cursor()
c.execute('create table seq(id, cds, full_seq)')
for seq_id in at_d:
    if len(at_d[seq_id])==2:
        # Write in this order: ID, CDS, FULL_SEQ.
        c.execute('INSERT INTO seq VALUES (?,?,?)',
                 ((seq_id, at_d[seq_id][1], at_d[seq_id][0])))
conn.commit()
conn.close()

Listing 20.2: estimateintrons.py: Estimate introns


In [20]:
#!/usr/bin/env python

import os
import sqlite3

from Bio import SeqIO, SeqRecord, Seq
from Bio.Align.Applications import ClustalwCommandline
from Bio.Blast import NCBIXML
from Bio.Blast.Applications import NcbiblastnCommandline as bn
from Bio import AlignIO

AT_DB_FILE = 'AT.db'
BLAST_EXE = 'ncbi-blast-2.6.0+/bin/blastn'
BLAST_DB = 'ncbi-blast-2.6.0+/db/TAIR10'
CLUSTALW_EXE = os.path.join(os.getcwd(), 'clustalw2')

input_sequence = """>XM_013747562.1 PREDICTED: Brassica oleracea var. oleracea
ATCTTTCGCGAGAGGTTCATTATTGTCCGGAAGAGGTGCTCATGTTTTGGTAAAGCGATCACAAGGTGTT
CGATACAATACCTGAGAGAGTTTCCACAGCTTTCTTCTGATTCTTACTCGGTTTGAGTGAGCTGGATCTT
CCACGACGAAGATGATGATCTTGGATGTTTGCAATGAGATTATAAAGATCCAGAAGCTAAGACGGGTTGT
CTCTTACGCTGGATTCTACTGCTTCACTGCAGCCCTCACATTCTTCTACACAAACAACACAACAAGAGCA
GGATTCTCCAGGGGAGATCAGTTTTATGCGTCTTACCCTGCGGGTACCGAACTTCTTACCGACACAGCTA
AGCTGTACAAAGCGGCGCTTGGGAATTGCTATGAATCTGAGGATTGGGGTCCTGTCGAGTTCTGCATAAT
GGCTAAGCATTTTGAGCGCCAGGGAAAGTCTCCATACGTTTACCACTCTCAATACATGGCTCACCTTCTT
TCACAAGGCCAACTTGATGGAAGTGGCTAGAGTCGTTGATGACTTGCAAGACAGCTCCTTTTTCAATCTG
TGTACCTAATCTTGTTATTGGAACTTCCTTCTTTACTCTTTTTCCGAATTTGTACGGCGATGGTATTTGA
GGTTACCACCAAGAAATATAAGAACATGTTCTGGTGTAGACAATGAATGTAATAAACACATAAGATCAGA
CCTTGATATGA
"""

with open('input_sequence.fasta', 'w') as in_seq:
    in_seq.write(input_sequence)

def allgaps(seq):
    """Return a list with tuples containing all gap positions
       and length. seq is a string."""
    gaps = []
    indash = False
    for i, c in enumerate(seq):
        if indash is False and c == '-':
            c_ini = i
            indash = True
            dashn = 0
        elif indash is True and c == '-':
            dashn += 1
        elif indash is True and c != '-':
            indash = False
            gaps.append((c_ini, dashn+1))
    return gaps

def iss(user_seq):
    """Infer Splicing Sites from a FASTA file full of EST
    sequences"""

    with open('forblast','w') as forblastfh:
        forblastfh.write(str(user_seq.seq))

    blastn_cline = bn(cmd=BLAST_EXE, query='forblast',
                      db=BLAST_DB, evalue='1e-10', outfmt=5,
                      num_descriptions='1',
                      num_alignments='1', out='outfile.xml')
    blastn_cline()
    b_record = NCBIXML.read(open('outfile.xml'))
    title = b_record.alignments[0].title
    sid = title[title.index(' ')+1 : title.index(' |')]
    # Polarity information of returned sequence.
    # 1 = normal, -1 = reverse.
    frame = b_record.alignments[0].hsps[0].frame[1]
    # Run the SQLite query
    conn = sqlite3.connect(AT_DB_FILE)
    c = conn.cursor()
    res_cur = c.execute('SELECT CDS, FULL_SEQ from seq '
                        'WHERE ID=?', (sid,))
    cds, full_seq = res_cur.fetchone()
    if cds=='':
        print('There is no matching CDS')
        exit()
    # Check sequence polarity.
    sidcds = '{0}-CDS'.format(sid)
    sidseq = '{0}-SEQ'.format(sid)
    if frame==1:
        seqCDS = SeqRecord.SeqRecord(Seq.Seq(cds),
                                     id = sidcds,
                                     name = '',
                                     description = '')
        fullseq = SeqRecord.SeqRecord(Seq.Seq(full_seq),
                                      id = sidseq,
                                      name='',
                                      description='')
    else:
        seqCDS = SeqRecord.SeqRecord(
            Seq.Seq(cds).reverse_complement(),
            id = sidcds, name='', description='')
        fullseq = SeqRecord.SeqRecord(
            Seq.Seq(full_seq).reverse_complement(),
            id = sidseq, name = '', description='')
    # A tuple with the user sequence and both AT sequences
    allseqs = (record, seqCDS, fullseq)
    with open('foralig.txt','w') as trifh:
        # Write the file with the three sequences
        SeqIO.write(allseqs, trifh, 'fasta')
    # Do the alignment:
    outfilename = '{0}.aln'.format(user_seq.id)
    cline = ClustalwCommandline(CLUSTALW_EXE,
                                infile = 'foralig.txt',
                                outfile = outfilename,
                                )
    cline()
    # Walk over all sequences and look for query sequence
    for seq in AlignIO.read(outfilename, 'clustal'):
        if user_seq.id in seq.id:
            seqstr = str(seq.seq)
            gaps = allgaps(seqstr.strip('-'))
            break
    print('Original sequence: {0}'.format(user_seq.id))
    print('\nBest match in AT CDS: {0}'.format(sid))
    acc = 0
    for i, gap in enumerate(gaps):
        print('Putative intron #{0}: Start at position {1}, '
              'length {2}'.format(i+1, gap[0]-acc, gap[1]))
        acc += gap[1]
    print('\n{0}'.format(seqstr.strip('-')))
    print('\nAlignment file: {0}\n'.format(outfilename))

description = 'Program to infer intron position based on ' \
              'Arabidopsis Thaliana genome'
with open('input_sequence.fasta', 'r') as seqhandle:
    records = SeqIO.parse(seqhandle, 'fasta')
    for record in records:
        iss(record)


Original sequence: XM_013747562.1

Best match in AT CDS: AT1G14990.1
Putative intron #1: Start at position 14, length 1
Putative intron #2: Start at position 53, length 5
Putative intron #3: Start at position 92, length 2
Putative intron #4: Start at position 143, length 2
Putative intron #5: Start at position 148, length 2
Putative intron #6: Start at position 169, length 1
Putative intron #7: Start at position 210, length 3
Putative intron #8: Start at position 232, length 3
Putative intron #9: Start at position 245, length 2
Putative intron #10: Start at position 352, length 95
Putative intron #11: Start at position 518, length 1
Putative intron #12: Start at position 533, length 2
Putative intron #13: Start at position 558, length 1
Putative intron #14: Start at position 642, length 4
Putative intron #15: Start at position 654, length 1
Putative intron #16: Start at position 668, length 2
Putative intron #17: Start at position 673, length 1

ATCTTTCGCGAGAG-GTTCATTATTGTCCGGAAGAGGTGCTCATGTTTTGGTAA-----AGCGATCACAAGGTGTTCGATACAATACCTGAGAGAGTT--TCCACAGCTTTCTTCTGATTCTTACTCGGTTTGAGTGAGCTGGATCTTCCA--CGACG--AAGATGATGATCTTGGATGTT-TGCAATGAGATTATAAAGATCCAGAAGCTAAGACGGGTTGT---CTCTTACGCTGGATTCTACTGC---TTCACTGCAGCCC--TCACATTCTTCTACACAAACAACACAACAAGAGCAGGATTCTCCAGGGGAGATCAGTTTTATGCGTCTTACCCTGCGGGTACCGAACTTCTTACCGACACAGCTAAG-----------------------------------------------------------------------------------------------CTGTACAAAGCGGCGCTTGGGAATTGCTATGAATCTGAGGATTGGGGTCCTGTCGAGTTCTGCATAATGGCTAAGCATTTTGAGCGCCAGGGAAAGTCTCCATACGTTTACCACTCTCAATACATGGCTCACCTTCTTTCACAAGGCCAACTTGATGGAAGTGGCT-AGAGTCGTTGATGAC--TTGCAAGACAGCTCCTTTTTCAATC-TGTGTACCTAATCTTGTTATTGGAACTTCCTTCTTTACTCTTTTTCCGAATTTGTACGGCGATGGTATTTGAGGTTACCACCAA----GAAATATAAGAA-CATGTTCTGGTGTA--GACAA-TGAATGTAATAAACACATAAGATCAGACCTTGATATGA

Alignment file: XM_013747562.1.aln