Python for Bioinformatics

This Jupyter notebook is intented to be used alongside the book Python for Bioinformatics

Note: Before opening the file, this file should be accesible from this Jupyter notebook. In order to do so, the following commands will download these files from Github and extract them into a directory called samples.

Chapter 11: XML


In [1]:
!curl https://raw.githubusercontent.com/Serulab/Py4Bio/master/samples/samples.tar.bz2 -o samples.tar.bz2
!mkdir samples
!tar xvfj samples.tar.bz2 -C samples


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.5M  100 16.5M    0     0  10.5M      0  0:00:01  0:00:01 --:--:-- 10.5M
._.
./
./vectorssmall.fasta
./prot.fas
./conglycinin.phy
./input4align.dnd
./Q5R5X8.fas
./Q9JJE1.xml
./primers.txt
./NC_006581.gb
./PythonU.db
./hsc1.fasta
./B1.csv
./sampleXblast.xml
./sampledata.xlsx
./NC2033.txt
./conglycinin.dnd
./BLAST_output.xml
./UniVec_Core.nhr
./seqA.fas
./fishdata.csv
./template
./pdbaa
./bioinfo/
./fishbacteria.csv
./B1IXL9.txt
./._GSM188012.CEL
./GSM188012.CEL
./example.aln
./data.csv
./3seqs.fas
./BcrA.gp
./uniprotrecord.xml
./contig1.ace
./other.xml
./UniVec_Core.nsq
./test3.csv
./conglycinin.multiple.phy
./fasta22.fas
./pMOSBlue.txt
./readme.txt
./sampleX.fas
./UniVec_Core.nin
./pdb1apk.ent.gz
./cas9align.fasta
./a19.gp
./phd1
./t3beta.fasta
./conglycinin.fasta
./UniVec_Core
./BLAST_output.html
./spfile.txt
./input4align.fasta
./t3.fasta
./TAIR7_Transcripts_by_map_position.gz
./bioinfo/seqs/
./bioinfo/seqs/513710.fasta
./bioinfo/seqs/6598312.fasta
./bioinfo/seqs/4586830.fasta
./bioinfo/seqs/15721870.fasta
./bioinfo/seqs/2623545.fasta
./bioinfo/seqs/218744616.fasta
./bioinfo/seqs/7415878.fasta
./bioinfo/seqs/63108399.fasta
./bioinfo/seqs/513717.fasta
./bioinfo/seqs/7638455.fasta
./bioinfo/seqs/513719.fasta
./bioinfo/seqs/513419.fasta
./bioinfo/seqs/513718.fasta

SAX: cElementTree Iterparse


In [2]:
import xml.etree.cElementTree as cET
for event, elem in cET.iterparse('samples/uniprotrecord.xml',
events=('start', 'end')):
    if event=='end' and 'sequence' in elem.tag:
        print('Sequence: {0}'.format(elem.text))
        print('Checksum: {0}'.format(elem.attrib["checksum"]))
        print('Length: {0}'.format(elem.attrib["length"]))
        elem.clear()


Sequence: 
MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRL
EAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIH
LEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGS
LDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNS
RGEIKLCDFGVSGQLIDSMANSFVGTRSYMSPERLQGTHYSVQSDIWSMG
LSLVEMAVGRYPIPPPDAKELELLFGCHVEGDAAETPPRPRTPGGPLSSY
GMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERA
DLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAASI

Checksum: E0C0CC2E1F189B8A
Length: 393

In [3]:
for event, elem in cET.iterparse('samples/uniprotrecord.xml'):
    if 'sequence' in elem.tag:
        print('Sequence: {0}'.format(elem.text))
        print('Checksum: {0}'.format(elem.attrib["checksum"]))
        print('Length: {0}'.format(elem.attrib["length"]))
        elem.clear()


Sequence: 
MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRL
EAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIH
LEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGS
LDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNS
RGEIKLCDFGVSGQLIDSMANSFVGTRSYMSPERLQGTHYSVQSDIWSMG
LSLVEMAVGRYPIPPPDAKELELLFGCHVEGDAAETPPRPRTPGGPLSSY
GMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERA
DLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAASI

Checksum: E0C0CC2E1F189B8A
Length: 393

In [4]:
allelements = cET.iterparse('samples/uniprotrecord.xml', events=('start','end'))
allelements = iter(allelements)
event, root = next(allelements)

In [5]:
for event, elem in allelements:
    if event=='end' and 'sequence' in elem.tag:
        print(elem.text)
        root.clear()


MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRL
EAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIH
LEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGS
LDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNS
RGEIKLCDFGVSGQLIDSMANSFVGTRSYMSPERLQGTHYSVQSDIWSMG
LSLVEMAVGRYPIPPPDAKELELLFGCHVEGDAAETPPRPRTPGGPLSSY
GMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERA
DLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAASI


In [6]:
from bs4 import BeautifulSoup as bs
soup = bs(open('samples/uniprotrecord.xml'), 'lxml')

In [7]:
import requests
url = 'https://s3.amazonaws.com/py4bio/uniprotrecord.xml'
req = requests.get(url)
c = req.content

In [8]:
from bs4 import BeautifulSoup as bs
soup = bs(c, 'lxml')

In [9]:
soup.sequence


Out[9]:
<sequence checksum="E0C0CC2E1F189B8A" length="393">
MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRL
EAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIH
LEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGS
LDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNS
RGEIKLCDFGVSGQLIDSMANSFVGTRSYMSPERLQGTHYSVQSDIWSMG
LSLVEMAVGRYPIPPPDAKELELLFGCHVEGDAAETPPRPRTPGGPLSSY
GMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERA
DLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAASI
</sequence>

In [10]:
soup.sequence.string


Out[10]:
'\nMPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRL\nEAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIH\nLEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGS\nLDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNS\nRGEIKLCDFGVSGQLIDSMANSFVGTRSYMSPERLQGTHYSVQSDIWSMG\nLSLVEMAVGRYPIPPPDAKELELLFGCHVEGDAAETPPRPRTPGGPLSSY\nGMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERA\nDLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAASI\n'

In [11]:
soup.sequence.get('checksum')


Out[11]:
'E0C0CC2E1F189B8A'

In [12]:
soup.sequence.get('length')


Out[12]:
'393'

In [13]:
for taxon in soup.lineage.children:
    if taxon.string != '\n':
        print(taxon.string)


Eukaryota
Metazoa
Chordata
Craniata
Vertebrata
Euteleostomi
Mammalia
Eutheria
Euarchontoglires
Glires
Rodentia
Sciurognathi
Muroidea
Muridae
Murinae
Mus

In [14]:
print('Sequence: {0}'.format(soup.sequence.string))


Sequence: 
MPKKKPTPIQLNPAPDGSAVNGTSSAETNLEALQKKLEELELDEQQRKRL
EAFLTQKQKVGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIH
LEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGS
LDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNS
RGEIKLCDFGVSGQLIDSMANSFVGTRSYMSPERLQGTHYSVQSDIWSMG
LSLVEMAVGRYPIPPPDAKELELLFGCHVEGDAAETPPRPRTPGGPLSSY
GMDSRPPMAIFELLDYIVNEPPPKLPSGVFSLEFQDFVNKCLIKNPAERA
DLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAASI


In [15]:
print('Checksum: {0}'.format(soup.sequence.get('checksum')))


Checksum: E0C0CC2E1F189B8A

In [16]:
print('Length: {0}'.format(soup.sequence.get('length')))


Length: 393