Text manipulation is quite simplified in Python thanks to a wide variety of packages. It is generally advisable not to reinvent the wheels, so only perform quick and dirty text parsing when it is really necessary.
Here is a basic raw text file opening template in Python.
In [ ]:
# f = open('path/to/file.txt','r')
# #f.readlines()
# for l in f:
# #do stuff
# print l
# f.close()
What if we want to read text from the standard input? (Useful for pipelining)
In [1]:
#use ike this: cat file.txt | python script.py
import sys
for line in sys.stdin:
# do suff
print line
# there is a dedicated module to text IO
#import io
#t = io.TextIO()
In [2]:
d = {'first': [1,"two"], 'second': set([3, 4, 'five'])}
import pickle
with open('dumpfile.pkl','wb') as fout:
pickle.dump(d, fout)
with open('dumpfile.pkl','rb') as fin:
d2 = pickle.load(fin)
print(d2)
In [3]:
import json
#json_string = json.dumps([1, 2, 3, "a", "b", "c"])
d = {'first': [1,"two"], 'second': [3, 4, 'five']}
json_string = json.dumps(d)
print(json_string)
Example: FASTA parsing
Open the file containing all peptide sequences in the human body. How many unknown peptides does it contain? How many unique genes and transcripts are in there for the unknown peptides? Output a tab separated file containing the gene id and transcript id for each unknown peptide.
Observation: Usage of Biopython and pandas modules.
Task: Order the chromosomes by the number of unknown peptides versus the total number of peptides they translate.
ENSP00000388523 pep:known chromosome:GRCh38:7:142300924:142301432:1 gene:ENSG00000226660 transcript:ENST00000455382 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene MDTWLVCWAIFSLLKAGLTEPEVTQTPSHQVTQMGQEVILRCVPISNHLYFYWYRQILGQ KVEFLVSFYNNEISEKSEIFDDQFSVERPDGSNFTLKIRSTKLEDSAMYFCASSE
In [19]:
import sys
f = open('data/Homo_sapiens.GRCh38.pep.all.fa','r')
peptides = {}
for l in f:
if l[0]=='>':
#print l.strip().split()
record = {}
r = l.strip('\n').split()
pepid = r[0][1:]
record['pep'] = 1 if r[1].split(':')[1]=='known' else 0
record['gene'] = r[3].split(':')[1]
record['transcript'] = r[4].split(':')[1]
peptides[pepid] = record
f.close()
##using regular expressions to match all known peptides
nupep2 = 0
import re
#pattern = re.compile('^>.*(known).*')
pattern = re.compile('^>((?!known).)*$')
with open('data/Homo_sapiens.GRCh38.pep.all.fa','r') as f:
for l in f:
if pattern.search(l) is not None: nupep2 += 1
npep = len(peptides)
upep = set([pepid for pepid in peptides if peptides[pepid]['pep']==0]) #unknown peptides
nunknown = len(upep)
genes = set([peptides[pepid]['gene'] for pepid in upep])
trans = set([peptides[pepid]['transcript'] for pepid in upep])
print npep, nupep2, nunknown, len(genes), len(ntrans)
with open('unknown_peptides.txt','w') as f:
for pepid in upep:
f.write('\t'.join([pepid, peptides[pepid]['gene'], peptides[pepid]['transcript']])+'\n')
In [3]:
f = open('data/Homo_sapiens.GRCh38.pep.all.fa','r')
from Bio import SeqIO
fasta = SeqIO.parse(f,'fasta')
i = 0
name, sequence = fasta.id, fasta.seq.tostring()
if len(sequence)<100 and len(sequence)>20:
i += 1
print i
print "Name",name
print "Sequence",sequence
if i > 5: break
f.close()
XML is a general file format used for data interchange, especially among different applications. One of the most popular use in Biology is the SBML format, that aims to store a biological model specification, no matter how specific that model may be.
Task:
Download a curated SBML file from the BioModels database: http://www.ebi.ac.uk/biomodels-main/
Find out how many reactions the file contains.
Extra task:
Make a simplified XML file of the reactants and their k-values for each reaction.
In [13]:
import sys
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='data/curated_sbml.xml')
#tree = ET.parse(open('data/curated_sbml.xml'))
root = tree.getroot()
print root.tag, root.attrib
for child in root:
print child.tag, child.attrib
for child2 in child:
print child2.tag, child2.attrib
#print tree.write(sys.stdout)
for elem in root.iter('reaction'):
print elem.tag, elem.attrib
for elem in root.iter('species'):
print elem.tag, elem.attrib
print elem.get('id')
print tree.findall('.//reaction')
In [ ]:
import xmltodict
with open('data/curated_sbml.xml','r') as fd:
doc = xmltodict.parse(fd.read())
This is concerned with automatic information processing from teh Internet.
Task:
BeautifulSoup is loved by hackers. Aside from html it can also parse xml.
Here is a small script that will list all web anchors from Reddit main page (an anchod is a html tag normally used to provide hyperlinks and reference points inside a web page).
In [4]:
from bs4 import BeautifulSoup
import urllib2
redditFile = urllib2.urlopen("http://www.reddit.com")
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
redditAll = soup.find_all("a")
for links in soup.find_all('a'):
print (links.get('href'))
In [ ]: