Sequence records, part 2

Instructions

This part of the course material does not rely on the Biopython tutorial. Rather, it shows how sequences can be searched and fetched from UniProt databases and how to use other online services.

Read the documentation for programmatic access to UniProt. It is also recommended to read the Proteins REST API, which is another way to access UniProt data.

Objectives

search and fetch sequence records from UniProt databases
use other online services programmatically
use named tuples to represent structural information

Summary

UniProt is a resource of protein sequences and their annotations. Biopython does not support the online access to UniProt but can parse the XML format used by UniProt. Fortunately, UniProt has a simple API to search and fetch data.

The requests module is a simple-to-use module for HTTP-based communication. With this module, the UniProt API can be used in a similar manner as EUtils can be used via Biopython, which makes the approaches used in the previous part of the course relevant. The requests module will also be used to communicate with other online services.



In [ ]:

    
import requests as R

UniProt can supply records in various formats

The simplest query is to fetch a single record from UniProt. The web address specifies both the ID of the record and the format of the output.



In [ ]:

    
# API address
# (%s are placeholders for user input)
base = "https://www.uniprot.org/uniprot/%s.%s"

# record ID
uid = 'P12345'
# output format
fmt = 'fasta'

# replace the first %s with ID and the second %s with format
url = base%(uid, fmt)

# send query and get response
r = R.get(url)



In [ ]:

    
# web address that was used (you can try and see what it looks like in browser)
print(r.url)

# (the address 'https://www.uniprot.org/uniprot/P12345' would return HTML that you normally see in the browser)



In [ ]:

    
# record in FASTA format as requested
print(r.text)

Full records can be obtained in the XML format.



In [ ]:

    
# API address
base = "https://www.uniprot.org/uniprot/%s.%s"

uid = 'P12345'
fmt = 'xml'

url = base%(uid, fmt)
r = R.get(url)

# save record to file
with open('P12345.xml', 'w') as f:
    f.write(r.text)

Biopython can parse UniProt XML

The UniProt XML can be parsed by the Bio.SeqIO module.



In [ ]:

    
import Bio.SeqIO as BSIO

# parse a single record in the UniProt XML format to a SeqRecord object
r = BSIO.read("P12345.xml", "uniprot-xml")

# SeqRecord object
print(r)

UniProt can be queried for records satisfying specific conditions

Queries to UniProt can be made by sending the query (as you would write it into the search box in a browser) to the address https://www.uniprot.org/uniprot. The details of the query are supplied as parameters rather than as part of the address.

Take a look at the Query API for a list of parameters that can be used. Note particularly the limit and offset parameters, which correspond to the retstart and retmax arguments of EUtils.



In [ ]:

    
# API address
url = "https://www.uniprot.org/uniprot"

# required parameters as dictionary
data = {
    # query
    # - reviewed:yes == manually annotated
    # - name:p53 == proteins with 'p53' in their names
    # - organism:"Homo sapiens (Human) [9606]" == proteins from human
    'query': 'reviewed:yes AND name:p53 AND organism:"Homo sapiens (Human) [9606]"',
    # output in FASTA format
    'format': 'fasta',
    # fetch the first three records
    'limit': '3',
}

# send query and get response
r = R.get(url, params=data)

# save to file
with open('sequences.fasta', 'w') as f:
    f.write(r.text)

The UniProt website uses the same API when it is accessed with a browser. It is therefore possible to first design a query in the browser and then implement it in the code.



In [ ]:

    
# the url of the query
# (try it in the browser after removing the "format=fasta" parameter)
print(r.url)



In [ ]:

    
# show file content
with open('sequences.fasta') as f:
    print(f.read())

If you only need the list of matching IDs, there is no need to fetch any sequences.



In [ ]:

    
# API address
url = "https://www.uniprot.org/uniprot"

# required parameters as dictionary
data = {
    'query': 'reviewed:yes AND name:p53 AND organism:"Homo sapiens (Human) [9606]"',
    # output as list of IDs
    'format': 'list',
    # fetch the first ten records
    'limit': '10',
}

# send query and get response
r = R.get(url, params=data)

# store data to variable
ids = r.text

# raw text output
print(ids)

The output can be easily parsed into a Python list.



In [ ]:

    
# remove surrounding whitespace and split at newlines
ids = ids.strip().split("\n")

# Python list of ids
print(ids)

It is also possible to fetch specific fields in a tabular format. The columns parameter speficies which fields to fetch. See the Query API for the details of fields.



In [ ]:

    
# API address
url = "https://www.uniprot.org/uniprot"

# required parameters as dictionary
data = {
    'query': 'reviewed:yes AND name:p53 AND organism:"Homo sapiens (Human) [9606]"',
    # output as table
    'format': 'tab',
    # desired fields as comma-separated list
    'columns': 'id,entry name,length',
    # fetch the first ten records
    'limit': '10',
}

# send query and get response
r = R.get(url, params=data)

# store data to variable
text = r.text

# raw text output
print(text)

Named tuples are convenient for simple data structures

The collections module of Python contains functions for handling and organising data. The namedtuple function can be used to create custom classes that are tuples but have named fields. They are convenient in storing simple structured data, like fields from UniProt records.



In [ ]:

    
import collections as C



In [ ]:

    
# data as tuple
record = ('P04637', 'P53_HUMAN', 393)
print(record)



In [ ]:

    
# access via index
print(record[1])



In [ ]:

    
# named tuple class that models some UniProt fields
# ('Entry' is the name of the class)
Entry = C.namedtuple('Entry', ['id', 'name', 'length'])

# data as named tuple
record = Entry('P04637', 'P53_HUMAN', 393)
print(record)



In [ ]:

    
# access via index
print(record[1])



In [ ]:

    
# access via name
print(record.name)



In [ ]:

    
# data as tuple
data = ['P04637', 'P53_HUMAN', 393]

# data as named tuple
# (the * operator converts a list into positional arguments)
record = Entry(*data)
print(record)

UniProt provides a mapping service between database IDs

Each database has its own set of records, but much of the biological information is shared between databases. UniProt can map between database IDs such that a record in one database is paired with the same (or corresponding) record(s) in another database.



In [ ]:

    
# API address
url = 'https://www.uniprot.org/uploadlists/'

# required parameters as dictionary
data = {
    # map from this database
    'from': 'ACC',
    # map to this database
    'to': 'PIR',
    # output format
    'format': 'tab',
    # space-separated list of IDs
    'query': 'P12345 P12346'
}

# send query and get response
r = R.get(url, params=data)

# store data to variable
text = r.text

# raw text output
print(text)

If you are mapping from a UniProt record, the cross-references could also be accessed via the dbxrefs attribute of the SeqRecord object. If you do not have the full record or if you need to map to UniProt from another database, the UniProt mapping service is useful.



In [ ]:

    
# parse a single UniProt record
r = BSIO.read("P12345.xml", "uniprot-xml")

# the PIR ID 'B27103' is listed as a cross-reference for the UniProt record 'P12345'
for ref in r.dbxrefs:
    if ref.startswith('PIR:'):
        print(ref)

Other online services can be accessed by scraping HTML

You should primarily use APIs to access online services because they are intended for programmatic access. Always read the documentation of the service to see how the owners of the service ask you to use their service.

Some online services do not have any API. These services can still be used by communicating with the website like a browser would. In these cases, the services respond in the HTML format because that is what browsers except. Since each website has its own HTML structure, there is no simple way to extract the desired information from the HTML response. One must look at the HTML of the website and implement your own solution.

SecretomeP service can be accessed via its website

The SecretomeP service by DTU Bioinformatics serves as an example in the course. Look at the source of the website to see how the user input is collected with the online form and send to the server.

To simulate the behaviour of the online form in a browser, it is good to first locate the form element and then match the visible form elements to their argument names and values. For example, the text area to supply the sequence is provided by

<textarea name="SEQPASTE" rows=3 cols=64>

and the "Organism group" radio buttons are provided by

<input name="orgtype" type="radio" value="gram-" ...>
<input name="orgtype" type="radio" value="gram+" ...>.
<input name="orgtype" type="radio" value="mam" ...>

The form also contains a hidden field

<input type=HIDDEN name=configfile value="/usr/opt/www/pub/CBS/services/SecretomeP-2.0/SecretomeP.athena.cf">,

the value of which is sent along the user input. The destination is defined in the element

<form ... action="/cgi-bin/webface2.fcgi" method="POST">,

which indicates that the data is sent to http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi as a POST request. Note the relative address in the action field.



In [ ]:

    
import Bio.Seq as BS

# sequences to use as input
# (since the service accepts FASTA format, we can use the file content as such)
with open('sequences.fasta') as f:
    seqs = f.read()

# it seems that the SignalP service expects Windows newlines
# (i.e. replace \n with \r\n)
seqs = seqs.replace('\n', '\r\n')

# service address
url = 'http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi'

# required parameters as dictionary
data = {
    # hidden input
    'configfile': '/usr/opt/www/pub/CBS/services/SecretomeP-2.0/SecretomeP.athena.cf',
    # input from text area
    'SEQPASTE': seqs,
    # input from radio buttons
    'orgtype': 'mam',
}

# send query as POST and get response
r = R.post(url, data=data)

# store response to variable
html = r.text

# raw response as HTML
print(html)

The SignalP service does not respond with the results but rather gives a job ID, which can be used to fetch the actual results as soon as they are ready. The extraction of the jobid value from the HTML response can be achieved with a regular expression, for example.

The job ID is mentioned several times in the HTML. The code below extracts it from the line that has the following format:



In [ ]:

    
import re as RE

# regular expression to match "jobid: X status:" where X is 24 non-whitespace characters
# (it will match for "jobid: XXXXXXXXXXXXXXXXXXXXXXXX status:" as required)
match = RE.search(r"jobid: (\S{24}) status:", html)

jobid = ""
# check if there is a match (there should be...)
if match:
    # extract the section that was enclosed in the parenthesis
    jobid = match.group(1)

# extracted job ID
print(jobid)

Based on the first response, the results can be fetched from the same address to which the job was submitted, but this time the request should be a GET request with jobid as a parameter:

http://www.cbs.dtu.dk//cgi-bin/webface2.fcgi?jobid=X,

where X is the job ID. This is a GET request as indicated by the presence of ?.



In [ ]:

    
# service address
url = 'http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi'

# request data as dictionary
data = {
    # job ID
    'jobid': jobid,
}

# fetch the actual results
r = R.get(url, params=data)

print(r.text)

Again, the response is in HTML and the desired information must be extracted from it. Looking at the response, the relevant information is in the <PRE> element which contains a few lines in a tabular format.

If you get a response that looks like the first response, your job is still queued and not yet ready. Wait for a few seconds (or minutes/hours/days, depending on the service) and try again.