The new search service under development at PDBe is powered by Apache Solr.
A pre-release version of user interface is available here: http://wwwdev.ebi.ac.uk/pdbe/entry/search/index
For programmatic usage, a Solr instance is available here: http://wwwdev.ebi.ac.uk/pdbe/search/pdb
Please note that the search service will be released in 2015 - at that point, it would be better to use URLs similar to those above, but hosted from www instead of wwwdev.
To avoid writing long Solr URLs by hand and having to encode them etc., we will use a Solr client library called mysolr. It is pretty lightweight and easy to install, e.g. I installed it on my Redhat (Enterprise 6.6) machine as follows:
easy_install mysolr==0.7
There are many such client libraries available for python as well as other languages.
Let us now make a simple query - let us look for a PDB entry.
In [1]:
PDBE_SOLR_URL = "http://www.ebi.ac.uk/pdbe/search/pdb"
# or https://www.ebi.ac.uk/pdbe/search/pdb/select?rows=0&q=status:REL&wt=json
from mysolr import Solr
solr = Solr(PDBE_SOLR_URL, version=4)
response = solr.search(q='status:REL', rows=0)
documents = response.documents
print("Number of results:", len(documents))
#fields = response.documents[0].keys()
#print("Number of fields in the documents:", [len(rd.keys()) for rd in documents])
response.raw_content
Out[1]:
There are 3 documents in Solr response for a single PDB id, and each has >75 fields. At this juncture, it is essential to understand what the document represents and contains before proceeding further.
PDBe Solr instance serves documents based on polymeric entities in PDB entries, i.e. each document indexed by Solr represents polymeric molecules of type protein, sugar, DNA, RNA or DNA/RNA hybrid. This is why for entry 2qk9 we get 3 documents in the response, each representing the protein, RNA and DNA molecule in that entry.
Fields in PDBe's entity-based Solr document cover a wide range of properties, such as entry's experimental details, details of deposition and primary publication, entity's taxonomy, entry's quality, entity's cross references to UniProt and popular domain databases, biological assembly, etc. They are documented here: http://wwwdev.ebi.ac.uk/pdbe/api/doc/search.html
It is also useful now to understand a little more about Solr querying. Solr has a rich and complex query syntax, described at http://wiki.apache.org/solr/CommonQueryParameters and elsewhere.
The fields of immediate relevance to us in this tutorial are:
Solr capabilities combined with the wide-ranging description in entity document can help us write really powerful Solr queries to find precisely the entries or polymers of interest.
Now let us write a query to find entities containing a Pfam domain called "Lipocalin" in X-ray entries of decent resolution (1Å - 2Å).
In [2]:
def join_with_AND(query_params) :
'''convenience function to create query string with AND'''
return " AND ".join(["%s:%s" % (k,v) for k,v in query_params.items()])
def execute_solr_query(query, query_fields) :
'''convenience function'''
query["q"] = join_with_AND(query_fields) # add q
response = solr.search(**query)
documents = response.documents
print("Found %d matching entities in %d entries." % (len(documents), len({rd["pdb_id"] for rd in documents})))
return documents
query_detail = {
"pfam_name" : "Lipocalin",
"resolution" : "[1 TO 2]",
}
query = {
"rows" : pow(10,8), # i.e. all matching documents are required in response
"fl" : "pdb_id, entity_id", # restrict the returned documents to these fields only
}
docs = execute_solr_query(query, query_detail)
Let us narrow down to proteins of human origin.
In [3]:
query_detail = {
"pfam_name" : "Lipocalin",
"resolution" : "[1 TO 2]",
"tax_id" : "9606",
}
query = {
"rows" : pow(10,8), # i.e. all matching documents are required in response
"fl" : "pdb_id, entity_id", # restrict the returned documents to these fields only
}
docs = execute_solr_query(query, query_detail)
Let us look for entries deposited by Kleywegt.
In [4]:
query_detail = {
"pfam_name" : "Lipocalin",
"resolution" : "[1 TO 2]",
"tax_id" : "9606",
"entry_authors" : "*Kleywegt*",
}
query = {
"rows" : pow(10,8), # i.e. all matching documents are required in response
"fl" : "pdb_id, entity_id", # restrict the returned documents to these fields only
}
docs = execute_solr_query(query, query_detail)
Can you now query PDBe Solr instance to find entries that match the following criteria?