Introduction

The new search service under development at PDBe is powered by Apache Solr.

A pre-release version of user interface is available here: http://wwwdev.ebi.ac.uk/pdbe/entry/search/index

For programmatic usage, a Solr instance is available here: http://wwwdev.ebi.ac.uk/pdbe/search/pdb

Please note that the search service will be released in 2015 - at that point, it would be better to use URLs similar to those above, but hosted from www instead of wwwdev.

Getting started

To avoid writing long Solr URLs by hand and having to encode them etc., we will use a Solr client library called mysolr. It is pretty lightweight and easy to install, e.g. I installed it on my Redhat (Enterprise 6.6) machine as follows:

easy_install mysolr==0.7

There are many such client libraries available for python as well as other languages.

Let us now make a simple query - let us look for a PDB entry.


In [1]:
PDBE_SOLR_URL = "http://www.ebi.ac.uk/pdbe/search/pdb"                 
# or https://www.ebi.ac.uk/pdbe/search/pdb/select?rows=0&q=status:REL&wt=json

from mysolr import Solr
solr = Solr(PDBE_SOLR_URL, version=4)

response = solr.search(q='status:REL', rows=0)

documents = response.documents
print("Number of results:", len(documents))

#fields = response.documents[0].keys()
#print("Number of fields in the documents:", [len(rd.keys()) for rd in documents])

response.raw_content


Number of results: 0
Out[1]:
{'response': {'docs': [], 'numFound': 237765, 'start': 0},
 'responseHeader': {'QTime': 3,
  'params': {'q': 'status:REL', 'rows': '0', 'wt': 'json'},
  'status': 0}}

There are 3 documents in Solr response for a single PDB id, and each has >75 fields. At this juncture, it is essential to understand what the document represents and contains before proceeding further.

Entity document

PDBe Solr instance serves documents based on polymeric entities in PDB entries, i.e. each document indexed by Solr represents polymeric molecules of type protein, sugar, DNA, RNA or DNA/RNA hybrid. This is why for entry 2qk9 we get 3 documents in the response, each representing the protein, RNA and DNA molecule in that entry.

Fields in PDBe's entity-based Solr document cover a wide range of properties, such as entry's experimental details, details of deposition and primary publication, entity's taxonomy, entry's quality, entity's cross references to UniProt and popular domain databases, biological assembly, etc. They are documented here: http://wwwdev.ebi.ac.uk/pdbe/api/doc/search.html

Solr features

It is also useful now to understand a little more about Solr querying. Solr has a rich and complex query syntax, described at http://wiki.apache.org/solr/CommonQueryParameters and elsewhere.

The fields of immediate relevance to us in this tutorial are:

  • q - the query itself. There is a lot of flexibility in describing a query, e.g. fields, wildcards, case-insensitivity, logical operators, ranges, etc.
  • rows - number of results returned by Solr. Needs to be explicitly set in mysolr because it defaults to 10. Useful if only part of results are desired.
  • fl - fields returned in each document. This is useful to reduce the size of response.

Solr capabilities combined with the wide-ranging description in entity document can help us write really powerful Solr queries to find precisely the entries or polymers of interest.

Examples

Now let us write a query to find entities containing a Pfam domain called "Lipocalin" in X-ray entries of decent resolution (1Å - 2Å).


In [2]:
def join_with_AND(query_params) :
    '''convenience function to create query string with AND'''
    return " AND ".join(["%s:%s" % (k,v) for k,v in query_params.items()])

def execute_solr_query(query, query_fields) :
    '''convenience function'''
    query["q"] = join_with_AND(query_fields) # add q
    response = solr.search(**query)
    documents = response.documents
    print("Found %d matching entities in %d entries." % (len(documents), len({rd["pdb_id"] for rd in documents})))
    return documents

query_detail = {                        
    "pfam_name"  : "Lipocalin",
    "resolution" : "[1 TO 2]",
}
query = {                                                                       
    "rows" : pow(10,8), # i.e. all matching documents are required in response
    "fl"   : "pdb_id, entity_id", # restrict the returned documents to these fields only
}

docs = execute_solr_query(query, query_detail)


Found 292 matching entities in 292 entries.

Let us narrow down to proteins of human origin.


In [3]:
query_detail = {                        
    "pfam_name"  : "Lipocalin",
    "resolution" : "[1 TO 2]",
    "tax_id"     : "9606",
}
query = {                                                                       
    "rows" : pow(10,8), # i.e. all matching documents are required in response
    "fl"   : "pdb_id, entity_id", # restrict the returned documents to these fields only
}

docs = execute_solr_query(query, query_detail)


Found 171 matching entities in 171 entries.

Let us look for entries deposited by Kleywegt.


In [4]:
query_detail = {                        
    "pfam_name"     : "Lipocalin",
    "resolution"    : "[1 TO 2]",
    "tax_id"        : "9606",
    "entry_authors" : "*Kleywegt*",
}
query = {                                                                       
    "rows" : pow(10,8), # i.e. all matching documents are required in response
    "fl"   : "pdb_id, entity_id", # restrict the returned documents to these fields only
}

docs = execute_solr_query(query, query_detail)


Found 2 matching entities in 2 entries.

Your turn!

Can you now query PDBe Solr instance to find entries that match the following criteria?

  • entries published in Nature and containing transmembrane protein.
  • number of SCOP domain families in entries that have homo-tetramer as the most likely assembly.