Search with facetting and grouping

Introduction

In search_introduction, we saw how basic selectors can be progressively added to a Solr query to find entries of interest.
Now we will see how facetting, grouping and pivoting can be used to find interesting facts about your favorite protein.

Getting started

Let us setup logger and create mysolr instance for the Solr core.



In [4]:

    
from mysolr import Solr
PDBE_SOLR_URL = "http://wwwdev.ebi.ac.uk/pdbe/search/pdb"
solr = Solr(PDBE_SOLR_URL, version=4)

UNLIMITED_ROWS = 10000000 # necessary because default in mysolr is mere 10

import logging, sys
#reload(logging) # reload is just a hack to make logging work in the notebook, it's usually unnecessary
logging.basicConfig( level=logging.INFO, stream=sys.stdout,
        format='LOG|%(asctime)s|%(levelname)s  %(message)s', datefmt='%d-%b-%Y %H:%M:%S' )
logging.getLogger("requests").setLevel(logging.WARNING)

def join_with_AND(selectors) :
    return " AND ".join(
        ["%s:%s" % (k,v) for k,v in selectors]
    )

Find your protein

Identifying previous instances of your protein in the PDB is not an easy task because molecule names given by depositors can differ slightly. The SIFTS project assigns UniProt cross-references to proteins in PDB entries and names them consistently. The following function searches and facets on UniProt name to find proteins of our interest. Note how we are using facet options to identify all distinct values of molecule_name.



In [5]:

    
def molecule_name_facet_search(selectors) :
    response = solr.search(**{
        "rows" : UNLIMITED_ROWS, "fl" : "pdb_id, entity_id", "q" : join_with_AND(selectors),
        "facet" : "true", "facet.limit" : UNLIMITED_ROWS, "facet.mincount" : 1,
        "facet.field" : "molecule_name",
    })
    num_mols = len(response.documents)
    mol_name_counts = response.facets['facet_fields']['molecule_name']
    logging.info("%d molecules found with %d distinct molecule_names." % (num_mols, len(mol_name_counts.keys())))
    for mol_name, nmol in mol_name_counts.items() :
        logging.info("%3d molecules are named as %s" % (nmol, mol_name))

Let us assume we are interested in carbonic anhydrases. We write the protein name as a regular expression allowing for case changes on start of word.



In [6]:

    
molecule_name_facet_search([
    (    'molecule_name'  ,  '/.*[Cc]arbonic.*[aA]nhydrase.*/'),
])









    



LOG|11-Jul-2018 09:53:09|INFO  0 molecules found with 0 distinct molecule_names.

Note that there are some unintended hits - one putative and another inhibitor. Let us filter those out.



In [7]:

    
selectors = [
    (    'molecule_name'  ,  '/.*[Cc]arbonic.*[aA]nhydrase.*/'),
    ('NOT molecule_name'  ,  '(/.*Putative.*/ OR /.*Inhibitor.*/)'),
]
molecule_name_facet_search(selectors)









    



LOG|11-Jul-2018 09:53:13|INFO  0 molecules found with 0 distinct molecule_names.

We can also sharpen our search considerably by using annotations like GO, SCOP etc. But the filters should strike a balance in removing spurious hits and keeping genuine ones. Often optimal filters are found through multiple trials.

Count entries by experiment type

Now let us see a summary of experiment types that have been used to solve carbonic anhydrases. Since experiment is a property entry, and not molecules within it, we need to group on pdb_id and facet in a group-sensitive way so that the counts we get are for entries.



In [8]:

    
response = solr.search(**{
    "rows" : UNLIMITED_ROWS, "fl" : "pdb_id, entity_id",
    "q" : join_with_AND(selectors),
    "facet" : "true", "facet.limit" : UNLIMITED_ROWS, "facet.mincount" : 1,
    "facet.field" : "experimental_method",
    "group" : "true", "group.facet" : "true",
    "group.field" : "pdb_id",
})

expt_counts = response.facets['facet_fields']['experimental_method']
logging.info("There are %d experimental methods with this protein's structure has been studied." % len(expt_counts))
for expt, count in expt_counts.items() :
    logging.info("%s : %d" % (expt,count))









    



LOG|11-Jul-2018 09:53:16|INFO  There are 0 experimental methods with this protein's structure has been studied.

Count entries by year of deposition

Let us now facet on year of deposition and see the years in which an entry was deposited for carbonic anhydrases.



In [9]:

    
response = solr.search(**{
    "rows" : UNLIMITED_ROWS, "fl" : "pdb_id, entity_id",
    "q" : join_with_AND(selectors),
    "facet" : "true", "facet.limit" : UNLIMITED_ROWS, "facet.mincount" : 1,
    "facet.field" : "deposition_year",
    "group" : "true", "group.facet" : "true",
    "group.field" : "pdb_id",
})
year_counts = response.facets['facet_fields']['deposition_year']
logging.info("There are %d years in which this protein's structure has been studied." % len(year_counts))
for year in sorted(year_counts.keys(), key=lambda x : int(x)) :
    logging.info("%s : %d" % (year,year_counts[year]))









    



LOG|11-Jul-2018 09:53:19|INFO  There are 0 years in which this protein's structure has been studied.

Note that we do not have to facet on one field at a time - we could have facetted on multiple fields individually in the same call - just provide comma-separated fields list.

Facets can be defined to be range based, e.g. this is useful for fields like resolution, year, length of crystallographic cell, etc.



In [10]:

    
response = solr.search(**{
    "rows" : UNLIMITED_ROWS, "fl" : "pdb_id, entity_id",
    "q" : join_with_AND(selectors),
    "facet" : "true", "facet.limit" : UNLIMITED_ROWS, "facet.mincount" : 1,
    "facet.field" : "resolution",
    "facet.range" : "resolution",
    "f.resolution.facet.range.start" : "0.0",
    "f.resolution.facet.range.end" : "100",
    "f.resolution.facet.range.gap" : "0.5",
    "f.resolution.facet.range.other" : "between",
    "f.resolution.facet.range.include" : "upper",
    "group" : "true", "group.facet" : "true",
    "group.field" : "pdb_id",
})

import string, collections

resol_counts = response.facets['facet_ranges']['resolution']['counts']
resol_counts = collections.OrderedDict([(resol_counts[rci], resol_counts[rci+1]) for rci in range(0, len(resol_counts), 2)])
logging.info("Resolutions at which this protein has been solved is as follows:")
for resol in sorted(resol_counts.keys(), key=lambda x : string.atof(x)) :
    logging.info("%3d entries in resolution bin starting %s" % (resol_counts[resol], resol))









    



LOG|11-Jul-2018 09:53:22|INFO  Resolutions at which this protein has been solved is as follows:

Hierarchical facetting

Factes can be used hierarchically too, e.g. facet first on resolution, then on year, etc. Unfortunately mysolr does not support this feature, but the good news is that you can write simple python on documents returned and achieve the same effect. e.g. let us see how to find distribution of resolution vs deposition year in this set of entries.



In [12]:

    
response = solr.search(**{
    "rows" : UNLIMITED_ROWS,
    "fl" : "pdb_id, entity_id, deposition_year, resolution",
    "q" : join_with_AND(selectors),
})

resbin_width = 0.5
def resol_bin(resol) :
    import decimal
    return decimal.Decimal(int(resol/resbin_width) * resbin_width)

yearbin_width = 5
def depyear_bin(year) :
    return (year / yearbin_width) * yearbin_width

entry_counted = set()
counts = collections.defaultdict( lambda : collections.defaultdict( lambda: 0 ) )
for adoc in response.documents :
    if adoc['pdb_id'] not in entry_counted :
        res_bin = resol_bin(adoc['resolution'])
        year_bin = depyear_bin(adoc['deposition_year'])
        counts[year_bin][res_bin] += 1

import itertools
year_bins = sorted(counts.keys())
resol_bins = sorted(set( itertools.chain(*[v.keys() for v in counts.values()]) ))

logging.info("          " + "  ".join("%.1f-%.1f" % (rb,float(rb)+resbin_width) for rb in resol_bins))
for year in year_bins :
    to_print = ["%d-%d" % (year,year+yearbin_width)]
    total = 0
    for resol in resol_bins :
        total += counts.get(year, {}).get(resol, 0)
    for resol in resol_bins :
        count = counts.get(year, {}).get(resol, 0)
        to_print.append(count) #, #int(count*100./total),
    logging.info(to_print[0] + "    ".join(["%5d" % tp for tp in to_print[1:]]))









    



LOG|11-Jul-2018 11:59:04|INFO

Note how higher resolution structures have increased over the years.

Your turn!

Find entries with protein of your interest, and facet by organism, genus, etc.



In [ ]: