G2P Genotype to Phenotype example notebook.

This example illustrates how to access the GA4GH Genotype to Phenotype service. The GA4GH G2P allows the researcher to query features, phenotypes and their associated evidence.

The examples are based on the 1000 genomes server data set with a full complement of G2P data from the Monarch project.

Setup


Initialize the client

In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.


In [1]:
from ga4gh.client import client
ga4gh_endpoint = "http://1kgenomes.ga4gh.org"
c = client.HttpClient(ga4gh_endpoint)

Search for hierarchy identifiers: FeatureSet and PhenotypeAssociationSet

The G2P dataset exists within the hierarchy of Ga4GH datasets and featuresets. This call returns phenotype association sets hosted by the API. Observe that we are querying all datasets hosted in the endpoint. The identifiers for the featureset and phenotype association set are used by all subsequent API calls.


In [29]:
datasets = c.search_datasets()
phenotype_association_set_id = None
phenotype_association_set_name = None
for  dataset in datasets:
  phenotype_association_sets = c.search_phenotype_association_sets(dataset_id=dataset.id)
  for phenotype_association_set in phenotype_association_sets:
    phenotype_association_set_id = phenotype_association_set.id
    phenotype_association_set_name = phenotype_association_set.name
    print 'Found G2P phenotype_association_set:', phenotype_association_set.id, phenotype_association_set.name
    break

assert phenotype_association_set_id
assert phenotype_association_set_name

feature_set_id = None
datasets = c.search_datasets()
for  dataset in datasets:
  featuresets = c.search_feature_sets(dataset_id=dataset.id)
  for featureset in featuresets:
    if phenotype_association_set_name in featureset.name:
      feature_set_id = featureset.id
      print 'Found G2P feature_set:', feature_set_id
      break        
assert feature_set_id


Found G2P phenotype_association_set: WyIxa2dlbm9tZXMiLCJjZ2QiXQ cgd
Found G2P feature_set: WyIxa2dlbm9tZXMiLCJjZ2QiXQ

Use case: Find evidence for a Genomic Feature


Search for Features by location

Using the feature set id returned above, the following request returns a list of features that exactly match a location


In [14]:
feature_generator = c.search_features(feature_set_id=feature_set_id,
                        reference_name="chr7",
                        start=55249005,
                        end=55249006
                    )

features = list(feature_generator)
assert len(features) == 1
print "Found {} features in G2P feature_set {}".format(len(features),feature_set_id)
feature = features[0]
print [feature.name,feature.gene_symbol,feature.reference_name,feature.start,feature.end]


Found 1 features in G2P feature_set WyIxa2dlbm9tZXMiLCJjZ2QiXQ
[u'EGFR S768I missense mutation', u'COSM6241', u'chr7', 55249005L, 55249006L]

Search features by name

Alternatively, if the location is not known, we can query using the name of the feature. Using the feature set id returned above, the following request returns a list of features that exactly match a given name - 'EGFR S768I missense mutation'.


In [31]:
feature_generator = c.search_features(feature_set_id=feature_set_id, name='EGFR S768I missense mutation')
features = list(feature_generator)
assert len(features) == 1
print "Found {} features in G2P feature_set {}".format(len(features),feature_set_id)
feature = features[0]
print [feature.name,feature.gene_symbol,feature.reference_name,feature.start,feature.end]


Found 1 features in G2P feature_set WyIxa2dlbm9tZXMiLCJjZ2QiXQ
[u'EGFR S768I missense mutation', u'COSM6241', u'chr7', 55249005L, 55249006L]

Get evidence associated with that feature.

Once we have looked up the feature, we can then search for all evidence associated with that feature.


In [32]:
feature_phenotype_associations =  c.search_genotype_phenotype(
                                    phenotype_association_set_id=phenotype_association_set_id,
                                    feature_ids=[f.id  for f in features])
associations = list(feature_phenotype_associations)
assert len(associations) >= len(features)
print "There are {} associations".format(len(associations))
print "\n".join([a.description for a in associations])


There are 2 associations
Association: genotype:[COSM6241] phenotype:[Adenosquamous carcinoma with response to therapy] environment:[irreversible EGFR TKIs] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22753918|http://www.ncbi.nlm.nih.gov/pubmed/22753918]
Association: genotype:[COSM6241] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]

Display evidence

Explore the evidence. For example, a publication.


In [53]:
from IPython.display import IFrame
IFrame(associations[0].evidence[0].info['publications'][0], "100%",300)


Out[53]:

Use Case: Find evidence for a Phenotype


Search a phenotype

Alternatively, a researcher can query for a phenotype. In this case by the phenotype's description matching 'Adenosquamous carcinoma .*'


In [19]:
phenotypes_generator = c.search_phenotype(
                phenotype_association_set_id=phenotype_association_set_id,
                description="Adenosquamous carcinoma .*"
                )
phenotypes = list(phenotypes_generator)

assert len(phenotypes) >= 0
print "\n".join(set([p.description for p in phenotypes]))


Adenosquamous carcinoma with response to therapy
Adenosquamous carcinoma with decreased sensitivity to therapy
Adenosquamous carcinoma with sensitivity to therapy

Get evidence associated with those phenotypes.

The researcher can use those phenotype identifiers to query for evidence associations.


In [33]:
feature_phenotype_associations =  c.search_genotype_phenotype(
                                    phenotype_association_set_id=phenotype_association_set_id,
                                    phenotype_ids=[p.id for p in phenotypes])
associations = list(feature_phenotype_associations)
assert len(associations) >= len(phenotypes)
print "There are {} associations. First five...".format(len(associations))
print "\n".join([a.description for a in associations][:5])


There are 38 associations. First five...
Association: genotype:[FGFR2 K660N missense mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[FGFR inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/25035393]
Association: genotype:[EGFR G719S missense mutation] phenotype:[Adenosquamous carcinoma with response to therapy] environment:[irreversible EGFR TKIs] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22753918|http://www.ncbi.nlm.nih.gov/pubmed/22753918]
Association: genotype:[COSM6213] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[EGFR exon 19 p.729-761 insertion mutation] phenotype:[Adenosquamous carcinoma with response to therapy] environment:[erlotinib] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22190593]
Association: genotype:[EGFR G719D missense mutation] phenotype:[Adenosquamous carcinoma with response to therapy] environment:[irreversible EGFR TKIs] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22753918]

Further constrain associations with environment

The researcher can limit the associations returned by introducing the evironment contraint


In [26]:
import ga4gh_client.protocol as protocol
evidence = protocol.EvidenceQuery()
evidence.description = "MEK inhibitors"
    
feature_phenotype_associations =  c.search_genotype_phenotype(
                                    phenotype_association_set_id=phenotype_association_set_id,
                                    phenotype_ids=[p.id for p in phenotypes],
                                    evidence = [evidence]
                                    )
associations = list(feature_phenotype_associations)
print "There are {} associations. First five...".format(len(associations))
print "\n".join([a.description for a in associations][:5])


There are 13 associations. First five...
Association: genotype:[COSM6213] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[EGFR G719D missense mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[COSM6240] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[EGFR exon 19 p.729-761 deletion mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[EGFR G719S missense mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]

Use Case: Association Heatmap

The bokeh package should be installed for graphing.


Find features

First, we collect a set of features.


In [34]:
feature_generator = c.search_features(feature_set_id=feature_set_id, name='.*KIT.*')
features = list(feature_generator)
assert len(features) > 0
print "Found {} features. First five...".format(len(features),feature_set_id)
print "\n".join([a.description for a in associations][:5])


Found 69 features. First five...
Association: genotype:[FGFR2 K660N missense mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[FGFR inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/25035393]
Association: genotype:[EGFR G719S missense mutation] phenotype:[Adenosquamous carcinoma with response to therapy] environment:[irreversible EGFR TKIs] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22753918|http://www.ncbi.nlm.nih.gov/pubmed/22753918]
Association: genotype:[COSM6213] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[EGFR exon 19 p.729-761 insertion mutation] phenotype:[Adenosquamous carcinoma with response to therapy] environment:[erlotinib] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22190593]
Association: genotype:[EGFR G719D missense mutation] phenotype:[Adenosquamous carcinoma with response to therapy] environment:[irreversible EGFR TKIs] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22753918]

Get all associations

Then we select all the associations for those features.


In [28]:
feature_phenotype_associations =  c.search_genotype_phenotype(
                                    phenotype_association_set_id=phenotype_association_set_id,
                                    feature_ids=[f.id  for f in features])
associations = list(feature_phenotype_associations)
print "There are {} associations.  First five...".format(len(associations))
print "\n".join([a.description for a in associations][:5])


There are 52 associations.  First five...
Association: genotype:[KIT exon 11 p.550-592 any mutation] phenotype:[GIST with response to therapy] environment:[HSP90 inhibitors] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22898035]
Association: genotype:[KIT exon 8 p.416-422 indel] phenotype:[ACUTE MYELOID LEUKAEMIA (AML) AND RELATED PRECURSOR NEOPLASMS with sensitivity to therapy] environment:[imatinib] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/15618474]
Association: genotype:[KIT V559I, H687Y, T670, V654A, A829P, D816, N822, Y823D missense mutation] phenotype:[GIST with resistance to therapy] environment:[imatinib] evidence:[resistance] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23582185|http://www.ncbi.nlm.nih.gov/pubmed/21689725|http://www.ncbi.nlm.nih.gov/pubmed/17259998]
Association: genotype:[KIT V559I, H687Y, T670, V654A, A829P, D816, N822, Y823D missense mutation] phenotype:[GIST with resistance to therapy] environment:[imatinib] evidence:[resistance] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23582185|http://www.ncbi.nlm.nih.gov/pubmed/21689725|http://www.ncbi.nlm.nih.gov/pubmed/17259998]
Association: genotype:[KIT exon 9 p.449-514 any mutation] phenotype:[GIST with decreased sensitivity to therapy] environment:[imatinib] evidence:[decreased_sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/18955458|http://www.ncbi.nlm.nih.gov/pubmed/18955451|http://www.ncbi.nlm.nih.gov/pubmed/16624552]

Association Heatmap

Developers can use the G2P package to create researcher friendly applications. Here we take the results from the GA4GH queries and create a dataframe showing association counts.


In [24]:
from bokeh.charts import HeatMap, output_notebook, output_file, show 

from bokeh.layouts import column
from bokeh.models import ColumnDataSource
from bokeh.models.widgets import DataTable,   TableColumn
from bokeh.models import HoverTool


feature_ids = {}
for feature in features:
    feature_ids[feature.id]=feature.name

phenotype_descriptions = []
feature_names = []
association_count = [] 
association_descriptions = []

for association in associations:
    for feature_id in association.feature_ids:
        phenotype_descriptions.append(association.phenotype.description)
        feature_names.append(feature_ids[feature_id])
        association_count.append(1)
        association_descriptions.append(association.description)

output_notebook()
output_file("g2p_heatmap.html")
  
data = {'feature': feature_names  ,
        'association_count': association_count,
        'phenotype': phenotype_descriptions,
        'association_descriptions': association_descriptions
        }

hover = HoverTool(
        tooltips=[
            ("associations", "@values")
        ]
    )

hm = HeatMap(data, x='feature', y='phenotype', values='association_count',
             title='G2P Associations for KIT', stat='sum',
             legend=False,width=1024,
             tools=[hover], #"hover,pan,wheel_zoom,box_zoom,reset,tap",
             toolbar_location="above")

source = ColumnDataSource(data)
columns = [
        TableColumn(field="association_descriptions", title="Description"),
    ]
data_table = DataTable(source=source, columns=columns,width=1024 )

show( column(hm,data_table)  )


Loading BokehJS ...