Facets Demo on 1000 Genomes Metadata

This notebook demonstrates using Facets for nimble visualization of metadata from 1000 Genomes. Facets contains two robust visualizations to get a sense of the shape of each feature of your dataset:

  • Facets Overview
  • Facets Dive

Facets is from the People+AI Research Initiative.

Setup


In [ ]:
FACETS_INSTALL_DIR = './'

In [ ]:
%%bash -s "$FACETS_INSTALL_DIR"
if [ ! -d "${1}/facets" ]; then
    # Install facets - only need to do this once per Datalab instance.
    cd $1
    git clone https://github.com/PAIR-code/facets
    cd facets
    jupyter nbextension install facets-dist/
else
    echo Facets is already installed under $1.
fi

In [ ]:
# Add the facets overview python code to the python path and import dependencies.
import os
import sys
sys.path.append(os.path.join(FACETS_INSTALL_DIR, 'facets/facets_overview/python'))
reload(sys)
sys.setdefaultencoding('utf-8')
import pandas as pd
import google.datalab.bigquery as bq
from generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
from IPython.core.display import display, HTML
import base64

Retrieve the data

Here we define one query for an initial demo using metadata from 1000 Genomes but there are more queries at the bottom of this notebook.

In general, as long as the query results in tabular-shaped data (e.g., you could export it to CSV) and it is on the order 10s of thousands of rows or less, it should work fine here. If larger than that, please sample the data before visualizing.


In [ ]:
sql = """
--
-- The 1000 Genomes metadata includes gender, familial relationships, population,
-- super population, sequencing metrics, etc.
--
SELECT
  *
FROM
  `genomics-public-data.1000_genomes.sample_info`
"""

Execute the query to fill a Pandas dataframe with the data of interest.


In [ ]:
query = bq.Query(sql)
df = query.execute().result().to_dataframe()

Visualize the result with Facets

The blocks of code that follow are boilerplate for visualizing the data using Facets Overview and Facets Dive. They use the value of variable df as the input to the visualization.

Note: This interactive visualization requires javascript, so if this notebook is viewed from GitHub the output will be empty.

Facets Overview

The following cell (when executed) will display the dataframe with Facets Overview.


In [ ]:
proto = GenericFeatureStatisticsGenerator().ProtoFromDataFrames([{'name': 'test', 'table': df}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")
HTML_TEMPLATE = """<link rel="import" href="{facetsPath}" >
        <h4>Facets Overview of dataframe with shape {shape}</h4>
        <facets-overview id="overviewelem"></facets-overview>
        <script>
          document.querySelector("#overviewelem").height = "1000px";
          document.querySelector("#overviewelem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(facetsPath=os.path.join(FACETS_INSTALL_DIR, 'facets/facets-dist/facets-jupyter.html'),
                            shape=str(df.shape),
                            protostr=protostr)
display(HTML(html))

Facets Dive

The following cell (when executed) will display the dataframe with Facets Dive. For 1000 Genomes the default settings reproduce the last plot in this notebook to compare sequencing center metrcis. Zoom in to see more detail in each plot.


In [ ]:
jsonstr = df.to_json(orient='records')
HTML_TEMPLATE = """<link rel="import" href="{facetsPath}" >
        <h4>Facets Dive of dataframe with shape {shape}</h4>
        <facets-dive id="diveelem"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#diveelem").height = "1000px";
          document.querySelector("#diveelem").data = data;
          // Specify a few default settings.
          document.querySelector("#diveelem").positionMode = 'scatter';
          // Specify a few default settings specific to 1000 Genomes.
          document.querySelector("#diveelem").horizontalFacet = 'Super_Population';
          document.querySelector("#diveelem").verticalFacet = 'Main_Project_E_Centers';
          document.querySelector("#diveelem").horizontalPosition = 'Total_Exome_Sequence';
          document.querySelector("#diveelem").verticalPosition = 'Total_LC_Sequence';
          document.querySelector("#diveelem").colorBy = 'In_Phase1_Integrated_Variant_Set';
        </script>"""
html = HTML_TEMPLATE.format(facetsPath=os.path.join(FACETS_INSTALL_DIR, 'facets/facets-dist/facets-jupyter.html'),
                            shape=str(df.shape),
                            jsonstr=jsonstr)
display(HTML(html))

Additional Queries

Here are additional metadata queries for the Personal Genome Project and the Simons Genome Diversity Project.

If you execute one of the following cells, it will update the sql variable. You can then return to the query and visualization cells above to re-execute them and display the new data.


In [ ]:
sql = """
--
-- Examine metadata about individuals in the Personal Genomes Project.
--
SELECT * FROM `google.com:biggene.pgp.phenotypes` 
"""

In [ ]:
sql = """
--
-- Examine metadata about individuals in the Simons Genome Diversity Project.
--
SELECT * 
FROM `genomics-public-data.simons_genome_diversity_project.sample_metadata` 
"""