This notebook will show you how to create a TCGA cohort using the publicly available TCGA BigQuery tables that the ISB-CGC project has produced based on the open-access TCGA data available at the Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial or contact us and become part of the community evaluation phase of our Cancer Genomics Cloud pilot.
We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are some links to some resources that you might find useful:
There are also many tutorials and samples available on github (see, in particular, the datalab repo and the Google Genomics project).
OK then, let's get started! In order to work with BigQuery, the first thing you need to do is import the bigquery module:
In [1]:
import gcp.bigquery as bq
The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we will be working with in this notebook are in a dataset called tcga_201607_beta
, owned by the isb-cgc
project. A full table identifier is of the form <project_id>:<dataset_id>.<table_id>
. Let's start by getting some basic information about the tables in this dataset:
In [2]:
d = bq.DataSet('isb-cgc:tcga_201607_beta')
for t in d.tables():
print '%10d rows %12d bytes %s' \
% (t.metadata.rows, t.metadata.size, t.name.table_id)
In this tutorial, we are going to look at a few different ways that we can use the information in these tables to create cohorts. Now, you maybe asking what we mean by "cohort" and why you might be interested in creating one, or maybe what it even means to "create" a cohort. The TCGA dataset includes clinical, biospecimen, and molecular data from over 10,000 cancer patients who agreed to be a part of this landmark research project to build The Cancer Genome Atlas. This large dataset was originally organized and studied according to cancer type but now that this multi-year project is nearing completion, with over 30 types of cancer and over 10,000 tumors analyzed, you have the opportunity to look at this dataset from whichever angle most interests you. Maybe you are particularly interested in early-onset cancers, or gastro-intestinal cancers, or a specific type of genetic mutation. This is where the idea of a "cohort" comes in. The original TCGA "cohorts" were based on cancer type (aka "study"), but now you can define a cohort based on virtually any clinical or molecular feature by querying these BigQuery tables. A cohort is simply a list of samples, using the TCGA barcode system. Once you have created a cohort you can use it in any number of ways: you could further explore the data available for one cohort, or compare one cohort to another, for example.
In the rest of this tutorial, we will create several different cohorts based on different motivating research questions. We hope that these examples will provide you with a starting point from which you can build, to answer your own research questions.
Let's start by looking at the clinical data table. The TCGA dataset contains a few very basic clinical data elements for almost all patients, and contains additional information for some tumor types only. For example smoking history information is generally available only for lung cancer patients, and BMI (body mass index) is only available for tumor types where that is a known significant risk factor. Let's take a look at the clinical data table and see how many different pieces of information are available to us:
In [3]:
%bigquery schema --table isb-cgc:tcga_201607_beta.Clinical_data
Out[3]:
That's a lot of fields! We can also get at the schema programmatically:
In [4]:
table = bq.Table('isb-cgc:tcga_201607_beta.Clinical_data')
if ( table.exists() ):
fieldNames = map(lambda tsf: tsf.name, table.schema)
fieldTypes = map(lambda tsf: tsf.data_type, table.schema)
print " This table has %d fields. " % ( len(fieldNames) )
print " The first few field names and types are: "
print " ", fieldNames[:5]
print " ", fieldTypes[:5]
else:
print " There is no existing table called %s:%s.%s" % ( table.name.project_id, table.name.dataset_id, table.name.table_id )
Let's look at these fields and see which ones might be the most "interesting", by looking at how many times they are filled-in (not NULL), or how much variation exists in the values. If we wanted to look at just a single field, "tobacco_smoking_history" for example, we could use a very simple query to get a basic summary:
In [5]:
%%sql
SELECT tobacco_smoking_history, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
GROUP BY tobacco_smoking_history
ORDER BY n DESC
Out[5]:
But if we want to loop over all fields and get a sense of which fields might provide us with useful criteria for specifying a cohort, we'll want to automate that. We'll put a threshold on the minimum number of patients that we expect information for, and the maximum number of unique values (since fields such as the "ParticipantBarcode" will be unique for every patient and, although we will need that field later, it's probably not useful for defining a cohort).
In [15]:
numPatients = table.metadata.rows
print " The %s table describes a total of %d patients. " % ( table.name.table_id, numPatients )
# let's set a threshold for the minimum number of values that a field should have,
# and also the maximum number of unique values
minNumPatients = int(numPatients*0.80)
maxNumValues = 50
numInteresting = 0
iList = []
for iField in range(len(fieldNames)):
aField = fieldNames[iField]
aType = fieldTypes[iField]
try:
qString = "SELECT {0} FROM [{1}]".format(aField,table)
query = bq.Query(qString)
df = query.to_dataframe()
summary = df[str(aField)].describe()
if ( aType == "STRING" ):
topFrac = float(summary['freq'])/float(summary['count'])
if ( summary['count'] >= minNumPatients ):
if ( summary['unique'] <= maxNumValues and summary['unique'] > 1 ):
if ( topFrac < 0.90 ):
numInteresting += 1
iList += [aField]
print " > %s has %d values with %d unique (%s occurs %d times) " \
% (str(aField), summary['count'], summary['unique'], summary['top'], summary['freq'])
else:
if ( summary['count'] >= minNumPatients ):
if ( summary['std'] > 0.1 ):
numInteresting += 1
iList += [aField]
print " > %s has %d values (mean=%.0f, sigma=%.0f) " \
% (str(aField), summary['count'], summary['mean'], summary['std'])
except:
pass
print " "
print " Found %d potentially interesting features: " % numInteresting
print " ", iList
The above helps us narrow down on which fields are likely to be the most useful, but if you have a specific interest, for example in menopause or HPV status, you can still look at those in more detail very easily:
In [16]:
%%sql
SELECT menopause_status, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY menopause_status
ORDER BY n DESC
Out[16]:
We might wonder which specific tumor types have menopause information:
In [17]:
%%sql
SELECT Study, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY Study
ORDER BY n DESC
Out[17]:
In [18]:
%%sql
SELECT hpv_status, hpv_calls, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE hpv_status IS NOT NULL
GROUP BY hpv_status, hpv_calls
HAVING n > 20
ORDER BY n DESC
Out[18]:
An additional factor to consider, when creating a cohort is that there may be additional information that might lead one to exclude a particular patient from a cohort. In certain instances, patients have been redacted or excluded from analyses for reasons such as prior treatment, etc, but since different researchers may have different criteria for using or excluding certain patients or certain samples from their analyses, in many cases the data is still available while at the same time "annotations" may have been entered into a searchable database. These annotations have also been uploaded into a BigQuery table and can be used in conjuction with the other BigQuery tables.
In this next code cell, we define several queries within a module
which allows us to use them both individually and by reference in the final, main query.
select_on_annotations
, finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment'; select_on_clinical
selects all female breast-cancer patients who were diagnosed at age 50 or younger, while also pulling out a few additional fields that might be of interest; and
In [19]:
%%sql --module createCohort_and_checkAnnotations
DEFINE QUERY select_on_annotations
SELECT
ParticipantBarcode,
annotationCategoryName AS categoryName,
annotationClassification AS classificationName
FROM
[isb-cgc:tcga_201607_beta.Annotations]
WHERE
( itemTypeName="Patient"
AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy"
OR annotationClassification="Redaction" ) )
GROUP BY
ParticipantBarcode,
categoryName,
classificationName
DEFINE QUERY select_on_clinical
SELECT
ParticipantBarcode,
vital_status,
days_to_last_known_alive,
ethnicity,
histological_type,
menopause_status,
race
FROM
[isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
( Study="BRCA"
AND age_at_initial_pathologic_diagnosis<=50
AND gender="FEMALE" )
SELECT
c.ParticipantBarcode AS ParticipantBarcode
FROM (
SELECT
a.categoryName,
a.classificationName,
a.ParticipantBarcode,
c.ParticipantBarcode,
FROM ( $select_on_annotations ) AS a
OUTER JOIN EACH
( $select_on_clinical ) AS c
ON
a.ParticipantBarcode = c.ParticipantBarcode
WHERE
(a.ParticipantBarcode IS NOT NULL
OR c.ParticipantBarcode IS NOT NULL)
ORDER BY
a.classificationName,
a.categoryName,
a.ParticipantBarcode,
c.ParticipantBarcode )
WHERE
( a.categoryName IS NULL
AND a.classificationName IS NULL
AND c.ParticipantBarcode IS NOT NULL )
ORDER BY
c.ParticipantBarcode
Here we explicitly call just the first query in the module, and we get a list of 212 patients with one of these disqualifying annotations:
In [20]:
bq.Query(createCohort_and_checkAnnotations.select_on_annotations).results().to_dataframe()
Out[20]:
and here we explicitly call just the second query, resulting in 329 patients:
In [21]:
bq.Query(createCohort_and_checkAnnotations.select_on_clinical).results().to_dataframe()
Out[21]:
and finally we call the main query:
In [22]:
bq.Query(createCohort_and_checkAnnotations).results().to_dataframe()
Out[22]:
Note that we didn't need to call each sub-query individually, we could have just called the main query and gotten the same result. As you can see, two patients that met the clinical select criteria (which returned 329 patients) were excluded from the final result (which returned 327 patients).
Before we leave off, here are a few useful tricks for working with BigQuery in Cloud Datalab:
In [23]:
q = bq.Query(createCohort_and_checkAnnotations)
q
Out[23]:
In [24]:
q.execute_dry_run()
Out[24]:
In [ ]: