The goal of this notebook is to introduce you to the mRNAseq gene expression BigQuery table.
This table contains all available TCGA Level-3 gene expression data produced by UNC's RNAseqV2 pipeline using the Illumina HiSeq platform, as of July 2016. The most recent archive (eg unc.edu_BRCA.IlluminaHiSeq_RNASeqV2.Level_3.1.11.0
) for each of the 33 tumor types was downloaded from the DCC, and data extracted from all files matching the pattern %.rsem.genes.normalized_results
. Each of these raw “RSEM genes normalized results” files has two columns: gene_id and normalized_count. The gene_id string contains two parts: the gene symbol, and the Entrez gene ID, separated by | eg: TP53|7157
. During ETL, the gene_id string is split and the gene symbol is stored in the original_gene_symbol
field, and the Entrez gene ID is stored in the gene_id
field. In addition, the Entrez ID is used to look up the current HGNC approved gene symbol, which is stored in the HGNC_gene_sybmol
field.
In order to work with BigQuery, you need to import the python bigquery module (gcp.bigquery
) and you need to know the name(s) of the table(s) you are going to be working with:
In [1]:
import gcp.bigquery as bq
mRNAseq_BQtable = bq.Table('isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM')
From now on, we will refer to this table using this variable ($mRNAseq_BQtable), but we could just as well explicitly give the table name each time.
Let's start by taking a look at the table schema:
In [2]:
%bigquery schema --table $mRNAseq_BQtable
Out[2]:
Now let's count up the number of unique patients, samples and aliquots mentioned in this table. We will do this by defining a very simple parameterized query. (Note that when using a variable for the table name in the FROM clause, you should not also use the square brackets that you usually would if you were specifying the table name as a string.)
In [3]:
%%sql --module count_unique
DEFINE QUERY q1
SELECT COUNT (DISTINCT $f, 25000) AS n
FROM $t
In [4]:
fieldList = ['ParticipantBarcode', 'SampleBarcode', 'AliquotBarcode']
for aField in fieldList:
field = mRNAseq_BQtable.schema[aField]
rdf = bq.Query(count_unique.q1,t=mRNAseq_BQtable,f=field).results().to_dataframe()
print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)
We can do the same thing to look at how many unique gene symbols and gene ids exist in the table:
In [5]:
fieldList = ['original_gene_symbol', 'HGNC_gene_symbol', 'gene_id']
for aField in fieldList:
field = mRNAseq_BQtable.schema[aField]
rdf = bq.Query(count_unique.q1,t=mRNAseq_BQtable,f=field).results().to_dataframe()
print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)
Based on the counts, we can see that there are a few instances where the original gene symbol (from the underlying TCGA data file), or the HGNC gene symbol or the gene id (also from the original TCGA data file) is missing, but for the majority of genes, all three values should be available and for the most part the original gene symbol and the HGNC gene symbol that was added during ETL should all match up. This next query will generate the complete list of genes for which none of the identifiers are null, and where the original gene symbol and the HGNC gene symbol match. This list has over 18000 genes in it.
In [6]:
%%sql
SELECT
HGNC_gene_symbol,
original_gene_symbol,
gene_id
FROM
$mRNAseq_BQtable
WHERE
( original_gene_symbol IS NOT NULL
AND HGNC_gene_symbol IS NOT NULL
AND original_gene_symbol=HGNC_gene_symbol
AND gene_id IS NOT NULL )
GROUP BY
original_gene_symbol,
HGNC_gene_symbol,
gene_id
ORDER BY
HGNC_gene_symbol
Out[6]:
We might also want to know how often the gene symbols do not agree:
In [7]:
%%sql
SELECT
HGNC_gene_symbol,
original_gene_symbol,
gene_id
FROM
$mRNAseq_BQtable
WHERE
( original_gene_symbol IS NOT NULL
AND HGNC_gene_symbol IS NOT NULL
AND original_gene_symbol!=HGNC_gene_symbol
AND gene_id IS NOT NULL )
GROUP BY
original_gene_symbol,
HGNC_gene_symbol,
gene_id
ORDER BY
HGNC_gene_symbol
Out[7]:
BigQuery is not just a "look-up" service -- you can also use it to perform calculations. In this next query, we take a look at the mean, standard deviation, and coefficient of variation for the expression of EGFR, within each tumor-type, as well as the number of primary tumor samples that went into each summary statistic.
In [8]:
%%sql
SELECT
Study,
n,
exp_mean,
exp_sigma,
(exp_sigma/exp_mean) AS exp_cv
FROM (
SELECT
Study,
AVG(LOG2(normalized_count+1)) AS exp_mean,
STDDEV_POP(LOG2(normalized_count+1)) AS exp_sigma,
COUNT(AliquotBarcode) AS n
FROM
$mRNAseq_BQtable
WHERE
( SampleTypeLetterCode="TP"
AND HGNC_gene_symbol="EGFR" )
GROUP BY
Study )
ORDER BY
exp_sigma DESC
Out[8]:
We can also easily move the gene-symbol out of the WHERE clause and into the SELECT and GROUP BY clauses and have BigQuery do this same calculation over all genes and all tumor types. This time we will use the --module
option to define the query and then call it in the next cell from python.
In [9]:
%%sql --module highVar
SELECT
Study,
HGNC_gene_symbol,
n,
exp_mean,
exp_sigma,
(exp_sigma/exp_mean) AS exp_cv
FROM (
SELECT
Study,
HGNC_gene_symbol,
AVG(LOG2(normalized_count+1)) AS exp_mean,
STDDEV_POP(LOG2(normalized_count+1)) AS exp_sigma,
COUNT(AliquotBarcode) AS n
FROM
$t
WHERE
( SampleTypeLetterCode="TP" )
GROUP BY
Study,
HGNC_gene_symbol )
ORDER BY
exp_sigma DESC
Once we have defined a query, we can put it into a python object and print out the SQL statement to make sure it looks as expected:
In [10]:
q = bq.Query(highVar,t=mRNAseq_BQtable)
print q.sql
And then we can run it and save the results in another python object:
In [11]:
r = bq.Query(highVar,t=mRNAseq_BQtable).results()
In [12]:
#r.to_dataframe()
Since the result of the previous query is quite large (over 600,000 rows representing ~20,000 genes x ~30 tumor types), we might want to put those results into one or more subsequent queries that further refine these results, for example:
In [13]:
%%sql --module hv_genes
SELECT *
FROM ( $hv_result )
HAVING
( exp_mean > 6.
AND n >= 200
AND exp_cv > 0.5 )
ORDER BY
exp_cv DESC
In [14]:
bq.Query(hv_genes,hv_result=r).results().to_dataframe()
Out[14]:
In [ ]: