The goal of this notebook is to introduce you to the microRNA expression BigQuery table.
This table contains all available TCGA Level-3 microRNA expression data produced by BCGSC's microRNA pipeline using the Illumina HiSeq platform, as of July 2016. The most recent archive (eg bcgsc.ca_THCA.IlluminaHiSeq_miRNASeq.Level_3.1.9.0
) for each of the 32 tumor types was downloaded from the DCC, and data extracted from all files matching the pattern %.isoform.quantification.txt
. The isoform-quantification values were then processed through a Perl script provided by BCGSC which produces normalized expression levels for mature microRNAs. Each of these mature microRNAs is identified by name (eg hsa-mir-21) and by MIMAT accession number (eg MIMAT0000076).
In order to work with BigQuery, you need to import the python bigquery module (gcp.bigquery
) and you need to know the name(s) of the table(s) you are going to be working with:
In [5]:
import gcp.bigquery as bq
miRNA_BQtable = bq.Table('isb-cgc:tcga_201607_beta.miRNA_Expression')
From now on, we will refer to this table using this variable ($miRNA_BQtable), but we could just as well explicitly give the table name each time.
Let's start by taking a look at the table schema:
In [6]:
%bigquery schema --table $miRNA_BQtable
Out[6]:
Now let's count up the number of unique patients, samples and aliquots mentioned in this table. We will do this by defining a very simple parameterized query. (Note that when using a variable for the table name in the FROM clause, you should not also use the square brackets that you usually would if you were specifying the table name as a string.)
In [7]:
%%sql --module count_unique
DEFINE QUERY q1
SELECT COUNT (DISTINCT $f, 25000) AS n
FROM $t
In [8]:
fieldList = ['ParticipantBarcode', 'SampleBarcode', 'AliquotBarcode']
for aField in fieldList:
field = miRNA_BQtable.schema[aField]
rdf = bq.Query(count_unique.q1,t=miRNA_BQtable,f=field).results().to_dataframe()
print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)
In [9]:
fieldList = ['mirna_id', 'mirna_accession']
for aField in fieldList:
field = miRNA_BQtable.schema[aField]
rdf = bq.Query(count_unique.q1,t=miRNA_BQtable,f=field).results().to_dataframe()
print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)
These counts show that the mirna_id field is not a unique identifier and should be used in combination with the MIMAT accession number.
Another thing to note about this table is that these expression values are obtained from two different platforms -- approximately 15% of the data is from the Illumina GA platform, and 85% from the Illumina HiSeq:
In [10]:
%%sql
SELECT
Platform,
COUNT(*) AS n
FROM
$miRNA_BQtable
GROUP BY
Platform
ORDER BY
n DESC
Out[10]: