Creating TCGA cohorts (part 2)

This notebook will provide another example of building a cohort, this time based on the types of samples available.


In [1]:
import gcp.bigquery as bq

Many different types of samples were obtained from the TCGA participants, and details about these samples are available in the Biospecimen data table. This next query shows how many samples exist of each type, as well as the full names and abbreviations of each type:


In [2]:
%%sql

SELECT
  SampleType,
  SampleTypeLetterCode,
  COUNT(*) AS n
FROM
  [isb-cgc:tcga_201607_beta.Biospecimen_data]
GROUP BY
  SampleType,
  SampleTypeLetterCode,
ORDER BY
  n DESC


Out[2]:
SampleTypeSampleTypeLetterCoden
Primary solid TumorTP10841
Blood Derived NormalNB9395
Solid Tissue NormalNT2726
MetastaticTM397
Primary Blood Derived Cancer - Peripheral BloodTB356
Recurrent Solid TumorTR60
Additional - New PrimaryTAP11
Buccal Cell NormalNBC5
Bone Marrow NormalNBM4
Additional MetastaticTAM2

(rows: 10, time: 0.9s, 610KB processed, job: job_Dlk2XaAt5ioV1a41HVygiZc3Dh8)

Note that there are many types of tumor samples: primary, metastatic, recurrent, etc, although the vast majority are samples from primary tumors. In the TCGA project, almost all tumor samples were assayed on multiple platforms for mRNA and miRNA expression, DNA methylation, DNA copy-number, and either exome- or whole-genome DNA sequence. For some tumor samples, protein activity was also measured using RPPA arrays. When available, adjacent "normal" tissue samples were also assayed on a subset of these platforms. The "blood normal" samples were primarily used only as a reference source of germline DNA in order to call somatic mutations.

We can do a similar counting exercise of the sample types represented in one of the molecular data tables, using one of the mRNA expression data tables:


In [3]:
%%sql

SELECT
  SampleTypeLetterCode,
  COUNT(*) AS n
FROM (
  SELECT
    SampleBarcode,
    SampleTypeLetterCode
  FROM
    [isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM]
  GROUP BY
    SampleBarcode,
    SampleTypeLetterCode )
GROUP BY
  SampleTypeLetterCode
ORDER BY
  n DESC


Out[3]:
SampleTypeLetterCoden
TP9112
NT726
TM393
TR46
TAP11
TAM1

(rows: 6, time: 2.7s, 4GB processed, job: job_J2Ok7Ohyp5KYDKot7SNzOi11Rho)

In this example, let's assume that we would like to do a study that requires a primary tumor sample and a matched-normal (adjacent) tissue sample. In order to find out which patients provided which types of samples, we need to query the Biospecimen data table. This next query module uses two sub-queries, one to get all patients with TP samples and another to get all patients with NT samples. The final query joins these two and returns a single list of patients.


In [4]:
%%sql --module patients_with_matched_samples


DEFINE QUERY patients_tp
SELECT
  ParticipantBarcode
FROM
  [isb-cgc:tcga_201607_beta.Biospecimen_data]
WHERE
  ( SampleTypeLetterCode="TP" )
GROUP BY
  ParticipantBarcode
ORDER BY
  ParticipantBarcode


DEFINE QUERY patients_nt
SELECT
  ParticipantBarcode
FROM
  [isb-cgc:tcga_201607_beta.Biospecimen_data]
WHERE
  ( SampleTypeLetterCode="NT" )
GROUP BY
  ParticipantBarcode
ORDER BY
  ParticipantBarcode


DEFINE QUERY patients_both
SELECT
  nt.ParticipantBarcode AS ParticipantBarcode
FROM ( $patients_nt ) AS nt
JOIN
  ( $patients_tp ) AS tp
ON
  nt.ParticipantBarcode = tp.ParticipantBarcode
ORDER BY
  ParticipantBarcode

In [5]:
bq.Query(patients_with_matched_samples).results().to_dataframe()


Out[5]:
ParticipantBarcode
0 TCGA-04-1335
1 TCGA-04-1336
2 TCGA-04-1337
3 TCGA-04-1338
4 TCGA-04-1342
5 TCGA-04-1346
6 TCGA-04-1347
7 TCGA-04-1348
8 TCGA-04-1349
9 TCGA-04-1350
10 TCGA-04-1351
11 TCGA-04-1353
12 TCGA-04-1356
13 TCGA-04-1357
14 TCGA-04-1360
15 TCGA-04-1361
16 TCGA-04-1369
17 TCGA-04-1370
18 TCGA-04-1371
19 TCGA-04-1516
20 TCGA-04-1517
21 TCGA-04-1519
22 TCGA-04-1536
23 TCGA-04-1638
24 TCGA-04-1644
25 TCGA-04-1646
26 TCGA-04-1648
27 TCGA-04-1649
28 TCGA-04-1651
29 TCGA-04-1652
... ...
2321 TCGA-VT-A80G
2322 TCGA-VT-A80J
2323 TCGA-W5-AA2I
2324 TCGA-W5-AA2O
2325 TCGA-W5-AA2Q
2326 TCGA-W5-AA2R
2327 TCGA-W5-AA2T
2328 TCGA-W5-AA2U
2329 TCGA-W5-AA2X
2330 TCGA-W5-AA2Z
2331 TCGA-W5-AA30
2332 TCGA-W5-AA31
2333 TCGA-W5-AA33
2334 TCGA-W5-AA34
2335 TCGA-W7-A93N
2336 TCGA-W7-A93O
2337 TCGA-W7-A93P
2338 TCGA-WA-A7GZ
2339 TCGA-X7-A8D6
2340 TCGA-X7-A8D7
2341 TCGA-XD-AAUH
2342 TCGA-XT-AASU
2343 TCGA-XX-A899
2344 TCGA-Y8-A8RY
2345 TCGA-Y8-A8RZ
2346 TCGA-Y8-A8S0
2347 TCGA-Y8-A8S1
2348 TCGA-YB-A89D
2349 TCGA-ZB-A969
2350 TCGA-ZU-A8S4

2351 rows × 1 columns

It might be interesting to find out what the distribution of tumor types is for this list of patients with matched tumor-normal sample pairs. We can define a new SQL module that refers to the results of a previously defined query as long as we pass that reference in when we call bq.Query():


In [6]:
%%sql --module tumor_type_counts

# now we'll use this list to find what types of tumors these patients
# belong to:
SELECT
  Study, 
  COUNT(*) AS n
FROM 
  [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
  ParticipantBarcode IN ($patients_both)
GROUP BY
  Study
ORDER BY
  n DESC

In [7]:
bq.Query(tumor_type_counts,
         patients_nt=patients_with_matched_samples.patients_nt,
         patients_tp=patients_with_matched_samples.patients_tp,
         patients_both=patients_with_matched_samples.patients_both).results().to_dataframe()


Out[7]:
Study n
0 KIRC 442
1 LUSC 254
2 LUAD 211
3 BRCA 162
4 OV 126
5 PRAD 118
6 STAD 101
7 THCA 100
8 COAD 93
9 LIHC 89
10 KIRP 88
11 HNSC 82
12 KICH 71
13 ESCA 65
14 UCEC 40
15 BLCA 37
16 PAAD 37
17 GBM 29
18 SARC 24
19 READ 18
20 CHOL 17
21 THYM 13
22 CESC 8
23 UCS 6
24 PCPG 5
25 ACC 5
26 MESO 1

In [ ]: