Creating TCGA cohorts (part 2)

This notebook will provide another example of building a cohort, this time based on the types of samples available.



In [1]:

    
import gcp.bigquery as bq

Many different types of samples were obtained from the TCGA participants, and details about these samples are available in the Biospecimen data table. This next query shows how many samples exist of each type, as well as the full names and abbreviations of each type:



In [2]:

    
%%sql

SELECT
  SampleType,
  SampleTypeLetterCode,
  COUNT(*) AS n
FROM
  [isb-cgc:tcga_201607_beta.Biospecimen_data]
GROUP BY
  SampleType,
  SampleTypeLetterCode,
ORDER BY
  n DESC









    Out[2]:





    SampleType SampleTypeLetterCode n
Primary solid Tumor TP 10841
Blood Derived Normal NB 9395
Solid Tissue Normal NT 2726
Metastatic TM 397
Primary Blood Derived Cancer - Peripheral Blood TB 356
Recurrent Solid Tumor TR 60
Additional - New Primary TAP 11
Buccal Cell Normal NBC 5
Bone Marrow Normal NBM 4
Additional Metastatic TAM 2
    
(rows: 10, time: 0.9s,   610KB processed, job: job_Dlk2XaAt5ioV1a41HVygiZc3Dh8)

Note that there are many types of tumor samples: primary, metastatic, recurrent, etc, although the vast majority are samples from primary tumors. In the TCGA project, almost all tumor samples were assayed on multiple platforms for mRNA and miRNA expression, DNA methylation, DNA copy-number, and either exome- or whole-genome DNA sequence. For some tumor samples, protein activity was also measured using RPPA arrays. When available, adjacent "normal" tissue samples were also assayed on a subset of these platforms. The "blood normal" samples were primarily used only as a reference source of germline DNA in order to call somatic mutations.

We can do a similar counting exercise of the sample types represented in one of the molecular data tables, using one of the mRNA expression data tables:



In [3]:

    
%%sql

SELECT
  SampleTypeLetterCode,
  COUNT(*) AS n
FROM (
  SELECT
    SampleBarcode,
    SampleTypeLetterCode
  FROM
    [isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM]
  GROUP BY
    SampleBarcode,
    SampleTypeLetterCode )
GROUP BY
  SampleTypeLetterCode
ORDER BY
  n DESC









    Out[3]:





    SampleTypeLetterCode n
TP 9112
NT 726
TM 393
TR 46
TAP 11
TAM 1
    
(rows: 6, time: 2.7s,     4GB processed, job: job_J2Ok7Ohyp5KYDKot7SNzOi11Rho)

In this example, let's assume that we would like to do a study that requires a primary tumor sample and a matched-normal (adjacent) tissue sample. In order to find out which patients provided which types of samples, we need to query the Biospecimen data table. This next query module uses two sub-queries, one to get all patients with TP samples and another to get all patients with NT samples. The final query joins these two and returns a single list of patients.



In [4]:

    
%%sql --module patients_with_matched_samples


DEFINE QUERY patients_tp
SELECT
  ParticipantBarcode
FROM
  [isb-cgc:tcga_201607_beta.Biospecimen_data]
WHERE
  ( SampleTypeLetterCode="TP" )
GROUP BY
  ParticipantBarcode
ORDER BY
  ParticipantBarcode


DEFINE QUERY patients_nt
SELECT
  ParticipantBarcode
FROM
  [isb-cgc:tcga_201607_beta.Biospecimen_data]
WHERE
  ( SampleTypeLetterCode="NT" )
GROUP BY
  ParticipantBarcode
ORDER BY
  ParticipantBarcode


DEFINE QUERY patients_both
SELECT
  nt.ParticipantBarcode AS ParticipantBarcode
FROM ( $patients_nt ) AS nt
JOIN
  ( $patients_tp ) AS tp
ON
  nt.ParticipantBarcode = tp.ParticipantBarcode
ORDER BY
  ParticipantBarcode



In [5]:

    
bq.Query(patients_with_matched_samples).results().to_dataframe()









    Out[5]:






  
    
      
      ParticipantBarcode
    
  
  
    
      0
      TCGA-04-1335
    
    
      1
      TCGA-04-1336
    
    
      2
      TCGA-04-1337
    
    
      3
      TCGA-04-1338
    
    
      4
      TCGA-04-1342
    
    
      5
      TCGA-04-1346
    
    
      6
      TCGA-04-1347
    
    
      7
      TCGA-04-1348
    
    
      8
      TCGA-04-1349
    
    
      9
      TCGA-04-1350
    
    
      10
      TCGA-04-1351
    
    
      11
      TCGA-04-1353
    
    
      12
      TCGA-04-1356
    
    
      13
      TCGA-04-1357
    
    
      14
      TCGA-04-1360
    
    
      15
      TCGA-04-1361
    
    
      16
      TCGA-04-1369
    
    
      17
      TCGA-04-1370
    
    
      18
      TCGA-04-1371
    
    
      19
      TCGA-04-1516
    
    
      20
      TCGA-04-1517
    
    
      21
      TCGA-04-1519
    
    
      22
      TCGA-04-1536
    
    
      23
      TCGA-04-1638
    
    
      24
      TCGA-04-1644
    
    
      25
      TCGA-04-1646
    
    
      26
      TCGA-04-1648
    
    
      27
      TCGA-04-1649
    
    
      28
      TCGA-04-1651
    
    
      29
      TCGA-04-1652
    
    
      ...
      ...
    
    
      2321
      TCGA-VT-A80G
    
    
      2322
      TCGA-VT-A80J
    
    
      2323
      TCGA-W5-AA2I
    
    
      2324
      TCGA-W5-AA2O
    
    
      2325
      TCGA-W5-AA2Q
    
    
      2326
      TCGA-W5-AA2R
    
    
      2327
      TCGA-W5-AA2T
    
    
      2328
      TCGA-W5-AA2U
    
    
      2329
      TCGA-W5-AA2X
    
    
      2330
      TCGA-W5-AA2Z
    
    
      2331
      TCGA-W5-AA30
    
    
      2332
      TCGA-W5-AA31
    
    
      2333
      TCGA-W5-AA33
    
    
      2334
      TCGA-W5-AA34
    
    
      2335
      TCGA-W7-A93N
    
    
      2336
      TCGA-W7-A93O
    
    
      2337
      TCGA-W7-A93P
    
    
      2338
      TCGA-WA-A7GZ
    
    
      2339
      TCGA-X7-A8D6
    
    
      2340
      TCGA-X7-A8D7
    
    
      2341
      TCGA-XD-AAUH
    
    
      2342
      TCGA-XT-AASU
    
    
      2343
      TCGA-XX-A899
    
    
      2344
      TCGA-Y8-A8RY
    
    
      2345
      TCGA-Y8-A8RZ
    
    
      2346
      TCGA-Y8-A8S0
    
    
      2347
      TCGA-Y8-A8S1
    
    
      2348
      TCGA-YB-A89D
    
    
      2349
      TCGA-ZB-A969
    
    
      2350
      TCGA-ZU-A8S4
    
  

2351 rows × 1 columns

It might be interesting to find out what the distribution of tumor types is for this list of patients with matched tumor-normal sample pairs. We can define a new SQL module that refers to the results of a previously defined query as long as we pass that reference in when we call bq.Query():



In [6]:

    
%%sql --module tumor_type_counts

# now we'll use this list to find what types of tumors these patients
# belong to:
SELECT
  Study, 
  COUNT(*) AS n
FROM 
  [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
  ParticipantBarcode IN ($patients_both)
GROUP BY
  Study
ORDER BY
  n DESC



In [7]:

    
bq.Query(tumor_type_counts,
         patients_nt=patients_with_matched_samples.patients_nt,
         patients_tp=patients_with_matched_samples.patients_tp,
         patients_both=patients_with_matched_samples.patients_both).results().to_dataframe()



In [ ]:

	Study	n
0	KIRC	442
1	LUSC	254
2	LUAD	211
3	BRCA	162
4	OV	126
5	PRAD	118
6	STAD	101
7	THCA	100
8	COAD	93
9	LIHC	89
10	KIRP	88
11	HNSC	82
12	KICH	71
13	ESCA	65
14	UCEC	40
15	BLCA	37
16	PAAD	37
17	GBM	29
18	SARC	24
19	READ	18
20	CHOL	17
21	THYM	13
22	CESC	8
23	UCS	6
24	PCPG	5
25	ACC	5
26	MESO	1

SampleType	SampleTypeLetterCode	n
Primary solid Tumor	TP	10841
Blood Derived Normal	NB	9395
Solid Tissue Normal	NT	2726
Metastatic	TM	397
Primary Blood Derived Cancer - Peripheral Blood	TB	356
Recurrent Solid Tumor	TR	60
Additional - New Primary	TAP	11
Buccal Cell Normal	NBC	5
Bone Marrow Normal	NBM	4
Additional Metastatic	TAM	2

	ParticipantBarcode
0	TCGA-04-1335
1	TCGA-04-1336
2	TCGA-04-1337
3	TCGA-04-1338
4	TCGA-04-1342
5	TCGA-04-1346
6	TCGA-04-1347
7	TCGA-04-1348
8	TCGA-04-1349
9	TCGA-04-1350
10	TCGA-04-1351
11	TCGA-04-1353
12	TCGA-04-1356
13	TCGA-04-1357
14	TCGA-04-1360
15	TCGA-04-1361
16	TCGA-04-1369
17	TCGA-04-1370
18	TCGA-04-1371
19	TCGA-04-1516
20	TCGA-04-1517
21	TCGA-04-1519
22	TCGA-04-1536
23	TCGA-04-1638
24	TCGA-04-1644
25	TCGA-04-1646
26	TCGA-04-1648
27	TCGA-04-1649
28	TCGA-04-1651
29	TCGA-04-1652
...	...
2321	TCGA-VT-A80G
2322	TCGA-VT-A80J
2323	TCGA-W5-AA2I
2324	TCGA-W5-AA2O
2325	TCGA-W5-AA2Q
2326	TCGA-W5-AA2R
2327	TCGA-W5-AA2T
2328	TCGA-W5-AA2U
2329	TCGA-W5-AA2X
2330	TCGA-W5-AA2Z
2331	TCGA-W5-AA30
2332	TCGA-W5-AA31
2333	TCGA-W5-AA33
2334	TCGA-W5-AA34
2335	TCGA-W7-A93N
2336	TCGA-W7-A93O
2337	TCGA-W7-A93P
2338	TCGA-WA-A7GZ
2339	TCGA-X7-A8D6
2340	TCGA-X7-A8D7
2341	TCGA-XD-AAUH
2342	TCGA-XT-AASU
2343	TCGA-XX-A899
2344	TCGA-Y8-A8RY
2345	TCGA-Y8-A8RZ
2346	TCGA-Y8-A8S0
2347	TCGA-Y8-A8S1
2348	TCGA-YB-A89D
2349	TCGA-ZB-A969
2350	TCGA-ZU-A8S4