Creating TCGA cohorts (part 1)

This notebook will show you how to create a TCGA cohort using the publicly available TCGA BigQuery tables that the ISB-CGC project has produced based on the open-access TCGA data available at the Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial or contact us and become part of the community evaluation phase of our Cancer Genomics Cloud pilot.

We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are some links to some resources that you might find useful:

BigQuery,
the BigQuery web UI where you can run queries interactively,
IPython (now known as Jupyter), and
Cloud Datalab the recently announced interactive cloud-based platform that this notebook is being developed on.

There are also many tutorials and samples available on github (see, in particular, the datalab repo and the Google Genomics project).

OK then, let's get started! In order to work with BigQuery, the first thing you need to do is import the bigquery module:



In [1]:

    
import gcp.bigquery as bq

The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we will be working with in this notebook are in a dataset called tcga_201607_beta, owned by the isb-cgc project. A full table identifier is of the form <project_id>:<dataset_id>.<table_id>. Let's start by getting some basic information about the tables in this dataset:



In [2]:

    
d = bq.DataSet('isb-cgc:tcga_201607_beta')
for t in d.tables():
  print '%10d rows  %12d bytes   %s' \
      % (t.metadata.rows, t.metadata.size, t.name.table_id)









    



      6322 rows       1729204 bytes   Annotations
     23797 rows       6382147 bytes   Biospecimen_data
     11160 rows       4201379 bytes   Clinical_data
   2646095 rows     333774244 bytes   Copy_Number_segments
3944304319 rows  445303830985 bytes   DNA_Methylation_betas
 382335670 rows   43164264006 bytes   DNA_Methylation_chr1
 197519895 rows   22301345198 bytes   DNA_Methylation_chr10
 235823572 rows   26623975945 bytes   DNA_Methylation_chr11
 198050739 rows   22359642619 bytes   DNA_Methylation_chr12
  97301675 rows   10986815862 bytes   DNA_Methylation_chr13
 123239379 rows   13913712352 bytes   DNA_Methylation_chr14
 124566185 rows   14064712239 bytes   DNA_Methylation_chr15
 179772812 rows   20296128173 bytes   DNA_Methylation_chr16
 234003341 rows   26417830751 bytes   DNA_Methylation_chr17
  50216619 rows    5669139362 bytes   DNA_Methylation_chr18
 211386795 rows   23862583107 bytes   DNA_Methylation_chr19
 279668485 rows   31577200462 bytes   DNA_Methylation_chr2
  86858120 rows    9805923353 bytes   DNA_Methylation_chr20
  35410447 rows    3997986812 bytes   DNA_Methylation_chr21
  70676468 rows    7978947938 bytes   DNA_Methylation_chr22
 201119616 rows   22705358910 bytes   DNA_Methylation_chr3
 159148744 rows   17968482285 bytes   DNA_Methylation_chr4
 195864180 rows   22113162401 bytes   DNA_Methylation_chr5
 290275524 rows   32772371379 bytes   DNA_Methylation_chr6
 240010275 rows   27097948808 bytes   DNA_Methylation_chr7
 164810092 rows   18607886221 bytes   DNA_Methylation_chr8
  81260723 rows    9173717922 bytes   DNA_Methylation_chr9
  98082681 rows   11072059468 bytes   DNA_Methylation_chrX
   2330426 rows     263109775 bytes   DNA_Methylation_chrY
   1867233 rows     207365611 bytes   Protein_RPPA_data
   5356089 rows    5715538107 bytes   Somatic_Mutation_calls
   5738048 rows     657855993 bytes   mRNA_BCGSC_GA_RPKM
  38299138 rows    4459086535 bytes   mRNA_BCGSC_HiSeq_RPKM
  44037186 rows    5116942528 bytes   mRNA_BCGSC_RPKM
  16794358 rows    1934755686 bytes   mRNA_UNC_GA_RSEM
 211284521 rows   24942992190 bytes   mRNA_UNC_HiSeq_RSEM
 228078879 rows   26877747876 bytes   mRNA_UNC_RSEM
  11997545 rows    2000881026 bytes   miRNA_BCGSC_GA_isoform
   4503046 rows     527101917 bytes   miRNA_BCGSC_GA_mirna
  90237323 rows   15289326462 bytes   miRNA_BCGSC_HiSeq_isoform
  28207741 rows    3381212265 bytes   miRNA_BCGSC_HiSeq_mirna
 102234868 rows   17290207488 bytes   miRNA_BCGSC_isoform
  32710787 rows    3908314182 bytes   miRNA_BCGSC_mirna
  26763022 rows    3265303352 bytes   miRNA_Expression

In this tutorial, we are going to look at a few different ways that we can use the information in these tables to create cohorts. Now, you maybe asking what we mean by "cohort" and why you might be interested in creating one, or maybe what it even means to "create" a cohort. The TCGA dataset includes clinical, biospecimen, and molecular data from over 10,000 cancer patients who agreed to be a part of this landmark research project to build The Cancer Genome Atlas. This large dataset was originally organized and studied according to cancer type but now that this multi-year project is nearing completion, with over 30 types of cancer and over 10,000 tumors analyzed, you have the opportunity to look at this dataset from whichever angle most interests you. Maybe you are particularly interested in early-onset cancers, or gastro-intestinal cancers, or a specific type of genetic mutation. This is where the idea of a "cohort" comes in. The original TCGA "cohorts" were based on cancer type (aka "study"), but now you can define a cohort based on virtually any clinical or molecular feature by querying these BigQuery tables. A cohort is simply a list of samples, using the TCGA barcode system. Once you have created a cohort you can use it in any number of ways: you could further explore the data available for one cohort, or compare one cohort to another, for example.

In the rest of this tutorial, we will create several different cohorts based on different motivating research questions. We hope that these examples will provide you with a starting point from which you can build, to answer your own research questions.

Exploring the Clinical data table

Let's start by looking at the clinical data table. The TCGA dataset contains a few very basic clinical data elements for almost all patients, and contains additional information for some tumor types only. For example smoking history information is generally available only for lung cancer patients, and BMI (body mass index) is only available for tumor types where that is a known significant risk factor. Let's take a look at the clinical data table and see how many different pieces of information are available to us:



In [3]:

    
%bigquery schema --table isb-cgc:tcga_201607_beta.Clinical_data









    Out[3]:

That's a lot of fields! We can also get at the schema programmatically:



In [4]:

    
table = bq.Table('isb-cgc:tcga_201607_beta.Clinical_data')
if ( table.exists() ):
    fieldNames = map(lambda tsf: tsf.name, table.schema)
    fieldTypes = map(lambda tsf: tsf.data_type, table.schema)
    print " This table has %d fields. " % ( len(fieldNames) )
    print " The first few field names and types are: " 
    print "     ", fieldNames[:5]
    print "     ", fieldTypes[:5]
else: 
    print " There is no existing table called %s:%s.%s" % ( table.name.project_id, table.name.dataset_id, table.name.table_id )









    



 This table has 70 fields. 
 The first few field names and types are: 
      [u'ParticipantBarcode', u'Study', u'Project', u'ParticipantUUID', u'TSSCode']
      [u'STRING', u'STRING', u'STRING', u'STRING', u'STRING']

Let's look at these fields and see which ones might be the most "interesting", by looking at how many times they are filled-in (not NULL), or how much variation exists in the values. If we wanted to look at just a single field, "tobacco_smoking_history" for example, we could use a very simple query to get a basic summary:



In [5]:

    
%%sql 

SELECT tobacco_smoking_history, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
GROUP BY tobacco_smoking_history
ORDER BY n DESC









    Out[5]:





    tobacco_smoking_history n
  8161
1 865
4 799
2 710
3 568
5 57
    
(rows: 6, time: 1.3s,     8KB processed, job: job_Tp9_oGNMvJ7f0ZcTwdt2TirZePk)

But if we want to loop over all fields and get a sense of which fields might provide us with useful criteria for specifying a cohort, we'll want to automate that. We'll put a threshold on the minimum number of patients that we expect information for, and the maximum number of unique values (since fields such as the "ParticipantBarcode" will be unique for every patient and, although we will need that field later, it's probably not useful for defining a cohort).



In [15]:

    
numPatients = table.metadata.rows
print " The %s table describes a total of %d patients. " % ( table.name.table_id, numPatients )

# let's set a threshold for the minimum number of values that a field should have,
# and also the maximum number of unique values
minNumPatients = int(numPatients*0.80)
maxNumValues = 50

numInteresting = 0
iList = []
for iField in range(len(fieldNames)):
  aField = fieldNames[iField]
  aType = fieldTypes[iField]
  try:
    qString = "SELECT {0} FROM [{1}]".format(aField,table)
    query = bq.Query(qString)
    df = query.to_dataframe()
    summary = df[str(aField)].describe()
    if ( aType == "STRING" ):
      topFrac = float(summary['freq'])/float(summary['count'])
      if ( summary['count'] >= minNumPatients ):
        if ( summary['unique'] <= maxNumValues and summary['unique'] > 1 ):
          if ( topFrac < 0.90 ):
            numInteresting += 1
            iList += [aField]
            print "     > %s has %d values with %d unique (%s occurs %d times) " \
              % (str(aField), summary['count'], summary['unique'], summary['top'], summary['freq'])
    else:
      if ( summary['count'] >= minNumPatients ):
        if ( summary['std'] > 0.1 ):
          numInteresting += 1
          iList += [aField]
          print "     > %s has %d values (mean=%.0f, sigma=%.0f) " \
            % (str(aField), summary['count'], summary['mean'], summary['std'])
  except:
    pass

print " "
print " Found %d potentially interesting features: " % numInteresting
print "   ", iList









    



 The Clinical_data table describes a total of 11160 patients. 
     > Study has 11160 values with 33 unique (BRCA occurs 1097 times) 
     > age_at_initial_pathologic_diagnosis has 11109 values (mean=59, sigma=14) 
     > batch_number has 11160 values (mean=203, sigma=135) 
     > vital_status has 11156 values with 2 unique (Alive occurs 7534 times) 
     > days_to_birth has 11041 values (mean=-21763, sigma=5266) 
     > days_to_last_known_alive has 11102 values (mean=1037, sigma=1041) 
     > gender has 11160 values with 2 unique (FEMALE occurs 5815 times) 
     > year_of_initial_pathologic_diagnosis has 11030 values (mean=2008, sigma=4) 
     > person_neoplasm_cancer_status has 10236 values with 2 unique (TUMOR FREE occurs 6507 times) 
     > race has 9835 values with 5 unique (WHITE occurs 8186 times) 
 
 Found 10 potentially interesting features: 
    [u'Study', u'age_at_initial_pathologic_diagnosis', u'batch_number', u'vital_status', u'days_to_birth', u'days_to_last_known_alive', u'gender', u'year_of_initial_pathologic_diagnosis', u'person_neoplasm_cancer_status', u'race']

The above helps us narrow down on which fields are likely to be the most useful, but if you have a specific interest, for example in menopause or HPV status, you can still look at those in more detail very easily:



In [16]:

    
%%sql
SELECT menopause_status, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY menopause_status
ORDER BY n DESC









    Out[16]:





    menopause_status n
Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy) 1291
Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement) 389
Peri (6-12 months since last menstrual period) 82
Indeterminate (neither Pre or Postmenopausal) 54
    
(rows: 4, time: 0.8s, cached, job: job_R6HF_RzYdXcntPKVIy06TNbDi8Q)

We might wonder which specific tumor types have menopause information:



In [17]:

    
%%sql
SELECT Study, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY Study
ORDER BY n DESC









    Out[17]:





    Study n
BRCA 1007
UCEC 517
CESC 237
UCS 55
    
(rows: 4, time: 0.5s, cached, job: job_WIGSHA6zqYL1yChZi9jnwh-ZL-A)



In [18]:

    
%%sql
SELECT hpv_status, hpv_calls, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE hpv_status IS NOT NULL
GROUP BY hpv_status, hpv_calls
HAVING n > 20
ORDER BY n DESC









    Out[18]:





    hpv_status hpv_calls n
Negative   664
Positive HPV16 238
Positive HPV18 41
Positive HPV33 25
Positive HPV45 24
    
(rows: 5, time: 0.6s, cached, job: job_4BS0EXKldZBN8Y2QcbPA--vK6_g)

TCGA Annotations

An additional factor to consider, when creating a cohort is that there may be additional information that might lead one to exclude a particular patient from a cohort. In certain instances, patients have been redacted or excluded from analyses for reasons such as prior treatment, etc, but since different researchers may have different criteria for using or excluding certain patients or certain samples from their analyses, in many cases the data is still available while at the same time "annotations" may have been entered into a searchable database. These annotations have also been uploaded into a BigQuery table and can be used in conjuction with the other BigQuery tables.

Early-onset Breast Cancer

Now that we have a better idea of what types of information is available in the Clinical data table, let's create a cohort consisting of female breast-cancer patients, diagnosed at the age of 50 or younger.

In this next code cell, we define several queries within a module which allows us to use them both individually and by reference in the final, main query.

the first query, called select_on_annotations, finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment';
the second query, select_on_clinical selects all female breast-cancer patients who were diagnosed at age 50 or younger, while also pulling out a few additional fields that might be of interest; and
the final query joins these two together and returns just those patients that meet the clinical-criteria and do not meet the exclusion-criteria.



In [19]:

    
%%sql --module createCohort_and_checkAnnotations

DEFINE QUERY select_on_annotations
SELECT
  ParticipantBarcode,
  annotationCategoryName AS categoryName,
  annotationClassification AS classificationName
FROM
  [isb-cgc:tcga_201607_beta.Annotations]
WHERE
  ( itemTypeName="Patient"
    AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy"
      OR annotationClassification="Redaction" ) )
GROUP BY
  ParticipantBarcode,
  categoryName,
  classificationName

DEFINE QUERY select_on_clinical
SELECT
  ParticipantBarcode,
  vital_status,
  days_to_last_known_alive,
  ethnicity,
  histological_type,
  menopause_status,
  race
FROM
  [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
  ( Study="BRCA"
    AND age_at_initial_pathologic_diagnosis<=50
    AND gender="FEMALE" )

SELECT
  c.ParticipantBarcode AS ParticipantBarcode
FROM (
  SELECT
    a.categoryName,
    a.classificationName,
    a.ParticipantBarcode,
    c.ParticipantBarcode,
  FROM ( $select_on_annotations ) AS a
  OUTER JOIN EACH 
       ( $select_on_clinical ) AS c
  ON
    a.ParticipantBarcode = c.ParticipantBarcode
  WHERE
    (a.ParticipantBarcode IS NOT NULL
      OR c.ParticipantBarcode IS NOT NULL)
  ORDER BY
    a.classificationName,
    a.categoryName,
    a.ParticipantBarcode,
    c.ParticipantBarcode )
WHERE
  ( a.categoryName IS NULL
    AND a.classificationName IS NULL
    AND c.ParticipantBarcode IS NOT NULL )
ORDER BY
  c.ParticipantBarcode

Here we explicitly call just the first query in the module, and we get a list of 212 patients with one of these disqualifying annotations:



In [20]:

    
bq.Query(createCohort_and_checkAnnotations.select_on_annotations).results().to_dataframe()









    Out[20]:






  
    
      
      ParticipantBarcode
      categoryName
      classificationName
    
  
  
    
      0
      TCGA-01-0629
      Tumor tissue origin incorrect
      Redaction
    
    
      1
      TCGA-13-1479
      Tumor tissue origin incorrect
      Redaction
    
    
      2
      TCGA-01-0638
      Tumor tissue origin incorrect
      Redaction
    
    
      3
      TCGA-33-4579
      Tumor tissue origin incorrect
      Redaction
    
    
      4
      TCGA-GN-A261
      Tumor tissue origin incorrect
      Redaction
    
    
      5
      TCGA-66-2751
      Genotype mismatch
      Redaction
    
    
      6
      TCGA-66-2752
      Genotype mismatch
      Redaction
    
    
      7
      TCGA-66-2750
      Genotype mismatch
      Redaction
    
    
      8
      TCGA-66-2746
      Genotype mismatch
      Redaction
    
    
      9
      TCGA-66-2747
      Genotype mismatch
      Redaction
    
    
      10
      TCGA-35-3621
      Genotype mismatch
      Redaction
    
    
      11
      TCGA-02-0002
      Genotype mismatch
      Redaction
    
    
      12
      TCGA-02-0117
      Genotype mismatch
      Redaction
    
    
      13
      TCGA-08-0384
      Genotype mismatch
      Redaction
    
    
      14
      TCGA-E2-A1IP
      Genotype mismatch
      Redaction
    
    
      15
      TCGA-14-0784
      Genotype mismatch
      Redaction
    
    
      16
      TCGA-14-1036
      Genotype mismatch
      Redaction
    
    
      17
      TCGA-06-0748
      Genotype mismatch
      Redaction
    
    
      18
      TCGA-02-2488
      Genotype mismatch
      Redaction
    
    
      19
      TCGA-14-1824
      Genotype mismatch
      Redaction
    
    
      20
      TCGA-PN-A8M9
      Genotype mismatch
      Redaction
    
    
      21
      TCGA-12-1601
      Subject withdrew consent
      Redaction
    
    
      22
      TCGA-12-0653
      Subject withdrew consent
      Redaction
    
    
      23
      TCGA-32-2498
      Subject withdrew consent
      Redaction
    
    
      24
      TCGA-AF-3912
      Subject withdrew consent
      Redaction
    
    
      25
      TCGA-A6-2670
      Subject withdrew consent
      Redaction
    
    
      26
      TCGA-06-0131
      Subject withdrew consent
      Redaction
    
    
      27
      TCGA-AN-A0FG
      Subject identity unknown
      Redaction
    
    
      28
      TCGA-AN-A0FE
      Subject identity unknown
      Redaction
    
    
      29
      TCGA-F4-6857
      Subject identity unknown
      Redaction
    
    
      ...
      ...
      ...
      ...
    
    
      182
      TCGA-AP-A053
      History of unacceptable prior treatment relate...
      Notification
    
    
      183
      TCGA-AX-A06D
      History of unacceptable prior treatment relate...
      Notification
    
    
      184
      TCGA-AX-A1CP
      History of unacceptable prior treatment relate...
      Notification
    
    
      185
      TCGA-AX-A1CR
      History of unacceptable prior treatment relate...
      Notification
    
    
      186
      TCGA-AX-A2H8
      History of unacceptable prior treatment relate...
      Notification
    
    
      187
      TCGA-AX-A2HF
      History of unacceptable prior treatment relate...
      Notification
    
    
      188
      TCGA-AX-A3G3
      History of unacceptable prior treatment relate...
      Notification
    
    
      189
      TCGA-B5-A0KB
      History of unacceptable prior treatment relate...
      Notification
    
    
      190
      TCGA-BG-A221
      History of unacceptable prior treatment relate...
      Notification
    
    
      191
      TCGA-D1-A3JP
      History of unacceptable prior treatment relate...
      Notification
    
    
      192
      TCGA-EY-A1G8
      History of unacceptable prior treatment relate...
      Notification
    
    
      193
      TCGA-L5-A88T
      History of unacceptable prior treatment relate...
      Notification
    
    
      194
      TCGA-WB-A820
      History of unacceptable prior treatment relate...
      Notification
    
    
      195
      TCGA-XK-AAJ3
      History of unacceptable prior treatment relate...
      Notification
    
    
      196
      TCGA-EJ-7312
      History of unacceptable prior treatment relate...
      Notification
    
    
      197
      TCGA-96-A4JK
      History of unacceptable prior treatment relate...
      Notification
    
    
      198
      TCGA-2G-AAFE
      History of unacceptable prior treatment relate...
      Notification
    
    
      199
      TCGA-IC-A6RF
      History of unacceptable prior treatment relate...
      Notification
    
    
      200
      TCGA-BA-4075
      History of unacceptable prior treatment relate...
      Notification
    
    
      201
      TCGA-06-6391
      History of unacceptable prior treatment relate...
      Notification
    
    
      202
      TCGA-5L-AAT1
      History of unacceptable prior treatment relate...
      Notification
    
    
      203
      TCGA-HT-A619
      History of unacceptable prior treatment relate...
      Notification
    
    
      204
      TCGA-T1-A6J8
      History of unacceptable prior treatment relate...
      Notification
    
    
      205
      TCGA-BG-A0M8
      History of unacceptable prior treatment relate...
      Notification
    
    
      206
      TCGA-XK-AAK1
      History of unacceptable prior treatment relate...
      Notification
    
    
      207
      TCGA-BH-A0B6
      History of unacceptable prior treatment relate...
      Notification
    
    
      208
      TCGA-BG-A0MS
      History of unacceptable prior treatment relate...
      Notification
    
    
      209
      TCGA-AR-A2LR
      History of unacceptable prior treatment relate...
      Notification
    
    
      210
      TCGA-BH-A1F5
      History of unacceptable prior treatment relate...
      Notification
    
    
      211
      TCGA-DM-A286
      Inadvertently shipped
      Redaction
    
  

212 rows × 3 columns

and here we explicitly call just the second query, resulting in 329 patients:



In [21]:

    
bq.Query(createCohort_and_checkAnnotations.select_on_clinical).results().to_dataframe()









    Out[21]:






  
    
      
      ParticipantBarcode
      vital_status
      days_to_last_known_alive
      ethnicity
      histological_type
      menopause_status
      race
    
  
  
    
      0
      TCGA-BH-A18M
      Dead
      2207
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      None
      WHITE
    
    
      1
      TCGA-BH-A18V
      Dead
      1556
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      None
      WHITE
    
    
      2
      TCGA-Z7-A8R6
      Alive
      3256
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      3
      TCGA-3C-AALI
      Alive
      4005
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      BLACK OR AFRICAN AMERICAN
    
    
      4
      TCGA-4H-AAAK
      Alive
      348
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      WHITE
    
    
      5
      TCGA-5L-AAT0
      Alive
      1477
      HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      WHITE
    
    
      6
      TCGA-A1-A0SN
      Alive
      1196
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      WHITE
    
    
      7
      TCGA-A1-A0SJ
      Alive
      416
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      None
      BLACK OR AFRICAN AMERICAN
    
    
      8
      TCGA-A1-A0SQ
      Alive
      554
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      9
      TCGA-A1-A0SP
      Alive
      584
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      None
      None
    
    
      10
      TCGA-A1-A0SH
      Alive
      1437
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      11
      TCGA-A2-A25E
      Alive
      3204
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      12
      TCGA-A2-A25B
      Alive
      1291
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      13
      TCGA-A2-A0SX
      Alive
      1534
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      14
      TCGA-A2-A0YL
      Alive
      1474
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      15
      TCGA-A2-A0D4
      Alive
      767
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      16
      TCGA-A2-A3XZ
      Alive
      1532
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      17
      TCGA-A2-A0D2
      Alive
      1027
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      18
      TCGA-A2-A04U
      Alive
      2654
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Peri (6-12 months since last menstrual period)
      WHITE
    
    
      19
      TCGA-A2-A04V
      Dead
      1920
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      20
      TCGA-A2-A0CV
      Alive
      3011
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      ASIAN
    
    
      21
      TCGA-A2-A0T3
      Alive
      1516
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      22
      TCGA-A2-A3XU
      Dead
      912
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      BLACK OR AFRICAN AMERICAN
    
    
      23
      TCGA-A2-A3XV
      Alive
      996
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      24
      TCGA-A2-A0T6
      Alive
      575
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      25
      TCGA-A2-A0D3
      Alive
      1873
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      WHITE
    
    
      26
      TCGA-A2-A0EX
      Alive
      752
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      WHITE
    
    
      27
      TCGA-A2-A25A
      Alive
      3276
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      28
      TCGA-A2-A3XT
      Alive
      2770
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      29
      TCGA-A2-A1G0
      Alive
      616
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      299
      TCGA-GM-A2DA
      Dead
      6593
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      WHITE
    
    
      300
      TCGA-GM-A2DL
      Alive
      3519
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Peri (6-12 months since last menstrual period)
      WHITE
    
    
      301
      TCGA-GM-A3XN
      Alive
      2019
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      WHITE
    
    
      302
      TCGA-GM-A3XL
      Alive
      2108
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      BLACK OR AFRICAN AMERICAN
    
    
      303
      TCGA-GM-A3XG
      Alive
      1330
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      304
      TCGA-HN-A2OB
      Dead
      1900
      None
      Infiltrating Lobular Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      None
    
    
      305
      TCGA-JL-A3YX
      Alive
      352
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      ASIAN
    
    
      306
      TCGA-JL-A3YW
      Alive
      360
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      ASIAN
    
    
      307
      TCGA-LD-A66U
      Alive
      646
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      None
      WHITE
    
    
      308
      TCGA-LL-A5YP
      Alive
      450
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      BLACK OR AFRICAN AMERICAN
    
    
      309
      TCGA-LL-A7SZ
      Alive
      594
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      310
      TCGA-LL-A6FR
      Alive
      489
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      BLACK OR AFRICAN AMERICAN
    
    
      311
      TCGA-LL-A5YN
      Alive
      447
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      BLACK OR AFRICAN AMERICAN
    
    
      312
      TCGA-LL-A5YO
      Alive
      440
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      313
      TCGA-MS-A51U
      Alive
      681
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      None
      WHITE
    
    
      314
      TCGA-OL-A5RV
      Alive
      1062
      None
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      315
      TCGA-OL-A5RW
      Alive
      1106
      None
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      316
      TCGA-OL-A5D8
      Alive
      973
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      317
      TCGA-OL-A6VQ
      Alive
      600
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      318
      TCGA-OL-A6VO
      Alive
      858
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      None
      BLACK OR AFRICAN AMERICAN
    
    
      319
      TCGA-OL-A66I
      Alive
      714
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      None
      BLACK OR AFRICAN AMERICAN
    
    
      320
      TCGA-OL-A6VR
      Alive
      1220
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      None
      BLACK OR AFRICAN AMERICAN
    
    
      321
      TCGA-OL-A66O
      Alive
      528
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      322
      TCGA-PE-A5DE
      Alive
      2645
      None
      Infiltrating Lobular Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      WHITE
    
    
      323
      TCGA-PL-A8LZ
      Alive
      302
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      324
      TCGA-PL-A8LY
      Alive
      8
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      325
      TCGA-PL-A8LX
      Alive
      5
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Pre (<6 months since LMP AND no prior bilatera...
      BLACK OR AFRICAN AMERICAN
    
    
      326
      TCGA-S3-AA14
      Alive
      529
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      BLACK OR AFRICAN AMERICAN
    
    
      327
      TCGA-S3-A6ZH
      Alive
      641
      NOT HISPANIC OR LATINO
      Infiltrating Ductal Carcinoma
      None
      BLACK OR AFRICAN AMERICAN
    
    
      328
      TCGA-XX-A899
      Alive
      467
      NOT HISPANIC OR LATINO
      Infiltrating Lobular Carcinoma
      Post (prior bilateral ovariectomy OR >12 mo si...
      WHITE
    
  

329 rows × 7 columns

and finally we call the main query:



In [22]:

    
bq.Query(createCohort_and_checkAnnotations).results().to_dataframe()









    Out[22]:






  
    
      
      ParticipantBarcode
    
  
  
    
      0
      TCGA-3C-AALI
    
    
      1
      TCGA-4H-AAAK
    
    
      2
      TCGA-5L-AAT0
    
    
      3
      TCGA-A1-A0SH
    
    
      4
      TCGA-A1-A0SJ
    
    
      5
      TCGA-A1-A0SN
    
    
      6
      TCGA-A1-A0SP
    
    
      7
      TCGA-A1-A0SQ
    
    
      8
      TCGA-A2-A04P
    
    
      9
      TCGA-A2-A04Q
    
    
      10
      TCGA-A2-A04R
    
    
      11
      TCGA-A2-A04U
    
    
      12
      TCGA-A2-A04V
    
    
      13
      TCGA-A2-A04W
    
    
      14
      TCGA-A2-A04X
    
    
      15
      TCGA-A2-A0CL
    
    
      16
      TCGA-A2-A0CM
    
    
      17
      TCGA-A2-A0CV
    
    
      18
      TCGA-A2-A0CZ
    
    
      19
      TCGA-A2-A0D2
    
    
      20
      TCGA-A2-A0D3
    
    
      21
      TCGA-A2-A0D4
    
    
      22
      TCGA-A2-A0EX
    
    
      23
      TCGA-A2-A0SX
    
    
      24
      TCGA-A2-A0T3
    
    
      25
      TCGA-A2-A0T5
    
    
      26
      TCGA-A2-A0T6
    
    
      27
      TCGA-A2-A0YE
    
    
      28
      TCGA-A2-A0YJ
    
    
      29
      TCGA-A2-A0YL
    
    
      ...
      ...
    
    
      297
      TCGA-GM-A2DL
    
    
      298
      TCGA-GM-A3XG
    
    
      299
      TCGA-GM-A3XL
    
    
      300
      TCGA-GM-A3XN
    
    
      301
      TCGA-HN-A2OB
    
    
      302
      TCGA-JL-A3YW
    
    
      303
      TCGA-JL-A3YX
    
    
      304
      TCGA-LD-A66U
    
    
      305
      TCGA-LL-A5YN
    
    
      306
      TCGA-LL-A5YO
    
    
      307
      TCGA-LL-A5YP
    
    
      308
      TCGA-LL-A6FR
    
    
      309
      TCGA-LL-A7SZ
    
    
      310
      TCGA-MS-A51U
    
    
      311
      TCGA-OL-A5D8
    
    
      312
      TCGA-OL-A5RV
    
    
      313
      TCGA-OL-A5RW
    
    
      314
      TCGA-OL-A66I
    
    
      315
      TCGA-OL-A66O
    
    
      316
      TCGA-OL-A6VO
    
    
      317
      TCGA-OL-A6VQ
    
    
      318
      TCGA-OL-A6VR
    
    
      319
      TCGA-PE-A5DE
    
    
      320
      TCGA-PL-A8LX
    
    
      321
      TCGA-PL-A8LY
    
    
      322
      TCGA-PL-A8LZ
    
    
      323
      TCGA-S3-A6ZH
    
    
      324
      TCGA-S3-AA14
    
    
      325
      TCGA-XX-A899
    
    
      326
      TCGA-Z7-A8R6
    
  

327 rows × 1 columns

Note that we didn't need to call each sub-query individually, we could have just called the main query and gotten the same result. As you can see, two patients that met the clinical select criteria (which returned 329 patients) were excluded from the final result (which returned 327 patients).

Before we leave off, here are a few useful tricks for working with BigQuery in Cloud Datalab:

if you want to see the raw SQL, you can just build the query and then print it out (this might be useful, for example, in debugging a query -- you can copy and paste the SQL directly into the BigQuery Web UI);
if you want to see how much data and which tables are going to be touched by this data, you can use the "dry run" option. (Notice the "cacheHit" flag -- if you have recently done a particular query, you will not be charged to repeat it since it will have been cached.)



In [23]:

    
q = bq.Query(createCohort_and_checkAnnotations)
q









    Out[23]:




SELECT
  c.ParticipantBarcode AS ParticipantBarcode
FROM (
  SELECT
    a.categoryName,
    a.classificationName,
    a.ParticipantBarcode,
    c.ParticipantBarcode,
  FROM ( (SELECT
  ParticipantBarcode,
  annotationCategoryName AS categoryName,
  annotationClassification AS classificationName
FROM
  [isb-cgc:tcga_201607_beta.Annotations]
WHERE
  ( itemTypeName="Patient"
    AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy"
      OR annotationClassification="Redaction" ) )
GROUP BY
  ParticipantBarcode,
  categoryName,
  classificationName) ) AS a
  OUTER JOIN EACH 
       ( (SELECT
  ParticipantBarcode,
  vital_status,
  days_to_last_known_alive,
  ethnicity,
  histological_type,
  menopause_status,
  race
FROM
  [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
  ( Study="BRCA"
    AND age_at_initial_pathologic_diagnosis<=50
    AND gender="FEMALE" )) ) AS c
  ON
    a.ParticipantBarcode = c.ParticipantBarcode
  WHERE
    (a.ParticipantBarcode IS NOT NULL
      OR c.ParticipantBarcode IS NOT NULL)
  ORDER BY
    a.classificationName,
    a.categoryName,
    a.ParticipantBarcode,
    c.ParticipantBarcode )
WHERE
  ( a.categoryName IS NULL
    AND a.classificationName IS NULL
    AND c.ParticipantBarcode IS NOT NULL )
ORDER BY
  c.ParticipantBarcode



In [24]:

    
q.execute_dry_run()









    Out[24]:





{u'cacheHit': True,
 u'referencedTables': [{u'datasetId': u'tcga_201607_beta',
   u'projectId': u'isb-cgc',
   u'tableId': u'Annotations'},
  {u'datasetId': u'tcga_201607_beta',
   u'projectId': u'isb-cgc',
   u'tableId': u'Clinical_data'}],
 u'totalBytesBilled': u'0',
 u'totalBytesProcessed': u'785007'}



In [ ]:

menopause_status	n
Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)	1291
Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)	389
Peri (6-12 months since last menstrual period)	82
Indeterminate (neither Pre or Postmenopausal)	54

hpv_status	hpv_calls	n
Negative		664
Positive	HPV16	238
Positive	HPV18	41
Positive	HPV33	25
Positive	HPV45	24

	ParticipantBarcode	categoryName	classificationName
0	TCGA-01-0629	Tumor tissue origin incorrect	Redaction
1	TCGA-13-1479	Tumor tissue origin incorrect	Redaction
2	TCGA-01-0638	Tumor tissue origin incorrect	Redaction
3	TCGA-33-4579	Tumor tissue origin incorrect	Redaction
4	TCGA-GN-A261	Tumor tissue origin incorrect	Redaction
5	TCGA-66-2751	Genotype mismatch	Redaction
6	TCGA-66-2752	Genotype mismatch	Redaction
7	TCGA-66-2750	Genotype mismatch	Redaction
8	TCGA-66-2746	Genotype mismatch	Redaction
9	TCGA-66-2747	Genotype mismatch	Redaction
10	TCGA-35-3621	Genotype mismatch	Redaction
11	TCGA-02-0002	Genotype mismatch	Redaction
12	TCGA-02-0117	Genotype mismatch	Redaction
13	TCGA-08-0384	Genotype mismatch	Redaction
14	TCGA-E2-A1IP	Genotype mismatch	Redaction
15	TCGA-14-0784	Genotype mismatch	Redaction
16	TCGA-14-1036	Genotype mismatch	Redaction
17	TCGA-06-0748	Genotype mismatch	Redaction
18	TCGA-02-2488	Genotype mismatch	Redaction
19	TCGA-14-1824	Genotype mismatch	Redaction
20	TCGA-PN-A8M9	Genotype mismatch	Redaction
21	TCGA-12-1601	Subject withdrew consent	Redaction
22	TCGA-12-0653	Subject withdrew consent	Redaction
23	TCGA-32-2498	Subject withdrew consent	Redaction
24	TCGA-AF-3912	Subject withdrew consent	Redaction
25	TCGA-A6-2670	Subject withdrew consent	Redaction
26	TCGA-06-0131	Subject withdrew consent	Redaction
27	TCGA-AN-A0FG	Subject identity unknown	Redaction
28	TCGA-AN-A0FE	Subject identity unknown	Redaction
29	TCGA-F4-6857	Subject identity unknown	Redaction
...	...	...	...
182	TCGA-AP-A053	History of unacceptable prior treatment relate...	Notification
183	TCGA-AX-A06D	History of unacceptable prior treatment relate...	Notification
184	TCGA-AX-A1CP	History of unacceptable prior treatment relate...	Notification
185	TCGA-AX-A1CR	History of unacceptable prior treatment relate...	Notification
186	TCGA-AX-A2H8	History of unacceptable prior treatment relate...	Notification
187	TCGA-AX-A2HF	History of unacceptable prior treatment relate...	Notification
188	TCGA-AX-A3G3	History of unacceptable prior treatment relate...	Notification
189	TCGA-B5-A0KB	History of unacceptable prior treatment relate...	Notification
190	TCGA-BG-A221	History of unacceptable prior treatment relate...	Notification
191	TCGA-D1-A3JP	History of unacceptable prior treatment relate...	Notification
192	TCGA-EY-A1G8	History of unacceptable prior treatment relate...	Notification
193	TCGA-L5-A88T	History of unacceptable prior treatment relate...	Notification
194	TCGA-WB-A820	History of unacceptable prior treatment relate...	Notification
195	TCGA-XK-AAJ3	History of unacceptable prior treatment relate...	Notification
196	TCGA-EJ-7312	History of unacceptable prior treatment relate...	Notification
197	TCGA-96-A4JK	History of unacceptable prior treatment relate...	Notification
198	TCGA-2G-AAFE	History of unacceptable prior treatment relate...	Notification
199	TCGA-IC-A6RF	History of unacceptable prior treatment relate...	Notification
200	TCGA-BA-4075	History of unacceptable prior treatment relate...	Notification
201	TCGA-06-6391	History of unacceptable prior treatment relate...	Notification
202	TCGA-5L-AAT1	History of unacceptable prior treatment relate...	Notification
203	TCGA-HT-A619	History of unacceptable prior treatment relate...	Notification
204	TCGA-T1-A6J8	History of unacceptable prior treatment relate...	Notification
205	TCGA-BG-A0M8	History of unacceptable prior treatment relate...	Notification
206	TCGA-XK-AAK1	History of unacceptable prior treatment relate...	Notification
207	TCGA-BH-A0B6	History of unacceptable prior treatment relate...	Notification
208	TCGA-BG-A0MS	History of unacceptable prior treatment relate...	Notification
209	TCGA-AR-A2LR	History of unacceptable prior treatment relate...	Notification
210	TCGA-BH-A1F5	History of unacceptable prior treatment relate...	Notification
211	TCGA-DM-A286	Inadvertently shipped	Redaction

	ParticipantBarcode	vital_status	days_to_last_known_alive	ethnicity	histological_type	menopause_status	race
0	TCGA-BH-A18M	Dead	2207	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	WHITE
1	TCGA-BH-A18V	Dead	1556	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	WHITE
2	TCGA-Z7-A8R6	Alive	3256	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
3	TCGA-3C-AALI	Alive	4005	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
4	TCGA-4H-AAAK	Alive	348	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
5	TCGA-5L-AAT0	Alive	1477	HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
6	TCGA-A1-A0SN	Alive	1196	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
7	TCGA-A1-A0SJ	Alive	416	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
8	TCGA-A1-A0SQ	Alive	554	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
9	TCGA-A1-A0SP	Alive	584	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	None
10	TCGA-A1-A0SH	Alive	1437	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
11	TCGA-A2-A25E	Alive	3204	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
12	TCGA-A2-A25B	Alive	1291	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
13	TCGA-A2-A0SX	Alive	1534	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
14	TCGA-A2-A0YL	Alive	1474	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
15	TCGA-A2-A0D4	Alive	767	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
16	TCGA-A2-A3XZ	Alive	1532	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
17	TCGA-A2-A0D2	Alive	1027	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
18	TCGA-A2-A04U	Alive	2654	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Peri (6-12 months since last menstrual period)	WHITE
19	TCGA-A2-A04V	Dead	1920	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
20	TCGA-A2-A0CV	Alive	3011	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	ASIAN
21	TCGA-A2-A0T3	Alive	1516	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
22	TCGA-A2-A3XU	Dead	912	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
23	TCGA-A2-A3XV	Alive	996	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
24	TCGA-A2-A0T6	Alive	575	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
25	TCGA-A2-A0D3	Alive	1873	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
26	TCGA-A2-A0EX	Alive	752	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
27	TCGA-A2-A25A	Alive	3276	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
28	TCGA-A2-A3XT	Alive	2770	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
29	TCGA-A2-A1G0	Alive	616	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
...	...	...	...	...	...	...	...
299	TCGA-GM-A2DA	Dead	6593	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
300	TCGA-GM-A2DL	Alive	3519	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Peri (6-12 months since last menstrual period)	WHITE
301	TCGA-GM-A3XN	Alive	2019	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
302	TCGA-GM-A3XL	Alive	2108	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
303	TCGA-GM-A3XG	Alive	1330	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
304	TCGA-HN-A2OB	Dead	1900	None	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	None
305	TCGA-JL-A3YX	Alive	352	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	ASIAN
306	TCGA-JL-A3YW	Alive	360	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	ASIAN
307	TCGA-LD-A66U	Alive	646	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	None	WHITE
308	TCGA-LL-A5YP	Alive	450	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
309	TCGA-LL-A7SZ	Alive	594	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
310	TCGA-LL-A6FR	Alive	489	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
311	TCGA-LL-A5YN	Alive	447	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
312	TCGA-LL-A5YO	Alive	440	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
313	TCGA-MS-A51U	Alive	681	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	None	WHITE
314	TCGA-OL-A5RV	Alive	1062	None	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
315	TCGA-OL-A5RW	Alive	1106	None	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
316	TCGA-OL-A5D8	Alive	973	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
317	TCGA-OL-A6VQ	Alive	600	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
318	TCGA-OL-A6VO	Alive	858	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
319	TCGA-OL-A66I	Alive	714	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
320	TCGA-OL-A6VR	Alive	1220	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
321	TCGA-OL-A66O	Alive	528	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
322	TCGA-PE-A5DE	Alive	2645	None	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
323	TCGA-PL-A8LZ	Alive	302	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
324	TCGA-PL-A8LY	Alive	8	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
325	TCGA-PL-A8LX	Alive	5	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
326	TCGA-S3-AA14	Alive	529	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
327	TCGA-S3-A6ZH	Alive	641	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
328	TCGA-XX-A899	Alive	467	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE