CGHub Manifest

Download and parse out CGHub manifest file. This contains a summary of all of the data in CGHub.



In [1]:

    
import pandas as pd



In [2]:

    
MANIFEST = 'https://cghub.ucsc.edu/reports/SUMMARY_STATS/LATEST_MANIFEST.tsv'



In [3]:

    
cols = ['study','barcode','disease','platform','assembly','state',
        'sample_type','library_type','uploaded','analysis_id',
        'filename']
tcga = pd.read_table(MANIFEST, low_memory=False, usecols=cols)

For now I'm only really looking at the TCGA data run on the Illumina sequencers so I pull out that subset of entries. I'm also only looking at data aligned to the HG19/GR37 genome. While we could realign the other samples this takes a while and is outside of the scope of our current analysis.



In [4]:

    
tcga = tcga[tcga.study == 'TCGA']
tcga = tcga[tcga.platform == 'ILLUMINA']
tcga = tcga[tcga.assembly.isin(['GRCh37-lite','HG19_Broad_variant','HG19',
                    'GRCh37-lite-+-HPV_Redux-build','GRCh37'])]

I load these records into a DataFrame with a 3-level hierarchical index. This index contains most of the information we will need for queries, while the columns give the information needed to retreive the files. Some patients have had the same assay preformed multiple times, in this case I take the most recent one.



In [5]:

    
tcga = tcga[tcga.barcode.notnull()]
tcga['patient'] = tcga['barcode'].map(lambda s: s[:12])
tcga = tcga[tcga.state == 'Live']
tcga = tcga[tcga.assembly != 'unaligned']
tcga = tcga.sort('uploaded')
tcga = tcga.drop_duplicates(subset=['patient', 'sample_type','library_type'], 
                            take_last=True)
tcga = tcga.set_index(['patient', 'sample_type','library_type'])



In [6]:

    
tcga.head().T









    Out[6]:






  
    
      patient
      TCGA-66-2768
      TCGA-64-1677
      TCGA-67-3772
      TCGA-64-1677
    
    
      sample_type
      TP
      TP
      NB
      TP
      NB
    
    
      library_type
      WXS
      WXS
      WXS
      WXS
      WXS
    
  
  
    
      study
                                       TCGA
                                          TCGA
                                          TCGA
                                          TCGA
                                          TCGA
    
    
      barcode
               TCGA-66-2768-01A-01W-0877-08
                  TCGA-64-1677-01A-01W-0928-08
                  TCGA-67-3772-10A-01W-0928-08
                  TCGA-67-3772-01A-01W-0928-08
                  TCGA-64-1677-10A-01W-0928-08
    
    
      disease
                                       LUSC
                                          LUAD
                                          LUAD
                                          LUAD
                                          LUAD
    
    
      platform
                                   ILLUMINA
                                      ILLUMINA
                                      ILLUMINA
                                      ILLUMINA
                                      ILLUMINA
    
    
      assembly
                         HG19_Broad_variant
                            HG19_Broad_variant
                            HG19_Broad_variant
                            HG19_Broad_variant
                            HG19_Broad_variant
    
    
      filename
           TCGA-66-2768-01A-01W-0877-08.bam
       C347.TCGA-64-1677-01A-01W-0928-08.1.bam
       C347.TCGA-67-3772-10A-01W-0928-08.1.bam
       C347.TCGA-67-3772-01A-01W-0928-08.1.bam
       C347.TCGA-64-1677-10A-01W-0928-08.1.bam
    
    
      analysis_id
       fb280562-6067-4a92-9c6f-fbadbb6a748e
          709b873d-c4eb-4dc2-a1dd-e486a6aca50f
          f2a2d424-0908-4589-b36c-cabf54c9920a
          83024fb3-5782-4674-8e4c-83126cffa1a3
          975c6317-99a8-4b0c-b70c-1774e42401d6
    
    
      uploaded
                                 2010-08-27
                                    2010-12-01
                                    2010-12-01
                                    2010-12-01
                                    2010-12-01
    
    
      state
                                       Live
                                          Live
                                          Live
                                          Live
                                          Live

Form lists of matched analyses.

In TCGA at least most patients are run against a blood control. In addition some have additional metastatic samples and/or adjacent normal tissue sameples. Here I'm parsing out the manifest to form lists of the data availiblity and paired samples. Here I am doing this for the patients with exome data, but you could also look at the other measurment platforms.

Form tissue type lists



In [9]:

    
exome = tcga.xs('WXS', axis=0, level='library_type')

pats = set(tcga.index.get_level_values('patient'))

tumor = exome.xs('TP', level='sample_type')
tumor = set(tumor.index.get_level_values('patient'))

normal = exome.xs('NB', level='sample_type')
normal = set(normal.index.get_level_values('patient'))

normal_tissue = exome.xs('NT', level='sample_type')
normal_tissue = set(normal_tissue.index.get_level_values('patient'))

metastatic = exome.xs('TM', level='sample_type')
metastatic = set(metastatic.index.get_level_values('patient'))

blood = exome.xs('TB', level='sample_type')
blood = set(blood.index.get_level_values('patient'))

Compile lists of matched samples



In [10]:

    
tn_blood  = [i for i in pats if i in tumor and i in normal]
tn_tissue  = [i for i in pats if i in tumor and i in normal_tissue]
met_norm = [i for i in pats if i in normal and i in metastatic]
blood_tumor = [i for i in pats if i in normal_tissue and i in blood]

len(pats), len(tn_blood), len(tn_tissue), len(met_norm), len(blood_tumor)









    Out[10]:





(9759, 7002, 1689, 359, 18)

These are the patients with tumor, normal tissue, and blood samples.



In [11]:

    
triplets = [i for i in pats if i in normal and i in normal_tissue and i in tumor]
len(triplets)









    Out[11]:





450

patient	TCGA-66-2768	TCGA-64-1677	TCGA-67-3772		TCGA-64-1677
sample_type	TP	TP	NB	TP	NB
library_type	WXS	WXS	WXS	WXS	WXS
study	TCGA	TCGA	TCGA	TCGA	TCGA
barcode	TCGA-66-2768-01A-01W-0877-08	TCGA-64-1677-01A-01W-0928-08	TCGA-67-3772-10A-01W-0928-08	TCGA-67-3772-01A-01W-0928-08	TCGA-64-1677-10A-01W-0928-08
disease	LUSC	LUAD	LUAD	LUAD	LUAD
platform	ILLUMINA	ILLUMINA	ILLUMINA	ILLUMINA	ILLUMINA
assembly	HG19_Broad_variant	HG19_Broad_variant	HG19_Broad_variant	HG19_Broad_variant	HG19_Broad_variant
filename	TCGA-66-2768-01A-01W-0877-08.bam	C347.TCGA-64-1677-01A-01W-0928-08.1.bam	C347.TCGA-67-3772-10A-01W-0928-08.1.bam	C347.TCGA-67-3772-01A-01W-0928-08.1.bam	C347.TCGA-64-1677-10A-01W-0928-08.1.bam
analysis_id	fb280562-6067-4a92-9c6f-fbadbb6a748e	709b873d-c4eb-4dc2-a1dd-e486a6aca50f	f2a2d424-0908-4589-b36c-cabf54c9920a	83024fb3-5782-4674-8e4c-83126cffa1a3	975c6317-99a8-4b0c-b70c-1774e42401d6
uploaded	2010-08-27	2010-12-01	2010-12-01	2010-12-01	2010-12-01
state	Live	Live	Live	Live	Live