CGHub Manifest

Download and parse out CGHub manifest file. This contains a summary of all of the data in CGHub.


In [1]:
import pandas as pd

In [2]:
MANIFEST = 'https://cghub.ucsc.edu/reports/SUMMARY_STATS/LATEST_MANIFEST.tsv'

In [3]:
cols = ['study','barcode','disease','platform','assembly','state',
        'sample_type','library_type','uploaded','analysis_id',
        'filename']
tcga = pd.read_table(MANIFEST, low_memory=False, usecols=cols)

For now I'm only really looking at the TCGA data run on the Illumina sequencers so I pull out that subset of entries. I'm also only looking at data aligned to the HG19/GR37 genome. While we could realign the other samples this takes a while and is outside of the scope of our current analysis.


In [4]:
tcga = tcga[tcga.study == 'TCGA']
tcga = tcga[tcga.platform == 'ILLUMINA']
tcga = tcga[tcga.assembly.isin(['GRCh37-lite','HG19_Broad_variant','HG19',
                    'GRCh37-lite-+-HPV_Redux-build','GRCh37'])]

I load these records into a DataFrame with a 3-level hierarchical index. This index contains most of the information we will need for queries, while the columns give the information needed to retreive the files. Some patients have had the same assay preformed multiple times, in this case I take the most recent one.


In [5]:
tcga = tcga[tcga.barcode.notnull()]
tcga['patient'] = tcga['barcode'].map(lambda s: s[:12])
tcga = tcga[tcga.state == 'Live']
tcga = tcga[tcga.assembly != 'unaligned']
tcga = tcga.sort('uploaded')
tcga = tcga.drop_duplicates(subset=['patient', 'sample_type','library_type'], 
                            take_last=True)
tcga = tcga.set_index(['patient', 'sample_type','library_type'])

In [6]:
tcga.head().T


Out[6]:
patient TCGA-66-2768 TCGA-64-1677 TCGA-67-3772 TCGA-64-1677
sample_type TP TP NB TP NB
library_type WXS WXS WXS WXS WXS
study TCGA TCGA TCGA TCGA TCGA
barcode TCGA-66-2768-01A-01W-0877-08 TCGA-64-1677-01A-01W-0928-08 TCGA-67-3772-10A-01W-0928-08 TCGA-67-3772-01A-01W-0928-08 TCGA-64-1677-10A-01W-0928-08
disease LUSC LUAD LUAD LUAD LUAD
platform ILLUMINA ILLUMINA ILLUMINA ILLUMINA ILLUMINA
assembly HG19_Broad_variant HG19_Broad_variant HG19_Broad_variant HG19_Broad_variant HG19_Broad_variant
filename TCGA-66-2768-01A-01W-0877-08.bam C347.TCGA-64-1677-01A-01W-0928-08.1.bam C347.TCGA-67-3772-10A-01W-0928-08.1.bam C347.TCGA-67-3772-01A-01W-0928-08.1.bam C347.TCGA-64-1677-10A-01W-0928-08.1.bam
analysis_id fb280562-6067-4a92-9c6f-fbadbb6a748e 709b873d-c4eb-4dc2-a1dd-e486a6aca50f f2a2d424-0908-4589-b36c-cabf54c9920a 83024fb3-5782-4674-8e4c-83126cffa1a3 975c6317-99a8-4b0c-b70c-1774e42401d6
uploaded 2010-08-27 2010-12-01 2010-12-01 2010-12-01 2010-12-01
state Live Live Live Live Live

Form lists of matched analyses.

In TCGA at least most patients are run against a blood control. In addition some have additional metastatic samples and/or adjacent normal tissue sameples. Here I'm parsing out the manifest to form lists of the data availiblity and paired samples. Here I am doing this for the patients with exome data, but you could also look at the other measurment platforms.

Form tissue type lists


In [9]:
exome = tcga.xs('WXS', axis=0, level='library_type')

pats = set(tcga.index.get_level_values('patient'))

tumor = exome.xs('TP', level='sample_type')
tumor = set(tumor.index.get_level_values('patient'))

normal = exome.xs('NB', level='sample_type')
normal = set(normal.index.get_level_values('patient'))

normal_tissue = exome.xs('NT', level='sample_type')
normal_tissue = set(normal_tissue.index.get_level_values('patient'))

metastatic = exome.xs('TM', level='sample_type')
metastatic = set(metastatic.index.get_level_values('patient'))

blood = exome.xs('TB', level='sample_type')
blood = set(blood.index.get_level_values('patient'))

Compile lists of matched samples


In [10]:
tn_blood  = [i for i in pats if i in tumor and i in normal]
tn_tissue  = [i for i in pats if i in tumor and i in normal_tissue]
met_norm = [i for i in pats if i in normal and i in metastatic]
blood_tumor = [i for i in pats if i in normal_tissue and i in blood]

len(pats), len(tn_blood), len(tn_tissue), len(met_norm), len(blood_tumor)


Out[10]:
(9759, 7002, 1689, 359, 18)

These are the patients with tumor, normal tissue, and blood samples.


In [11]:
triplets = [i for i in pats if i in normal and i in normal_tissue and i in tumor]
len(triplets)


Out[11]:
450