Download and parse out CGHub manifest file. This contains a summary of all of the data in CGHub.
In [1]:
import pandas as pd
In [2]:
MANIFEST = 'https://cghub.ucsc.edu/reports/SUMMARY_STATS/LATEST_MANIFEST.tsv'
In [3]:
cols = ['study','barcode','disease','platform','assembly','state',
'sample_type','library_type','uploaded','analysis_id',
'filename']
tcga = pd.read_table(MANIFEST, low_memory=False, usecols=cols)
For now I'm only really looking at the TCGA data run on the Illumina sequencers so I pull out that subset of entries. I'm also only looking at data aligned to the HG19/GR37 genome. While we could realign the other samples this takes a while and is outside of the scope of our current analysis.
In [4]:
tcga = tcga[tcga.study == 'TCGA']
tcga = tcga[tcga.platform == 'ILLUMINA']
tcga = tcga[tcga.assembly.isin(['GRCh37-lite','HG19_Broad_variant','HG19',
'GRCh37-lite-+-HPV_Redux-build','GRCh37'])]
I load these records into a DataFrame with a 3-level hierarchical index. This index contains most of the information we will need for queries, while the columns give the information needed to retreive the files. Some patients have had the same assay preformed multiple times, in this case I take the most recent one.
In [5]:
tcga = tcga[tcga.barcode.notnull()]
tcga['patient'] = tcga['barcode'].map(lambda s: s[:12])
tcga = tcga[tcga.state == 'Live']
tcga = tcga[tcga.assembly != 'unaligned']
tcga = tcga.sort('uploaded')
tcga = tcga.drop_duplicates(subset=['patient', 'sample_type','library_type'],
take_last=True)
tcga = tcga.set_index(['patient', 'sample_type','library_type'])
In [6]:
tcga.head().T
Out[6]:
In TCGA at least most patients are run against a blood control. In addition some have additional metastatic samples and/or adjacent normal tissue sameples. Here I'm parsing out the manifest to form lists of the data availiblity and paired samples. Here I am doing this for the patients with exome data, but you could also look at the other measurment platforms.
Form tissue type lists
In [9]:
exome = tcga.xs('WXS', axis=0, level='library_type')
pats = set(tcga.index.get_level_values('patient'))
tumor = exome.xs('TP', level='sample_type')
tumor = set(tumor.index.get_level_values('patient'))
normal = exome.xs('NB', level='sample_type')
normal = set(normal.index.get_level_values('patient'))
normal_tissue = exome.xs('NT', level='sample_type')
normal_tissue = set(normal_tissue.index.get_level_values('patient'))
metastatic = exome.xs('TM', level='sample_type')
metastatic = set(metastatic.index.get_level_values('patient'))
blood = exome.xs('TB', level='sample_type')
blood = set(blood.index.get_level_values('patient'))
Compile lists of matched samples
In [10]:
tn_blood = [i for i in pats if i in tumor and i in normal]
tn_tissue = [i for i in pats if i in tumor and i in normal_tissue]
met_norm = [i for i in pats if i in normal and i in metastatic]
blood_tumor = [i for i in pats if i in normal_tissue and i in blood]
len(pats), len(tn_blood), len(tn_tissue), len(met_norm), len(blood_tumor)
Out[10]:
These are the patients with tumor, normal tissue, and blood samples.
In [11]:
triplets = [i for i in pats if i in normal and i in normal_tissue and i in tumor]
len(triplets)
Out[11]: