TCGA Data analysis Pan Cancer Import Data

Check for correlation between MYC amplified tumor samples and TP53 mutations MYC amplification is determined by two different ways:

  1. MYC is in a segment having a copynumber > 4 (this likely excludes chromosome-arm level events)
  2. "Our" definition of a focal amplification
    • segment is small (smaller than 20Mbp)
    • segment covering gene is increased (log2-ratio > 0.2)
    • segment is focal (log2-ratio of segment is at least 0.2 above weighted mean of neighbouring 20Mbp)
    • segment does not contain segmental duplications (less than 50%)
    • segment doesn't overlap with known CNV from Database of genomic variants (start and endpoint are within 100Kbp from variants start and endpoint)

Import Packages, Modules and Classes


In [1]:
import glob
import sys
from TCGA_sample import TCGA_sample
from Focal_amplification import Focal_amplification
from CNV_segment import CNV_segment
from Gene import Gene
import pandas
import numpy
import os
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline

Define File names for BRCA samples and genes


In [2]:
gene_positions_file = "./Ref/genes_unique.txt"
focal_directory = "./PANCANCER/FocalOutput/"
rnaseq_directory = "./PANCANCER/RNASeq/"
cnv_directory = "./PANCANCER/CNV/"
somatic_directory = "./PANCANCER/SomaticMutations/"
clinical_directory="./PANCANCER/Clinical/"

Load Gene positions


In [3]:
gene_positions = dict()
GENES = open(gene_positions_file, "r")
header = GENES.readline()
body = GENES.readlines()
for line in body:
  info = line.split("\t")
  tmp_gene = Gene(info[4],info[0], info[1], info[2], info[3])
  gene_positions[info[4]] = tmp_gene

Define samples and load output of focal amplification calling (our definition)

We can only use samples where Focal Amplification data (hence CNV data) are available


In [4]:
print "Loading Focal Amplification data"
samples = dict()
files = glob.glob(focal_directory+"*.csv")
for focal in files: 
  sample_id = os.path.basename(focal)[:16]
  if sample_id in TCGA_sample.sample_ids:
     print "Sample ID already exists"
     continue
  sample = TCGA_sample(sample_id)
  sample.loadFocalOutput(focal)
  samples[sample_id]=sample


Loading Focal Amplification data

Load RNASeq Data


In [5]:
print "Loading RNASeq data"
files = glob.glob(rnaseq_directory+"TCGA*.txt")
for rna_file in files: 
  sample_id = os.path.basename(rna_file)[:16]
  if sample_id in samples.keys():
    samples[sample_id].loadRNASeq(rna_file)


Loading RNASeq data

Load RNASeq Data of controls


In [ ]:
print "Loading RNASeq data of controls"
controls = dict()
files = glob.glob(rnaseq_directory+"TCGA*.txt")
for rna_file in files:
   sample_id = os.path.basename(focal)[:16]
   if sample_id[13] == '1':
        control = TCGA_sample(sample_id)
        control.loadRNASeq(rna_file)
        controls[sample_id]=control

Load raw CNV data


In [6]:
print "Loading CNV Data"
files = glob.glob(cnv_directory+"TCGA*.txt")
for cnv_file in files: 
  sample_id = os.path.basename(cnv_file)[:16]
  if sample_id in samples.keys():
    samples[sample_id].loadCNVData(cnv_file)


Loading CNV Data

Load Somatic Mutations calls


In [7]:
print "Loading Somatic Mutation Data"
files = glob.glob(somatic_directory+"TCGA*.maf.txt")
somatic_not_found = 0
for maf_file in files: 
   sample_id = os.path.basename(maf_file)[:15]+"A"
   if sample_id in samples.keys():
      samples[sample_id].loadSomaticMutation(maf_file) 
   elif sample_id[:-1]+"B" in samples.keys():
      samples[sample_id[:-1]+"B"].loadSomaticMutation(maf_file)
   else:
      somatic_not_found += 1

print str(somatic_not_found)+" samples have somatic but no CNV data"


Loading Somatic Mutation Data
TCGA-17-Z059-01A not found
TCGA-AB-2876-03A not found
TCGA-A8-A07C-01A not found
TCGA-BP-4345-01A not found
TCGA-B5-A11M-01A not found
TCGA-19-1790-01A not found
TCGA-17-Z058-01A not found
TCGA-09-2049-01A not found
TCGA-BH-A0HN-01A not found
TCGA-17-Z011-01A not found
TCGA-17-Z041-01A not found
TCGA-D1-A16F-01A not found
TCGA-BH-A0B8-01A not found
TCGA-17-Z018-01A not found
TCGA-17-Z050-01A not found
TCGA-76-4927-01A not found
TCGA-B5-A11X-01A not found
TCGA-17-Z038-01A not found
TCGA-17-Z009-01A not found
TCGA-17-Z030-01A not found
TCGA-17-Z037-01A not found
TCGA-17-Z032-01A not found
TCGA-06-5417-01A not found
TCGA-AB-2802-03A not found
TCGA-CN-5361-01A not found
TCGA-17-Z020-01A not found
TCGA-17-Z001-01A not found
TCGA-17-Z062-01A not found
TCGA-17-Z008-01A not found
TCGA-CQ-6219-01A not found
TCGA-17-Z054-01A not found
TCGA-C4-A0F7-01A not found
TCGA-17-Z010-01A not found
TCGA-BS-A0UM-01A not found
TCGA-32-4209-01A not found
TCGA-AB-2833-03A not found
TCGA-17-Z055-01A not found
TCGA-13-0894-01A not found
TCGA-CQ-6222-01A not found
TCGA-17-Z052-01A not found
TCGA-17-Z033-01A not found
TCGA-BH-A0HF-01A not found
TCGA-B0-5707-01A not found
TCGA-A7-A4SC-01A not found
TCGA-17-Z031-01A not found
TCGA-AB-2981-03A not found
TCGA-13-0765-01A not found
TCGA-AB-2891-03A not found
TCGA-17-Z053-01A not found
TCGA-AR-A0TU-01A not found
TCGA-AR-A1AT-01A not found
TCGA-17-Z056-01A not found
TCGA-17-Z017-01A not found
TCGA-17-Z025-01A not found
TCGA-17-Z000-01A not found
TCGA-A5-A0VO-01A not found
TCGA-06-0167-01A not found
TCGA-17-Z046-01A not found
TCGA-19-4068-01A not found
TCGA-C4-A0F1-01A not found
TCGA-25-1324-01A not found
TCGA-17-Z026-01A not found
TCGA-17-Z004-01A not found
TCGA-17-Z003-01A not found
TCGA-09-2053-01A not found
TCGA-17-Z042-01A not found
TCGA-17-Z035-01A not found
TCGA-28-5211-01A not found
TCGA-17-Z015-01A not found
TCGA-17-Z023-01A not found
TCGA-14-3476-01A not found
TCGA-E2-A1LS-01A not found
TCGA-17-Z028-01A not found
TCGA-17-Z044-01A not found
TCGA-17-Z040-01A not found
TCGA-AA-3980-01A not found
TCGA-17-Z061-01A not found
TCGA-17-Z045-01A not found
TCGA-A2-A0CZ-01A not found
TCGA-17-Z049-01A not found
TCGA-BH-A0HL-01A not found
TCGA-17-Z012-01A not found
TCGA-A5-A0G3-01A not found
TCGA-AN-A0G0-01A not found
TCGA-28-2499-01A not found
TCGA-17-Z048-01A not found
TCGA-17-Z005-01A not found
TCGA-17-Z036-01A not found
TCGA-17-Z039-01A not found
TCGA-BH-A0B1-01A not found
TCGA-17-Z027-01A not found
TCGA-76-4932-01A not found
TCGA-17-Z013-01A not found
TCGA-17-Z047-01A not found
TCGA-17-Z043-01A not found
TCGA-17-Z057-01A not found
TCGA-17-Z060-01A not found
TCGA-25-1326-01A not found
TCGA-25-1328-01A not found
TCGA-B6-A0I8-01A not found
TCGA-17-Z014-01A not found
TCGA-17-Z051-01A not found
TCGA-AB-2918-03A not found
TCGA-AG-A036-01A not found
TCGA-17-Z007-01A not found
TCGA-CN-4734-01A not found
TCGA-17-Z022-01A not found
TCGA-AB-2843-03A not found
TCGA-12-1597-01A not found
TCGA-B6-A0I6-01A not found
TCGA-AB-2979-03A not found
TCGA-17-Z016-01A not found
TCGA-A6-3808-01A not found
TCGA-28-1747-01A not found
TCGA-AB-2847-03A not found
TCGA-17-Z021-01A not found

Load Clinical Data from Biotab File


In [8]:
print "Loading Clinical Data"
for sample_id in samples.keys():
    clinical_file = clinical_directory+sample_id[:12]+".txt"
    if os.path.isfile(clinical_file):
       samples[sample_id].loadClinicalData(clinical_file)


Loading Clinical Data

Check if loading worked and get statistics


In [9]:
focal_samples = 0
cnv_samples = 0
rnaseq_samples = 0
somatic_mutation_samples = 0
clinical_samples = 0
for sample in samples.values():
    if sample.focal_amplification_data:
        focal_samples += 1
    if sample.CNV_data:
        cnv_samples += 1
    if sample.rnaseq_data:
        rnaseq_samples += 1
    if sample.somatic_mutation_data:
        somatic_mutation_samples += 1
    if sample.clinical:
        clinical_samples += 1
sample_count = len(samples)
print "Samples: "+str(sample_count)
print "  --Focal Data: "+str(focal_samples)
print "  --CNV:        "+str(cnv_samples)
print "  --RNASeq:     "+str(rnaseq_samples)
print "  --Somatic:    "+str(somatic_mutation_samples)
print "  --Clinical:   "+str(clinical_samples)


Samples: 5737
  --Focal Data: 5737
  --CNV:        5737
  --RNASeq:     4196
  --Somatic:    3297
  --Clinical:   5312

In [ ]: