In [1]:
import os as os
import pandas as pd

import NotebookImport #local script which handles importing notebooks as modules

Style-Sheet

CGHub Data Retrieval Cookbook

CGHub is a great resource for obtaining large amounts of (mostly cancer) sequencing data. Here I am describing my analysis pipeline for retreiving and computing on this data.

10,000 foot view

The main objective here is to build up bash scripts to perform tasks related to downloading, processing, and cleaning up sequence files from CGHub. I define a set of objects which contain funtions to generate bash commands. Higher level functions then piece these outputs together and form them into scripts to preform a task. Finally annother optional layer is added to set up jobs on the cluster or distribute tasks to be run in parallel.

Dependencies

The first step in this analysis pipeline is setting up the required tools and dependencies to download the data and do basic tasks such as variant calling. For this I recomend downloading and installing bcbio-nextgen analysis package. While I do not explicitly use their pipeline, I have found that their automated software instalation scripts as well as their test suite to be very usefull in getting all of the basic dependencies in order.

Annother dependency is instalation of the CGHub GeneTerrent software. We are mainly using the GTDownload function. Instructions for downloading and instaliation can be found here.

Global Variables

We need to set up a number of global variables with the locations of data-files and software. This is going to be user specific so I recomend going into the Globals notebook and adjusting the paths.

For my use-case, I have two seperate systems I run this pipeline on. I use our in-house cluster for longer jobs in which I generally download full data-files, and our Annai Systems virtual machine for quick jobs in which I can get away with mounting BAM files using their GT-Fuse software. These have slightly difference configurations so I use different global configuration files for each.


In [2]:
import Globals_Cellar as cellar


importing IPython notebook from Globals_Cellar

In [3]:
import Globals_Annai as annai


importing IPython notebook from Globals_Annai

Import Helper Objects


In [4]:
from CGHelpers import CGFile, MutationCalling


importing IPython notebook from CGHelpers

CGFile object

This is our main abstraction for dealing with files from CGhub. Here we store a number of attributes that are important for retreiving, and cleaning up files from CGHub. The process for downloading and cleaning up files is different using the standard GTDownload mechansim or the proprietary GTFuse functionality available on the Annai machine. This object is meant to abtract away the housekeeping required for different methods of downloading the data.

MutationCalling Object

This is an object to handing doing variant calling on matched tumor-normal genomes. We run MuTect and SomaticIndelDetector for SNVs and indels, respectively.

Generate Mutation Calls: GTDownload

Here we are going to do mutation calls using GTDownload. I.E. we download the entire BAM files, call mutations on a subset of genes, and then clean up.

Initialize Objects

First we set global parameters in the CGFile object and intialize our mutation calling object.


In [5]:
GLOBALS = cellar

In [6]:
#CGFile.gt_fuse = gt_fuse
CGFile.gt_download = GLOBALS.GT_DOWNLOAD
CGFile.cghub = GLOBALS.CGHUB
#CGFile.cache = CACHE
caller = MutationCalling(GLOBALS.JAVA, GLOBALS.MUTECT_JAR, GLOBALS.SID_JAR, 
                         GLOBALS.REFERENCE, GLOBALS.DBSNP, GLOBALS.COSMIC)

Read in a set of regions we want to make calls across. Our downstream functions expect this in the form of a list. Regions should be formated of the style: chr:start-end.


In [7]:
f = open('/cellar/users/agross/Downloads_Old/cancer_genes.samtools.txt', 'rb')
regions = [l.split('$')[2].split()[1] for l in f.readlines()]
regions = regions[:10]

Import CGHub manifest of tcga exome files and list of tumor-normal pairs.


In [8]:
from CGHub_Manifest import tcga, tn_blood


importing IPython notebook from CGHub_Manifest

In [9]:
local_dir = '/cellar/users/agross/Data/WXS/Hannah'
if not os.path.isdir(local_dir):
    os.makedirs(local_dir)

Instantiate function to spit out scripts.


In [10]:
fx = caller.mutation_caller(tcga, regions=regions, directory='/tmp', 
                            local_dir=local_dir, library_type='WXS')

Here is an example of a script for a single patient.


In [11]:
print fx('TCGA-13-1411')


echo running patient TCGA-13-1411 at `date`;
mkdir /tmp/TCGA-13-1411_NB_WXS;
gtdownload --inactivity-timeout=720 --max-children 1 -c /cellar/users/agross/.ssh/cghub.key https://cghub.ucsc.edu/cghub/data/analysis/download/d8bab086-d1b9-4a27-a879-eb87bd77b834 -p /tmp/TCGA-13-1411_NB_WXS;
mkdir /tmp/TCGA-13-1411_TP_WXS;
gtdownload --inactivity-timeout=720 --max-children 1 -c /cellar/users/agross/.ssh/cghub.key https://cghub.ucsc.edu/cghub/data/analysis/download/d392dc95-c15e-4eb5-972f-ddadeeb9fefe -p /tmp/TCGA-13-1411_TP_WXS;
/cellar/software/jdk1.7.0_45/bin/java -Xmx32g -jar /cellar/users/agross/sources/bcbio_nextgen_local/share/java/mutect/muTect-1.1.5.jar --coverage_file /tmp//TCGA-13-1411_coverage.wig.txt --performanceLog /tmp//TCGA-13-1411_mutect_log.txt --cosmic /cellar/users/agross/sources/bcbio_nextgen_local/genomes/Hsapiens/GRCh37/variation/cosmic-v68-GRCh37.vcf.gz --analysis_type MuTect --dbsnp /cellar/users/agross/sources/bcbio_nextgen_local/genomes/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz --reference_sequence /cellar/users/agross/sources/bcbio_nextgen_local/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --out /tmp//TCGA-13-1411_call_stats.txt --input_file:normal /tmp/TCGA-13-1411_NB_WXS/d8bab086-d1b9-4a27-a879-eb87bd77b834/C239.TCGA-13-1411-10A-01W.2.bam --input_file:tumor /tmp/TCGA-13-1411_TP_WXS/d392dc95-c15e-4eb5-972f-ddadeeb9fefe/C239.TCGA-13-1411-01A-01W.2.bam --intervals 1:27021521-27109601 --intervals 2:29414639-30145477 --intervals 5:112042201-112182936 --intervals 6:157098063-157532913 --intervals 9:133588267-133764062 --intervals 12:46122619-46302819 --intervals 12:52344450-52391863 --intervals 14:105234686-105263080 --intervals X:63403996-63426624 --intervals X:66762873-66951461;
/cellar/software/jdk1.7.0_45/bin/java -Xmx32g -jar /cellar/users/hcarter/programs/GenomeAnalysisTK-2.2-2/GenomeAnalysisTK.jar --analysis_type SomaticIndelDetector --performanceLog /tmp//TCGA-13-1411_indel_log.txt --out /tmp//TCGA-13-1411_indels.txt --reference_sequence /cellar/users/agross/sources/bcbio_nextgen_local/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --input_file:normal /tmp/TCGA-13-1411_NB_WXS/d8bab086-d1b9-4a27-a879-eb87bd77b834/C239.TCGA-13-1411-10A-01W.2.bam --input_file:tumor /tmp/TCGA-13-1411_TP_WXS/d392dc95-c15e-4eb5-972f-ddadeeb9fefe/C239.TCGA-13-1411-01A-01W.2.bam --intervals 1:27021521-27109601 --intervals 2:29414639-30145477 --intervals 5:112042201-112182936 --intervals 6:157098063-157532913 --intervals 9:133588267-133764062 --intervals 12:46122619-46302819 --intervals 12:52344450-52391863 --intervals 14:105234686-105263080 --intervals X:63403996-63426624 --intervals X:66762873-66951461;
rm -rf /tmp/TCGA-13-1411_NB_WXS;
rm -rf /tmp/TCGA-13-1411_TP_WXS;
mv /tmp/TCGA-13-1411*.txt* /cellar/users/agross/Data/WXS/Hannah;
echo finished patient TCGA-13-1411 at `date`

Now we generate scripts for all of the patients we want to run, as well as a driver script to launch the SGE job.


In [12]:
from ParallelHelpers import generate_sge


importing IPython notebook from ParallelHelpers

In [13]:
script_dir = '/cellar/users/agross/scripts/Hannah_WXS_Test'

In [14]:
generate_sge(tn_blood[:100], fx, script_dir, threads=12)


Writing scripts and SGE driver to /cellar/users/agross/scripts/Hannah_WXS_Test

Pulling Reads with GTFuse and Samtools View

Here we are going to use GTFuse to grab the RNA-Seq reads mapping to a handfull of mutations for a single patient. region_subsetter is a function from the CGHelpers Notebook which does this. For now I have it configured to pull all reads mapping to a set of mutations supplied to it in the form of a MAF file. At some point I will generalize it.


In [15]:
from CGHelpers import region_subsetter

In [16]:
GLOBALS = annai

CGFile.gt_fuse = GLOBALS.GT_FUSE
CGFile.cghub = GLOBALS.CGHUB
CGFile.cache = GLOBALS.CACHE

In [17]:
maf = pd.read_csv('/cellar/users/agross/TCGA_Code/TCGA/Data/MAFs/mega_maf.csv', 
                  index_col=0)

In [18]:
subsetter = region_subsetter(maf, tcga, directory=GLOBALS.VM_DIRECTORY, 
                             library_type='RNA-Seq', tissue_code='TP',
                             out='RNA-seq')

In [19]:
print subsetter('TCGA-06-0878')


echo running patient TCGA-06-0878 at `date`;
mkdir /home/centos/projects/TCGA-06-0878_TP_RNA-Seq;
gtfuse -c /home/centos/cghub.key --inactivity-timeout=2 https://cghub.ucsc.edu/cghub/data/analysis/download/72ac9bf5-ccb1-46a7-9341-4178b5dc8f17 /home/centos/projects/TCGA-06-0878_TP_RNA-Seq;
samtools view /home/centos/projects/TCGA-06-0878_TP_RNA-Seq/72ac9bf5-ccb1-46a7-9341-4178b5dc8f17/UNCID_1538835.903d756b-42ba-49bd-9a53-085b3211d8a7.sorted_genome_alignments.bam chr1:15428142-15428142 chr1:15428142-15428142 chr1:22206668-22206668 chr1:89729595-89729595 chr1:104122114-104122114 chr1:157667597-157667597 chr1:249148230-249148230 chr2:3743415-3743415 chr2:103379127-103379127 chr2:161025770-161025770 chr2:196865488-196865488 chr2:227660808-227660808 chr3:13672223-13672223 chr3:182585181-182585181 chr4:1206064-1206064 chr4:10515205-10515205 chr4:106406491-106406491 chr4:106406491-106406491 chr4:155252747-155252747 chr5:11022883-11022883 chr5:140222517-140222517 chr5:140222517-140222517 chr5:169122848-169122848 chr6:55142306-55142306 chr7:31016925-31016925 chr7:55221781-55221781 chr7:55221796-55221796 chr7:55221821-55221821 chr7:82997221-82997221 chr7:100350611-100350611 chr7:107706946-107706946 chr7:151892993-151892993 chr8:69104577-69104577 chr9:113704320-113704320 chr9:123675735-123675735 chr9:127670739-127670739 chr9:138667205-138667205 chr10:105823552-105823552 chr11:55873034-55873034 chr11:67941367-67941367 chr12:43770043-43770043 chr12:54913072-54913072 chr12:81688794-81688794 chr12:109843786-109843786 chr12:112703783-112703783 chr13:50746904-50746904 chr13:50746904-50746904 chr15:50254197-50254197 chr16:315011-315011 chr16:20357449-20357449 chr16:50731207-50731207 chr16:84472802-84472802 chr17:21730916-21730916 chr17:21730916-21730916 chr19:40395919-40395919 chr19:41292794-41292794 chr19:41726597-41726597 chr19:45912970-45912970 chr19:55420770-55420770 chr20:5903131-5903131 chr20:60697790-60697790 chr21:31852408-31852408 chr21:32007726-32007726 chr22:18178948-18178948 chr22:37482392-37482392 chr22:47189513-47189513 chrX:18275064-18275064 chrX:117054239-117054239 > /home/centos/projects/RNA-seq/TCGA-06-0878_RNA-Seq;
fusermount -u /home/centos/projects/TCGA-06-0878_TP_RNA-Seq;
rm -rf /home/centos/cache/fusecache/72ac9bf5-ccb1-46a7-9341-4178b5dc8f17*;
rm -rf /home/centos/projects/TCGA-06-0878_TP_RNA-Seq

From here you would throw a bunch of these into a folder and run a them in parallel on the Annai machine.