In [1]:
import os as os
import pandas as pd
import NotebookImport #local script which handles importing notebooks as modules
Style-Sheet
CGHub is a great resource for obtaining large amounts of (mostly cancer) sequencing data. Here I am describing my analysis pipeline for retreiving and computing on this data.
The main objective here is to build up bash scripts to perform tasks related to downloading, processing, and cleaning up sequence files from CGHub. I define a set of objects which contain funtions to generate bash commands. Higher level functions then piece these outputs together and form them into scripts to preform a task. Finally annother optional layer is added to set up jobs on the cluster or distribute tasks to be run in parallel.
The first step in this analysis pipeline is setting up the required tools and dependencies to download the data and do basic tasks such as variant calling. For this I recomend downloading and installing bcbio-nextgen analysis package. While I do not explicitly use their pipeline, I have found that their automated software instalation scripts as well as their test suite to be very usefull in getting all of the basic dependencies in order.
Annother dependency is instalation of the CGHub GeneTerrent software. We are mainly using the GTDownload function. Instructions for downloading and instaliation can be found here.
We need to set up a number of global variables with the locations of data-files and software. This is going to be user specific so I recomend going into the Globals notebook and adjusting the paths.
For my use-case, I have two seperate systems I run this pipeline on. I use our in-house cluster for longer jobs in which I generally download full data-files, and our Annai Systems virtual machine for quick jobs in which I can get away with mounting BAM files using their GT-Fuse software. These have slightly difference configurations so I use different global configuration files for each.
In [2]:
import Globals_Cellar as cellar
In [3]:
import Globals_Annai as annai
In [4]:
from CGHelpers import CGFile, MutationCalling
This is our main abstraction for dealing with files from CGhub. Here we store a number of attributes that are important for retreiving, and cleaning up files from CGHub. The process for downloading and cleaning up files is different using the standard GTDownload mechansim or the proprietary GTFuse functionality available on the Annai machine. This object is meant to abtract away the housekeeping required for different methods of downloading the data.
This is an object to handing doing variant calling on matched tumor-normal genomes. We run MuTect and SomaticIndelDetector for SNVs and indels, respectively.
Here we are going to do mutation calls using GTDownload. I.E. we download the entire BAM files, call mutations on a subset of genes, and then clean up.
First we set global parameters in the CGFile object and intialize our mutation calling object.
In [5]:
GLOBALS = cellar
In [6]:
#CGFile.gt_fuse = gt_fuse
CGFile.gt_download = GLOBALS.GT_DOWNLOAD
CGFile.cghub = GLOBALS.CGHUB
#CGFile.cache = CACHE
caller = MutationCalling(GLOBALS.JAVA, GLOBALS.MUTECT_JAR, GLOBALS.SID_JAR,
GLOBALS.REFERENCE, GLOBALS.DBSNP, GLOBALS.COSMIC)
Read in a set of regions we want to make calls across. Our downstream functions expect this in the form of a list. Regions should be formated of the style: chr:start-end.
In [7]:
f = open('/cellar/users/agross/Downloads_Old/cancer_genes.samtools.txt', 'rb')
regions = [l.split('$')[2].split()[1] for l in f.readlines()]
regions = regions[:10]
Import CGHub manifest of tcga exome files and list of tumor-normal pairs.
In [8]:
from CGHub_Manifest import tcga, tn_blood
In [9]:
local_dir = '/cellar/users/agross/Data/WXS/Hannah'
if not os.path.isdir(local_dir):
os.makedirs(local_dir)
Instantiate function to spit out scripts.
In [10]:
fx = caller.mutation_caller(tcga, regions=regions, directory='/tmp',
local_dir=local_dir, library_type='WXS')
Here is an example of a script for a single patient.
In [11]:
print fx('TCGA-13-1411')
Now we generate scripts for all of the patients we want to run, as well as a driver script to launch the SGE job.
In [12]:
from ParallelHelpers import generate_sge
In [13]:
script_dir = '/cellar/users/agross/scripts/Hannah_WXS_Test'
In [14]:
generate_sge(tn_blood[:100], fx, script_dir, threads=12)
Here we are going to use GTFuse to grab the RNA-Seq reads mapping to a handfull of mutations for a single patient. region_subsetter is a function from the CGHelpers Notebook which does this. For now I have it configured to pull all reads mapping to a set of mutations supplied to it in the form of a MAF file. At some point I will generalize it.
In [15]:
from CGHelpers import region_subsetter
In [16]:
GLOBALS = annai
CGFile.gt_fuse = GLOBALS.GT_FUSE
CGFile.cghub = GLOBALS.CGHUB
CGFile.cache = GLOBALS.CACHE
In [17]:
maf = pd.read_csv('/cellar/users/agross/TCGA_Code/TCGA/Data/MAFs/mega_maf.csv',
index_col=0)
In [18]:
subsetter = region_subsetter(maf, tcga, directory=GLOBALS.VM_DIRECTORY,
library_type='RNA-Seq', tissue_code='TP',
out='RNA-seq')
In [19]:
print subsetter('TCGA-06-0878')
From here you would throw a bunch of these into a folder and run a them in parallel on the Annai machine.