Global Variant Calling Parameters: Cellar

This is the configuration file for Cellar our network file system that is used on my local machine as well as our in-house cluster.

Link to CGHub manifest describing the contents of its database.


In [3]:
cghub_manifest = 'https://cghub.ucsc.edu/reports/SUMMARY_STATS/LATEST_MANIFEST.tsv'

CGHUB download url


In [4]:
CGHUB = 'https://cghub.ucsc.edu/cghub/data/analysis/download'

There paths are on the local file-system (cellar).


In [5]:
BCBIO = '/cellar/users/agross/sources/bcbio_nextgen_local/'
DBSNP = BCBIO + 'genomes/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz'
COSMIC = BCBIO + 'genomes/Hsapiens/GRCh37/variation/cosmic-v68-GRCh37.vcf.gz'
REFERENCE = BCBIO + 'genomes/Hsapiens/GRCh37/seq/GRCh37.fa'

Path to key for downloading data from CGHub. You need to contact CGHub to get a key.


In [6]:
KEY = '/cellar/users/agross/.ssh/cghub.key'

Path to MuTect. I was getting bugs with version 1.1.4 so make sure you use this version.


In [7]:
MUTECT_JAR = BCBIO + 'share/java/mutect/muTect-1.1.5.jar'

Path to SomaticIndelDetector jar. This is in a specific version of GATK, so make sure you are using version 2.2-2.


In [8]:
SID_JAR = '/cellar/users/hcarter/programs/GenomeAnalysisTK-2.2-2/GenomeAnalysisTK.jar'

Path to cache directory. This is only used in conjuntion with GT-Fuse. This is important because the bam files can pile up in the cash and eat up all of the space on a hard drive very quickly.


In [9]:
#CACHE = '/home/centos/cache/fusecache'

Number of processes you want to spawn with the variant calling. I'm taking a quick and dirty pass here and just running a bunch of bash scripts simultaniously. At some point we will probably switch to a scheduler, but we don't have one on our Annai VM currently and this seems to be working ok for now.


In [10]:
NUM_PROCESSES = 16

Directory to store the data on whatever machine you are running the scripts on (we use a VM).


In [11]:
#VM_DIRECTORY = '/home/centos/projects'

Local directory to spit out the bash scripts.


In [12]:
LOCAL_DIRECTORY = '/cellar/users/agross/scripts'

In most situations you can probably get away with using the default java on your system, but we have had conflicts with java-6 and java-7 so I explicity hard code the path to the JDK here.


In [13]:
JAVA = '/cellar/software/jdk1.7.0_45/bin/java'

Command for downloading full BAM files via torrent. gtdownload should be in your path or you can hard-code the path into this command. For my use I am setting two non-default parameters which you may want to alter.

  • inactivity-timeout=720: This sets the inactivity-timeout to 12 hours. I use this to kill jobs that are stalled. I have found that setting this parameter too low kills a lot of downloads when this command is being run in parallel with other downloads. This is because competition for bandwidth across the rounter can often stall jobs for a while.
  • max-children 1: This sets the torrent to only use a single thread to download rather than try and multi-thread it. This seems to maximize throughput when we are running downloads of seperate files in parallel.

In [14]:
GT_DOWNLOAD = 'gtdownload --inactivity-timeout=720 --max-children 1'
GT_DOWNLOAD = '{} -c {}'.format(GT_DOWNLOAD, KEY)