In [ ]:
# Notes

- Modify writeScriptHeader
  * Load modules
- "main" starts around line 940
  * wrk is defined on 1130
  * really starts at 1336 (call runCA)


## Running the pipeline

### Gatekeeper

### Overlapper (MHAP)

### Correction
  - 


### Partition
  - ~ 500 MB / RAM
  - Parallelizable (199 independent jobs)
  - 8 cores

past approaches they have tried, and why they failed/what status was

PBJelly + Quiver: gap filling MHAP -> low coverage after error correcting

current approach you're trying

PBcR-MHAP

what problems you are running into, or that you anticipate running into

PBcR-MHAP is based on wgs-assembler (Celera). They only support SGE and LSF. Submitting a single big job consumes too much memory (gatekeeper -> 1.5TB) or could use more CPU (overlapper) on some steps

what other approaches people are using

Josh: I just worked on a fungus genome which we used PacBio (Titus is involved with this). I used PBJelly and then SPAdes and got a really good assembly when paired with our Illumina data from the original sequencing project (PacBio was for genome scaffolding improvement).

Eichler lab: HGAP (using a modified Blasr, but they just added some outputs, no core modifications)

Dazzler - http://dazzlerblog.wordpress.com/ Currently just the aligner, need to figure out how to use together with other parts of the pipeline. Talk with Jason Chin, he done that for big genomes

ectools - https://github.com/jgurtowski/ectools

dbg2olc - http://arxiv.org/abs/1410.2801 http://sites.google.com/site/dbg2olc/. claim to be superfast

http://wgs-assembler.sourceforge.net/wiki/index.php/RunCA

Adam tips

Hybrid

dbg2olc

  • arxiv
  • u. mariland

lordec

ec2tools

pure

pacbio2ca

  • Align with reference (Blasr, galGal4)
  • Smaller subsets:
    • Unmapped reads
    • Reads mapping partially to micro chromosomes

In [ ]: