Earlier, I modified nanocorrect to spit out the number of overlaps detected by DALIGN into the FASTA header. Initially it looked like very few reads were overlapping but that was because the ones without overlaps complete faster so you get a false view early on. It's actually more like 50% with the R7.3 data. I noticed one read that had ~80 overlaps ended up with 98% identity, so there is confidence that higher coverage would do a better job. Therefore I am trying again with the R7 and R7.3 data combined. The R7.3 data is from workflow 1.9, but the R7 data is the old base caller as Metrichor has not finished re-calling. There is also the ONI data to add in if necessary.
In [1]:
!cat FC20.2D.fasta ../Ecoli_R7_2D.fasta > R7_R73.fasta
!make -f pipeline.make INPUT=R7_R73.fasta NAME=R7_R73
In [2]:
!grep ">" R7_R73.fasta | wc -l
In [4]:
!python makerange.py 1 34762 > input.txt
In [ ]:
!cat input.txt | parallel python nanocorrect.py R7_R73
It looks like many of the reads that I BLAST are no longer full length. Should the consensus part of the alignments be trimmed out of PO?? I suppose this might make sense, if an errant overlap gets in.
It looks like nanocorrect already trims to the seed 'poabaseread' read extents. But I have noticed that the input to nanocorrect trims the other reads to the part of the read it thinks aligns. This is possibly an issue if the overlap is not good or complete, so I have disabled this functionality.
Hmm, not sure if this is helping just yet. I wonder if I have made things worse by forcing unrelated segments of reads into the alignment. Need to systematically review the output both in terms of longest alignments and % identity. Write a script to compare the results of different settings. Perhaps need to score the POA output too.
Ideas:
In [ ]: