Pacbio seq reads were mapped to the current version of the reference genome (galGal4).
Important details:
In [1]:
    
%matplotlib inline
from matplotlib import pyplot as plt
from glob import glob
import os
    
In [3]:
    
!cd .. && make pacbio_galGal4
    
    
There are 11,666,999 reads in the input files.
There are 627,636 (5.37%) unmapped reads to galGal4.
Reference for using samtools to count reads:
In [6]:
    
print "Total: ",
! samtools view -c ../outputs/pacbio/galGal4.pacbio.sorted.bam
    
    
In [7]:
    
print "Unmapped ",
! samtools view -c -f 4 ../outputs/pacbio/galGal4.pacbio.sorted.bam
    
    
In [2]:
    
total_N = !grep -F -o N ../outputs/reference/galGal4.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal4.fa |wc -c
print float(total_N[0]) / float(total_bases[0])
    
    
In [10]:
    
!python ~/khmer/sandbox/assemstats3.py 100 ../outputs/pacbio_assembly/Chicken_10Kb20Kb_40X_Filtered_Subreads.fastq > ../outputs/pacbio_assembly/assemstats_output
    
In [11]:
    
import pandas as pd
pd.read_table("../outputs/pacbio_assembly/assemstats_output", sep=" ")
    
    Out[11]:
In [12]:
    
!cat ../outputs/pacbio_assembly/assemstats_output
    
    
In [ ]: