Exploring PacBio

Pacbio seq reads were mapped to the current version of the reference genome (galGal4).

Important details:

All reference genomes are soft masked. At first it was inconclusive how BWA behaved in this case, so I ran it with both soft masks and replacing 'agct' with 'AGCT'. The results were the same.



In [1]:

    
%matplotlib inline
from matplotlib import pyplot as plt
from glob import glob
import os



In [3]:

    
!cd .. && make pacbio_galGal4









    



make: Nothing to be done for `pacbio_galGal4'.

Counting unmapped reads

There are 11,666,999 reads in the input files.

There are 627,636 (5.37%) unmapped reads to galGal4.

Reference for using samtools to count reads:

galGal4



In [6]:

    
print "Total: ",
! samtools view -c ../outputs/pacbio/galGal4.pacbio.sorted.bam









    



Total: 11666999



In [7]:

    
print "Unmapped ",
! samtools view -c -f 4 ../outputs/pacbio/galGal4.pacbio.sorted.bam









    



Unmapped 627636



In [2]:

    
total_N = !grep -F -o N ../outputs/reference/galGal4.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal4.fa |wc -c
print float(total_N[0]) / float(total_bases[0])









    



0.0131960026036

k-mer assembly stats



In [10]:

    
!python ~/khmer/sandbox/assemstats3.py 100 ../outputs/pacbio_assembly/Chicken_10Kb20Kb_40X_Filtered_Subreads.fastq > ../outputs/pacbio_assembly/assemstats_output



In [11]:

    
import pandas as pd
pd.read_table("../outputs/pacbio_assembly/assemstats_output", sep=" ")









    Out[11]:






  
    
      
      **
      cutoff:
      100
    
  
  
    
      0
                                   N\tsum\tmax\tfilename
      NaN
      NaN
    
    
      1
       8935637\t43071775702\t37722\t../outputs/pacbio...
      NaN
      NaN



In [12]:

    
!cat ../outputs/pacbio_assembly/assemstats_output









    



** cutoff: 100
N	sum	max	filename
8935637	43071775702	37722	../outputs/pacbio_assembly/Chicken_10Kb20Kb_40X_Filtered_Subreads.fastq



In [ ]:

	**	cutoff:	100
0	N\tsum\tmax\tfilename	NaN	NaN
1	8935637\t43071775702\t37722\t../outputs/pacbio...	NaN	NaN