Exploring PacBio

Pacbio seq reads were mapped to the current version of the reference genome (galGal4).

Important details:

  • All reference genomes are soft masked. At first it was inconclusive how BWA behaved in this case, so I ran it with both soft masks and replacing 'agct' with 'AGCT'. The results were the same.

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from glob import glob
import os

In [3]:
!cd .. && make pacbio_galGal4


make: Nothing to be done for `pacbio_galGal4'.

Counting unmapped reads

There are 11,666,999 reads in the input files.

There are 627,636 (5.37%) unmapped reads to galGal4.

Reference for using samtools to count reads:

galGal4


In [6]:
print "Total: ",
! samtools view -c ../outputs/pacbio/galGal4.pacbio.sorted.bam


Total: 11666999


In [7]:
print "Unmapped ",
! samtools view -c -f 4 ../outputs/pacbio/galGal4.pacbio.sorted.bam


Unmapped 627636


In [2]:
total_N = !grep -F -o N ../outputs/reference/galGal4.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal4.fa |wc -c
print float(total_N[0]) / float(total_bases[0])


0.0131960026036

k-mer assembly stats


In [10]:
!python ~/khmer/sandbox/assemstats3.py 100 ../outputs/pacbio_assembly/Chicken_10Kb20Kb_40X_Filtered_Subreads.fastq > ../outputs/pacbio_assembly/assemstats_output

In [11]:
import pandas as pd
pd.read_table("../outputs/pacbio_assembly/assemstats_output", sep=" ")


Out[11]:
** cutoff: 100
0 N\tsum\tmax\tfilename NaN NaN
1 8935637\t43071775702\t37722\t../outputs/pacbio... NaN NaN

In [12]:
!cat ../outputs/pacbio_assembly/assemstats_output


** cutoff: 100
N	sum	max	filename
8935637	43071775702	37722	../outputs/pacbio_assembly/Chicken_10Kb20Kb_40X_Filtered_Subreads.fastq

In [ ]: