Exploring msu seq

msu seq reads were mapped to the current version of the reference genome (galGal4).

Important details:

  • All reference genomes are soft masked. At first it was inconclusive how BWA behaved in this case, so I ran it with both soft masks and replacing 'agct' with 'AGCT'. The results were the same.

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from glob import glob
import os

In [3]:
!cd .. && make msu_latest_galGal4


mkdir -p outputs/msu
bwa mem outputs/reference/galGal4.fa outputs/msu/latest.fasta > outputs/msu/latest.fasta.sam.galGal4
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 10462 sequences (10001506 bp)...
[M::process] read 10482 sequences (10001981 bp)...
[M::mem_process_seqs] Processed 10462 reads in 16.092 CPU sec, 16.092 real sec
[M::process] read 10450 sequences (10000765 bp)...
[M::mem_process_seqs] Processed 10482 reads in 16.613 CPU sec, 16.610 real sec
[M::process] read 10470 sequences (10002319 bp)...
[M::mem_process_seqs] Processed 10450 reads in 16.313 CPU sec, 16.306 real sec
[M::process] read 10588 sequences (10001208 bp)...
[M::mem_process_seqs] Processed 10470 reads in 15.887 CPU sec, 15.880 real sec
[M::process] read 10432 sequences (10001564 bp)...
[M::mem_process_seqs] Processed 10588 reads in 16.536 CPU sec, 16.529 real sec
[M::process] read 10480 sequences (10001527 bp)...
[M::mem_process_seqs] Processed 10432 reads in 16.167 CPU sec, 16.160 real sec
[M::process] read 10488 sequences (10000831 bp)...
[M::mem_process_seqs] Processed 10480 reads in 16.087 CPU sec, 16.077 real sec
[M::process] read 10544 sequences (10001205 bp)...
[M::mem_process_seqs] Processed 10488 reads in 15.795 CPU sec, 15.788 real sec
[M::process] read 10650 sequences (10002036 bp)...
[M::mem_process_seqs] Processed 10544 reads in 16.245 CPU sec, 16.236 real sec
[M::process] read 10390 sequences (10001440 bp)...
[M::mem_process_seqs] Processed 10650 reads in 16.151 CPU sec, 16.144 real sec
[M::process] read 10428 sequences (10000684 bp)...
[M::mem_process_seqs] Processed 10390 reads in 16.653 CPU sec, 16.643 real sec
[M::process] read 10566 sequences (10002874 bp)...
[M::mem_process_seqs] Processed 10428 reads in 15.624 CPU sec, 15.617 real sec
[M::process] read 10394 sequences (10000127 bp)...
[M::mem_process_seqs] Processed 10566 reads in 15.229 CPU sec, 15.218 real sec
[M::process] read 10540 sequences (10001688 bp)...
[M::mem_process_seqs] Processed 10394 reads in 16.694 CPU sec, 16.688 real sec
[M::process] read 10380 sequences (10000744 bp)...
[M::mem_process_seqs] Processed 10540 reads in 15.953 CPU sec, 15.942 real sec
[M::process] read 10510 sequences (10000494 bp)...
[M::mem_process_seqs] Processed 10380 reads in 16.299 CPU sec, 16.294 real sec
[M::process] read 10596 sequences (10001121 bp)...
[M::mem_process_seqs] Processed 10510 reads in 16.723 CPU sec, 16.715 real sec
[M::process] read 10402 sequences (10000353 bp)...
[M::mem_process_seqs] Processed 10596 reads in 16.751 CPU sec, 16.746 real sec
[M::process] read 10442 sequences (10003079 bp)...
[M::mem_process_seqs] Processed 10402 reads in 16.605 CPU sec, 16.595 real sec
[M::process] read 10542 sequences (10000214 bp)...
[M::mem_process_seqs] Processed 10442 reads in 16.335 CPU sec, 16.329 real sec
[M::process] read 10412 sequences (10001168 bp)...
[M::mem_process_seqs] Processed 10542 reads in 16.188 CPU sec, 16.177 real sec
[M::process] read 10404 sequences (10000241 bp)...
[M::mem_process_seqs] Processed 10412 reads in 16.451 CPU sec, 16.445 real sec
[M::process] read 10446 sequences (10000554 bp)...
[M::mem_process_seqs] Processed 10404 reads in 15.884 CPU sec, 15.871 real sec
[M::process] read 10542 sequences (10003040 bp)...
[M::mem_process_seqs] Processed 10446 reads in 16.322 CPU sec, 16.314 real sec
[M::process] read 10508 sequences (10000545 bp)...
[M::mem_process_seqs] Processed 10542 reads in 16.904 CPU sec, 16.896 real sec
[M::process] read 10524 sequences (10003149 bp)...
[M::mem_process_seqs] Processed 10508 reads in 15.994 CPU sec, 15.987 real sec
[M::process] read 10426 sequences (10000497 bp)...
[M::mem_process_seqs] Processed 10524 reads in 16.467 CPU sec, 16.461 real sec
[M::process] read 10346 sequences (10000563 bp)...
[M::mem_process_seqs] Processed 10426 reads in 16.571 CPU sec, 16.565 real sec
[M::process] read 10358 sequences (10000818 bp)...
[M::mem_process_seqs] Processed 10346 reads in 17.214 CPU sec, 17.206 real sec
[M::process] read 10384 sequences (10000608 bp)...
[M::mem_process_seqs] Processed 10358 reads in 16.096 CPU sec, 16.089 real sec
[M::process] read 10538 sequences (10000811 bp)...
[M::mem_process_seqs] Processed 10384 reads in 15.883 CPU sec, 15.869 real sec
[M::process] read 10484 sequences (10001499 bp)...
[M::mem_process_seqs] Processed 10538 reads in 16.284 CPU sec, 16.274 real sec
[M::process] read 10436 sequences (10000075 bp)...
[M::mem_process_seqs] Processed 10484 reads in 16.165 CPU sec, 16.155 real sec
[M::process] read 10402 sequences (10001574 bp)...
[M::mem_process_seqs] Processed 10436 reads in 16.386 CPU sec, 16.381 real sec
[M::process] read 10544 sequences (10000681 bp)...
[M::mem_process_seqs] Processed 10402 reads in 17.838 CPU sec, 17.832 real sec
[M::process] read 10360 sequences (10002658 bp)...
[M::mem_process_seqs] Processed 10544 reads in 15.003 CPU sec, 14.995 real sec
[M::process] read 10466 sequences (10003204 bp)...
[M::mem_process_seqs] Processed 10360 reads in 15.939 CPU sec, 15.929 real sec
[M::process] read 10528 sequences (10002286 bp)...
[M::mem_process_seqs] Processed 10466 reads in 15.655 CPU sec, 15.649 real sec
[M::process] read 10506 sequences (10000374 bp)...
[M::mem_process_seqs] Processed 10528 reads in 15.760 CPU sec, 15.750 real sec
[M::process] read 10450 sequences (10000878 bp)...
[M::mem_process_seqs] Processed 10506 reads in 16.571 CPU sec, 16.565 real sec
[M::process] read 10432 sequences (10000044 bp)...
[M::mem_process_seqs] Processed 10450 reads in 16.075 CPU sec, 16.065 real sec
[M::process] read 10438 sequences (10000143 bp)...
[M::mem_process_seqs] Processed 10432 reads in 15.820 CPU sec, 15.812 real sec
[M::process] read 10440 sequences (10000016 bp)...
[M::mem_process_seqs] Processed 10438 reads in 16.801 CPU sec, 16.794 real sec
[M::process] read 10400 sequences (10000192 bp)...
[M::mem_process_seqs] Processed 10440 reads in 15.541 CPU sec, 15.532 real sec
[M::process] read 10618 sequences (10001031 bp)...
[M::mem_process_seqs] Processed 10400 reads in 16.327 CPU sec, 16.316 real sec
[M::process] read 10522 sequences (10000266 bp)...
[M::mem_process_seqs] Processed 10618 reads in 16.461 CPU sec, 16.457 real sec
[M::process] read 377 sequences (342590 bp)...
[M::mem_process_seqs] Processed 10522 reads in 16.286 CPU sec, 16.298 real sec
[M::mem_process_seqs] Processed 377 reads in 0.551 CPU sec, 0.539 real sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa mem outputs/reference/galGal4.fa outputs/msu/latest.fasta
[main] Real time: 806.363 sec; CPU: 794.331 sec
samtools import outputs/reference/galGal4.fa.fai outputs/msu/latest.fasta.sam.galGal4 outputs/msu/galGal4.latest.bam
[samopen] SAM header is present: 15932 sequences.
samtools sort outputs/msu/galGal4.latest.bam outputs/msu/galGal4.latest.sorted 
[bam_sort_core] merging from 2 files...
samtools index outputs/msu/galGal4.latest.sorted.bam

Counting unmapped reads

There are 510,070 reads in the input files.

There are 661 (0.13%) unmapped reads to galGal4.

Reference for using samtools to count reads:

galGal4


In [5]:
print "Total: ",
! samtools view -c ../outputs/msu/galGal4.latest.sorted.bam


Total: 510070


In [4]:
print "Unmapped: ",
! samtools view -c -f 4 ../outputs/msu/galGal4.latest.sorted.bam


Unmapped: 661


In [6]:
661./510070


Out[6]:
0.0012959005626678691

Checking unmapped reads


In [8]:
! cd .. && make outputs/msu/galGal4.latest.unmapped_reads


scripts/extract_reads.sh outputs/msu/galGal4.latest.sorted.bam > outputs/msu/galGal4.latest.unmapped_reads

In [10]:
from random import sample

In [13]:
seqs = [line.strip() for line in open('../outputs/msu/galGal4.latest.unmapped_reads').readlines() 
                     if not line.startswith('>')]
sample(seqs, 5)


Out[13]:
['CAATATTCAGTGTGCTGTTCCATGTGCACTGTTCCATGTTCACTATTCCATGTTCCCTGTTCCGTGTGCACTATTCCGTGTTCGCTATTCCATGTTCACTATTCCGTGTGCACTATTTCACATGTGCTATTCCGTGTGCTGTTGCATGTGCACTGTTCCATGTTCACCATTGCATGTTCCCTATTCCGTGTTCGCTATTCCATGTGCTGTTCCATGTGCACTGTTCCATGTTCACTATTCCATGTGCACTATTCCGTGTGCTATTCCATGTGCACTATTCCATGTTCACTATTCCATGTGCTGTTCCATGTGCACTATTCCATGTGCACTATTCCGTGTGCTGTTCCATGTGCACTGTTCCATGTTCACTATTCCATGTTCACTATTCCGTGTGCACTATTCCGTGTTCACTATTCCATGTTCACTATTCCGTGTGCACTATTTCACATG',
 'CTTTAATTTTGCTCATGTAATTTATGAGTGTCTTCTGCTTGATTCCTCTGCTGGCCAGGATTTTTTCGTAGCGATCAAGCCATGAATGTAACGTAACGGAATTATCACTGTTGATTCTCGCTGTCAGAGGCTTGTGTTTGTGTCCTGAAAATAACTCAATGTTGGTCTGTATAGCTTCAGTGATTGCGATTCGCCTGTCTCTGCCTAATCCAAACTCTTTACCCGTCCTTGGGTCCCTGTAGCAGTAATATCCATTGTTTCTTATATAAAGGTTAGGGGGTAAATCCCGGCGCTCATGACTTCGCCTTCTTCCCATTTCTGATCCTCTTCAAAAGGCCACCTGTTACTGGTCGATTTAAGTCAACCTTTACCGCTGATTCGTGGATGGCCC',
 'CTTACCTTTTATATCAGGGTGATGCTGTCTTTCATCTGGAGAGAAGATTCTCCTTGTGAGGTGTTTTGTGTAGAGGCACGTGCTTGTGTTTATGTTATTTTTGTTTCTGGTGCAGTTAGGTTATGGCTTTTCAGACCTTCTTCTGTTTCTTCTCCAGCATTACTCCTATATTCTCATGGTGCTGGCAGTGCTTCTCGGTTGGCCGTGTGTACCTTCACATAACACTAAAGGCGCAGAAGCGTTTTTTTCTAAGGCAAACCAGTGTGTATCCTGGAGCGTTTCTAATGTGGTTGTTGAGAACAATTTTGAAGGAGCCTGGCTTCTGGTCTGTGGCATGGGCTGCTGAGAAGGGTGGAAATTGTGTTTGAGCTTAATGGCATGGAAAGGAAAAATGTGGGAGAATGGTTTTCAATATGAGGGAGTTATAGAGTAGAGTTTGCATAGTTT',
 'TGTCGAACAGGGGGATGATGCGTGGCACAAATTACGGCTCGGCGTCATCACCGCTTCAGAAGTTCACAACGCGATAGCAAAACCCCGCTCCGGAAAGAAGTGGCCTGACATGAAAATGTCCTACTTCCACACCCTGCTTGCTGAGGTTTGCACCGGTGTGGCTCCGGAAGTTAACGCTAAAGCACTGGCCTGGGGAAAACGGTACGAGAACGACGCCAGAACCCTGTTTGAATTCACTTCCGGCGTGAATGTTACTGAATCCCCGATCATCTATCGCGACGAAAGTATGCGTACCGCCTGCTCTCCCGATGGTTTATGCAGTGACGGCAACGGCCTTGAACTGAAATGCCCGTTTACCTCCCGGGATTTCATGAGGTTCCGGCTCGGTGGTTTCGAGGCCATAAAGTCAGCTTACATGGCCCAGGTGCAGTACAGCATGTGGGTGACGCGAAAAAATGCCTGGT',
 'GTCAGCTCTGGACAGGTCTGGGGGGCCCACACAAGAAATATTTTTGAAGATGGGCACCTCTGGCAAAGATCAAACGCCTAATATAGCTAGGTGCCCTTGTCTGAGCTTATGAGAAGCCCACAAGCTGAGTGTGGGTCCATCTGGAGCTGGTCAGGGCCATTGGGCTCCTTCAGGGCACCTAGAGCTCTGTATGTAATATAAAAATGAATGTGAAGCCCACCTGTACCTAAAATCGTAAGAAATTCTAGGCCAATAAGAAGAAATGAATAGAAAAAGTAAAAATGAGAAATATTACTGTAATTGAAGTACAATGTCTACTCAGTGCACCTGGCTGGAATTACAGGAATTACTGTTACTGATGCCAAGGAACAGTGGAAAAGATTTTGACTACTGTGCTCTGTATTTGTAATAAGAATGTGCAATCATCTTGTGATAGATAGTAGGTCTTGTTCAGCAGTGTTTATGAAATGGCTGGGTTTTGACCACTCAGTTTGAGGGCTAAACACAGAAATACTCCTGAGTTCCAGAGAAATATTATGTACATACTCGGTCACTAGTTTAGAAAGTGAATAAGAAAAGAAACACTGTCAGCTTACAGCTGGATGCTTAAAATTCAAGAAAACTTTAAAAATAAATTTTCATATAAGCTAGCAATCAAAAAATGGGGTCCAGATGGGATCTCAGACGCCACTTATAACATTCACAAGGGTGAGAGGTCCTGTCAAATCTGAATACTATCTAGCAGAAGTCACGGCATTACAAGTCTCACAAGGACAGAAAGTAAATTTAAGCTGACAAATAGATTTGATGTCAGAGGGAAGAGATTTTAAAACAAAACATGAGAATACAACCAGGAAAAATTAACAAAAAAAAAAATTGCAAACTACATCTGTAGAAATGT']

ENA exonerate results

Then I tried Jared's suggestion and use the same sequences in http://www.ebi.ac.uk/ena/

1) No match.

2) E.coli and gadus morhua

3) No match.

4) E.coli and taeniopygia_guttata

5) 86% identity to Meleagris gallopavo

k-mer assembly stats


In [14]:
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/msu/latest.fasta > ../outputs/msu/assemstats_output_latest

In [15]:
import pandas as pd
pd.read_table("../outputs/msu/assemstats_output_latest", sep=" ")


Out[15]:
filename sum n trim_n min med mean max n50 n50_len n90 n90_len
0 ../outputs/msu/latest.fasta 470397190 492527 492526 128 639 955 6607 112009 1273 377893 459

In [1]:
!cat ../outputs/msu/assemstats_output_latest


filename sum n trim_n min med mean max n50 n50_len n90 n90_len
../outputs/msu/latest.fasta 470397190 492527 492526 128 639 955 6607 112009 1273 377893 459