msu seq reads were mapped to:
Important details:
In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from glob import glob
import os
In [2]:
!cd .. && make -j3 msu_galGal4 msu_galGal3 msu_galGal5
There are 426,661 reads in the input files.
There is a different number of unmapped reads for each reference genome:
Reference for using samtools to count reads:
In [5]:
print "Total: ",
! samtools view -c -f 4 ../outputs/msu/galGal4.msu.sorted.bam
In [1]:
!samtools view -c -q 30 ../outputs/msu/galGal4.msu.sorted.bam
In [2]:
!samtools view -c -q 60 ../outputs/msu/galGal4.msu.sorted.bam
In [4]:
!samtools view -c -f 256 ../outputs/msu/galGal4.msu.sorted.bam
In [6]:
print "Total: ",
! samtools view -c -f 4 ../outputs/msu/galGal3.msu.sorted.bam
In [7]:
print "Total: ",
! samtools view -c -f 4 ../outputs/msu/galGal5.msu.sorted.bam
Notes:
In [3]:
total_N = !grep -F -o N ../outputs/reference/galGal5.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal5.fa |wc -c
print float(total_N[0]) / float(total_bases[0])
In [12]:
!head -10 ../outputs/reference/galGal5.fa
In [2]:
total_N = !grep -F -o N ../outputs/reference/galGal4.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal4.fa |wc -c
print float(total_N[0]) / float(total_bases[0])
In [9]:
! cd .. && make -j3 outputs/msu/galGal4.unmapped_reads outputs/msu/galGal3.unmapped_reads outputs/msu/galGal5.unmapped_reads
In [24]:
from random import sample
In [27]:
seqs = [line.strip() for line in open('../outputs/msu/galGal4.unmapped_reads').readlines() if not line.startswith('>')]
sample(seqs, 5)
Out[27]:
https://genome.ucsc.edu/cgi-bin/hgBlat?command=start
All matches are very small (< 7%) compared to the query size.
1)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 21 262 282 547 100.0% 4 - 11051873 11051893 21
browser details YourSeq 21 416 436 547 100.0% 1 + 24132887 24132907 21
2)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 26 1136 1169 1280 75.9% 2 - 120603081 120603109 29
browser details YourSeq 23 270 293 1280 100.0% 2 - 27266445 27266476 32
browser details YourSeq 20 415 434 1280 100.0% 20 - 9938886 9938905 20
3)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 24 142 166 677 100.0% 1 - 102735789 102735825 37
browser details YourSeq 23 642 668 677 96.2% 3 - 102342875 102342903 29
browser details YourSeq 23 149 174 677 96.2% 2 - 106722750 106722791 42
browser details YourSeq 21 328 348 677 100.0% 2 + 53633766 53633786 21
browser details YourSeq 20 375 394 677 100.0% 22 - 819629 819648 20
browser details YourSeq 20 394 413 677 100.0% 2 - 68978384 68978403 20
browser details YourSeq 20 150 169 677 100.0% 2 - 5589621 5589640 20
browser details YourSeq 20 575 596 677 95.5% 1 + 165085880 165085901 22
4)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 22 321 344 527 95.9% 10 + 14635936 14635959 24
browser details YourSeq 21 94 114 527 100.0% 8 - 17426718 17426738 21
browser details YourSeq 20 12 31 527 100.0% 2 - 144504712 144504731 20
browser details YourSeq 20 16 37 527 95.5% 4 + 51302011 51302032 22
5)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 22 99 121 542 100.0% 11 - 18247134 18247158 25
browser details YourSeq 21 93 113 542 100.0% 1 - 24132887 24132907 21
browser details YourSeq 21 247 267 542 100.0% 4 + 11051873 11051893 21
browser details YourSeq 20 513 532 542 100.0% 6 + 34444401 34444420 20
Then I tried Jared's suggestion and use the same sequences in http://www.ebi.ac.uk/ena/
All sequences have >99% identity to E.coli (and other organisms too). Is this some sort of contamination?
In [17]:
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/msu/msu.fasta > ../outputs/msu/assemstats_output
In [19]:
import pandas as pd
pd.read_table("../outputs/msu/assemstats_output", sep=" ")
Out[19]:
In [13]:
!cat ../outputs/msu/assemstats_output
In [ ]: