Exploring msu seq

msu seq reads were mapped to:

  • the current version of the reference genome (galGal4)
  • the previous version, galGal3
  • the next version draft, galGal5

Important details:

  • All reference genomes are soft masked. At first it was inconclusive how BWA behaved in this case, so I ran it with both soft masks and replacing 'agct' with 'AGCT'. The results were the same.

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from glob import glob
import os

In [2]:
!cd .. && make -j3 msu_galGal4 msu_galGal3 msu_galGal5


make: Nothing to be done for `msu_galGal4'.
make: Nothing to be done for `msu_galGal3'.
make: Nothing to be done for `msu_galGal5'.

Counting unmapped reads

There are 426,661 reads in the input files.

There is a different number of unmapped reads for each reference genome:

  • galGal4: 541 (0.13%)
  • galGal3: 2351 (0.55%)
  • galGal5: 4110 (0.96%)

Reference for using samtools to count reads:

galGal4


In [5]:
print "Total: ",
! samtools view -c -f 4 ../outputs/msu/galGal4.msu.sorted.bam


Total: 541


In [1]:
!samtools view -c -q 30 ../outputs/msu/galGal4.msu.sorted.bam


420611

In [2]:
!samtools view -c -q 60 ../outputs/msu/galGal4.msu.sorted.bam


415642

In [4]:
!samtools view -c -f 256 ../outputs/msu/galGal4.msu.sorted.bam


0

galGal3


In [6]:
print "Total: ",
! samtools view -c -f 4 ../outputs/msu/galGal3.msu.sorted.bam


Total: 2351

galGal5


In [7]:
print "Total: ",
! samtools view -c -f 4 ../outputs/msu/galGal5.msu.sorted.bam


Total: 4110

Notes:

  • 0.5% of the bases are N, meaning either unidentified or hard masked. For galGal4 this is ~1.4%

In [3]:
total_N = !grep -F -o N ../outputs/reference/galGal5.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal5.fa |wc -c
print float(total_N[0]) / float(total_bases[0])


0.0

In [12]:
!head -10 ../outputs/reference/galGal5.fa


>Chr1|Contig0|quiver
caatcctagccctgttcttattctacctacccctatcttctacctaccctatgctacccc
atctaacctactccttctctacttgagcctaccctatccttatcctacccttttgaacct
atctacctaccctaacctatctatctaacctacagtatctgattctaccctatcatactt
accccattccaaccctacagtaccctacccgtatccttccctaccctaatcctaccaaaa
aaaggcccccttaaccttacctattctactctaccctatcctatcctacaccccatccta
atccgtcctacccattccctacattgccctaccctacctagcctatttttacccagttct
atcctacctctacccaaacctactccatcattgtccatacccttacccttctctattcta
ccctacctttcccaaccctgccctaccctaccctatccttatcctaccctaccgtatctt
atcctagcctaggcaccctatcctatcctaacttacctcccctattacaccataaccctt

In [2]:
total_N = !grep -F -o N ../outputs/reference/galGal4.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal4.fa |wc -c
print float(total_N[0]) / float(total_bases[0])


0.0131960026036

Checking unmapped reads


In [9]:
! cd .. && make -j3 outputs/msu/galGal4.unmapped_reads outputs/msu/galGal3.unmapped_reads outputs/msu/galGal5.unmapped_reads


make: `outputs/msu/galGal4.unmapped_reads' is up to date.
make: `outputs/msu/galGal3.unmapped_reads' is up to date.
make: `outputs/msu/galGal5.unmapped_reads' is up to date.

In [24]:
from random import sample

In [27]:
seqs = [line.strip() for line in open('../outputs/msu/galGal4.unmapped_reads').readlines() if not line.startswith('>')]
sample(seqs, 5)


Out[27]:
['TATTCAATAAGTCAATATCATGCCGTTAATATGTTGCCATCCGTGGCAATCATGCTGCTAACGTGTGACCGCATTCAAAATGTTGTCTGCGATTGACTCTTCTTTGTGGCATTGCACCACCAGAGCGTCATACAGCGGCTTAACAGTGCGTGACCAGGTGGGTTGGGTAAGGTTTGGGGTTAGCATCGTCACAGCGCGATATGCTGCGCTTGCTGGCATCCTTGAATAGCCGACGCCTTTGCATCTTCCGCACTCTTTCTCGACAACTCTCCCCCACAGCTCTGTTTTGGCAATATCAACCGCACGGCCTGTACCATGACAATCTCTGCATCTTGCCCCCGGCGTCGCGGCACTACGGCAATAATCCGCATAAGCGAATGTTGCGAGCACTTGCAGTACCTTTGCCTTAGTATTTCCTTCAAGCTTTGCCACACCACGGTATTTCCCCGATACCTTGTGTGCAAATTGCATCAGATAGTTGATAGCCTTTTGTTTGTCGTTCTGGCTGAGTTCGTGCTTACCGCAGAATGCAGCCATACCGAATCCG',
 'AGTCGGTGTGAATCCCATCAGCGTTACCGTTTCGCGGTGCTTCTTCAGTACGCTACGGCAAATGTCATCGACGTTTTTATCCGGAAACTGCTGTCTGGCTTTTTTTGATTTCAGAATTAGCCTGACGGGCAATGCTGCGAAGGGCGTTTTCCTGCTGAGGTGTCATTGAACAAGTCCCATGTCGGCAAGCATAAGCACACAGAATATGAAGCCCGCTGCCAGAAAAATGCATTCCGTGGTTGTCATACCTGGTTTCTCTCATCTGCTTCTGCTTTCGCCACCATCATTTCCAGCTTTTGTGAAAGGGATGCGGCTAACGTATGAAATTCTTCGTCTGTTTCTACTGGTATTGGCACAAACCTGATTCCAATTTGAGCAAGGCTATGTGCCATCTCGATACTCATTCTTAACTCAACAGAAGATGCTTTGTGCATACAGCCCCTCGTTTATTATTTATCTCCTCAGCCAGCCGCTGTGCTTTCAGTGGATTTCGGATAACAGAAAGGCCGGGAAATACCCAGCCTCGCTCTGTAACGGAGTAGACGAAAGTGATTGCGCCTACCCGGATATTATCGTGAGGATGCGTCATCGCATTGCTCCCCAAATACAAAACCAATTTCAGCCAGTGCCTCGTCCATTTTTTCGATGAACTCCGGCACGATCTCGTCAAAACTCGCCATGTACTTTTCATCCCGCTCAATCACGACATAATGCAGGCCTTCACGCTTCATACGCGGGTCATAGTTGGCAAAGTACCAGGCATTTTTTCGCGTCACCCACATGCTGTACTGCACCTGGGCCATGTAAGCTGACTTTATGGCCTCGAAACCACCGAGCCGGAACTTCATGAAATCCCGGGAGGTAAACGGGCATTTCAGTTCAAGGCCGTTGCCGTCACTGCATAAACCATCGGGAGAGCAGGCGGTACGCATACTTTCGTCGCGATAGATGATCGGGGATTCAGTAACATTCACGCCGGAAGTGGATTCAAACAGGGTTCTGCTATCACGTCGTGAACTTCTGAAGCGGTGATGACGCCGAGCCGTAATTTGTGCCACGCATCATCCCCCTGTTCGACAGCTCTCACATCGATCCCGGTACGCTGCAGGATAATGTCCGGTGTCATGCTGCCACCTTCTGCTCTGCGGCTTTCTGTTTCAGGAATCCAAGAGCTTTTACTGCTTCGGCCTGTGTCAGTTCTGACGATGCACGAATGTCGCGGCGAAATACCTGGGAACAGAGCGGCAATAAGTCGTCATCCCATGTTTTATCCAGGGCGA',
 'AGCATTCTTGAGTCCAATATAAAAGTATTGTGTACCTTTTGCTGGGTCAGGTTGTTCTTTAGGAGGAGTAAAAGGATCAAACGCACTAAACGAAACTGAAACAAGCGATCGAAAATATCCCTTTGGGATTCTTGACTCGATAAGTCTATTATTTTCAGAGAAAAAATATTCATTGTTTTCTGGGTTGGTGACTGCACCAATCATTCCATTCAAAATTGTTGTTTTACCACACCCATTCCGCCCGATAAAAGCATGAATGTTCGTGCTGGGCATAGAATTAACCGTCACCTCAAAGGGTATAGTTAAATCACTGAATCCGGGAGCACTTTTTCTATTAAATGAAAAGTGGAAATCTGACAATTCTGGCAAACCATTTAGCACACGTGCGAACTGTCCATGAATTTCTGAAAGAGTTACCCCTCTAAGTAATGAGGTGTTAAGGACGCTTTCATTTTCAATGTCGGCTAATCGATTTGGCCATACTACTAAATCCTGAATAGCTTTAAGATGGTTATGTTTAAAACCATCGCTTAATTTGCTGAGATTAACATAGTAGTCAATGCTTTCACCTAAGGAAAAAAACATTTCAGGGAGTTGACTGAATTTTTTATCTATTAATGAATAAGTGCTTACCTCTTCTTTTTGACCTACAAAACCAATTTTAACATTTCCGATAT',
 'GCAGACATCATTGATTCAGCATCAGAAATAGAAGAATTACAGCGCAACACAGCAATAAAAATGCGCCGCCTGAACCACCAGGCTATATCTGCCACTCATTGTTGTGAGTGTGGCGATCCGATAGATGAACGAAGACGTCTGGTCGTTCAGGGTTGTCGGACTTGTGCAAGTTGCCAGGAGGATCTGGAACTTATCAGTAAACAGAGAGGTTCGAAGTGAGCGAAATTAACTCTCAGGCACTGCGTGAAGCGGCAGAGCAGGCAATGCATGACGACTGGGGATTTGACGCAGACCTTTTCCATGAATCGGTAACACCGTCGATTGTGCTGGAACTGCTGGATGAACGGGAAAGAAACCAGCAATACATCAAACGCCGCGACCAGGAGAACGAGGATATTGCGCTAACAGTAGGGAAACTGCGTGTTGAGCTTGAAACAGCAAAATCAAAACTCAACGAGCAGCGTGAGTATTACGAAGGTGTTATCTCGGATGGGAGTAAGCGTATTGCTAAACTGGAAAGCAACGAA',
 'TTCTGCGGTAAGCACGAACTCAGCCAGAACGACAAACAAAAGGCTATCAACTATCTGATGCAATTTGCACACGAGGTATCGGGGAAATACCGTGGTGTGGCAAAGCTTGAAGGAAATACTAAGGCAAAGGTACTGCAAGTGCTCGCAACATTCGCTTATGCGGATTATTGCCGTAGTGCCGCGACGCCGGGGGCAAGATGCAGAGATTGCCATGGTACAGGCCGTGCGGTTGATATTGCCAAAACAGAGCTGTGGGGGAGAGTTGTCGAGAGAGAGTGCGGAAGATGCAAAGGCGTCGGCTATTCAAGGATGCCAGCAAGCGCAGCATATCGCGCTGTGACGATGCTAATCCCAAACCTTACCCAACCCACCTGGTCACGCACTGTTAAGCCGCTGTATGACGCTCTGGTGGTGCAATGCCACAAAGAAGAGTCAATCGCAGACAACATTTTGAATGCGGTCACACGTTAGCAGCATGATTGCCACGGATGGCGACATATTAACGGCATGATATTGACTTATTGAATAAAATTGGGTAAATT']

UCSC Genome Browser BLAT Results

https://genome.ucsc.edu/cgi-bin/hgBlat?command=start

All matches are very small (< 7%) compared to the query size.

1)

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           21   262   282   547 100.0%     4   -   11051873  11051893     21
browser details YourSeq           21   416   436   547 100.0%     1   +   24132887  24132907     21


2) 

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           26  1136  1169  1280  75.9%     2   -  120603081 120603109     29
browser details YourSeq           23   270   293  1280 100.0%     2   -   27266445  27266476     32
browser details YourSeq           20   415   434  1280 100.0%    20   -    9938886   9938905     20

3) 

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           24   142   166   677 100.0%     1   -  102735789 102735825     37
browser details YourSeq           23   642   668   677  96.2%     3   -  102342875 102342903     29
browser details YourSeq           23   149   174   677  96.2%     2   -  106722750 106722791     42
browser details YourSeq           21   328   348   677 100.0%     2   +   53633766  53633786     21
browser details YourSeq           20   375   394   677 100.0%    22   -     819629    819648     20
browser details YourSeq           20   394   413   677 100.0%     2   -   68978384  68978403     20
browser details YourSeq           20   150   169   677 100.0%     2   -    5589621   5589640     20
browser details YourSeq           20   575   596   677  95.5%     1   +  165085880 165085901     22

4) 

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           22   321   344   527  95.9%    10   +   14635936  14635959     24
browser details YourSeq           21    94   114   527 100.0%     8   -   17426718  17426738     21
browser details YourSeq           20    12    31   527 100.0%     2   -  144504712 144504731     20
browser details YourSeq           20    16    37   527  95.5%     4   +   51302011  51302032     22


5) 

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           22    99   121   542 100.0%    11   -   18247134  18247158     25
browser details YourSeq           21    93   113   542 100.0%     1   -   24132887  24132907     21
browser details YourSeq           21   247   267   542 100.0%     4   +   11051873  11051893     21
browser details YourSeq           20   513   532   542 100.0%     6   +   34444401  34444420     20

ENA exonerate results

Then I tried Jared's suggestion and use the same sequences in http://www.ebi.ac.uk/ena/

All sequences have >99% identity to E.coli (and other organisms too). Is this some sort of contamination?

k-mer assembly stats


In [17]:
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/msu/msu.fasta > ../outputs/msu/assemstats_output

In [19]:
import pandas as pd
pd.read_table("../outputs/msu/assemstats_output", sep=" ")


Out[19]:
filename sum n trim_n min med mean max n50 n50_len n90 n90_len
0 ../outputs/msu/msu.fasta 397695267 426661 426657 126 637 932 6498 99131 1220 327577 455

In [13]:
!cat ../outputs/msu/assemstats_output


filename sum n trim_n min med mean max n50 n50_len n90 n90_len
../outputs/msu/msu.fasta 397695267 426661 426657 126 637 932 6498 99131 1220 327577 455

In [ ]: