Exploring moleculo

Moleculo reads were mapped to:

  • the current version of the reference genome (galGal4)
  • the previous version, galGal3
  • the next version draft, galGal5

Important details:

  • The first iteration used a version of galGal4 with hard masked repeats, with N replacing 'acgt' bases. BWA doesn't support hard masked inputs and so results were misleading.
  • All reference genomes are soft masked. At first it was inconclusive how BWA behaved in this case, so I ran it with both soft masks and replacing 'agct' with 'AGCT'. The results were the same.

In [48]:
%matplotlib inline
from matplotlib import pyplot as plt
from glob import glob
import os

In [11]:
!cd .. && make moleculo_galGal4 moleculo_galGal3 moleculo_galGal5


make: Nothing to be done for `moleculo_galGal4'.
make: Nothing to be done for `moleculo_galGal4_masked'.

Counting unmapped reads

There are 1578022 Moleculo reads in the input files.

There is a different number of unmapped reads for each reference genome:

  • galGal4: 326 (0.02%)
  • galGal3: 1504 (0.09%)
  • galGal5: 6085 (0.3%)

Reference for using samtools to count reads:

galGal4


In [75]:
%%bash
for bam in ../outputs/moleculo/galGal4.L*fastq.sorted.bam
do
  echo -n "$(basename $bam): "
  samtools view -c -f 4 $bam
done

echo -n "Total: "
samtools view -c -f 4 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam


galGal4.LR6000017-DNA_A01-LRAAA-1_LongRead_500_1499nt.fastq.sorted.bam: 53
galGal4.LR6000017-DNA_A01-LRAAA-1_LongRead.fastq.sorted.bam: 16
galGal4.LR6000017-DNA_A01-LRAAA-2_LongRead_500_1499nt.fastq.sorted.bam: 34
galGal4.LR6000017-DNA_A01-LRAAA-2_LongRead.fastq.sorted.bam: 12
galGal4.LR6000017-DNA_A01-LRAAA-3_LongRead_500_1499nt.fastq.sorted.bam: 62
galGal4.LR6000017-DNA_A01-LRAAA-3_LongRead.fastq.sorted.bam: 22
galGal4.LR6000017-DNA_A01-LRAAA-4_LongRead_500_1499nt.fastq.sorted.bam: 47
galGal4.LR6000017-DNA_A01-LRAAA-4_LongRead.fastq.sorted.bam: 18
galGal4.LR6000017-DNA_A01-LRAAA-5_LongRead_500_1499nt.fastq.sorted.bam: 49
galGal4.LR6000017-DNA_A01-LRAAA-5_LongRead.fastq.sorted.bam: 13
Total: 326

In [1]:
!samtools view -c -q 30 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam


1616145

In [2]:
!samtools view -c -q 60 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam


1600071

In [3]:
!samtools view -c -f 256 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam


0

galGal3


In [80]:
%%bash
for bam in ../outputs/moleculo/galGal3.L*fastq.sorted.bam
do
  echo -n "$(basename $bam): "
  samtools view -c -f 4 $bam
done

echo -n "Total: "
samtools view -c -f 4 ../outputs/moleculo/galGal3.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam


galGal3.LR6000017-DNA_A01-LRAAA-1_LongRead_500_1499nt.fastq.sorted.bam: 256
galGal3.LR6000017-DNA_A01-LRAAA-1_LongRead.fastq.sorted.bam: 40
galGal3.LR6000017-DNA_A01-LRAAA-2_LongRead_500_1499nt.fastq.sorted.bam: 254
galGal3.LR6000017-DNA_A01-LRAAA-2_LongRead.fastq.sorted.bam: 36
galGal3.LR6000017-DNA_A01-LRAAA-3_LongRead_500_1499nt.fastq.sorted.bam: 273
galGal3.LR6000017-DNA_A01-LRAAA-3_LongRead.fastq.sorted.bam: 35
galGal3.LR6000017-DNA_A01-LRAAA-4_LongRead_500_1499nt.fastq.sorted.bam: 254
galGal3.LR6000017-DNA_A01-LRAAA-4_LongRead.fastq.sorted.bam: 43
galGal3.LR6000017-DNA_A01-LRAAA-5_LongRead_500_1499nt.fastq.sorted.bam: 274
galGal3.LR6000017-DNA_A01-LRAAA-5_LongRead.fastq.sorted.bam: 39
Total: 1504

galGal5

Notes:

  • 0.5% of the bases are N, meaning either unidentified or hard masked. For galGal4 this is ~1.4%

In [22]:
total_N = !grep -F -o N ../outputs/reference/galGal5.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal5.fa |wc -c
print float(total_N[0]) / float(total_nucleotides[0])


0.00562306256509

In [77]:
total_N = !grep -F -o N ../outputs/galGal4/galGal4.fa |wc -l
total_bases = !grep -v ">" ../outputs/galGal4/galGal4.fa |wc -c
print float(total_N[0]) / float(total_nucleotides[0])


0.0138014789357

In [78]:
%%bash
for bam in ../outputs/moleculo/galGal5.L*fastq.sorted.bam
do
  echo -n "$(basename $bam): "
  samtools view -c -f 4 $bam
done

echo -n "Total: "
samtools view -c -f 4 ../outputs/moleculo/galGal5.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam


galGal5.LR6000017-DNA_A01-LRAAA-1_LongRead_500_1499nt.fastq.sorted.bam: 952
galGal5.LR6000017-DNA_A01-LRAAA-1_LongRead.fastq.sorted.bam: 235
galGal5.LR6000017-DNA_A01-LRAAA-2_LongRead_500_1499nt.fastq.sorted.bam: 931
galGal5.LR6000017-DNA_A01-LRAAA-2_LongRead.fastq.sorted.bam: 248
galGal5.LR6000017-DNA_A01-LRAAA-3_LongRead_500_1499nt.fastq.sorted.bam: 1003
galGal5.LR6000017-DNA_A01-LRAAA-3_LongRead.fastq.sorted.bam: 237
galGal5.LR6000017-DNA_A01-LRAAA-4_LongRead_500_1499nt.fastq.sorted.bam: 993
galGal5.LR6000017-DNA_A01-LRAAA-4_LongRead.fastq.sorted.bam: 258
galGal5.LR6000017-DNA_A01-LRAAA-5_LongRead_500_1499nt.fastq.sorted.bam: 981
galGal5.LR6000017-DNA_A01-LRAAA-5_LongRead.fastq.sorted.bam: 247
Total: 6085

Checking unmapped reads

Removing -c samtools output the reads, and so I took 5 from galGal4.LR6000017-DNA_A01-LRAAA-1_LongRead and thrown in BLAT:

In [ ]:
#! cd .. && make outputs/moleculo/galGal4.unmapped_reads

In [50]:
!head -5 ../outputs/moleculo/galGal4.1_LongRead.unmapped_reads


TGCTCACTCTCTCTCCATCTCTTCCTCCTTCCCCTTTTTCTTGTTCAATTGATTTTCAGTGCAATAGGAACAAAGGGCCCAGTCAAGGTCAGGGCACTGGTCTGAGTCTGCAACATGCCTTTAGAGGGGAGCAGTGACAAATATAATCAGTGGAGCAATAGACAGTGAGTGGCAGTGGTGGACTTAGGGAGCTTATGACATGCTTTACACAAAGCTGAGTAGAAAGGCATGTTGTAAGTCAATGAAGTGACAATTTCAGCATTTTTATCCTACCTTACAATTGAAGGATCAGACACAGAGTTTGTAGTCATGGAATCAAGCTGTAAGCATAATTACTCCATTTTCCATTCTCTCATCTCTGCCACCAGTTCTGTGCCACATACTGCCATTCCCCACCTTTCCAGTCTTTCTACCCTGGTTTTCTACCCTGAAAAAATGTGTTTTAAGAACGTCACAACAACACCACAATATACGAGTATCTTTAGCTTTTATGAGGATTGACAGAAAAATATTGATGGAAAGTATCTGAGGAGAAATTTGACCTCCAAATTTTATTTCTGCTCCCCAGACAACCTGATTACTTTCCATAGAAGTATTAATATTTTCTTCATGAGAATCTGATGCTTTGACGGTTCCATGACATCCACATAGGAGCCATAATATGTCTCTTTCCTTCTCCTGAAAAATAAGAAAGTGGTGTTTGCTGGATGCTTTGGCCCAAGAATGGGATCCAGACTGAACTGAAAAATGAGCGGGTCAGGAACTCAATTCTCCGGAAGGTTTTCACAAACACATTTCAGAACAAGATTTTTAATTTATCAATCTTTTCTAGGGGGAAAAATATTTTTTAAAAAAGATTTTATTAACATTTCTACAGTAACTTAGATTTTACGCTAGGGGGAATTCTGAGATGAGTAACAAAATAAAGAGCATCAGACATCTTCATTTTTTCAGTAACTGCCACTTAAAAGATGCAGAAAAACATTTATATGCATCAGTATATTACATTCAATAAGCATCCTCCTGAGAAGAAGAGGAGGGTAAAATCTCTGCAATCCTTCTCGTATTTCAAAACAAAACAGAAGGAAGCCTGCACTTGGTTTGTTGCTCTGATAAACAAACAAGTCTTACATCATCAACTTTGTGTATTCATCATATACAGCTAAAGATTGCTAAACACCAGATCTAAATTAATTATAAGCTATAATTTATTTATCTAGGAGTTAAGTCCTATTCCCCTATCACCCCACTGTTTCTGTCATCAGGAGCAATTCCTCCACAGATGAAATCTTTGTTCAGGGACCTCTTATAAGTTGTGGCAGCAGAGCTGGGCTTCCTCTGCTTGGACAAGCCAGTAGTGCTCCCCCTCTAGCCCTTACAGCACCTTCATACTCCTTCCATTCCCCTTCACCTCAAAGACAAATCTTTTCTTATGATGTAGGTGAATGCTACTTGAATGCTCTGATGACATGAGAGAGCTTGACTTTTTATTCTGCAGTTACAAGGTTTTAGAGAAAGAAGTCACTGTGAGTGAGTTCACACCTCACACTATACTAAGTTCCACTGCAAGGGAGATTTTAGGGAGGTCTCTCTCTAATACCTCCTGAAAGCTGCACAAGGCTTCAGATTACATCTGGTCTGCACATATGACAAAAGGACCTTCTATTTCAGCATGGCAATAGTCAGCAGTTTAAATTCTGCTCTCTGATAGATGTTTAGTCTGAAAATGTGGACTGAAAGGTACCATCATGCTCTTCTTCAGAAACTAAATCACATGAGGCCTCACCTGTAGATCCACCTGCAAAAGAAAACAGAAATCCTAGGGGACACAGGAGATTGGAGATGATTTTTTCTGTCACTGCAGAGTTTGGTTTTCTCAATCCAAACTGCAATGCTCTTTGAGTGACACTGTCTTTTAAATAATAAATACTTTATCACTTAACATTATCAAATAGATTTTTGTCACTTTACTGATGCTATGAAAGTTTCTTAACCACTAGAAATCAGATACATTTTAATAATTTGGATTTGTATTCACCTGCCAGAAAGGCAATTTGCCTAATACTTTTGTAATATTACCATAATTCACTGCTCCCAAAATTTTGGACTTTCTGTATTAACATCAACTTTCTGTCCTAACAATTGTCACAGACCATGTATGTTAGTACTTAGAGTGAGCTCATGATCCCAAAACACACATTTTCATTCCAAGACCTACATTTTACAGTCCAGAATGCTACTTCTATATGATGTGACTTCAGATTTGGTCTTCTTTCAGCCCTCTTTTGTGTCAAAAGCTTGTGTGACTTCTAATCCTCTGAAAAACAAAGGTGACATGCTACAGAAACTTACATTAGCCTACACCAAATGAGAGTTACAGACATTAGCTTCTCCCTCACTTCTACTTTTCAGGGCAGTTCATTGAGACTTCATGCATAGTGTGCTCCTCTGTGGTTACTAAGGAGGGGGTTCATATTTTCAGCATATCTTTAGTATATATTCTGCCTAGTGTGCTTCATCTGATAAGAATTCCAATGATTACAAAAAAAAAAAAATCTACAAACACTAATTGATTTTGGATTATTAAGACAACAGTTGAATGCCACGACTGCCACGGTGTGATGGGAAACATTACTGTCTGATGAGAGCAATTTCAAGATGGGGCTGGGGAAACTCAAAGAAGATAGGGCATAACTTGTTATTAAATACAACCAGTGAGATATGACATTCAGCTTAGGAAATTCCTAACCAACTGGCCATAAAAAATAGGAAGAGCAGAGTGGCTGGAGCATTTTGCAGTGCTTGTTCTCATTCTGACGTCCTTTGCCCACACATTTGCTGGTGGTCAGTGCAGGAGGAAAGATGTTATCTGGCCACTTCCAGCTATTCCTTTTTTCTTAGTGGCTCACTGAAACCACTTTCCTTTGTGAACTGCGTTTGTACTTGTTCTGAAGAGGTTAAAGTTATGACTGCCATAAAAACCCCACAACATAGTTGCTTCCTCCTTCCTGATTTTCTTAGAGCACTTCAGCAGAAGTAATGTACGGGGAGTGAGATTCTTATCTAAGTCATACTCCCTGAGTCTATCTCAAGTTCCTTGTTTTACAAGAGTTTTTGTAAGTGACTGAGAACAGCATGTAACC
ATTGGAGAAGGGCAGGGGTTACATAAGATTATTCCTGAAGTTGCAAAGCAATTGGGTAAAGGGTATTTGTTTCGGATAATCGGAGATGGGGGCGCTAAGGTTTTATTAGAGACTGCGATCCATGAGCAACAACTATCAAATGTTGTTTTGGAAAAACCGGTTAAAAGAGATGCTTTGATTGAAATATACAAACGCTCTCATTTTCTATTTATGCATTTAAACGATTATGCAGCATTTGAGAAAGTACTTCCATCAAAGTTATTTGAACTAGGAGCTTTCCCCAGACCAGTAATAGCAGGAGTAAACGGATATGCAAGAAGCTTTATAAAACAGCATGTACCAAACAGTATTGTGTTTAATCCAACTGATGCTAAGGAATTAGTTAAAAAGTTAGAAAGCTATAGCTATGAATTGCCACATCGTAATGAATTCATAAATCAGTTCAATAGAAAGCAGATAAATAAAGAGATGGCACAATCCATTGCCAGTTACCTATGAAGCGTATTGCTATTACTGGTGTCAGCGGATTTGTAGGCGCAAATTTGCTGCCGTATTTGAAAGAAAGGGGAGTGGATACTATACCCTTGTCTCGGACATTTGGTCCTGGATATGATGCTGTTGATGCTGTATTCTTGAACCAGCAGTCCGTTTATGGAATTGTTCATCTTGCAGGAAAGGCGCATGATTTGCGTAAAGAGGTTAATTCGGCAGAGTATTATGAAGTCAATACGGAGCTGACGAAGCGTTTGTTTGATGCATTCCTAGAATCAGATTCGGAAGTTTTCGTTTATCTGAGTTCAGTGAAAGCAATAGCAGACAGAGTGGAAGGGGCATTGACAGAAACTCATATACCAAAACCATTAACAGATTATGGAAAATCGAAACTGGCTGCAGAGACATATTTGCAACAAGCTATATTACCCCCTCGCAAAAGAGTTTATATTCTAAGGCCTTGTATGATACATGGGCCAGGAAATAAGGGTAATCTGAATCTTTTATACCAAGTAGTAAAAAGAGGTATTCCGTATCCATTAGCAGCATTTGACAACCAGCGATCTTTCTTATCTGTGGAGAATCTTTGTTTTGTGATTCAGGAGTTGCTGCAAAGGAAGGATATTGCACCAGGAGTTTACCATATTGCGGATGATCAATGTTTAAGTACGAATGAGGTAGTCGAGATCATTGCGGATGTTTTGGGCAAGAGAAAGCGATTGATCGCAATTCCTGCTTCATGGATTCGCGCTATAGCAAGAGCAGGCAATTATCTTCCTTTGCCGTTGAATACAGAAAGACTACAAAAGCTTACAGAAGACTATGTAGTAGACAACCGGAAGTTAATTGAAGCATTGCAAAAAAAGTTACCAGTGGATGCGGGAACTGGATTAAAAAAAACTATTTCTTCTTTTACTTATGTGGATTGATTATATAGGATTGAGTATTGTATTCCTGTTACTCGAATTATTATATTTTGTCATTGCAACAAAGCGAAATATTGTTGACAAGCCAAATAATAGAAGCTCCCATCAGTATATTACTATTAGGGGTGGAGGTATTATTTTCCCAATAGCTTCTGTTATTTTTCTGCCATTTAGTAGTTTAAATGAGCTGCTATTGATAGGATCTTTATTATTAATCTCGGTGTTAAGTTTTGTAGATGATATAAAAAGTGTTGATAGCAAGATTCGACTTGTTATACAAAGTATAGCAGTGATCGGTTTATTGTACTCATTTATTGGCGTATTATCGGTAGGATGGCTATTGGTATTTTTTGTTATCATTACAGGAGTGATCAATGCATATAATTTTATGGATGGTATCAATGGCGTTACTGCACTATACTCTATAGTAACTATTGCCAGCTTATTTTGGATTAGTGAACAGGTTCAGTTTTTACAAAACAGCTTGTTTTTTCTTTCAATTTTAGCAGCATTGACAATATTCTCTTTTTTTAACCTGCGTAAGCGTGCTAAATGCTTCGCAGGGGATGTAGGTAGTGTTTCTCTTGCTTTTATTATCTGTTTTTTAGTATTATCATTGATTATCTCTACCTCGTTTCCATATTGGATATTGTTATTAAGTGTTTATGGTATTGATACCGTGTTTACAATCTTTTGTCGTATTTTACGAAAGGAACCGTTAATGAAAGCTCATCGCTCTCATTTTTATCAATACTTAACTAATGAAGCTGGATGGGATCATTGGATCGTTAGTTTATTGTACGCAGGCGTTCAGGCGATAGTTGATATTTTATTGATCTATGCATACCTAAGCCAGCTATATTTACTTCCGATCGTAGCACTATTTGTAATTCTTATAATATATGTTATATTTAGACTCCGATTTGAAGGTAAACATCGCTTATTTGCTACTTACAACTGCTAAACTATAAACATGTTTAAACACATAAACATTGTTCCACGTTGGATCATCTTTCTGATTGATCTTGTTATTTGCTGCTTTTCGTTCATTTTTTCTTCACTGATTAAATACAACCTAACTCTTGGTGGACTCAATTTACATGACTTGAGTGGCAATTTACTCATCATTATACTCATTAATTCTATTGTATTTATCAATTTCCGGACTTATGCAGGCATTATCCGCTATACAGGAGTACAAGATGCTTTAAGAATTTGCTATGCAATAGCTATGTCGACCAGTGTATTGTTTTTTATTAGTCTCGTGTCTTCCAATTCGGGCAGCGCACTTTTCTTTTCTAATGTAACTCTGATCATTTATGCATTCTTTAGTTTTTTATTTTTGATTTCTTACAGAGTATTGGTAAAATATACTTTTGCGTACTTCCGAAACTATAAAATGGACCGGAAGAATGTAATTATATACGGTGCAGGAGAAGCAGGATTTGCAACTAAAAGAGTATTAGAACATGATACTACTTCAAATGTTAATATTGTGGCTTTTGTAGATGATGACTTACGTAAAGTGGGTAAAGTAGTTGATGGCATTAAAATTAGCCATACGTTGGATCTTCAAACTCTTTCATTGACGCAGAAAATTGATGAAATTATTATAGCTGCCTTTAACCTCCCTCCGGCAAAAAAGAATGAACTGGTTGACTTTTGTTTAGACCATGATATTACCGTGTTAAATGTTCCTCCTTTAGATAAATGGATTAATGGTCAGTTCTCGGCTCGTCAACTTCAAACTATTAAGATTGAAAACCTTTTAGAAAGAGAACCTATCCGCATTAATAATGAAGAGATTGGTAATCAGATAAAGAATAAGCGGATATTGGTTACAGGTGCTGCCGGGTCAATTGGTAGTGAGATCGTTCGACAACTCCTAAAATTTGATCCTCAGACAATTGTACTTTGTGATCAGGCAGAAACGCCTTTACATCAATTAGAACTAGAATTACAAGATATAAAGACCTCTACTAACTATGTTTCTTACCTGGGAGATGTAACGAATAGGGATCGAATGGAAGAACTTTTTAACCTATTTGAACCACATTATGTGTATCATGCGGCAGCTTACAAGCATGTTCCTATGATGGAACTTTGCCCTTCTGAAGCAATTTTAACTAACGTATTGGGGACAAGAATTATAGCTGATCTTGCTGTAAAGTATAAAGCACAGCGATTCGTAATGGTGTCTACTGATAAAGCAGTCAATCCTACCAATGTAATGGGAGCTTCTAAACGACTAGCTGAAACTTATGTACAATCATTGCATTATCACCAGATCGCGAATCTGATTAACAATAACCATCACACTACCACAAAATTCATTACTACTCGTTTCGGAAACGTATTAGGGTCTAATGGCTCAGTAATTATTAGATTTAAGGAGCAGATTCAGAAGGGAGGGCCAGTAACAGTGACACATCCTAATATTACTCGTTTCTTCATGACCATCCCTGAAGCTTGTCAGCTTGTGTTAGAAGCTGGGTCTATGGGGAAGGGGGGAGAAATCTTTGTGTTTGACATGGGGAAACCTGTTCCTATTGTGGATTTGGCTAAAAAAATGATTAGATTATACGGATTGGTACCTGGCATTGATGTTGATATAAAATATACTGGCTTAAGACCAGGTGAAAAATTATATGAAGAATTACTGACAGACTCTGAAAATACTTTGCCGACCTATCATGAAAAGATCATGATAGCGAAAGTAAGGCAAAATCATCTGGATGATATGCTGCATCATTTTGAGGACTTATTTGCATTAGCCAAACAAAAGGACGGCATGATGCAGATGGTAGCGAAAATGAAAGAACTGGTTCCGGAGTTTGTTAGTAATAATTCGGTGTTTGAGCAGTTGGATGGGGATAAGGCGGCGGTTATTGAGATGAATAGAGCTGTTTCGTAATCGGGCAGGTCCCTCGGCGTGCTTCGCTTGCTCGGGATGACGATTAATAAGGATATGAAGAACAAGTATAGAGTTAGCTTACTGAAATTAGGTAAGCTTTTTGTTTTTTTAATGGGTGGATTGCAGGCAGCTGCCCAGGGAGTGGAGATAGGATCGGTAAATGATCAGGCACTGCGGATGTTGCAATTGCAAGGGAAATTGAATGCAAAGTATTCTCTAATGGCACGACCTTTTTTTGCGGAAGGGGCCATTACAACTGACAGCATTTATAAACTGATTGATGATAGTGCCAGTATAAATGTGACCAGGAAACGAATAGGAGACAAAGGGGTATTAGAAATTTTCCCTCTTACCATCAACAGTCAGTTGAATACACATCACCCTTATGGATGGAATCAACCTGGGTTTGTACAGGCGAATGGATGGCAGGGGCTTATTACTGCCGGCGCTTATACTGCTATTGGGCCATTGAGCATACAGGTAAAGCCTGTAGTAGTATACGCTGCTAACACCGGGTTTGAACATACTAACACCTATGGTGCGGTAACAAGAGGTAGTTACTCGCGTGTCTTACCCGGCCAATCCAGTATCCGGCTGAATGCCGGAGCTGTTTCATTAGGTATTTCTACGGAAAATCTTTGGTGGGGGCCGGGTAGTTTTAATTCCTTGCTGATGAGCAATAACGCACCCGGATTCCTGCACCTTACTTTCAATTCTACAAGGCCTGTAAAAACTCCCATCGGTAGTTTTGAATGGCAATTGGTGAGTGGGAAACTTTATGAGGATACTATGCTATTAAGGGAAGATAAGAACCTGACTACTACGTATTATAATCCCAAAAATTATGATGGTAGTGGTTATAGTGGTCCCTATGACCCGCATCAAAAATGGCGATATTTCAATGGTGTTACTATCACTTACCAACCCAAATGGATCAAAGGGCTCTTCCTGGGCATTAATAGAATAGCCTATGCATACAATGATAGTTTGCAAAGTGGGAATAGCAATTTCTTCCACAGCTATATGCCTGTGATCTTTGGTGTATTTAGAGAGAGCTATGCTTACGGAACACGACAAGGGGTAAAAAAAAGGTACAAACAAATGCTAAGCTTGCACGCACGTTATCTATTCCCCCAAGCTCATACCGAAATTTATGCTGAATATGGCTGGGGAGATAATCTGCTCAATATCCGCGACTTTGTGCTTAATGTACCCCATTCAACTGCATATATTCTGGGTGCCCGAAAAATGGTTCCTCTAAACGCGAAGAATCAATGGCTGGACGTACAAGCAGAATTGACCAGGCTTTCACAGCCTTCGGACTATATTCCGCGAACTGCTGGTAATTGGTATGGATACCAAGGTGGCTATACTCAGCAAAGCCGCATAATCGGCGCCGGTATTGGCCCTGGCAACAATGTACAAACCTTCGCCACCACCTGGGTACACGGTTGGAAACGGTTGGGTATAAAATTGGAACGGCTCCAGCACGATCCTAACAGTTACCCCGTGTACTGGTCAGATTATTCTGTAGGCTTTACCGGCCAGCAACGTTTCGGTAAATGGATAGCGAGTTCACTCCTGCAATTCATCCAGTCGAAAAACTATACCTGGGAAGCCGGAAAAGACCGCTTCAACTTTTATGGAACCCTGAACCTCACTTATGTATTCTAAGCTATTTTTATTTTTTGCTTTATCCTTTTTGACAATTAATGAAGGTTGGGGGCAGTCTTTCACTAGTGACCGGATTAAAGAGCTGTTCCGCAACCAGCAATTATTGGGTACTTATCAACCCAACCATTCACTCTTGGTGAATGCCAACCAGGTGCCATTGACCGACCTGGATTCGGCGCTGGACATAAAAGCACAGCAACCTCTTATAGAGTTCCTCCCGGCCCAACTGGTACAGCAATACAACAGCCAATTACCTTACGACTGGAACAATGGCACCATGATACCCGCCCGCGGTTACCAGTTACAAGCTTCTGTAGGTGTGCACGCACAACTGGGCCGCCACCTCGAAATACAACTGGCCCCCGAAGCGGTACTCGCCGAAAACAAATCCTTTGAACAGTTCTCTTCTCAATTGAGCGATAAATCCTGGGCGGCGCGATACCGTTTCTGGAACACCATCGACATGCCCGACCGTTTCGGCAACGGCCATTACCAGAAACTGTTCCCCGGCCAGTCATTCATCCGCTACAATACCCGCTCCCTTTCCTTCGGTATTTCTACGCAAAGCCTATGGTGGGGTCCCGGTTACCGCAATGCGCTCATCATGAGCAGCAACGCCCCCGGCTTCCTGCACGCTACCATCAATACCATCCGCCCGCTGCACACCGGCATAGGCGATTTTGAAGGGCAGATTATTGCCGGCAAACTGGATGGCAGTGATGTATTGCCACCCCGCATCTACAGTGTGTACAACGGCCAGTTCCTGTACCAGCCCAAAAACGATGAATGGCGCTACCTGGCCGGGATGGCGCTCACCTGGCGACCCAAGTGGACGCCCAACCTCTTCCTGGGCTTCGCCAAAGCCTCTTACCTCTACCACAGCGACATCACCAACCCGTTAGATGTATTGCCTTTTGAAGGCTTCCTGGGACACAGCCGTACCCAGGCCGAACGCACCGGTAAAAAAGCTTCCCTGGGCTCTTTGTTCATGCGCTATATCATGCCCAAAGAACAGGCTGAAATGTACCTGGAATACGGCCGCAAAGACATTTCCATGATGCCATGGAACGTGCTGCAGAACGCTCCCTACCGCAGGGCCTTCACCGGCGGCTTCCGCAAACTCTTCAATTGGAAAAACCAAAGCCACATTCTGCTGGCAGTAGAACTAACGCAATTACAGGCCTCCGATGCCACCCTCATACGCAACCCCGATAGCTGGTACACCCATGCCTACGTGCGGCAGGGCTATACGCAACTGGGACGTCCGCTGGGCGCCGGCATAGGCCCTGGCAGCAACAGCGAGACACTGGAAATAGCCTGGGTAAAAGGACTGAAAAAAATAGGCATCCAGTTTGAACGCCTCCGCCACAACGGCGATTTTTACTATTATGCTTTTGAATCGATCGGCGACTTCCGCCGCAACTGGGTAGACCTCTCTACCACCTTCAAAGCCGATTGGAACTACCAGCGTTTCTTCTTCTCCGGCCAGCTCGGCATCATCCGCTCCCTCAACTACCAATGGCTGGT
AACGAAATGGACGGTCTTTTGACGCAGCGTTTCTCCCTTTATCCGACTTATTTTAATGCGTTTGTTCCTTATTCCAACTATAATGCAAACGGTCGTTCCGGTGTTGATATAATGCTGAACGTGAACAAAAAAATAGGGGAACTTGATTTGAATCTGGGTGTGAATGCTACGTATGCCACTTCGAAAGTGACCAAAAGTGATGGGCTGTATGCCGATGCTTATCAGTCCAGGATCGGCAAACCGGTTGATGCCATCTTCGGGCTGCAGAGTAAAGGCTTTTTTGCAGACCAATCCGATATCGACAAAAGCCCCAAACAATTGTTTGGCGTAGTAAAACCCGGAGATATCAAATATGTAGACCAAAACGGAGATGGTATCATTGATCAAAGAGACTTCGTAATGATCGGTCGTTATGTAGCTCCTTTTTCTTATGGGATCACATTCAACGCAACCTATAAAAATTTCAATCTCTTTTTGCTGGGTACAGGAAACAATGGTGGATATGGATTGAAGAACAACGACTACTACTGGGTTTTCGGCGATAAAAAATATTCCCAGGTTGTCTTGAATCGATGGACGCCTGCCACAAAAGAAACGGCAACGTTTCCCCGCCTCAGTTCACAACAGAATAATAATGATTTCCGCAGTTCGGATTTCTGGTTGTACAAAACAGACCGGTTTAACCTGCAAAAGATTCAGTTGACGTATAACGTGTCAGGCAATGTGTTGCGCAAAACATTCGTCAAAGAGTTAGGCGTATATGTTTCCGGATCTAACCTGTTTACGTTCTCCAGGAATCGAGAAATACTGGATTTAAATATCGGCACAGCCCCGCAATTCAGAAATTATATGGTTGGTCTCAGGGCCGGATTTTAAATGATAATCTCAAATAATTAACTGATGAACATGAATAAAAATATATGGGTGCTTGTTCTGCTCGCCGCCATGTTTGCGGGTTGCAAAAAGGAACTTGCTCCGCTTGATGACAACCACCTCTCACAGACAAGACTTTCCAGCAATCCTTATTTTGAGGAAGGTATCCTGATGAATGCATATACAAGACTTCCGACGGCTAACTATTCTTTTAGCGAAGTGGCGACAGACGATGCCGTCACGAACGATAAAGCAAACGGATTTCTTTTTATGGCAACAGGCGCATGGTCGGCAGCCAATAACCCGATGGACCAATGGAATAACTCCTATACAGCCATCATGTACCTGAACCTTTTTCTGCAGCAGGTCGATACCATCAACTGGTCGCCGATGAATCAAAATGTATCGGCTTTGTTTCGCGACCGCCTGAAAGGGGAAGCCTTGGGTCTGAGAGGATTGTTCTACCTGAATCTTTTGCAGGCGCATGCAGGTAATGATATGAGCGGCCAGTTATTGGGTGTGCCCATCATTACAGACGTACTCACTCCCAATACTGATTTCAAAAAGCCAAGAAATACGTTTGATGAGTGCATGAAACAGATCTACAATGATCTTACGGAAGCAGAGAAATACCTGCCTTTGGATTACCAGGACATTGCCAGCACATCGCAACTGCCATCCAGATACAGCGCTGCGGTAATAGCCGATTATAACCGGGTTTTTGGTAGCGTTGACAGGCAGCGCCTATCGGCCCGCATTGTAAAAGGCATTCGGGCAAGAGCCGCTTTGCTGGCAGCCAGCCCTGCTTTTAACCCACAGGGAGATGCGGCTAAATGGACGGATGCTGCCAACTATGCCGGTGATGTATTGGATCGCATCGGAGGCGTTAGTGGTCTGGACCCGAAAGGGGCACTTTTTTACACTGCAACCAATGTCAATGCCATCAACCTGGCCAGTAACATCGATCAAAAAGAAATGTTATGGCGAGGCAGTATTGCAACCAGTAACAATATGGAGATAAACAATTTTCCACCATCACTTTTTGGCAATGGAAGAGTGAATCCAACACAGAATTTGGTGGATGCCTTTCCTATGGCAAACGGGTATCCGATTACAGACCCGGGCAGCGGTTATGCAGCCGTCAATCCTTATGCAGGTCGCGATCCGCGTCTGAAAAATTATATCCTGGTAAATGGTGGAACCATGTCAAATAAAACCTTCTATACCAAAACCGACAATCCCACCAGCGATGCCGTTAATTTTCTTCCCACCTCAACCCGCACCGGCTATTACCTGCGGAAACTGTTACGGGAAGATGTAAATGCAGACCCTACTTCCAGTTCGGTTCAAAAGCACTATCCCGTTCATATGCGGTATACTGAACTCTTTTTGACCTATGCAGAAGCAGCCAATGAAGCCTGGGGGCCTGATGGTGTGGGTACACATGGCTATTCGGCCCGCAATGTGATTGCCGCCATCCGAAAAAGAGCCGGCATCTCGCAGCCTGATAACTACCTGGCCTCCATCTCTTCCAGGGAAGACATGCGCACGTTAATTCGCAATGAGCGAAGACTGGAGTTAAGTTTTGAAGGTTTCCGTTTCTGGGATTTGCGCCGCTGGAAACAAATGCTGACTGAGCCGGCAGAAGGTGTCTCCATCAACAACAACATTTATACGTATAACATTCTTGAAAACAGGCAGTATGCAGATTATATGTATTACGGTCCTATTCCTTACAATGAAGTTTTGAGAGCAAATCTTCGGCAAAACAAAGGATGGTAACAGCAATATCTAATTTATAAAAGAATACAAATGAAAAGTTTTTTTTGGGTCATGGCTCTTGTTGCTTTTGTCTTTGTGCTTTCTGCATGCAATAAAGAGCAGACATTCCCTGATTACAAGTACACCACGGTTTATTTCGCTTACCAGTCACCTGTAAGAACACTGGTAATGGGAGATGATATTTATGACAACACATTGGATAACCAGCACAAATGCCAGATTATGGCCACCATGGGCGGCGTTTATGAAAACAAAAAGAACATTACGCTGAATGTGATGGTGGATACGACGCTTGCCAGTCATTTAAAATTTGAAACGCCTACCGGGGATAATGTCATTGCCATGCCGTCTAATTATTACACCCTGGCCAAGAATATGCAGATCACCATTCCTGCCGGGAGTGTTATGGGCGGAATAGAAGTGCAATTAACCGATGCTTTTTTTCAGGACCCACTGTCTATTAAAAATACCTTTGTGATTCCACTCCGTATTACATCGGTAAGTGGCGCCGATTCAGTCCTGCGCGGGAAAAGTGATTTGGCTTCGCCCGACCGCCGTAAACCGGGTGATTGGACAATTACTCCGAAAGATTATATTCTCTATGCTGTTAAATATGTGAATCCTTATCATGGCGTTTATTTAAGAAGAGGTGTAGATCTGGTAAAAGGTGCAAACGGAAATACTGCGCTGGATACAAGCGTTGTCTATCATACGCAGTATGTAGAGACGGATCAGATAGTTAATGTTGTTACTACGGCGTTGGACCAAAATACCATTTCATTAACCACCAGGAATAAAGGCAGTAATCTTAATGTGCCCTTTCAACTGATGCTGAAATTTGACAATCAGGGTAATTGTACCGTTACCAATCCTGCATCAGCAAACTATACAATAAGCGGCAACGGTGCATTTGTAAAGAATGGAGATATGTGGGGAAATCAGCCACGTAATGTGCTTCGCCTTAAATACCAGGTAGACTTTGGAACAACAACGCATACAATTACCGATACTATCGTGGTGCGTGACAGAGGTGTGAAAATGGAAACATTCAACCCTGTCGTATACTGACAATAATCCCCATTTATTAAAGATAAAACTGTATATGATTCAAACAGACAGGCAATATGATTTACAACAATTAAAAGATGTTGCCTCACAGGTAAGACGTGATATTGTTCGCATGGTACATTCCTGCCAATCGGGACACCCGGGCGGTTCGCTTGGCTGCACGGATGTAGTGGTCTCACTTTATTTCGCAATCATGAAAATCAATGAAGCGAAAAACGAAAATGGCTTTTTACAATTCACTCGTGAGGGCCAGGAAGAAGATGTGTTCTTTCTTTCCAACGGCCACATTTCTCCGCTGTTTTATTCTGTCTTGGCCCGACGCGGTTATTTCCCGGTTGAAGAGCTTTCAACCTTTCGTAAAATCAATTCACGCTTGCAAGGCCATCCGGCAACGCATGAAGGCCTGCCCGGAATTCGGGTGGCATCGGGTTCCCTGGGACAAGGGCTTTCGGTGGCAGCCGGTGTTGCTTATGCCAAAAAATTAAGGAATGATGACCGAAAAGTATATGCACTAATGGGAGATGGTGAGCAACAGGAAGGGCAGATTTGGGAAGCCGTTCAATTTGCGGCGCACCACGAATTGGACAACCTTATTGCCATTATTGACTGGAACGGGCAACAAATTGATGGTCCTACTGAAAAGGTGATGAGCAATCGTGATCTAAAAGCCAAATACGAAGCCTTTGGTTGGGAGACCGTGATGCTTGCCAACGGCAACGATATGCAATTGACGTTAGAAGCCCTGAAAGACGCTAATCAAAAAGCCGGTAACAACAAACCGGTTGTGATACTGATGAAGACAGAAATGGGATTTGGCGTAGACTTCATGATGGGTACGCACAAATGGCATGGCGCTGCTCCCAATGATGAACAACTGGAAGTAGCACTCCTGCAATTACCTGCCACTTTCGGAGACTATTAATAACCGCTTAAACGTTGATGAACAAGGATGAAAAAATACAATTACACAGAAAAAAAAGATACTCGAAGCGGTTTTGGTGCCGGCATAGCAGAAGTAGGAAAAACAAATCCGAATGTAGTTGCCCTCTGTGCCGACCTGGTAGCTTCTTTAAAGCTCGATACACTTATTAAAAACAATCCGCAACGATTTATACAATGCGGCATTGCCGAAGCCAACATGATTGGCGTGGCAGCCGGATTAGCCATAGCAGGGTACATTCCTTATGCTACCACCTTTGCTAACTTTGGTTCCGGCCGTGTATATGACCAGATACGCCAAAGTGTGGCTTACTCCGGCACCAATGTAAAGATTTGTGTTTCGCATGCGGGATTAACACTGGGGGAAGATGGAGCAACGCACCAGATATTAGAAGACATCGCTATGATGCGGGCTATGCCGGAAATGACAGTGATTAATCCATGTGATTTCAACCAAACAAAAGCGGCTACCATCGCCATTGCTGATTATGAAAGACCGGTGTACCTGCGGTTTGGAAGACCAGTTGTTCCGGTGTTTACCAACCCTGATCAGAAGTTCGAGATCGGCAAAGCATGGACAGTGAACGAAGGAAAAGATGTAAGTATTTTTGCCACAGGTCATTTGGTATGGGAAGCCATATTGGCAGGAGAGATGCTGGAAAAAGAAGGAATAGATGCAGAAATTATCAACCTCCATACTATTAAGCCATTGGATGAACAAGCGATTTTGAATTCGGTAAGCAAGACAAGATGCGTGGTAACGGCGGAAGAGCATCAACTGAATGGAGGGTTGGGTGATGCTGTATGCCAGGTGTTAAGTCGAAGGCTTCCTGCACCGGTGGAAATGATAGGAGTAAACAACAGTTTTGGTGAAAGCGGTACACCGGCTGAACTAATGAAAAAGTACGGGTTGGATGCAACCAACATTGTAAAAGCCGTAAAGAAAGTGATACAGAAAAAACAGCAGGTTTTCGCTGATTATTCATCTGTACTATAATTGTGTATTCAATGAAACAACTTCTTTCGTTTTATTTACTGCTCATCACTATGAGCAGCTATCCGCAAAATAAGCCGGCCTTAAAACTTTGGTATAATCAACCATCTGGTAATACATGGGAGAATGCATTGCCTATCGGCAACGGCAGGTTGGGCGCCATGGTTTATGGCAATGTAGAAAAAGAGGTCATTCAGCTAAATGAACATACGGTTTGGAGCGGCAGCCCTAACCGGAATGATAATCCTTTAGCATTAGATTCACTGGCCCTCATCCGGCAACTGATATTTGATGGCAAACAAAAGGATGCGGAGAAGCTTGCCAACAGGGTTATTATTTCAAAAAAATCGCATGGCCAAATGTTTGAACCCGTAGGTAATCTGGTCCTTGTCTTTGAAAAATCCGGTAACTACAGCCATTATTACAGGGAGTTGGATATTGAAAAGGCTGTTATAAAAACAGCTTACACTGTTAATAATGTAACTTATACCCGTGAAGCATTGGTTTCTTTTCCGGATCGTGTTATTGTGATACGATTAACCGCAAGCAAACCACATGGTATTTCATTTAGCGCTTTTTATTCAACCCCGCAGCCCAAAGCAGAGATGAAAACAACAACGGTAAAAGACCTTACCATTGCGGGAACAACCATAGATCATGAAACCGTCTCAGGCAGGATGAAATTTAAAGGCATTACCCGCATTAAACTGGAAGGGGGTACGCTTTCAGCAAACGACGCTTCCCTTGCGGTCAGGAATGCCAATGCCGCAACTATTTACATCTCAATTGCTTCCAACTTTAATAATTACAACGACATTAGTGCTGATGAAAATAAAAGAGCAGCAGCTTATCTGAATAAAGCCTATGCAAAATCGTATGCCACAATTTTAAATGCACACGTTGCTGCTTATCAAAGGTATTTTAACCGGGTAAAATTAGACCTTGGATCAACCGACGCTGCCAATCATCCAACCGATGAACGGTTAAAAAATTTTCGTTCCGCTAATGATCCGCAGTTGGTTACACTTTACTATCAGTTTGGCCGTTATTTGTTGATATCCTCTTCGCAGCCGGGCGGACAGCCAGCCAATCTGCAAGGTATTTGGAATGATAAAATTCGTCCGCCATGGGATAGTAAATACACCATTAACATCAATGCTGAGATGAATTATTGGCCGGCAGAAAAAACCAACCTGGCCGAAATGCATGAACCCTTTTTGCGAATGGTGAAAGAGTTATCCGAAACAGGACAGGAAACGGCGCGGAGCATGTATGGCGCAAGAGGCTGGATGGCGCATCACAACACCGATATCTGGCGGGCTACTGGTGCAGTGGATGGTGCCTTTTGGGGAGCATGGAGCAACGGCGGTGGCTGGGCCAGTCAACATCTTTGGGAGCATTATTTATACAATGGCGACAAATCTTATTTAGCGTTGGTTTACCCCGTTTTAAAAGGAGCAGCCTTATTTTATGCTGATTTTCTTGTAGAACATCCAAAGTATCACTGGCTGGTGGTTTGTCCAGACGAGTCGCCGGAAAATGCATCCAAAGTACACCAGGGCTCTTCACTGGATGCCGGCGTAACGATGACCAACCAGATTATTTTTGATGTGTTCAGTACAGCGATCCGTGCTGCACAGATGTTGAACAAAGACAAAGTTTTTGCTGATACTTTGAAACAAATGCGTAGCCGGTTAGCTCCGATGCATGTTGGCCAATATGGACAATTGCAGGAGTGGTTAGATGATATTGACGATCCAAATGATCATCATCGCCACATTTCACATTTGTATGGCTTGTTTCCTTCTAATCAAATATCACCCTATCGCACGCCAGAATTGTACAGTGCGGCCCGAAATACACTGATGCAAAGGGGAGATGTGTCAACAGGATGGAGCATGGGCTGGAAAGTAAACTGGTGGGCAAGAATGCTGGATGGGAATCATGCCTATAAGTTAATTCAAAACCAACTGACACCAGTTGGTACCAATGAAGGAGGAGGCGGCACTTACAACAACCTCTTTGATGCGCATCCACCGTTTCAAATTGACGGTAACTTTGGATGCACCTCCGGTATCACCGAAATGCTTATGCAAAGTGCAGATGGTGCCATTCATTTGCTTCCGGCACTTCCTGACGTTTGGCGATCGGGGAGTATCAGCGGCCTGCGTGCAAGAGGCGGTTTTGAAATCATAAACATGCAATGGAAAGATGCAAAACTGGTGGCGGTTGTAATCAAATCAAATTTAGGTGGTAATCTTCGGCTACGTGTTCCGAATGCAATGAAGTTAAGTAGTGGCGGTCTGTTGACAAAAGCGACAGGTCAAAATGTCAATCCGTTTTACCAAACTGAAGAAACGGCTAAGCCGATTGTATCGCTTAAAGCGAATATTACCACCCTGCAATTGAAAGAAACAATGCTTTATGATTTACCCACACAAAGAGGAAGAACATATATGCTGGTTGTGGAATAAAATTGTAAACATTTGTCTCATGCTTTTCAACAAATGAGCTATCAAATAGAAAACTGTCCATATAGCTTTATCCTTTCACTGTTCATGAAATATATAGATCAAAAATGTGTTTTAGAGTTATTGTATTTGCTTGCCTGTATTTATTAGGATTGAATAACGCCCCAATGGCTCAAAAAGCATTGAACCTTATAGCGGGTATCAAAAGAAATTCGCCGGCCTTCGATTCGTCGGTTTTTTATACCCATGTGAAAACCTATGTGAACCCTGTATTACCCGGCGACTACCCGGATCCCACTTTGCTGAAAGTGGGCGATGATTTTTATCATTGCGGTTCAAGTTTTCATTTTACACCTTATCTGCCCATTTATCATTCGAGAGATTTAGTGCATTGGGAAGTGATCAGCCGGGTAGTTCCACCTGCAAAAGCAAAGTTTGTTACGGACCGGCCATCGGCGGGAATATGGCAGGGAGCTATTACTTATTTCTATGGTTCTTA
TGATCCCATACGGGCAGCCGCCTCATCCCATACAGACAGCCTACTGATCCCATAGTGACAGCCTGCTGATCCCACACAGACAGCCTGGGGGTCCCATAGTGAGATCCTACTGATCCCATACAGACAGCCCACTGATTCCATATGGACAGCCGCCTCATCCCATACAGACAGCCCACTGATCCCACACGGACAGCCTGGTGGTCCCATAGTGAGATCCTGCTGATCCCATAGGAACAGCCTGCTGATCCCATGGAGAGATCCTACTGATCCCATAGGAACAGCCTGCTGATCCCATACAGACAGCCCACTGATCCCACACAGACAGCCTGCTGATCCCATACAGACAGCCCACTGATCCCACACGGACAGCCTGCTGATCCCATACAGACAGCCCACTGATCCCACACGGACAGCCTTCTGATCCCATACGGACAGCCTACTGATCCCATAGGGACAGCCTGCTGATCCCACACAGACAGCCTGCTGATCCCATTAGGTCAGCCGCCTCATCCCATACAGACAGCCTACTGATCCCATAGGGACAGCCTGCTGATCCCATACAGACAGCCTGCTGATCCCATAGGGACAGCCTTCTGATCCCATAGGGACAGTCTGCTGATCCCATACGGACAGCCTACTGATCCCATAGTGACAGCCTGGTGATGCCGTGGTGACATTACGGTGATCCCATAGCGACATCCTGCTGATCTCATAGGGACAGCTTGGTGATCCCATAGGGACATCCTGGTGATCCCATACAGACAGCCTTCTGATCCCATGGTGACGTTGCAGTGATCCCATAGTGACAGCCTGCTGATCCCATAGTGACATCCTACTGATCCCATACGGGCAGCCCACTGATCCCATACGGACAGCTGCTGATCCCATAGTGAGATCCTGCTGATCCCATAGAGAGATCCTACTGATCCCATAGGAACAGCCTGCTGATCCCATATAGACAGCCCACTGATCCCATTAGGTCAGCCTTCTGATCCCATACAGACAGCCTACTGATCCCATAGGGACAGTCTGCTGATCCCATACAGACAGCCCATTGATCCCATACGGGCAGCCGCCTCATCCCATACGGACAGCCTTCTGATCCCATAGGGACAGCCTGCTGATCCCACACAGACAGCCTGGTGGTCCCATAGTGAGATCCTACTGATCCCATACAGACAGCCCACTGATCCCATAGGGACAGCCGCCTCATCCCATACAGACAGCCCACTGATCCCACACGGACAGCCTGCTGATCCCATACGGACAGCCTACTGATCCCATAGTGACAGCCTGCTGATCCCATACAGACAGCCCACTGATCCCATTAGGTCAGCCTTCTGATCCCATACGGACAGCCTACAGATCCCATAGGGACAGTCTGCTGATCCCATACAGACAGCCTGCTGATCCCATAGTGACAGCCTTCTGATCCCATAGGGACAGTCTGCTGATCCCATATGGACAGCCTGCTGATCCCATAGTGACAGCCTGCTGATCCCGTGGTGACGTTACGGTGATCCCATACAGACAGCCTGCTGATCCCATAGTGTCAGCCTGGTACTGCTATGGTGGGATCCTGCTGATCCCATGGTGACATCCTTGTGATCCTGTGGTGACATCCTGCTGATCCCATGGTGACATCCTGGTGATCCCATAATGAGATCCTGGTGATCCCATGGTGACATTCAGGTGATCCCATAATGAGCTCCTGGTGATCCCATGGGGATATCCTGGTGCTGCCATGGTGACATCCTGGTGATCCCATAGTGACATCCTGGTGATCCCATGGTGATGTTATGGTGATCCCATGGGGACATCCTGGTGATCCCATGGGGACATCGCACTGATCCCATGGTCACATCCTGGTGATCCCATAATGAGCTCCTGGTGATCCCATGGGGACATTCAGGTGATCCCATGGTGGAATCCTGGTGATCCCATGGGGACATTCAGGTGATCCCATAATGAGCTCTTGGTGATCCCATGGGGACATTCAGGTGATCCCGTGGTGACATCCTGGTGATCCCATGGTGACGTTCCGGTGATCCCATACGGACAGCCCGCTGATCCCATAGTGACATCCTGCTGATCCCGTGGTGACATTCAGGTGATCCCATGGTGACATCCTGGTGACCCCATGGTGACATTCAGGTGATCCCATGGTGACATCCTGGTGATCCCATGGTCATATCTTGGTGCTGTCATGGTAACATCCTGGTGATCCCATGGTGACATTCAGGTGATCCCATGGGGACATCCTGGTGATCCCATAATGAGCTCCTGGTACTGCCATGGTGACATCCTGGTGATCCCATAGCGACATCCTGGTGATCCCATAGCGACATCCTGCTGATCCCATGGTCACATCCTGGTGATTCCATAGTGACAGTCAAGTGGTCCCATGGTGACATCCTGGTGATCCCATAGTGACATCCTCGTGATCCCATGGGGACATCCTGGTGATCCCGTGGTCACATCCTGGTGCTGTCATGGTGACATCCTGGTGATCCCATGGGGACCTCCTGGTGATCCCATGGTGATGTTATGGTGATCCCATGCTGACCTCCCGGTGATCCCATGGTGACATTCTGGTGATCCCGTGGTGACATTCAGGTGATCCCATGGGGACATCCTGGTGATCCTGTGGACACATCCTGGTGATCCCTTGGGGACATTCAGGTGATCCCGTGGGGACATTCAGGTGATCCCATGATGACATTCAGGTGATCCCATGTGGACATCCTGGTGATCCCATAATGAGCTCCTGGTACTGCCATGGTGACGTTATGGTGATCCCATGGGGACATTCAGGTGATCCCATAGCGACATCCTGCTGATCCCATAGTGACAGTCAAGTGGTCCCATGGTAACATCCTGGTGATCCTATGGGGGACATCCTGGTGATCCCATGGTGACATCCCAGTGATCCCATGGGGACGTTCAGGTGATCCCATAATGAGCTCCTGGTGATCCCATGGGGACATTCAGGTGCTGCCATGGTGACATCCTGGTGATCCCATGGGGACATTCAGGTGATCCCATGTGGACCTCCTGGTGATCCCATAGTGAGGTCCTGGTGATCCCATGGGGACATCCTGGTGATCCCATTATGAGCTCCTGGTACTGCCATGGTGACATCCTGGTGATCCCATAGCGACATCCTGGTGATCCCATAGCGACATCCTGCTGATCCCATAGTGACAGCCTGCTGATCCCATGGTCACATCCTGGTGATTCCATAGTGACAGTCAAGTGGTCCCATGGTAACATCCTGGTGATCCTATGGGGGACATCCTGGTGATCCCATGGGGACTTCCTGGTGATCCCATGGTGAGATCCTGGTACTGCTATGGTGACATCCTGGTGATCCCATGGGGACATTCAGGTGATCCCATGGTGGAATCCTGGTAATCCCATAGTGAGATCCTGGTGATCCCGTGGTGACATTCAGGCAATCCCATGGTCACATCTTGGTGCTGCCATGGTGACATCCTGGTGATCCCGTGGTGACTTCGTGGTGATCCCATGGGGACATCCTGGTGATCCCATGGTGACATCCTGGTGATCCCATAGTGACCTCCTGGCACTGCTATGGTGAGATTCTGGTGCTACCATGGTGACTCCTGGTGATCCCATGGGGACATCGTGGTGATCCCATGGGGACATCCTGGTGATCCTGTGGTGACATTCAGGTGATCCCATGGTGACATCCTGGTGATCCCATGGGGACATCGCACTGATCCCATGGTCACATCCTGGTGATCCCATAATGAGCTCCTGGTGAGCCCATGGGGACATCCTGGTGATTCCGTGGTGACAGCCTGGTGATCCCATGGGGACGTTCAGGTGATCCCATAATGAGCTCCTGGTGATCCCATGGGGACATCCAGGTGATCCCATGGTGACATCCTGGTGATCCCGTGGTGACATTCAGGTGATCCCATGGGGACATCCTGGTGATCCCATGATGACATCCTGGTGATCCCATGGGGACATTCAGGTGATCCCGTGGGGACATTCAGGTGATCCCATGATGACATCCTGGTGATCCCATGGGGACATTCAGGTGATCCCATAATGAGATCCTGGTGATCCCATGGTGACATCTTGGTGCTGTCATGGCGACCTCCTGGTGATCCCATGTGGACATCCTGGTGATCCCATAATGAGCTCCTGGTACTGCCATGGTGACGTTATGGTGATCCCATGGTGACATCCTGGTGATCCCATAATGAGATCCTGGTGATCCCATGGTGACAGCCTGCTGATCCCATGGGGACATTTAGGTGATCCTGTGGTGACATTCAGGTGATCCCATGGGGACATTCAGGTGATCCCATAGTGAGATCCTGGTGATCCCATGGTGACATCCTGGTGACCCCATGGGGACGTTCAGGTGATCCCATGGTGGAATCCTGGTGATCCCATAATGAGCTCCTGGTGAGCCCATGGGGACATCCTGGTGATCCCGTGGTGACAGCCTGGTGATCCCATGGGGACGTTCAGGTGATCCCATAATGAGCTCCTGGTGATCCCATGGGGACATCCTGGT
TTTTCTTTCACTGCTTTGATCTGGAAAATTTCCTGCCCATTTTCTTCTAAGAAAACAATGCGGTAACGGTGTTGTTTATAACTGGGCAGGCGCACGGTGAGGTAACCTTTGTCGTTCGTGAATACATAAGGAGAAGGGACCCAGGATAATGCCTTGGCAACAAAGGGCCGCCAGTCGATGGAATCGGCATCAACGGTGTACAGGGAATCGCTGGTTCTGTTGATGATGGAATCACGAAAACGCAGGAAGTCTTTGTAGGGTAATTTGAGCGCCAGTTCACCCCGTTTATAAATGCGTATGACAGCATCGGCTGTCATGGCAGCAGCATTGGGTAATGGTTGTACGCGGGTATCCGGACTGTTATCGCCAACCCTTTTCACATTGCTAAAGAAATACTGGCCGCCCCTCAGCACATAAAAAATGCGGTAATACACTTTATAACCCGGCATTGCCTGCGCGTCTACAAAACCATTGGAGGGCAATTCGGGCGATTGGGAGGAGAAGATGGTCCTGAAATTGGTGATACTGTCGTAAGACCGCTGTACGGCCAATTGGATGCAGGAAGGGTAAGGATTGATCCAACTGATCTGGGTCTTTCCTTTGGTGAGTTCTTTAACGGTAAAACCGGGCAGCGTGGTCTGGGCCTGTGCACTGTTCCATACAACAAACATCGCCAGCCAGGCCGGAAAAATTAAGCGTAGTGATTTCATCTTGTACAAAGATAAAAGCTGTTGGGTTATCAAAAAGGTAATGTATGCGGTGGTAGTGAATTGTTTCCCTGCAGGCCGCCGCTATGGATAACCAGGAGGCGGCTGTTTTTATGGAAATATTTTTTCTCAATACAATCCTTCACGGCGAATAATAGCTTCGATGTGTACACGATATCGGTAGGCAATTGTTCATTTTCCCATAAGCCCTGCATGAACCGTACCAGATTTGAAGGGTGCTTGCCATATCCCCCGAAATGATAACCGTGTAAGATGCGATGTTGCTTCTGTTGATCACCGGCATCCAGTAAATCGTCAACGGCTTTTTGCAGCGCTTCATTGCCTTTCATAACGCTGATGCCAATAACTTCCTGGTTGGCAGATGCTGTCCGCACCAATCCCGCCATCATAGTGCCTGTACCTACTGCGCAGAGTATATGCGAATAGCCCGAAACATCAACCAGTTTGAGTATTTCCGAAGCGCCCCTCGCACCTTCCGGTCCGTAGCCGCCTTCTGGTATCCAATACCAACCATGATCGGCATGGGTTTGGATGATAGCATCTTTAAAACGGTAATCGGAGCGATTGATGAATACCAGCTCCATGCCGCAAGACTGCGCGGCTGCCAACGTATGTGAAAGCGGAGAATGGGCTTCACCGCGAATGAAACCGATCGTTTTCAGTCCCCGTTCCCGTCCTGCATAGGCCAGCGCTGCAATATGGTTGGAGTAGGCTCCGCCGAAACTGGCAATGGTTTTCTTACCGGTACGTTCGGCAGCTTCGAAGTAGTACTTTAGTTTGAACCATTTATTGCCGCTTATAACCGGGTGTATGCGGTCGAGCCGCAATACATCAACCGTTGTATAGTCGCCACTGACCCAACTGCCGAGCGATTGAATATTGATATTTTGTGAATTATAGGACATTCGTGCTGTTTGTTACTCAAATCGAATGATTTAGCTTCGCCACTCGAAAGTATCAAAACAAACCAATCAATGAAACCCACCCTTGTGATCCTTGCAGCAGGCATGGCCAGCCGTTACGGAAGTATGAAACAAATAGAATCATTCGGTCCATCCGGTGAAACCATTATGGACTATTCCATTTATGATGCCATCCAGGCCGGATTTGGAAAGATCGTTTTCATTATCCGCGAAGATTTTGCCGACCAGTTCAAAGCCATTTTTGAACCCAAGCTCAAGGGACGGATCGAGATCGATTATGTGTACCAGGACCTCAAATCTTTTACCGGCAACCGGGTAATACCTGCCGACAGAACCAAACCCTGGGGTACGGCGCACGCGGTATTATGTTGTAAAGGCAAGATACACGAGCCTTTTGCAGTGATCAATGCCGATGATTATTATGGACGCGATGCTTTCATCAAGGCCTATGATTTTCTCGTTACCAAATGCAACGAAAAAACATATTGCATCATCGGATATGAGCTGAATAAAACGTTGAGCGATAACGGCAGTGTGAGCAGGGGTGTTTGCGAAGTGGATGCCGATAATAACCTGACCGATATCAACGAACGTACCAAAATTTCACGCCAGGCCAACGGTGATATTATTTTTGAAGACGAGACTGGTACCCATCATGTATTATCCGAAAATGCAATGGTGAGCATGAACTATCTCTGTTTTGCTCCGGGCTTTATTGATGTTTGTGAAGGCTTTTTCGGCGAGTTCCTGGATAAGAATATCAATAACCTCAAGTCGGAGTTTTTCATTCCTGTTGTAGCGGGACAGTTCGTATCATCCGGAAAAGGTGTAGTGAAAGTAATACCTACTTCAGCAAAATGGTTCGGCGTTACTTATAAAGAAGATGCCCCGGTGGTACAGGCCAGCATCGATCAATTGGTAGCTGCGGGAGAATACCCCAATAATCTCTGGGCTTGATTTGATATCCATCGTCATCCCGAACGTAGTCGAGGGATCTGACCAAACTTTCAGCAGATCCTTCGACTCGCTTTCGCTCGCTCAGGATGACGGCCCATAATCGGACCTAATCCTCATCTGAGCGATGATTATCTGCTGTAAGCTTTTACATAGAATACGCAAGGCTCCAGTTCGGCAATTTGTTCCTTGCGCTGTACACCCCGGAGACTGCTGCGTGCCGGCAGTTTGATTTTTACCATGGAACTGTCGCTTTTCTTGAGTTCTTCCACCATGCTGAAAATGTTCCTGGATTTAACTGCAAATGCTACTCCTTCTGCCTGTGCCTGGCGGGTGCTCAATACACCTACCACTTCTCCGTTGGTATTGAATACGGGTCCGCCTGAGTTGCCGGGATTGGCGCTGATCTGCACCTGGTAGGAGAGTGAATCTCCGTTGTAGCCCGTTTTGGCGCTCAGGTATCCCATTCCGTATACGATGTCGTTCCGGGGATATCCCAGGGTGAAGATCTCTTCGCCGAGATCGGTATTGTTTTTACGGATACTATAGGGAAGCGAGTGCAATGGCTCGTAATCTTTGTCGCTGATCTTAAGAATGGCCAGGTCTTTATCCTGGTCGATATGCACAATGGTAGCGCTGAATTCCTGCCCCTGGTTATTCACGACAATAGCGCCCGAACCTTTGAGTACATGGGCATTGGTCATGATATAACCTTTGGTATCGATGAGGAACCCTGTGCCGCCGCTGATGAGCCGGGCATTCTCGGGTATCTTGCTTTTTACTTCGTTGATCAATGAACCCTGGTATTGCTGGTTCCTTTTGATCACTTCAATGTCTTTGCTCAGTTGTTGCAACTGGTTGCCGTTTACAGCGGGGGAGAGGTAAACGGTGAGGCCACTGATGACCAATGCTATGATACCACCCACCGAAGCGGCGATGGCGGTTACTTTTTTGTATTTGTTCCAGAGTTGGATAACGCGCCCTTTAGCCGAAGGTGCGGCGCCTTCATTGATGTCGCCCCGGGCCAGTAATCGGGCATGGGATTCATGCATGCTGTGTTTCAGTGTGCGGTGGGCTGCATAAAAATCCATCTGGTGCAGGAACATACCGTGCTCTACCACCATCTGGTCAATTTCAGGCGTATTTTTCCGGAGCTCTTCAAAATAAGCCCGTTCGGCGGCGTCCATATCGCCGCTGAGGTACCTTTCAATAGCTTCCAGTAATAAGATATCGTCCATCATAGTGCTTACTCCCCGATATTATATTGCGCAAAGAAGAGTTTCTTTAAACGCATTAAACATTTATATTTTTGGTTCTTGGCGTTATCGGCATTGGTATAACCGAATTCCTGCGCCAATTCCTGCATGTCCATTTTCTTCAGGTAAAATCCTTCCAACAATCCCTTGCAGGGCTCCCCCAGGCTGTTCAGGGCCCGGTCCATTATCGCAAACTCGGCATTCCGCTTCTCATGGATCTCCAGGTCTTCTTCCACCGCCACTGTTTCATCAAGCGGGTCAACTTGCCGTGTAAAACGACCCAGTTGTTGTAGCCGTTTGAGCCAAAGTCTCCTACATATTGAATAAACATAAGTTCTAATTTGACAGGTTAAGACAAATGATTCGCTCTGTGCTTTTTCATAGAGGGCAATCATCGCTTCCTGAAAAACATCCCTTGCGTCATCGTAAGAGCCATTGTTGTTCAGAATGAAAGACTGAATGGATGCGAAATTCTCCTTATAAATGGCATCAATGGCCTTCGAGTCATTGTTGGCCAACCCTTTCAATAACACCTGTTCGTTGCTGTCAGCTTTCACTATGCACCTTTTAATACTGTGAGGAGGGTAAAAGTAACCCAAAATAGTCCTAAAATGTTTTTTTGAAAAATATTCCGGCCCCTAGGTTACCTTTTATGTATCGCCGGTATTAAAAAGAAATTCACTCACACAAAAAACACACAACAAATGAAAAAGCTGTTATTCGTTTTAGCTATCGGCGCTCTTGCTGCCTGCGGTTCTGGTGCCAACAAAGATGCTGCTAAAGATTCTACTGCAACTGAATCAGCAGCAGGAGCTACTAAAGCTGCTGCTGACACCACTAAGAAAGCAGATACAGCTGCTAAAGCTGCTGTTGACACTTCAGCTAAGAAGAAGTAATTTATTTTAGGCTGCCGAATATAAAACGAACGGAGGCGTGCTTGCCTGCCCCTGGTAGCTTATCAAAGATTTTTTCATATGGCAAGCAATCGTTAGAGGCCGCTCCGTTTCCACGGAGGGGCTTTTTTCTTTTTTGAGTGAAGAGAATAGTTCCGTTAACGAGTTGTCAGATTTCATCCGCCAGTTGGCAAAAAACAGGCTTCAATTGTACGATTGTGATCTTCTGCTGTACTCCTAACTACTATTTGTTAGCTTAATTTTTTTTCATGAATTGATGGTTGGTATCAAATCAGTAGCTATTTTTGACACAGAATGAAAACAACAAATACTTTCTTTTTCTTTTATTACTTTTTCTTCTTTACACAAAGGAGCCGGGTAGTATTTGCGAAAGCTTAAAAGATAAAATGAATCAAAATACGATCCCGGCTCAAAGGCCGGGATTTTTTATACCCCCCACCCAACCACTATGAACAAAAGAATCGCTATTCAAGGATATGAAGGCAGCTTCCACCAGGTAGCCGCCATGCACATATTCGGTAAGAACATTGACGTAATTCCCTGTGATACTTTCAGGGAACTTATCAAAATAGCGGAAGACAAAAAGCAGTCCGACGGCGCGGTAATGGCCATCGAAAACTCCATCGCGGGCAGCATCCTGCCCAATTACAATCTCCTGCAAAAAAGCAAACTGAAAGTAACCGGCGAAGTATACCTCTCCATCAGCCAGAACCTGATGGTCAATCCCGGTGTAAGGTTCGAAGACATCCGCGAAGTGCACAGTCACCCGATGGCCATCCTGCAATGTCTCGACTACCTCGAAAAGCACAACTGGAAACTGGTGGAAACAGAAGACACCGCTTTAAGCGCCAAACTGCTGCACCAGCACCGGCGCCAACACGCTGCAGCTATTGCCAGCAGACTCGCAGCCGAATTATTCGGACTGGAAATCCTGGCTCCCAATATCCATACACTCAAAAACAATGTGACCCGGTTCCTGGTGCTGCAAAAAGAAAATGATGTGGAGCCGGTACCTGATGCCGACAAAGCATCGGTTTATTTCCAGACCGATCATTCCAAAGGATCACTGGCCCGCGTGCTCACACATATTGCCAGCGCCGGCATCAACCTGAGCAAATTGCAAAGCATGCCCATACCGGGCAGCGATTTCAAATATGGTTTCTATGCCGATATGGAATTCGAAGGCATGCAACAATTGAACGAAGTGCTGAAAGCCATGCAGCCACTGACCAATTTGGTCAAAACATTCGGTATTTATAAACAGGGAAAACTGGTGAAAGGATGATACAGGTAAACACAGCAAAAAGACTGGAAGGCATTGGTGAATACTATTTTTCGCAAAAACTGCGCGAAATAGAGGAGCTGAACAAACAGGGCAAGCAGATCATCAACCTGGGTATCGGTAGTCCCGATCTGCCCCCGCATCCCGATGTGATCAAAGTATTGCAGGACGAAGCGGCCAAGCCCAACGTGCATGCTTACCAGAATTACAAAGGGTCGCCGGTGTTGCGAAAAGCGATTGCCGACTGGTATGCGAAATGGTATGGTGTAACGCTGAACCCCGAAAGCGAGATACTCCCGCTGATCGGCAGTAAAGAAGGTATCATGCATATTTGCATGACCTACCTCAACGAAGGCGATCAGGCATTGATTCCTGATCCCGGCTATCCTACCTACAGCAGCGCCGTTCGCCTGTCGGGCGCTACTCCCGTAGTGTATGAATTATCGGAAGCCAGCAACTGGGAACCCGATTTTGCGCAACTGGAAAAAACAGATCTCAGCAAGGTAAAACTCATGTGGGTGAATTACCCGCACATGCCAACCGGCCGCTTGCCACAGAAAGACCTCTTTAACAAACTGATCGCGTTTGGAAAGAAACACCATATCCTGATCTGCCACGATAATCCTTACAGTTTTATTTTGAATGATGCACCACTCAGTTTGCTGAGTGTGGAAGGGGCAAAAGAAACTGCTGTTGAGCTGAACTCACTGAGCAAAAGTTCCAATATGGCCGGATGGCGCGTGGGGATGTTGTCGGGCGCGAAAGAGCGGATCGATGAAATATTGCGCTTCAAAAGCAATATGGACAGCGGTATGTTCCTGCCCGTGCAACTGGCTGCAGCCAAAGCGCTGGGCCTCGGCAAAGACTGGTACGATGAAGTGAATGCGATCTATAAAGAAAGAAGAGAAAAAGTATTTGAATTATTGACCCTGCTCCGTTGTGCATTTGATACAAAACAGGCAGGCATGTTTGTGTGGGCAAAGATTCCCGCTACGCATGCAAACGGTTTTGCCCTGAGCGATGCAGTATTATACAATGCAAATGTATTCATCACACCGGGAGGCATATTCGGCAAAGCCGGAGAACCTTATATCAGGGTCAGCTTGTGCGGATCGGTAGAACGGTTCACAGAAGCCATTAACAGGATCAGCAACGCAGGTATATGAGAGTAACGATCATTGGTACGGGATTGATAGGTGGCTCCATGGCCATCGCCCTGAAAGAGAAAGGCTTTGCGAAGCATATCATTGGGGTGGAAAAGCATGCCGCGCATGCAGAAAAGGCACTGGCACTGGGGCTGATAGATGCAGTGTTGCCATTGCAGGAAGCTGTAGCACAATCAGATCTGGTAGTGTTGGCCACGCCGATCAATGCGGCTGAAACTTTATTGCCGCAGGTGCTTGATATGGCAGACCGACAGGTAGTGATGGATGTAGGCTCTACTAAAAAAATGATCTGTGCATCGGTAGCAGGTCATGCAAAAAGAGGGCGCTTCGTGGCCACACACCCGATGTGGGGAACGGAATACAGCGGTCCGGAAGCTGCTGTAAAAGGCGCATTCACCGATAAAGCCACTGTGATCTGCGATAAAGCCAACAGCGATGCCGATGCGGTAGCCTGTGTTGAAGAAGTATATCGGTTATTGGGCATGCACCTGGTGTACATGAACGCCGGTGACCACGATGTGCACGTGGCTTATGTGAGCCATATATCACACATCACTTCTTTTGCACTGGCCAACACGGTACTGGAAAAAGAAAAAGAAGAGGACGCCATTTTTGAACTGGCCAGTGGTGGTTTTGAGAGTACGGTGCGATTGGCCAAGAGTAATCCCGCTATGTGGGTGCCCATCTTTATGCAGAATAAAGAAAATGTGTTGGATGTACTGAATGAACATATTGCCCAGTTGCGCAAATTCAAATCATGCCTGGAGAAGGAAAATTTCGATTACCTGCAGGAACTGATTGAAAATGCCAA

In [56]:
from random import sample
!ls ../outputs/moleculo/galGal4.

In [72]:
seqs = [line.strip() for line in open('../outputs/moleculo/galGal4.2_LongRead.unmapped_reads').readlines() if line.startswith('>')]
sample(seqs, 5)


Out[72]:
['>Read_1187-Barcode=BC003-PIPELINE=Develop_T40',
 '>Read_56254-Barcode=BC107-PIPELINE=Develop_T40',
 '>Read_6955-Barcode=BC014-PIPELINE=Develop_T40',
 '>Read_54717-Barcode=BC104-PIPELINE=Develop_T40',
 '>Read_11550-Barcode=BC023-PIPELINE=Develop_T40']

UCSC Genome Browser BLAT Results

All matches are small (< 20%) compared to the query size.

1)

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           31   594   680  3143  97.1%     2   -   24098595  24098686     92
browser details YourSeq           24   692   715  3143 100.0%     5   -   28574499  28574522     24
browser details YourSeq           24  2538  2563  3143  96.2%     2   -  147135181 147135206     26
browser details YourSeq           21  1094  1114  3143 100.0%    19   +    6310366   6310386     21
browser details YourSeq           20   837   862  3143  88.5%     1   +   11646631  11646656     26

2) 

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           22  1377  1400  7533  87.0%     Z   -   16508791  16508813     23
browser details YourSeq           22  3936  3961  7533  92.4%     2   +   67429996  67430021     26
browser details YourSeq           21  2606  2626  7533 100.0%     1   +   18773352  18773372     21
browser details YourSeq           20  4378  4399  7533  95.5%     1   -   20087286  20087307     22

3) 

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           27  5305  5352  8626  71.5%     6   +   24508087  24508127     41
browser details YourSeq           21  3120  3140  8626 100.0%     2   -  116678516 116678536     21
browser details YourSeq           21  8251  8271  8626 100.0%     5   +   29670651  29670671     21

4) 

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq          147   571  1636  4555  81.2%    28   +     971245    971827    583
browser details YourSeq          137   528  1446  4555  79.9%  Un_AADN03020025   -        437      1104    668
browser details YourSeq          133   528  1504  4555  81.3%    28   +     971246    971883    638
browser details YourSeq          123   666  1636  4555  80.2%    28   +     971229    971741    513
browser details YourSeq          121  1758  2434  4555  80.9%  Un_AADN03016205   +        166       679    514
browser details YourSeq          112   333  1027  4555  92.4%  Un_AADN03021523   -        358      1418   1061
browser details YourSeq          108   793  1286  4555  92.2%    28   +     971247    971829    583
browser details YourSeq          107   877  1549  4555  92.8%  Un_JH375825   +        181       997    817
(...)

5) 

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           23  5089  5113  8034  87.5%     1   +  134733453 134733476     24
browser details YourSeq           22  7904  7925  8034 100.0%     2   -   45462825  45462846     22
browser details YourSeq           22  4526  4547  8034 100.0%  Un_AADN03014882   +       1461      1482     22
browser details YourSeq           21  6959  6979  8034 100.0%     1   -    7639014   7639034     21
browser details YourSeq           21  4423  4443  8034 100.0%     1   +   30953311  30953331     21

ENA exonerate results

Then I tried Jared's suggestion and use the same sequences in http://www.ebi.ac.uk/ena/

Sequences 1 and 4 mapped to Gallus gallus sequences.

Sequences 2, 3 and 5 weirdly mapped to Sediminibacterium sp., a bacteria with a genome published January 2014.

1)

ENA - 3 Results
Accession
Description
OrganismAlignment LengthTarget LengthIdentity(%) E-Value
AC187113
Gallus gallus chromosome Z clone CH261-86G12, WORKING DRAFT SEQUENCE, 5 unordered pieces.
Gallus gallus31433144990
ADDD01052603
Meleagris gallopavo breed Aviagen turkey brand Nicholas breeding stock CTG_7180001608729, whole genome shotgun sequence.
Meleagris gallopavo674670904E-257
ADDD01052603
Meleagris gallopavo breed Aviagen turkey brand Nicholas breeding stock CTG_7180001608729, whole genome shotgun sequence.
Meleagris gallopavo241240891E-77

2)

ENA - 6 Results
Accession
Description
OrganismAlignment LengthTarget LengthIdentity(%) E-Value
AZXP01000001
Sediminibacterium sp. OR53 scaffold1_C1, whole genome shotgun sequence.
Sediminibacterium sp. OR53753375331000
KI911562
Sediminibacterium sp. OR53 genomic scaffold scaffold1, whole genome shotgun sequence.
Sediminibacterium sp. OR53753375331000
ATYE01000005
Sediminibacterium sp. OR43 SedOR43DRAFT_scaffold_3.4_C, whole genome shotgun sequence.
Sediminibacterium sp. OR4352315243860
CU207366
Gramella forsetii KT0803 complete circular genome.
Gramella forsetii KT0803325325747E-27
ABNO01018736
Coral metagenome 40382316, whole genome shotgun sequence.
coral metagenome109109856E-20
AKZQ01000023
Flavobacterium sp. F52 Contig23, whole genome shotgun sequence.
Flavobacterium sp. F52450451712E-15

3)

ENA - 36 Results
Accession
Description
OrganismAlignment LengthTarget LengthIdentity(%) E-Value
AZXP01000001
Sediminibacterium sp. OR53 scaffold1_C1, whole genome shotgun sequence.
Sediminibacterium sp. OR53862686261000
KI911562
Sediminibacterium sp. OR53 genomic scaffold scaffold1, whole genome shotgun sequence.
Sediminibacterium sp. OR53862686261000
KB893315
Segetibacter koreensis DSM 18137 genomic scaffold B154DRAFT_scaffold_1.2, whole genome shotgun sequence.
Segetibacter koreensis DSM 1813719881987757E-274
ARFB01000002
Segetibacter koreensis DSM 18137 B154DRAFT_scaffold_1.2_C, whole genome shotgun sequence.
Segetibacter koreensis DSM 1813719881987757E-274
AXYK01000032
Chitinophagaceae bacterium JGI 0001013-J17 G081DRAFT_2522133819.32_C, whole genome shotgun sequence.
Chitinophagaceae bacterium JGI 0001013-J1724432448714E-158
AXBK01000004
Adhaeribacter aquaticus DSM 16391 AdhaqDRAFT_Scaffold1.4_C1, whole genome shotgun sequence.
Adhaeribacter aquaticus DSM 1639121102107717E-137
ARFB01000014
Segetibacter koreensis DSM 18137 B154DRAFT_scaffold_12.13_C, whole genome shotgun sequence.
Segetibacter koreensis DSM 18137996997754E-127
AXVJ01000058
Runella limosa DSM 17973 K339DRAFT_scaffold00024.24_C, whole genome shotgun sequence.
Runella limosa DSM 1797311481148727E-98
KE384045
Runella zeae DSM 19591 genomic scaffold G563DRAFT_scaffold00014.14, whole genome shotgun sequence.
Runella zeae DSM 1959111451145725E-91
KE384044
Runella zeae DSM 19591 genomic scaffold G563DRAFT_scaffold00013.13, whole genome shotgun sequence.
Runella zeae DSM 19591950952729E-66

4)

ENA - 100 Results
Accession
Description
OrganismAlignment LengthTarget LengthIdentity(%) E-Value
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus304308839E-68
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus302307827E-61
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus303308829E-60
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus304308813E-59
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus302307823E-59
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus298303821E-58
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus303308811E-54
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus303308807E-53
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus303308804E-51
AF063649
Gallus gallus clone hm98155 genomic marker COM164.
Gallus gallus303308804E-51

5)

ENA - 64 Results
Accession
Description
OrganismAlignment LengthTarget LengthIdentity(%) E-Value
AZXP01000001
Sediminibacterium sp. OR53 scaffold1_C1, whole genome shotgun sequence.
Sediminibacterium sp. OR53803480341000
KI911562
Sediminibacterium sp. OR53 genomic scaffold scaffold1, whole genome shotgun sequence.
Sediminibacterium sp. OR53803480341000
ATYE01000006
Sediminibacterium sp. OR43 SedOR43DRAFT_scaffold_4.5_C, whole genome shotgun sequence.
Sediminibacterium sp. OR4380348035960
KI669560
Sediminibacterium sp. C3 genomic scaffold scf7180000002434, whole genome shotgun sequence.
Sediminibacterium sp. C3645645742E-62
CP003178
Niastella koreensis GR20-10, complete genome.
Niastella koreensis GR20-1011721180703E-57
KI866530
Sediminibacterium salmoneum NBRC 103935 genomic scaffold scaffold00001, whole genome shotgun sequence.
Sediminibacterium salmoneum NBRC 103935660658736E-55
ANHQ01012310
Puccinia striiformis f. sp. tritici CY32 contig_12310, whole genome shotgun sequence.
Puccinia striiformis f. sp. tritici CY3211011101691E-36
ANHQ01012310
Puccinia striiformis f. sp. tritici CY32 contig_12310, whole genome shotgun sequence.
Puccinia striiformis f. sp. tritici CY32848853701E-36
KI866530
Sediminibacterium salmoneum NBRC 103935 genomic scaffold scaffold00001, whole genome shotgun sequence.
Sediminibacterium salmoneum NBRC 103935910904703E-30
KI866530
Sediminibacterium salmoneum NBRC 103935 genomic scaffold scaffold00001, whole genome shotgun sequence.
Sediminibacterium salmoneum NBRC 103935910904703E-30

Kmer assembly stats


In [1]:
!python ~/khmer/sandbox/assemstats.py 500 ../inputs/moleculo/*_LongRead_500_1499nt.fastq.gz ../inputs/moleculo/*Read.fastq.gz


filename sum n trim_n min med mean max n50 n50_len n90 n90_len
Traceback (most recent call last):
  File "/mnt/home/irberlui/khmer/sandbox/assemstats.py", line 129, in <module>
    main()
  File "/mnt/home/irberlui/khmer/sandbox/assemstats.py", line 105, in main
    lens = getLens(filename)
  File "/mnt/home/irberlui/khmer/sandbox/assemstats.py", line 55, in getLens
    for record in fa_instance:
  File "/opt/software/ged-software/anaconda/lib/python2.7/site-packages/screed/fasta.py", line 21, in fasta_iter
    raise IOError("Bad FASTA format: no '>' at beginning of line")
IOError: Bad FASTA format: no '>' at beginning of line

In [ ]:
!python ~/khmer/sandbox/assemstats.py 100 ../inputs/moleculo/*_LongRead_500_1499nt.fastq.gz > ../outputs/moleculo/assemstats_output

Kmer overlap


In [8]:
#!load-into-counting.py -x 1e9 -N 4 -k 32 galGal4.fa.masked.kh galGal4.fa.masked.gz


galGal4.fa.masked        galGal4.fa.masked.gz     galGal4.fa.masked.gz.ann galGal4.fa.masked.gz.pac reads
galGal4.fa.masked.fai    galGal4.fa.masked.gz.amb galGal4.fa.masked.gz.bwt galGal4.fa.masked.gz.sa

In [12]:
#!load-into-counting.py -x 1e9 -N 4 -k 32 ../moleculo/LR6000017-DNA_A01-LRAAA-AllReads.kh ../moleculo/*.fastq.gz

In [13]:
#!python ~/repos/khmer/scripts/count-overlap.py -k 32 -N 4 -x 1e9 moleculo/LR6000017-DNA_A01-LRAAA-AllReads.kh galGal4/galGal4.fa.masked overlap_report

In [20]:
curve = np.loadtxt('../overlap_report.curve')

In [34]:
figure()
#title('Overlap')
#plot(curve)


Out[34]:
<matplotlib.figure.Figure at 0x1066cf950>
<matplotlib.figure.Figure at 0x1066cf950>

In [41]:
curve[:,1] - curve[:,0]


Out[41]:
array([  4.67924001e+18,  -3.05695600e+06,  -4.71728458e+13,
         4.84136960e+18,  -1.56116890e+07,  -4.71754615e+13,
         4.60267882e+18,  -9.76562500e+06,   4.75203689e+18,
         1.40734399e+14,   4.71754642e+13,   1.40735024e+14,
         1.48284610e+07,  -4.18960707e+16,  -1.07674842e+13,
        -6.08653160e+14,  -3.26417515e+11,   1.40735006e+14,
         4.71749555e+13,  -2.81309900e+06,   4.71754618e+13,
        -6.25000008e+08,   4.60267882e+18,   4.71717733e+13,
        -1.73887040e+07,  -6.10135464e+08,   4.71754794e+13,
         4.27757859e+09,  -1.40730729e+14,  -3.00000000e+00,
        -3.00000000e+00,  -6.25000008e+08,  -1.61596220e+07,
         4.15000920e+07,   1.16343741e+08,  -1.61854270e+07,
         1.40735008e+14,   1.91856027e+13,   5.36854732e+11,
         4.71704609e+13,   9.35593953e+13,   4.67959549e+18,
        -8.06286132e+08,   9.35595627e+13,  -4.71756216e+13,
         1.40735007e+14,   1.40735009e+14,   8.28758000e+05,
        -1.67920240e+07,  -4.71754464e+13,   1.49969040e+07,
        -1.39094868e+09,   2.00000000e+00,   1.84467441e+19,
         4.71756285e+13,   4.71754644e+13,   4.71754794e+13,
         4.71756136e+13,  -2.95237040e+18,   1.65227760e+07,
        -4.71716093e+13,   4.71754451e+13,   4.66429034e+13,
        -1.75820870e+07,  -4.71755632e+13,  -2.84750400e+06,
         4.71755695e+13,  -4.71754615e+13,  -7.68000000e+02,
        -9.35595497e+13,  -1.84466033e+19,   1.42256700e+06,
        -4.71717261e+13,  -1.64711200e+06,   1.69418365e+19,
        -1.84467441e+19,  -4.00266330e+09,  -4.71754479e+13,
         4.75198971e+18,   4.71755797e+13,   4.71754794e+13,
         3.32248700e+06,  -1.53058400e+06,   4.71754444e+13,
         4.75203689e+18,   9.35595629e+13,   1.84467441e+19,
        -4.71754642e+13,   4.71754444e+13,   4.71754448e+13,
         4.71754642e+13,  -7.14677000e+05,   1.65219040e+07,
        -4.71754610e+13,  -4.71754627e+13,   4.71754774e+13,
         6.10135460e+08,   4.71754612e+13,  -1.61234880e+07,
        -4.71748360e+13])