Moleculo reads were mapped to:
Important details:
In [48]:
%matplotlib inline
from matplotlib import pyplot as plt
from glob import glob
import os
In [11]:
!cd .. && make moleculo_galGal4 moleculo_galGal3 moleculo_galGal5
There are 1578022 Moleculo reads in the input files.
There is a different number of unmapped reads for each reference genome:
Reference for using samtools to count reads:
In [75]:
%%bash
for bam in ../outputs/moleculo/galGal4.L*fastq.sorted.bam
do
echo -n "$(basename $bam): "
samtools view -c -f 4 $bam
done
echo -n "Total: "
samtools view -c -f 4 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
In [1]:
!samtools view -c -q 30 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
In [2]:
!samtools view -c -q 60 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
In [3]:
!samtools view -c -f 256 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
In [80]:
%%bash
for bam in ../outputs/moleculo/galGal3.L*fastq.sorted.bam
do
echo -n "$(basename $bam): "
samtools view -c -f 4 $bam
done
echo -n "Total: "
samtools view -c -f 4 ../outputs/moleculo/galGal3.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
Notes:
In [22]:
total_N = !grep -F -o N ../outputs/reference/galGal5.fa |wc -l
total_bases = !grep -v ">" ../outputs/reference/galGal5.fa |wc -c
print float(total_N[0]) / float(total_nucleotides[0])
In [77]:
total_N = !grep -F -o N ../outputs/galGal4/galGal4.fa |wc -l
total_bases = !grep -v ">" ../outputs/galGal4/galGal4.fa |wc -c
print float(total_N[0]) / float(total_nucleotides[0])
In [78]:
%%bash
for bam in ../outputs/moleculo/galGal5.L*fastq.sorted.bam
do
echo -n "$(basename $bam): "
samtools view -c -f 4 $bam
done
echo -n "Total: "
samtools view -c -f 4 ../outputs/moleculo/galGal5.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
In [ ]:
#! cd .. && make outputs/moleculo/galGal4.unmapped_reads
In [50]:
!head -5 ../outputs/moleculo/galGal4.1_LongRead.unmapped_reads
In [56]:
from random import sample
!ls ../outputs/moleculo/galGal4.
In [72]:
seqs = [line.strip() for line in open('../outputs/moleculo/galGal4.2_LongRead.unmapped_reads').readlines() if line.startswith('>')]
sample(seqs, 5)
Out[72]:
All matches are small (< 20%) compared to the query size.
1)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 31 594 680 3143 97.1% 2 - 24098595 24098686 92
browser details YourSeq 24 692 715 3143 100.0% 5 - 28574499 28574522 24
browser details YourSeq 24 2538 2563 3143 96.2% 2 - 147135181 147135206 26
browser details YourSeq 21 1094 1114 3143 100.0% 19 + 6310366 6310386 21
browser details YourSeq 20 837 862 3143 88.5% 1 + 11646631 11646656 26
2)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 22 1377 1400 7533 87.0% Z - 16508791 16508813 23
browser details YourSeq 22 3936 3961 7533 92.4% 2 + 67429996 67430021 26
browser details YourSeq 21 2606 2626 7533 100.0% 1 + 18773352 18773372 21
browser details YourSeq 20 4378 4399 7533 95.5% 1 - 20087286 20087307 22
3)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 27 5305 5352 8626 71.5% 6 + 24508087 24508127 41
browser details YourSeq 21 3120 3140 8626 100.0% 2 - 116678516 116678536 21
browser details YourSeq 21 8251 8271 8626 100.0% 5 + 29670651 29670671 21
4)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 147 571 1636 4555 81.2% 28 + 971245 971827 583
browser details YourSeq 137 528 1446 4555 79.9% Un_AADN03020025 - 437 1104 668
browser details YourSeq 133 528 1504 4555 81.3% 28 + 971246 971883 638
browser details YourSeq 123 666 1636 4555 80.2% 28 + 971229 971741 513
browser details YourSeq 121 1758 2434 4555 80.9% Un_AADN03016205 + 166 679 514
browser details YourSeq 112 333 1027 4555 92.4% Un_AADN03021523 - 358 1418 1061
browser details YourSeq 108 793 1286 4555 92.2% 28 + 971247 971829 583
browser details YourSeq 107 877 1549 4555 92.8% Un_JH375825 + 181 997 817
(...)
5)
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq 23 5089 5113 8034 87.5% 1 + 134733453 134733476 24
browser details YourSeq 22 7904 7925 8034 100.0% 2 - 45462825 45462846 22
browser details YourSeq 22 4526 4547 8034 100.0% Un_AADN03014882 + 1461 1482 22
browser details YourSeq 21 6959 6979 8034 100.0% 1 - 7639014 7639034 21
browser details YourSeq 21 4423 4443 8034 100.0% 1 + 30953311 30953331 21
Then I tried Jared's suggestion and use the same sequences in http://www.ebi.ac.uk/ena/
Sequences 1 and 4 mapped to Gallus gallus sequences.
Sequences 2, 3 and 5 weirdly mapped to Sediminibacterium sp., a bacteria with a genome published January 2014.
1)
ENA - 3 Results | |||||||||||
Accession |
| Organism | Alignment Length | Target Length | Identity(%) | E-Value | |||||
AC187113 |
| Gallus gallus | 3143 | 3144 | 99 | 0 | |||||
ADDD01052603 |
| Meleagris gallopavo | 674 | 670 | 90 | 4E-257 | |||||
ADDD01052603 |
| Meleagris gallopavo | 241 | 240 | 89 | 1E-77 |
2)
ENA - 6 Results | |||||||||||
Accession |
| Organism | Alignment Length | Target Length | Identity(%) | E-Value | |||||
AZXP01000001 |
| Sediminibacterium sp. OR53 | 7533 | 7533 | 100 | 0 | |||||
KI911562 |
| Sediminibacterium sp. OR53 | 7533 | 7533 | 100 | 0 | |||||
ATYE01000005 |
| Sediminibacterium sp. OR43 | 5231 | 5243 | 86 | 0 | |||||
CU207366 |
| Gramella forsetii KT0803 | 325 | 325 | 74 | 7E-27 | |||||
ABNO01018736 |
| coral metagenome | 109 | 109 | 85 | 6E-20 | |||||
AKZQ01000023 |
| Flavobacterium sp. F52 | 450 | 451 | 71 | 2E-15 |
3)
ENA - 36 Results | |||||||||||
Accession |
| Organism | Alignment Length | Target Length | Identity(%) | E-Value | |||||
AZXP01000001 |
| Sediminibacterium sp. OR53 | 8626 | 8626 | 100 | 0 | |||||
KI911562 |
| Sediminibacterium sp. OR53 | 8626 | 8626 | 100 | 0 | |||||
KB893315 |
| Segetibacter koreensis DSM 18137 | 1988 | 1987 | 75 | 7E-274 | |||||
ARFB01000002 |
| Segetibacter koreensis DSM 18137 | 1988 | 1987 | 75 | 7E-274 | |||||
AXYK01000032 |
| Chitinophagaceae bacterium JGI 0001013-J17 | 2443 | 2448 | 71 | 4E-158 | |||||
AXBK01000004 |
| Adhaeribacter aquaticus DSM 16391 | 2110 | 2107 | 71 | 7E-137 | |||||
ARFB01000014 |
| Segetibacter koreensis DSM 18137 | 996 | 997 | 75 | 4E-127 | |||||
AXVJ01000058 |
| Runella limosa DSM 17973 | 1148 | 1148 | 72 | 7E-98 | |||||
KE384045 |
| Runella zeae DSM 19591 | 1145 | 1145 | 72 | 5E-91 | |||||
KE384044 |
| Runella zeae DSM 19591 | 950 | 952 | 72 | 9E-66 |
4)
ENA - 100 Results | |||||||||||
Accession |
| Organism | Alignment Length | Target Length | Identity(%) | E-Value | |||||
AF063649 |
| Gallus gallus | 304 | 308 | 83 | 9E-68 | |||||
AF063649 |
| Gallus gallus | 302 | 307 | 82 | 7E-61 | |||||
AF063649 |
| Gallus gallus | 303 | 308 | 82 | 9E-60 | |||||
AF063649 |
| Gallus gallus | 304 | 308 | 81 | 3E-59 | |||||
AF063649 |
| Gallus gallus | 302 | 307 | 82 | 3E-59 | |||||
AF063649 |
| Gallus gallus | 298 | 303 | 82 | 1E-58 | |||||
AF063649 |
| Gallus gallus | 303 | 308 | 81 | 1E-54 | |||||
AF063649 |
| Gallus gallus | 303 | 308 | 80 | 7E-53 | |||||
AF063649 |
| Gallus gallus | 303 | 308 | 80 | 4E-51 | |||||
AF063649 |
| Gallus gallus | 303 | 308 | 80 | 4E-51 |
5)
ENA - 64 Results | |||||||||||
Accession |
| Organism | Alignment Length | Target Length | Identity(%) | E-Value | |||||
AZXP01000001 |
| Sediminibacterium sp. OR53 | 8034 | 8034 | 100 | 0 | |||||
KI911562 |
| Sediminibacterium sp. OR53 | 8034 | 8034 | 100 | 0 | |||||
ATYE01000006 |
| Sediminibacterium sp. OR43 | 8034 | 8035 | 96 | 0 | |||||
KI669560 |
| Sediminibacterium sp. C3 | 645 | 645 | 74 | 2E-62 | |||||
CP003178 |
| Niastella koreensis GR20-10 | 1172 | 1180 | 70 | 3E-57 | |||||
KI866530 |
| Sediminibacterium salmoneum NBRC 103935 | 660 | 658 | 73 | 6E-55 | |||||
ANHQ01012310 |
| Puccinia striiformis f. sp. tritici CY32 | 1101 | 1101 | 69 | 1E-36 | |||||
ANHQ01012310 |
| Puccinia striiformis f. sp. tritici CY32 | 848 | 853 | 70 | 1E-36 | |||||
KI866530 |
| Sediminibacterium salmoneum NBRC 103935 | 910 | 904 | 70 | 3E-30 | |||||
KI866530 |
| Sediminibacterium salmoneum NBRC 103935 | 910 | 904 | 70 | 3E-30 |
In [1]:
!python ~/khmer/sandbox/assemstats.py 500 ../inputs/moleculo/*_LongRead_500_1499nt.fastq.gz ../inputs/moleculo/*Read.fastq.gz
In [ ]:
!python ~/khmer/sandbox/assemstats.py 100 ../inputs/moleculo/*_LongRead_500_1499nt.fastq.gz > ../outputs/moleculo/assemstats_output
In [8]:
#!load-into-counting.py -x 1e9 -N 4 -k 32 galGal4.fa.masked.kh galGal4.fa.masked.gz
In [12]:
#!load-into-counting.py -x 1e9 -N 4 -k 32 ../moleculo/LR6000017-DNA_A01-LRAAA-AllReads.kh ../moleculo/*.fastq.gz
In [13]:
#!python ~/repos/khmer/scripts/count-overlap.py -k 32 -N 4 -x 1e9 moleculo/LR6000017-DNA_A01-LRAAA-AllReads.kh galGal4/galGal4.fa.masked overlap_report
In [20]:
curve = np.loadtxt('../overlap_report.curve')
In [34]:
figure()
#title('Overlap')
#plot(curve)
Out[34]:
In [41]:
curve[:,1] - curve[:,0]
Out[41]: