From 1: Pretty much everything mapped to galGal4, only 661 reads out of 500k didn't (0.13%) I sampled 5 reads from these 661 and checked on ENA: two matches E.coli, one matched turkey and two didn't match anything, which is curious.

From 2: The reads cover ~30% of the reference genome

From 3: The sequences only in RNA are about the same as from the Moleculo data diagram 4, but there are less in the intersection. I think this is expected, since the coverage is way lower (30% against 88% on Moleculo).

From 5: I'm still doing this one, the goal is to check if it captures more orthologs than Moleculo did. My guess is that it probably won't (since both are Illumina sequencing), but let's see.

--


In [10]:
!samtools view -c -f 4 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
!samtools view -c ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam


326
1683294

In [3]:
import pandas as pd

In [ ]:
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/msu/latest.fasta > ../outputs/msu/assemstats_output_latest

In [30]:
msu_reads = pd.read_table("../outputs/msu/assemstats_output_latest", sep=" ")

In [13]:
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta > ../outputs/moleculo/assemstats

In [31]:
mol_reads = pd.read_table("../outputs/moleculo/assemstats", sep=" ")

No problem. Any progress on a figure or table showing alignment quality, and a methods paragraph?


In [32]:
both_reads = msu_reads.append(mol_reads, ignore_index=True)
both_reads


Out[32]:
filename sum n trim_n min med mean max n50 n50_len n90 n90_len
0 ../outputs/msu/latest.fasta 470397190 492527 492526 128 639 955 6607 112009 1273 377893 459
1 ../outputs/moleculo/LR6000017-DNA_A01-LRAAA-Al... 5979144764 1579060 1579060 500 2514 3786 17521 347311 7374 954271 1729

In [55]:
subset = (both_reads[['sum', 'n', 'min', 'mean', 'med', 'max']]
            .rename(
              index={
                0: 'MSU',
                1: 'Moleculo'},
              columns={
                'sum': 'Total bases',
                'n': "Number of reads",
                'min': 'Minimum read length',
                'mean': 'Mean read length',
                'med': 'Median read length',
                'max': 'Maximum read length'
              }))
subset['Unmapped reads to galGal4'] = [661, 326]
subset['Percentage of galGal4 covered by reads'] = [28.9, 88.4]

In [54]:
subset


Out[54]:
Total bases Number of reads Minimum read length Mean read length Median read length Maximum read length Unmapped reads to galGal4 Percentage of galGal4 covered by reads
MSU 470397190 492527 128 955 639 6607 661 28.9
Moleculo 5979144764 1579060 500 3786 2514 17521 326 88.4

In [57]:
subset.to_csv('../outputs/pub.csv')