From 1: Pretty much everything mapped to galGal4, only 661 reads out of 500k didn't (0.13%) I sampled 5 reads from these 661 and checked on ENA: two matches E.coli, one matched turkey and two didn't match anything, which is curious.

From 2: The reads cover ~30% of the reference genome

From 3: The sequences only in RNA are about the same as from the Moleculo data diagram 4, but there are less in the intersection. I think this is expected, since the coverage is way lower (30% against 88% on Moleculo).

From 5: I'm still doing this one, the goal is to check if it captures more orthologs than Moleculo did. My guess is that it probably won't (since both are Illumina sequencing), but let's see.



In [10]:

    
!samtools view -c -f 4 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
!samtools view -c ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam



In [3]:

    
import pandas as pd



In [ ]:

    
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/msu/latest.fasta > ../outputs/msu/assemstats_output_latest



In [30]:

    
msu_reads = pd.read_table("../outputs/msu/assemstats_output_latest", sep=" ")



In [13]:

    
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta > ../outputs/moleculo/assemstats



In [31]:

    
mol_reads = pd.read_table("../outputs/moleculo/assemstats", sep=" ")

No problem. Any progress on a figure or table showing alignment quality, and a methods paragraph?



In [32]:

    
both_reads = msu_reads.append(mol_reads, ignore_index=True)
both_reads









    Out[32]:






  
    
      
      filename
      sum
      n
      trim_n
      min
      med
      mean
      max
      n50
      n50_len
      n90
      n90_len
    
  
  
    
      0
                             ../outputs/msu/latest.fasta
        470397190
        492527
        492526
       128
        639
        955
        6607
       112009
       1273
       377893
        459
    
    
      1
       ../outputs/moleculo/LR6000017-DNA_A01-LRAAA-Al...
       5979144764
       1579060
       1579060
       500
       2514
       3786
       17521
       347311
       7374
       954271
       1729



In [55]:

    
subset = (both_reads[['sum', 'n', 'min', 'mean', 'med', 'max']]
            .rename(
              index={
                0: 'MSU',
                1: 'Moleculo'},
              columns={
                'sum': 'Total bases',
                'n': "Number of reads",
                'min': 'Minimum read length',
                'mean': 'Mean read length',
                'med': 'Median read length',
                'max': 'Maximum read length'
              }))
subset['Unmapped reads to galGal4'] = [661, 326]
subset['Percentage of galGal4 covered by reads'] = [28.9, 88.4]



In [54]:

    
subset









    Out[54]:






  
    
      
      Total bases
      Number of reads
      Minimum read length
      Mean read length
      Median read length
      Maximum read length
      Unmapped reads to galGal4
      Percentage of galGal4 covered by reads
    
  
  
    
      MSU
        470397190
        492527
       128
        955
        639
        6607
       661
       28.9
    
    
      Moleculo
       5979144764
       1579060
       500
       3786
       2514
       17521
       326
       88.4



In [57]:

    
subset.to_csv('../outputs/pub.csv')

	filename	sum	n	trim_n	min	med	mean	max	n50	n50_len	n90	n90_len
0	../outputs/msu/latest.fasta	470397190	492527	492526	128	639	955	6607	112009	1273	377893	459
1	../outputs/moleculo/LR6000017-DNA_A01-LRAAA-Al...	5979144764	1579060	1579060	500	2514	3786	17521	347311	7374	954271	1729