From 1: Pretty much everything mapped to galGal4, only 661 reads out of 500k didn't (0.13%) I sampled 5 reads from these 661 and checked on ENA: two matches E.coli, one matched turkey and two didn't match anything, which is curious.
From 2: The reads cover ~30% of the reference genome
From 3: The sequences only in RNA are about the same as from the Moleculo data diagram 4, but there are less in the intersection. I think this is expected, since the coverage is way lower (30% against 88% on Moleculo).
From 5: I'm still doing this one, the goal is to check if it captures more orthologs than Moleculo did. My guess is that it probably won't (since both are Illumina sequencing), but let's see.
--
In [10]:
!samtools view -c -f 4 ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
!samtools view -c ../outputs/moleculo/galGal4.LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam
In [3]:
import pandas as pd
In [ ]:
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/msu/latest.fasta > ../outputs/msu/assemstats_output_latest
In [30]:
msu_reads = pd.read_table("../outputs/msu/assemstats_output_latest", sep=" ")
In [13]:
!python ~/khmer/sandbox/assemstats.py 100 ../outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta > ../outputs/moleculo/assemstats
In [31]:
mol_reads = pd.read_table("../outputs/moleculo/assemstats", sep=" ")
No problem. Any progress on a figure or table showing alignment quality, and a methods paragraph?
In [32]:
both_reads = msu_reads.append(mol_reads, ignore_index=True)
both_reads
Out[32]:
In [55]:
subset = (both_reads[['sum', 'n', 'min', 'mean', 'med', 'max']]
.rename(
index={
0: 'MSU',
1: 'Moleculo'},
columns={
'sum': 'Total bases',
'n': "Number of reads",
'min': 'Minimum read length',
'mean': 'Mean read length',
'med': 'Median read length',
'max': 'Maximum read length'
}))
subset['Unmapped reads to galGal4'] = [661, 326]
subset['Percentage of galGal4 covered by reads'] = [28.9, 88.4]
In [54]:
subset
Out[54]:
In [57]:
subset.to_csv('../outputs/pub.csv')