Comparison of bowtie, bowtie2, and kallisto


In [2]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
%matplotlib inline

Data preparation

We will use some of Ben's Sjorgrens data for this. We will generate a random sample of 1 million reads from the full data set.

Prepare data with Snakemake

snakemake -s aligners.snakefile

It appears that kallisto needs at least 51 bases of the reference to successfully align most of the reads. Must be some kind of off-by-one issue with the data structures.

Load alignments


In [5]:
names = ['QNAME', 'FLAG', 'RNAME', 'POS', 'MAPQ', 'CIGAR', 'RNEXT', 'PNEXT', 'TLEN', 'SEQ', 'QUAL']

In [20]:
bowtie_alns = pd.read_csv('alns/bowtie-51mer.aln', sep='\t', header=None, usecols=list(range(11)), names=names)
bowtie2_alns = pd.read_csv('alns/bowtie2-51mer.aln', sep='\t', header=None, usecols=list(range(11)), names=names)
kallisto_alns = pd.read_csv('alns/kallisto-51mer.sam', sep='\t', header=None, usecols=list(range(11)), names=names, comment='@')

In [21]:
(bowtie_alns.RNAME != '*').sum() / len(bowtie_alns)


Out[21]:
0.73175699999999999

In [22]:
(bowtie2_alns.RNAME != '*').sum() / len(bowtie2_alns)


Out[22]:
0.81860999999999995

In [23]:
(kallisto_alns.RNAME != '*').sum() / len(kallisto_alns)


Out[23]:
0.75900699999999999

Bowtie2 vs kallisto


In [29]:
bt2_k_joined = pd.merge(bowtie2_alns, kallisto_alns, how='inner', on='QNAME', suffixes=['_bt2', '_k'])

How many reads do bowtie2 and kallisto agree on?


In [31]:
(bt2_k_joined.RNAME_bt2 == bt2_k_joined.RNAME_k).sum()


Out[31]:
920029

In [43]:
For the minority of reads they disagree on, what do they look like?


Object `like` not found.

For the minority of reads they disagree on, what do they look like


In [40]:
bt2_k_joined[bt2_k_joined.RNAME_bt2 != bt2_k_joined.RNAME_k].RNAME_k


Out[40]:
9                                      *
28                                     *
43                                     *
67                                     *
72                                     *
95                                     *
97                                     *
108                                    *
113                                    *
148                                    *
149                                    *
159                                    *
179                                    *
232       hblOligo171177|NM_001102386|2B
242                                    *
247                                    *
259                                    *
276                                    *
314                                    *
329                                    *
332                                    *
341                                    *
384                                    *
392                                    *
399                                    *
400                                    *
405                                    *
420                                    *
422                                    *
472                                    *
                       ...              
999585                                 *
999626                                 *
999631                                 *
999636                                 *
999640                                 *
999649      hblOligo131523|NM_181806|20B
999669                                 *
999673                                 *
999684                                 *
999708                                 *
999721                                 *
999732                                 *
999743                                 *
999749                                 *
999791                                 *
999808                                 *
999811                                 *
999833                                 *
999836                                 *
999879                                 *
999881                                 *
999913                                 *
999936                                 *
999947                                 *
999958                                 *
999966                                 *
999968                                 *
999973                                 *
999985                                 *
999988       hblOligo69326|NM_002332|19A
Name: RNAME_k, Length: 79971, dtype: object

Mostly lower sensitivity of kallisto due to indels in the read. Specifically, out of


In [45]:
(bt2_k_joined.RNAME_bt2 != bt2_k_joined.RNAME_k).sum()


Out[45]:
79971

discordant reads, the number where kallisto failed to map is


In [46]:
(bt2_k_joined[bt2_k_joined.RNAME_bt2 != bt2_k_joined.RNAME_k].RNAME_k == '*').sum()


Out[46]:
69787

or as a fraction


In [49]:
(bt2_k_joined[bt2_k_joined.RNAME_bt2 != bt2_k_joined.RNAME_k].RNAME_k == '*').sum() / (bt2_k_joined.RNAME_bt2 != bt2_k_joined.RNAME_k).sum()


Out[49]:
0.87265383701591825

Are there any cases where bowtie2 fails to align


In [50]:
(bt2_k_joined[bt2_k_joined.RNAME_bt2 != bt2_k_joined.RNAME_k].RNAME_bt2 == '*').sum()


Out[50]:
10184

Which means there are no cases where bowtie and kallisto align to different peptides.


In [57]:
((bt2_k_joined.RNAME_bt2 != bt2_k_joined.RNAME_k) & (bt2_k_joined.RNAME_bt2 != '*') & (bt2_k_joined.RNAME_k != '*')).sum()


Out[57]:
0

What do examples look like of kallisto aligning and bowtie2 not?


In [59]:
bt2_k_joined[(bt2_k_joined.RNAME_bt2 != bt2_k_joined.RNAME_k) & (bt2_k_joined.RNAME_bt2 == '*')]


Out[59]:
QNAME FLAG_bt2 RNAME_bt2 POS_bt2 MAPQ_bt2 CIGAR_bt2 RNEXT_bt2 PNEXT_bt2 TLEN_bt2 SEQ_bt2 ... FLAG_k RNAME_k POS_k MAPQ_k CIGAR_k RNEXT_k PNEXT_k TLEN_k SEQ_k QUAL_k
232 D00542:238:C9N7UANXX:5:2302:8303:4212 4 * 0 0 * * 0 0 AAATCCACCATTGTGAAGCAGATGAAGATCATTCATGGTTACTCAG... ... 0 hblOligo171177|NM_001102386|2B 1 255 50M * 0 0 AAATCCACCATTGTGAAGCAGATGAAGATCATTCATGGTTACTCAG... CCCCCGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGG...
523 D00542:238:C9N7UANXX:5:2302:11114:7010 4 * 0 0 * * 0 0 CTGGCCTTCTTTGACGATTGTATTGAAAAGCTTTTCCCAACAGAAA... ... 0 hblOligo157434|NM_017925|10B 1 255 50M * 0 0 CTGGCCTTCTTTGACGATTGTATTGAAAAGCTTTTCCCAACAGAAA... CBCCCGGGGCFGGGGGGGEFBGCGGGGGGGGGGFGFGGGGGGGGGG...
569 D00542:238:C9N7UANXX:5:2302:9302:7847 4 * 0 0 * * 0 0 GGTCCTCACGCCGCCCGCGTTCGCGGGTTGGCATTACAATCCGCTT... ... 0 hblOligo181715|NM_001145304|6B 1 255 50M * 0 0 GGTCCTCACGCCGCCCGCGTTCGCGGGTTGGCATTACAATCCGCTT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
671 D00542:238:C9N7UANXX:5:2302:8540:8897 4 * 0 0 * * 0 0 GGCTCTGGTAAGTCAAAACACTCTTCTCCTCACATTCCCTAGTTCC... ... 0 hblOligo19065|NM_058241|13A 1 255 50M * 0 0 GGCTCTGGTAAGTCAAAACACTCTTCTCCTCACATTCCCTAGTTCC... CCCCCGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGG...
708 D00542:238:C9N7UANXX:5:2302:6508:9139 4 * 0 0 * * 0 0 GAGAAGGGTTCTCTCTCTGACTTTCTGAAAGCAAATGGGAATGAAC... ... 0 hblOligo1809|NM_001616|7A 1 255 50M * 0 0 GAGAAGGGTTCTCTCTCTGACTTTCTGAAAGCAAATGGGAATGAAC... CBBCCGGGGGGGGGGGGGGGGGGGGGGGFFGGEGGGGGGGGGGGGE...
774 D00542:238:C9N7UANXX:5:2302:12874:9719 4 * 0 0 * * 0 0 AACCTTATTGCACCATGTAACGCTAATTGCAATTGCATTCGCAGGT... ... 0 hblOligo228923|NM_180991|12B 1 255 50M * 0 0 AACCTTATTGCACCATGTAACGCTAATTGCAATTGCATTCGCAGGT... ?ABBBGGGGGGGEGBGEGGGGGGGDGGGGGGGGGGGFGGGG<GGGG...
818 D00542:238:C9N7UANXX:5:2302:12576:10241 4 * 0 0 * * 0 0 ACTCAGCAGACAGTCAATGATTCCCTCTGCGATGCTTGTAATACAG... ... 0 hblOligo133592|NM_207517|16B 1 255 50M * 0 0 ACTCAGCAGACAGTCAATGATTCCCTCTGCGATGCTTGTAATACAG... CCCCCGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
854 D00542:238:C9N7UANXX:5:2302:13690:10377 4 * 0 0 * * 0 0 CTCAGTAAGATTAAGGGGGTGGAAGAGGAGGTTTTGAACATGATGC... ... 0 hblOligo148091|NM_144978|6B 1 255 50M * 0 0 CTCAGTAAGATTAAGGGGGTGGAAGAGGAGGTTTTGAACATGATGC... CCCCCGGGGGGGGGGGGGGEEGFGGGGGGGGGGGGGGGGGGGGGGG...
1001 D00542:238:C9N7UANXX:5:2302:15849:11668 4 * 0 0 * * 0 0 ATCCCTCGCAAACGCAAGAACTCTATGAAAGTGCACAGCTGCAGGC... ... 0 hblOligo130229|NM_001033723|5A 1 255 50M * 0 0 ATCCCTCGCAAACGCAAGAACTCTATGAAAGTGCACAGCTGCAGGC... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
1156 D00542:238:C9N7UANXX:5:2302:7457:13233 4 * 0 0 * * 0 0 ATCCCTCTCGGGTCACTGGGCTGGAAAACCGTCAACCAGTCTACGT... ... 0 hblOligo203366|NM_005002|6B 1 255 50M * 0 0 ATCCCTCTCGGGTCACTGGGCTGGAAAACCGTCAACCAGTCTACGT... BBC
1196 D00542:238:C9N7UANXX:5:2302:15153:13675 4 * 0 0 * * 0 0 ATGCGCTCACAGCATGGGTCACCAGCCCTGTGGCTGCAGCGCGGGG... ... 0 hblOligo63927|XM_002346915|1A 1 255 50M * 0 0 ATGCGCTCACAGCATGGGTCACCAGCCCTGTGGCTGCAGCGCGGGG... BBAABGGGGGGFGBGGGFGGGGGGGGGGEEGGGGGGGGGGGGGG
1315 D00542:238:C9N7UANXX:5:2302:14660:14970 4 * 0 0 * * 0 0 AACGGCCAGAGCACCGCAGCTAAACTTGGTCTCACCACTTACACCA... ... 0 hblOligo218118|NM_078480|2B 1 255 50M * 0 0 AACGGCCAGAGCACCGCAGCTAAACTTGGTCTCACCACTTACACCA... BBCCBGGGGGGGGGB><GGGDDGGGGGGGEGGGGGGGGGGGGGGGB...
1416 D00542:238:C9N7UANXX:5:2302:9355:16216 4 * 0 0 * * 0 0 CAGGTTTTGGAGCCTCCTCAGCACGGGGCTCTCCAGAACGCACACT... ... 0 hblOligo155033|NM_001897|32B 1 255 50M * 0 0 CAGGTTTTGGAGCCTCCTCAGCACGGGGCTCTCCAGAACGCACACT... BBCBBGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGG...
1433 D00542:238:C9N7UANXX:5:2302:9852:16296 4 * 0 0 * * 0 0 TTTTTTCAGCAAGACGCCCTTACTTGGAATAATGTTTTGTGAGACT... ... 0 hblOligo243343|NM_203494|6B 1 255 50M * 0 0 TTTTTTCAGCAAGACGCCCTTACTTGGAATAATGTTTTGTGAGACT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
1453 D00542:238:C9N7UANXX:5:2302:2990:16558 4 * 0 0 * * 0 0 ATGGAGGAGCCACGCCCATCTAAGCGCCTGCGCTCTGGCCCCAAAT... ... 0 hblOligo68367|XM_002347282|1A 1 255 50M * 0 0 ATGGAGGAGCCACGCCCATCTAAGCGCCTGCGCTCTGGCCCCAAAT... BB
1529 D00542:238:C9N7UANXX:5:2302:10063:16798 4 * 0 0 * * 0 0 GAGAAGGAAGTTTCTGCAACTCTCAAGTCAACAGTGGGCGCGAGCT... ... 0 hblOligo210592|NM_006031|64B 1 255 50M * 0 0 GAGAAGGAAGTTTCTGCAACTCTCAAGTCAACAGTGGGCGCGAGCT... NaN
1557 D00542:238:C9N7UANXX:5:2302:18289:17663 4 * 0 0 * * 0 0 GAGCTCATTCCACTTATCTTGTGTACAGCTTGCCTCTGAGCCAAAG... ... 0 hblOligo54427|NM_020854|13A 1 255 50M * 0 0 GAGCTCATTCCACTTATCTTGTGTACAGCTTGCCTCTGAGCCAAAG... BCBCCGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGG...
1583 D00542:238:C9N7UANXX:5:2302:17019:17773 4 * 0 0 * * 0 0 TTGCAAGTGCTTAAGGGCTTCGGTTGCTTCCTGTATGATACATGGT... ... 0 hblOligo203499|NM_001164507|152B 1 255 50M * 0 0 TTGCAAGTGCTTAAGGGCTTCGGTTGCTTCCTGTATGATACATGGT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
1658 D00542:238:C9N7UANXX:5:2302:15869:18648 4 * 0 0 * * 0 0 GTCATGGAGAAGTGGATTGTCGGCATTGCAAAGTCAACCGTGACCT... ... 0 hblOligo246282|NM_016653|12B 1 255 50M * 0 0 GTCATGGAGAAGTGGATTGTCGGCATTGCAAAGTCAACCGTGACCT... CCBCCFGGGGEGGGGFGGGGGGGGGGGGGGEGGGGGGGGGGGGGGG...
1739 D00542:238:C9N7UANXX:5:2302:9546:19739 4 * 0 0 * * 0 0 GAATCTATTGTGCCTTATTCACCTTATGAGGTTAAAGTGGTATAAT... ... 0 hblOligo152860|NM_020872|18B 1 255 50M * 0 0 GAATCTATTGTGCCTTATTCACCTTATGAGGTTAAAGTGGTATAAT... BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
1871 D00542:238:C9N7UANXX:5:2302:19197:20570 4 * 0 0 * * 0 0 ATTCAGGGTGCACCACTCTCTGCCTATCTTAGTAATAGTGACACAG... ... 0 hblOligo210208|NM_018912|12B 1 255 50M * 0 0 ATTCAGGGTGCACCACTCTCTGCCTATCTTAGTAATAGTGACACAG... BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
2080 D00542:238:C9N7UANXX:5:2302:12735:22516 4 * 0 0 * * 0 0 CCAGCCCTTGTTTCCATCCCTGAAGACGTGGTCCGAGGCTGGTGGT... ... 0 hblOligo97370|NM_001048194|3A 1 255 50M * 0 0 CCAGCCCTTGTTTCCATCCCTGAAGACGTGGTCCGAGGCTGGTGGT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEFGG...
2107 D00542:238:C9N7UANXX:5:2302:19946:22872 4 * 0 0 * * 0 0 GCTCATGTCATCCCAGATGGCACTGATCCAAATGACGCGGAAAAAC... ... 0 hblOligo224804|NM_006379|6B 1 255 50M * 0 0 GCTCATGTCATCCCAGATGGCACTGATCCAAATGACGCGGAAAAAC... BBB
2188 D00542:238:C9N7UANXX:5:2302:14096:23789 4 * 0 0 * * 0 0 GAAACCCCATCTGCAATTAACGGTAACCCATCCTGGCATCAGCTCT... ... 0 hblOligo141141|NM_138578|2B 1 255 50M * 0 0 GAAACCCCATCTGCAATTAACGGTAACCCATCCTGGCATCAGCTCT... BBB?A1E/;GGDFGCGGEGGGBECFBGGG<
2191 D00542:238:C9N7UANXX:5:2302:14804:23878 4 * 0 0 * * 0 0 TCAATCAGTTCCCTGGATCGCACACGCATGATGACACCTGGGGATC... ... 0 hblOligo140313|NM_015570|24B 1 255 50M * 0 0 TCAATCAGTTCCCTGGATCGCACACGCATGATGACACCTGGGGATC... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
2319 D00542:238:C9N7UANXX:5:2302:15618:25043 4 * 0 0 * * 0 0 GAACTGGGCTCCAAAGGTCTGTGGGCTGACTGATGGCAACGGTTTG... ... 0 hblOligo152063|NM_005602|2B 1 255 50M * 0 0 GAACTGGGCTCCAAAGGTCTGTGGGCTGACTGATGGCAACGGTTTG... BBCCCGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
2347 D00542:238:C9N7UANXX:5:2302:16495:25461 4 * 0 0 * * 0 0 CAATGCTCTACAAATCCATTGACAGGTTCTACTCTGCCTGTGCCAA... ... 0 hblOligo205172|NM_004557|10B 1 255 50M * 0 0 CAATGCTCTACAAATCCATTGACAGGTTCTACTCTGCCTGTGCCAA... ABB=AEFGGGGGGGGGGDGGGGGGGFGGGGGGGGGGGGGGGGGGGG...
2420 D00542:238:C9N7UANXX:5:2302:5288:26111 4 * 0 0 * * 0 0 CACTCCTTTACTCTGGCCTCTGCTGAGACAACAGTGGCTGGTGCAG... ... 0 hblOligo21737|NM_004000|7A 1 255 50M * 0 0 CACTCCTTTACTCTGGCCTCTGCTGAGACAACAGTGGCTGGTGCAG... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
2711 D00542:238:C9N7UANXX:5:2302:12861:28713 4 * 0 0 * * 0 0 AAGCAGGACGGTTACGCCTGTAACCAGAATCAGGGCCGCTCAATAT... ... 0 hblOligo133226|NM_003812|14B 1 255 50M * 0 0 AAGCAGGACGGTTACGCCTGTAACCAGAATCAGGGCCGCTCAATAT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGG...
2773 D00542:238:C9N7UANXX:5:2302:4567:29713 4 * 0 0 * * 0 0 AATCAGCTCGATTCCCATGTGCTTCTTTGCTTCGTCTTGTTGCCTG... ... 0 hblOligo156565|NM_001029955|4B 1 255 50M * 0 0 AATCAGCTCGATTCCCATGTGCTTCTTTGCTTCGTCTTGTTGCCTG... B
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
996620 D00542:238:C9N7UANXX:5:1108:19072:67502 4 * 0 0 * * 0 0 ATGCCTCCTAAAAAAGGTGGTGACGGGATCAAACCACCCCTATTTG... ... 0 hblOligo83194|NM_013341|1A 1 255 50M * 0 0 ATGCCTCCTAAAAAAGGTGGTGACGGGATCAAACCACCCCTATTTG... BBCABG>FGGGGGGGDGGGBGEGGGAAGGGGGGGGGGGGGEEDGGG...
996661 D00542:238:C9N7UANXX:5:1108:2772:68725 4 * 0 0 * * 0 0 ATGTATTCTTCTCAAACACAAACATACGGCGCATTGATCCCTCCTC... ... 0 hblOligo257975|NM_172106|1C 1 255 50M * 0 0 ATGTATTCTTCTCAAACACAAACATACGGCGCATTGATCCCTCCTC... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
996747 D00542:238:C9N7UANXX:5:1108:1220:69670 4 * 0 0 * * 0 0 CGCAGTTACCGCTATGCCCTTCAGGATATGGACAAATTTNNTCNNN... ... 0 hblOligo209872|NM_019035|20B 1 255 50M * 0 0 CGCAGTTACCGCTATGCCCTTCAGGATATGGACAAATTTNNTCNNN... BBBBBCGDGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGG##<<###...
996771 D00542:238:C9N7UANXX:5:1108:10201:69953 4 * 0 0 * * 0 0 CAATCTGCAGACACAAGTTGGCGCGCCCACGAATTGGCTGCTTTGC... ... 0 hblOligo203743|NM_006617|2B 1 255 50M * 0 0 CAATCTGCAGACACAAGTTGGCGCGCCCACGAATTGGCTGCTTTGC... BCABBGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGFGG...
996900 D00542:238:C9N7UANXX:5:1108:9708:71103 4 * 0 0 * * 0 0 GAACCACGCGTGGACCTCGGCATCTGGTTTGGTGCACGCCTTCCTT... ... 0 hblOligo187732|XM_002343215|6B 1 255 50M * 0 0 GAACCACGCGTGGACCTCGGCATCTGGTTTGGTGCACGCCTTCCTT... BBBCCGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGE...
996986 D00542:238:C9N7UANXX:5:1108:19115:71715 4 * 0 0 * * 0 0 AATCGCGACACTCCAGAGGAAAACCCTTACTCCCTACTTCAAATTC... ... 0 hblOligo230147|NM_001034954|10B 1 255 50M * 0 0 AATCGCGACACTCCAGAGGAAAACCCTTACTCCCTACTTCAAATTC... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
997190 D00542:238:C9N7UANXX:5:1108:11195:73890 4 * 0 0 * * 0 0 ACTGTGTGGGCTGAACCTCGCCTTCAAATTCCTATGCTTCTTCCTG... ... 0 hblOligo208182|NM_001005188|2B 1 255 50M * 0 0 ACTGTGTGGGCTGAACCTCGCCTTCAAATTCCTATGCTTCTTCCTG... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
997197 D00542:238:C9N7UANXX:5:1108:12048:73643 4 * 0 0 * * 0 0 GTCTCCGGTCGCAAAGAATCCTTGAAACACGGCTGACCACTACGAT... ... 0 hblOligo249152|NM_181489|10B 1 255 50M * 0 0 GTCTCCGGTCGCAAAGAATCCTTGAAACACGGCTGACCACTACGAT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
997398 D00542:238:C9N7UANXX:5:1108:13598:75506 4 * 0 0 * * 0 0 CCTAGTGTCGGGACTTCTCTTCCTTCAGTTGGGTCCCGCCTGAGAG... ... 0 hblOligo63676|XM_002342157|5A 1 255 50M * 0 0 CCTAGTGTCGGGACTTCTCTTCCTTCAGTTGGGTCCCGCCTGAGAG... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
997488 D00542:238:C9N7UANXX:5:1108:12961:76454 4 * 0 0 * * 0 0 ATGGGCGCAGCTGTCTTTTTTGGGTGTACCTTTATCACTGTGGCCG... ... 0 hblOligo6249|NM_001077628|1A 1 255 50M * 0 0 ATGGGCGCAGCTGTCTTTTTTGGGTGTACCTTTATCACTGTGGCCG... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
997558 D00542:238:C9N7UANXX:5:1108:16670:77170 4 * 0 0 * * 0 0 GAACTCGAAAGTAAGGAGCTCTCAGCTACCGTGGCCGACCGCAGAT... ... 0 hblOligo81665|NM_000901|3A 1 255 50M * 0 0 GAACTCGAAAGTAAGGAGCTCTCAGCTACCGTGGCCGACCGCAGAT... BBBBBFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
997607 D00542:238:C9N7UANXX:5:1108:13923:77676 4 * 0 0 * * 0 0 GACGCAGAACGCCAACGCGCAGGCCCACGCGACGAAGACGGTCTCG... ... 0 hblOligo190725|XM_002344523|6B 1 255 50M * 0 0 GACGCAGAACGCCAACGCGCAGGCCCACGCGACGAAGACGGTCTCG... BBABA=EDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
997699 D00542:238:C9N7UANXX:5:1108:1986:78547 4 * 0 0 * * 0 0 GGTCACTTCGAGCCAGATGACGGGACTAGTGCATGGACTGAAGAAT... ... 0 hblOligo241637|NM_014847|6B 1 255 50M * 0 0 GGTCACTTCGAGCCAGATGACGGGACTAGTGCATGGACTGAAGAAT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
997926 D00542:238:C9N7UANXX:5:1108:3248:81398 4 * 0 0 * * 0 0 TTGGTTAACGGTATCCAAAAACTCAAAACCACAGCTAGCCCGATCT... ... 0 hblOligo158811|NM_003777|70B 1 255 50M * 0 0 TTGGTTAACGGTATCCAAAAACTCAAAACCACAGCTAGCCCGATCT... ABABAFGGGGGGGGGGGGGGGGEFGGGGGGGGGGGGGGGGGGGGGG...
998086 D00542:238:C9N7UANXX:5:1108:9468:82587 4 * 0 0 * * 0 0 CTTGTGAAACGCACAGATTTGAAAGATCTCACAGTCAACGCATGCG... ... 0 hblOligo150811|NM_024491|2B 1 255 50M * 0 0 CTTGTGAAACGCACAGATTTGAAAGATCTCACAGTCAACGCATGCG... ABBAAF;F1CGGGCGGDGGE
998339 D00542:238:C9N7UANXX:5:1108:18062:84983 4 * 0 0 * * 0 0 ACAGGTCTCCCTCGCAATACCACTATCTCATTGCTGTGATCCAACA... ... 0 hblOligo251636|NM_001001567|1C 1 255 50M * 0 0 ACAGGTCTCCCTCGCAATACCACTATCTCATTGCTGTGATCCAACA... CCCCCEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
998363 D00542:238:C9N7UANXX:5:1108:12942:85337 4 * 0 0 * * 0 0 CTGAGTGTGAAGGGGCCTGAACTGATCAATATGTGTTGGGCAAAGT... ... 0 hblOligo211564|NM_000287|18B 1 255 50M * 0 0 CTGAGTGTGAAGGGGCCTGAACTGATCAATATGTGTTGGGCAAAGT... BBABBGGGECGGGGGGGGEGGGEGC1C>=BFE
998369 D00542:238:C9N7UANXX:5:1108:2135:85920 4 * 0 0 * * 0 0 ACTGCTTTGCACTACGCTGTCTATAACAAAGGGACGCCTTCTTAGC... ... 0 hblOligo136276|NM_147195|4B 1 255 50M * 0 0 ACTGCTTTGCACTACGCTGTCTATAACAAAGGGACGCCTTCTTAGC... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
998390 D00542:238:C9N7UANXX:5:1108:18952:85668 4 * 0 0 * * 0 0 CAAGAGAAGAAGCTGCAGCAGGAGCTGGAGCGCCAGAAAGCGCCGC... ... 0 hblOligo48836|NM_032821|53A 1 255 50M * 0 0 CAAGAGAAGAAGCTGCAGCAGGAGCTGGAGCGCCAGAAAGCGCCGC... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
998512 D00542:238:C9N7UANXX:5:1108:19590:86940 4 * 0 0 * * 0 0 ATGTCAATGTTGCCAAGTTTTGGTTTCACCCAAGAATTTGGAGCGC... ... 0 hblOligo104551|NM_005982|1A 1 255 50M * 0 0 ATGTCAATGTTGCCAAGTTTTGGTTTCACCCAAGAATTTGGAGCGC... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
998551 D00542:238:C9N7UANXX:5:1108:15371:87191 4 * 0 0 * * 0 0 CCACCTAACGAGGCCGTTTCCGTCGTGATTAATGGGCCCAACTCTT... ... 0 hblOligo10894|NM_016252|39A 1 255 50M * 0 0 CCACCTAACGAGGCCGTTTCCGTCGTGATTAATGGGCCCAACTCTT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
998871 D00542:238:C9N7UANXX:5:1108:6186:90471 4 * 0 0 * * 0 0 ATGAGTGGCCCTGGGCCTCGCGAGCCTCCTCTGCAGTCGAAGGCGC... ... 0 hblOligo113495|NM_006602|1A 1 255 50M * 0 0 ATGAGTGGCCCTGGGCCTCGCGAGCCTCCTCTGCAGTCGAAGGCGC... AA?0<BCDGGFFGEFBDGGDGGGGGGGGGGGG11FDFCFGGGGGGG...
999180 D00542:238:C9N7UANXX:5:1108:11698:93498 4 * 0 0 * * 0 0 GGTAGTTCACGCGACTACGTGGCCGTCAACCGCTGCAAGGTCTTAA... ... 0 hblOligo175705|NM_000861|8B 1 255 50M * 0 0 GGTAGTTCACGCGACTACGTGGCCGTCAACCGCTGCAAGGTCTTAA... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
999345 D00542:238:C9N7UANXX:5:1108:20603:94823 4 * 0 0 * * 0 0 GGTTTTTATGCATGCCATGTTCGCTCTTTGATTCCATGCCAACTAA... ... 0 hblOligo67894|XM_001725815|3A 1 255 50M * 0 0 GGTTTTTATGCATGCCATGTTCGCTCTTTGATTCCATGCCAACTAA... BBBBBGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFGGGGGGGG...
999361 D00542:238:C9N7UANXX:5:1108:17858:95047 4 * 0 0 * * 0 0 ACTGTCGCAGCCGCCTCAGCAACTACAGCAGCAAGCGTTGGGCAGT... ... 0 hblOligo124163|NM_025160|3A 1 255 50M * 0 0 ACTGTCGCAGCCGCCTCAGCAACTACAGCAGCAAGCGTTGGGCAGT... CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
999515 D00542:238:C9N7UANXX:5:1108:13601:96320 4 * 0 0 * * 0 0 GCAGCCTTGGCTCAGCGCGTCCGCAATCGCGTTAATTCTTAATACT... ... 0 hblOligo173186|NM_004492|2B 1 255 50M * 0 0 GCAGCCTTGGCTCAGCGCGTCCGCAATCGCGTTAATTCTTAATACT... CCCCCGGGGGGGGGGGGFGGGGGGGGGGGGGDGGGGGGGGGGGGGG...
999540 D00542:238:C9N7UANXX:5:1108:13669:96568 4 * 0 0 * * 0 0 TATACCCTCATTATGACACGCGAGCTGTGTACTCAGATGCGGTCTT... ... 0 hblOligo207286|NM_001004471|4B 1 255 50M * 0 0 TATACCCTCATTATGACACGCGAGCTGTGTACTCAGATGCGGTCTT... A:AB?>FGG1FG
999546 D00542:238:C9N7UANXX:5:1108:5719:96802 4 * 0 0 * * 0 0 GTTCGCAAGGCCCCAAATGCTTCAGATTTTGACAAATTCAGAGGTT... ... 0 hblOligo24646|NM_014711|3A 1 255 50M * 0 0 GTTCGCAAGGCCCCAAATGCTTCAGATTTTGACAAATTCAGAGGTT... =BBBBBGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGEGGGG...
999649 D00542:238:C9N7UANXX:5:1108:19160:97927 4 * 0 0 * * 0 0 GTCAAATCCAGCGCAACCATGGACCCAACTACCAGCATGCCTACGC... ... 0 hblOligo131523|NM_181806|20B 1 255 50M * 0 0 GTCAAATCCAGCGCAACCATGGACCCAACTACCAGCATGCCTACGC... BCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
999988 D00542:238:C9N7UANXX:5:1108:1965:101280 4 * 0 0 * * 0 0 AACGGGGGTTGCAGCTCACTGTGTCTGGCAANNNCNNNNNGTNNNN... ... 0 hblOligo69326|NM_002332|19A 1 255 50M * 0 0 AACGGGGGTTGCAGCTCACTGTGTCTGGCAANNNCNNNNNGTNNNN... BBCCCGGGGGGGGGGGEGGGGGGGGGGGGGG###=#####==####...

10184 rows × 21 columns

Looks like there is a perfect match to a prefix and the latter part of the read doesn't match

read AAATCCACCATTGTGAAGCAGATGAAGATCATTCATGGTTACTCAGAGCA
ref  AAATCCACCATTGTGAAGCAGATGAAGATCATTCATAAAAATGGTTACTCA

read GGTCCTCACGCCGCCCGCGTTCGCGGGTTGGCATTACAATCCGCTTTCCA
ref  GGTCCTCACGCCGCCCGCGTTCGCGGGTTGGCATTCCTCCCACACCAGACT

Bowtie vs kallisto


In [61]:
bt_k_joined = pd.merge(bowtie_alns, kallisto_alns, how='inner', on='QNAME', suffixes=['_bt', '_k'])

How many reads do bowtie and kallisto agree on?


In [62]:
(bt_k_joined.RNAME_bt == bt_k_joined.RNAME_k).sum()


Out[62]:
925110

For the minority of reads they disagree on, what do they look like


In [65]:
bt_k_joined[bt_k_joined.RNAME_bt != bt_k_joined.RNAME_k][['RNAME_bt', 'RNAME_k']]


Out[65]:
RNAME_bt RNAME_k
20 * hblOligo195898|NM_018050|8B
32 * hblOligo153699|NM_153264|48B
96 * hblOligo242494|NM_199242|18B
232 * hblOligo171177|NM_001102386|2B
253 hblOligo49822|NM_153461|7A *
254 hblOligo144036|NM_152769|8B *
260 * hblOligo124752|NM_001042424|17A
269 * hblOligo252161|NM_001032281|1C
281 * hblOligo38354|NM_203301|5A
316 * hblOligo12261|NM_020193|15A
341 hblOligo135200|NM_003748|6B *
343 * hblOligo199132|NM_015026|2B
346 hblOligo151665|NM_019886|10B *
381 hblOligo14177|NM_024319|1A *
392 * hblOligo248176|NM_006526|14B
431 * hblOligo29988|NM_020877|1A
438 * hblOligo34134|NM_005232|9A
450 * hblOligo99154|NM_020914|31A
456 * hblOligo13283|NM_144600|3A
495 * hblOligo123699|NM_014588|1A
499 * hblOligo143652|NM_001013657|4B
506 * hblOligo157434|NM_017925|10B
523 hblOligo40672|NM_015030|3A *
530 hblOligo153953|NM_182476|6B *
544 * hblOligo235927|NM_152658|4B
547 * hblOligo240228|NM_005706|4B
563 hblOligo142041|NM_139199|2B *
566 * hblOligo176312|NM_007312|6B
569 * hblOligo86484|NM_018907|15A
574 * hblOligo161089|NM_001949|10B
... ... ...
999634 * hblOligo96183|NM_030665|23A
999650 hblOligo211847|NM_004426|16B *
999658 * hblOligo190674|XM_002346229|8B
999667 hblOligo81556|NM_000910|1A *
999677 * hblOligo30457|NM_032364|5A
999682 hblOligo154749|NM_015974|2B *
999686 * hblOligo131523|NM_181806|20B
999692 * hblOligo187274|XM_002343409|2B
999695 * hblOligo137270|NM_001642|12B
999712 * hblOligo258452|NM_197949|2C
999753 * hblOligo51557|NM_002215|5A
999782 hblOligo209024|NM_000430|6B *
999801 * hblOligo127723|NM_003454|1A
999804 hblOligo207417|NM_001004451|2B *
999806 * hblOligo193000|XM_001128324|4B
999811 * hblOligo187548|XM_002343162|12B
999814 hblOligo211800|NM_024419|6B *
999869 * hblOligo141295|NM_004051|6B
999875 * hblOligo208869|NM_000917|10B
999876 * hblOligo2660|NM_001127687|7A
999889 * hblOligo138878|NM_006828|46B
999902 hblOligo240634|NM_003318|6B *
999922 hblOligo31137|NM_032482|19A *
999923 * hblOligo38334|NM_032145|1A
999924 * hblOligo132470|NM_014049|10B
999942 * hblOligo196233|NM_145686|14B
999949 * hblOligo8073|NM_018489|53A
999955 hblOligo15559|NM_001029863|1A *
999962 * hblOligo33761|NM_004435|5A
999998 * hblOligo69326|NM_002332|19A

74890 rows × 2 columns

Looks like many disagreeents, but probably still few disagreements on a positive mapping.


In [66]:
(bt_k_joined.RNAME_bt != bt_k_joined.RNAME_k).sum()


Out[66]:
74890

discordant reads, the number where kallisto failed to map is


In [67]:
(bt_k_joined[bt_k_joined.RNAME_bt != bt_k_joined.RNAME_k].RNAME_k == '*').sum()


Out[67]:
23820

and the number where bowtie failed is


In [68]:
(bt_k_joined[bt_k_joined.RNAME_bt != bt_k_joined.RNAME_k].RNAME_bt == '*').sum()


Out[68]:
51070

which means there are no disagreements on mapping. kallisto appears to be somewhat higher sensitivity.

Quantitation


In [4]:
bowtie_counts = pd.read_csv('counts/bowtie-51mer.tsv', sep='\t', header=0, names=['id', 'input', 'output'])
bowtie2_counts = pd.read_csv('counts/bowtie2-51mer.tsv', sep='\t', header=0, names=['id', 'input', 'output'])
kallisto_counts = pd.read_csv('counts/kallisto-51mer.tsv', sep='\t', header=0)

In [8]:
fig, ax = plt.subplots()
_ = ax.hist(bowtie_counts.output, bins=100, log=True)
_ = ax.set(title='bowtie')



In [9]:
fig, ax = plt.subplots()
_ = ax.hist(bowtie2_counts.output, bins=100, log=True)
_ = ax.set(title='bowtie2')



In [10]:
fig, ax = plt.subplots()
_ = ax.hist(kallisto_counts.est_counts, bins=100, log=True)
_ = ax.set(title='kallisto')



In [17]:
bt2_k_counts = pd.merge(bowtie2_counts, kallisto_counts, how='inner', left_on='id', right_on='target_id')

In [25]:
fig, ax = plt.subplots()
ax.scatter(bt2_k_counts.output, bt2_k_counts.est_counts)


Out[25]:
<matplotlib.collections.PathCollection at 0x11f36a860>

In [28]:
sp.stats.pearsonr(bt2_k_counts.output, bt2_k_counts.est_counts)


Out[28]:
(0.98889129323484826, 0.0)

In [30]:
sp.stats.spearmanr(bt2_k_counts.output, bt2_k_counts.est_counts)


Out[30]:
SpearmanrResult(correlation=0.98001436308964673, pvalue=0.0)

Otherwise, the kallisto index is about 3x bigger than the bowtie indices, but kallisto (5.7 s single-threaded) is about 3.5x faster than bowtie2 (20 s) and 7.3x faster than bowtie (42 s; though still appears to be using 2 threads).

Note: it appears that kallisto needs a few extra bases on the reference to achieve its sensitivity. Performed an analysis like so:

Looked at discordant cells according to Ben.

cpm = pd.read_csv('cpm.tsv', sep='\t', index_col=0, header=0)
mlxp = pd.read_csv('mlxp.tsv', sep='\t', index_col=0, header=0)
beadsonlycols = list(filter(lambda c: 'BEADS_ONLY' in c, mlxp.columns))

samples = ['Sjogrens.serum.Sjogrens.FS08-01647.20A20G.1']

oligo1 = list(filter(lambda c: 'hblOligo32108' in c, mlxp.index))[0]  # hit for Ben
oligo2 = list(filter(lambda c: 'hblOligo223219' in c, mlxp.index))[0]  # null for Ben
oligos = [oligo1, oligo2]

print(cpm[beadsonlycols + samples].loc[oligos].to_csv(sep='\t'))
print(mlxp[beadsonlycols + samples].loc[oligos].to_csv(sep='\t'))

Built some indices of different sizes

from Bio import SeqIO
k = 60
output = f'reference{k}.fasta'
with open(output, 'w') as op:
    for sr in SeqIO.parse('/Users/laserson/repos/phage_libraries_private/human90/human90-ref.fasta', 'fasta'):
        print(sr[:k].format('fasta'), end='', file=op)
kallisto index -i human90-50.idx reference50.fasta
kallisto index -i human90-51.idx reference51.fasta
kallisto index -i human90-52.idx reference52.fasta
kallisto index -i human90-55.idx reference55.fasta
kallisto index -i human90-60.idx reference60.fasta


kallisto quant -i human90-50.idx -o quant-50 --single -l 50 -s 0.1 --pseudobam Sjogrens.serum.Sjogrens.FS08-01647.20A20G.1.fastq.gz > aln-50.sam
kallisto quant -i human90-51.idx -o quant-51 --single -l 50 -s 0.1 --pseudobam Sjogrens.serum.Sjogrens.FS08-01647.20A20G.1.fastq.gz > aln-51.sam
kallisto quant -i human90-52.idx -o quant-52 --single -l 50 -s 0.1 --pseudobam Sjogrens.serum.Sjogrens.FS08-01647.20A20G.1.fastq.gz > aln-52.sam
kallisto quant -i human90-55.idx -o quant-55 --single -l 50 -s 0.1 --pseudobam Sjogrens.serum.Sjogrens.FS08-01647.20A20G.1.fastq.gz > aln-55.sam
kallisto quant -i human90-60.idx -o quant-60 --single -l 50 -s 0.1 --pseudobam Sjogrens.serum.Sjogrens.FS08-01647.20A20G.1.fastq.gz > aln-60.sam

Generated the following numbers of alignments

6,369 reads pseudoaligned
1,419,515 reads pseudoaligned
1,477,736 reads pseudoaligned
1,490,788 reads pseudoaligned
1,498,420 reads pseudoaligned

But looking at the results

grep hblOligo32108 quant-50/abundance.tsv
grep hblOligo32108 quant-51/abundance.tsv
grep hblOligo32108 quant-52/abundance.tsv
grep hblOligo32108 quant-55/abundance.tsv
grep hblOligo32108 quant-60/abundance.tsv

It was clear that at least 52 bases was necessary for the 50 base read to get max alignments for the peptides chosen.


In [ ]: