In [1]:

    
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

Extracting information about sequence quality and enrichment

Enrichment

Often want to compare two datasets (tissue 1 vs. tissue 2; -drug vs. +drug; etc.)
Done by taking ratio of counts for sequences between data sets

$$f_{seq} = \frac{C_{seq}}{C_{total}}$$

The normalized frequency of a sequence $f_{seq}$ is determined by the number of counts of that sequence relative to all counts in the data set.

The enrichment of the sequence in dataset 2 vs. dataset 1 is given by:

$$E_{seq} = \frac{f_{seq,2}}{f_{seq,1}}$$

where $f_{seq,1}$ and $f_{seq,2}$ are the normalized frequencies of the sequence in dataset 1 and 2.

How do we decide which sequences are high enough quality to include?

For each cluster, you get a sequence of colors representing the sequence

Some bases are read well, others are ambiguous.

The "Phred" score measures confidence in the base "call":

\begin{equation} Q = -10log_{10}(p) \qquad \qquad \text{$(1)$} \end{equation}

where $p$ is the probability that the call is wrong.

By rearranging Eq. $(1)$ above, we get: \begin{equation} p = 10^{- \frac{Q}{10}} \qquad \qquad \text{$(2)$} \end{equation}

Create a plot of Q vs. p. Is a high "Q" good or bad?



In [ ]:

Phred scores are encoded in last line:

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGG
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIII

Encoding goes like

Letter	ASCII	$Q$	$p$
`!`	33	0	1.00000
`"`	34	1	0.79433
`#`	35	2	0.63096
...	...	...	...
`J`	74	41	0.00008
`K`	75	42	0.00006

python `chr` command converts integer ASCII to character



In [2]:

    
print(chr(33))
print(chr(34))
print(chr(35))
print("...")
print(chr(74))
print(chr(75))









    



!
"
#
...
J
K

Can create dictionary that converts letters to quality scores

Recall that \begin{equation} p = 10^{- \frac{Q}{10}} \qquad \qquad \text{$(2)$} \end{equation}



In [3]:

    
Q_dict = {}
p_dict = {}
for i in range(33,76):
    Q_dict[chr(i)] = i-33
    p_dict[chr(i)] = 10**(-(Q_dict[chr(i)])/10.)
    
p_dict["K"]









    Out[3]:





6.309573444801929e-05

Example

$$p_{correct} = \prod_{i=1}^{L} (1-p_{incorrect})$$

where $i$ indexes along sequence and $L$ is sequence length.



In [4]:

    
qual_string = "IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIII"

p_correct = 1.0
for q in qual_string:
    p_correct = p_correct*(1-p_dict[q])

print(p_correct)









    



0.9857512130454114

Modify this code to pull out the $p_{correct}$ each the sequence



In [5]:

    
import gzip

get_line = False
seqs = {}
with gzip.open("files/example.fastq.gz") as f:
    for l in f:
        l_ascii = l.decode("ascii")
        if l_ascii[0] == "@":
            get_line = True
            continue
        if get_line:
            try:
                seqs[l_ascii.strip()] += 1
            except KeyError:
                seqs[l_ascii.strip()] = 1
            get_line = False

Does sequence quality vary along the length of your reads?



In [ ]:



In [ ]: