Comparison of Acevedo's passage 8 data when 3 repeats or only 2 repeats are used

In both cases, we consider a quality threshold of 60, i.e. a quality of 20 with 3 repeats, or 30 with two repeats. This means that we expect to have fewer remaining bases when only 2 repeats are used, because many will not pass the quality threshold. The analyses have been performed using Acevedo's scripts, modified in the case with 2 repeats. When 2 repeats are analysed, only the first and the third repeats are used. The third repeat should have lower quality than the second, so this is a conservative way to look at the data.

Reading in the data



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
import matplotlib

colNames = ["position","referenceBase","A","C", "G", "T"]
counts = pd.read_csv ("outputPassage8/Q20threshold.txt", sep="\t", header=None, names=colNames)

counts2Repeats = pd.read_csv ("output2RepeatsPassage8/Q30threshold.txt", sep="\t", header=None, names=colNames)

#Utilitary variables
axisFontSize = 20
titleFontSize = 25
axisTickFontSize = 12

matplotlib.rc('xtick', labelsize=axisTickFontSize)
matplotlib.rc('ytick', labelsize=axisTickFontSize)
rcParams["axes.labelsize"]=axisFontSize

Computing statistics of interest



In [2]:

    
def coverage(a,c,g,t):
    cov = a+c+g+t
    return cov

def majorBaseRatio(a,c,g,t,coverage):
    l=[a,c,g,t]
    l.sort()
    return l[3]/coverage 

def secondMajorBaseRatio(a,c,g,t,coverage):
    l=[a,c,g,t]
    l.sort()
    return l[2]/coverage 

def thirdMajorBaseRatio(a,c,g,t,coverage):
    l=[a,c,g,t]
    l.sort()
    return l[1]/coverage 

def fourthMajorBaseRatio(a,c,g,t,coverage):
    l=[a,c,g,t]
    l.sort()
    return l[0]/coverage 

def ARatio(a,coverage):
    return a/coverage 

def CRatio(c,coverage):
    return c/coverage 

def GRatio(g,coverage):
    return g/coverage 

def TRatio(t,coverage):
    return t/coverage 

def computeCoverageAndRatios (counts):
    counts["coverage"]= list ( map(coverage, counts["A"], counts["C"], counts["G"], counts["T"]) )
    counts["majorBaseRatio"]  = list ( map(majorBaseRatio, counts["A"], counts["C"], counts["G"], counts["T"], counts["coverage"]) )
    counts["secondMajorBaseRatio"]  = list ( map(secondMajorBaseRatio, counts["A"], counts["C"], counts["G"], counts["T"], counts["coverage"]) )
    counts["thirdMajorBaseRatio"] = list ( map(thirdMajorBaseRatio, counts["A"], counts["C"], counts["G"], counts["T"], counts["coverage"]) ) 
    counts["fourthMajorBaseRatio"] = list ( map(fourthMajorBaseRatio, counts["A"], counts["C"], counts["G"], counts["T"], counts["coverage"]) )
    counts["ARatio"] = list ( map(ARatio, counts["A"], counts["coverage"]) )
    counts["CRatio"] = list ( map(CRatio, counts["C"], counts["coverage"]) )
    counts["GRatio"] = list ( map(GRatio, counts["G"], counts["coverage"]) )
    counts["TRatio"] = list ( map(TRatio, counts["T"], counts["coverage"]) )
    return

computeCoverageAndRatios (counts)
counts.describe()

computeCoverageAndRatios (counts2Repeats)
counts2Repeats.describe()









    Out[2]:






  
    
      
      position
      A
      C
      G
      T
      coverage
      majorBaseRatio
      secondMajorBaseRatio
      thirdMajorBaseRatio
      fourthMajorBaseRatio
      ARatio
      CRatio
      GRatio
      TRatio
    
  
  
    
      count
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
    
    
      mean
      3720.500000
      6378.266263
      4450.078495
      4476.822984
      4855.572581
      20160.740323
      0.999802
      0.000182
      0.000015
      0.000001
      0.296490
      0.233578
      0.229992
      0.239940
    
    
      std
      2147.887334
      16949.345539
      13589.430771
      13364.229586
      14537.080609
      23661.456526
      0.000905
      0.000901
      0.000043
      0.000009
      0.456635
      0.423011
      0.420754
      0.426948
    
    
      min
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      59.000000
      0.931071
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      1860.750000
      0.000000
      0.000000
      0.000000
      0.000000
      6950.000000
      0.999751
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      50%
      3720.500000
      1.000000
      0.000000
      0.000000
      1.000000
      12909.000000
      0.999883
      0.000104
      0.000000
      0.000000
      0.000044
      0.000000
      0.000000
      0.000022
    
    
      75%
      5580.250000
      5357.000000
      10.000000
      8.000000
      21.250000
      24012.500000
      1.000000
      0.000225
      0.000000
      0.000000
      0.999728
      0.000491
      0.000368
      0.000700
    
    
      max
      7440.000000
      203304.000000
      191022.000000
      182796.000000
      179859.000000
      203317.000000
      1.000000
      0.068892
      0.000647
      0.000235
      1.000000
      1.000000
      1.000000
      1.000000

Impact of the number of repeats on coverage.



In [3]:

    
%matplotlib inline

def plotCoverage(subplot, counts, title):
    rcParams['figure.figsize'] = 18, 8
    subplot.set_title(title, fontsize=15, fontweight='bold')
    subplot.set_ylabel ('Number of reads')
    subplot.plot (counts ['position'], counts ['coverage'], 'b-') #(position, coverage, 'r-')
    fig_size = rcParams["figure.figsize"]
    subplot.legend (['Sequencing coverage'], loc = 'upper left')
    yaxis_lower_limit = counts ['coverage'].min()
    yaxis_upper_limit = counts ['coverage'].max()
    xaxis_lower_limit = counts ['position'].min()
    xaxis_upper_limit = counts ['position'].max()
    subplot.axis ([xaxis_lower_limit, xaxis_upper_limit, yaxis_lower_limit, yaxis_upper_limit])

f, axarr = plt.subplots(2, sharex=True,figsize=(15,15))
plotCoverage(axarr[0], counts,'Genome coverage by position, 3 repeats')
plotCoverage(axarr[1], counts2Repeats,'Genome coverage by position, 2 repeats')
plt.xlabel ('Genome position')
plt.tight_layout()



In [13]:

    
print("Average coverage for 3 repeats, quality threshold at 20: "+str(counts["coverage"].mean()))
print("Average coverage for 2 repeats, quality threshold at 30: "+str(counts2Repeats["coverage"].mean()))









    



Average coverage for 3 repeats, quality threshold at 20: 169749.96707
Average coverage for 2 repeats, quality threshold at 30: 20160.7403226

Conclusion: Coverage took a severe hit when going from 3 repeats with quality threshold at 20, or 2 repeats with quality threshold at 30.

Impact of the number of repeats on allele frequency.



In [4]:

    
%matplotlib inline

def plotFrequencyOfMostFrequentNucleotide (subplot, counts, title):
    rcParams['figure.figsize'] = 18, 8
    subplot.set_title(title, fontsize=15, fontweight='bold')
    subplot.set_ylabel ('Frequency')
    subplot.plot (counts ['position'], counts ['majorBaseRatio'], 'r.') #(position, coverage, 'r-')
    fig_size = rcParams["figure.figsize"]
    subplot.legend (['Frequency of the most frequent nucleotide'], loc = 'upper right')
    yaxis_lower_limit = counts ['majorBaseRatio'].min()
    yaxis_upper_limit = counts ['majorBaseRatio'].max()
    xaxis_lower_limit = counts ['position'].min()
    xaxis_upper_limit = counts ['position'].max()
    subplot.axis ([xaxis_lower_limit, xaxis_upper_limit, yaxis_lower_limit, yaxis_upper_limit])


f, axarr = plt.subplots(2, sharex=True,figsize=(15,15))
plotFrequencyOfMostFrequentNucleotide(axarr[0], counts, 'Frequency of the most frequent nucleotide by genome position, 3 repeats')
plotFrequencyOfMostFrequentNucleotide(axarr[1], counts2Repeats, 'Frequency of the most frequent nucleotide by genome position, 2 repeats')
plt.xlabel ('Genome position')
plt.tight_layout()

Conclusion: most outlier sites in the 3-repeat analysis also appear to be outliers in the 2-repeat analysis, except in low-coverage regions.



In [16]:

    
%matplotlib inline

def runningMeanFast(x, N):
    return np.convolve(x, np.ones((N,))/N)[(N-1):]
   
def plotRMFrequencyOfMostFrequentNucleotide (subplot, counts, title):
    majorBaseRatioMovingAverage10 = (runningMeanFast (counts ['majorBaseRatio'], 10))
    secondMajorBaseRatioMovingAverage10 = (runningMeanFast (counts ['secondMajorBaseRatio'], 10))
    rcParams['figure.figsize'] = 18, 8
    subplot.set_title(title, fontsize=15, fontweight='bold')
    subplot.plot (counts ['position'], majorBaseRatioMovingAverage10, 'r-') 
    subplot.plot (counts ['position'], 1-secondMajorBaseRatioMovingAverage10, 'b-', alpha=0.7) 
    subplot.set_ylabel ('Frequency')
    fig_size = rcParams["figure.figsize"]
    subplot.legend (['Frequency of the most frequent nucleotide', 'Frequency of the second most frequent nucleotide'], loc = 'lower left')
    yaxis_lower_limit = 0.991
    yaxis_upper_limit = majorBaseRatioMovingAverage10.max()
    xaxis_lower_limit = counts ['position'].min()
    xaxis_upper_limit = counts ['position'].max()
    subplot.axis ([xaxis_lower_limit, xaxis_upper_limit, yaxis_lower_limit, yaxis_upper_limit])

f, axarr = plt.subplots(2, sharex=True,figsize=(15,15))
plotRMFrequencyOfMostFrequentNucleotide(axarr[0], counts, 'Frequency of the most frequent nucleotide by genome position, 3 repeats')
plotRMFrequencyOfMostFrequentNucleotide(axarr[1], counts2Repeats, 'Frequency of the most frequent nucleotide by genome position, 2 repeats')
plt.xlabel ('Genome position')
plt.tight_layout()

Conclusion: Same result as above, some outlier regions have been lost, probably due to low coverage.

Now we plot the 3 least frequent bases per position.



In [6]:

    
%matplotlib inline

def plotFrequenciesOf3LeastFrequentNucleotides (subplot, counts, title):
    rcParams['figure.figsize'] = 18, 8
    subplot.set_title(title, fontsize=15, fontweight='bold')
    subplot.plot (counts ['position'], counts ['secondMajorBaseRatio'], 'r.') #(position, coverage, 'r-')
    subplot.plot (counts ['position'], counts ['thirdMajorBaseRatio'], 'b.') #(position, coverage, 'r-')
    subplot.plot (counts ['position'], counts ['fourthMajorBaseRatio'], 'g.') #(position, coverage, 'r-')
    subplot.set_ylabel ('Frequency')
    fig_size = rcParams["figure.figsize"]
    subplot.legend (['Frequency of the second most frequent nucleotide', 'Frequency of the third most frequent nucleotide', 'Frequency of the fourth most frequent nucleotide'], loc = 'upper left')
    yaxis_lower_limit = counts ['fourthMajorBaseRatio'].min()
    yaxis_upper_limit = 1
    xaxis_lower_limit = counts ['position'].min()
    xaxis_upper_limit = counts ['position'].max()
    subplot.set_yscale('log')
    subplot.axhline(y = 0.01, linewidth=1, linestyle='dashed', color='0.75')
    subplot.axis ([xaxis_lower_limit, xaxis_upper_limit, yaxis_lower_limit, yaxis_upper_limit])

f, axarr = plt.subplots(2, sharex=True,figsize=(15,15))
plotFrequenciesOf3LeastFrequentNucleotides(axarr[0], counts, 'Frequency of the 3 least frequent nucleotide by genome position, 3 repeats')
plotFrequenciesOf3LeastFrequentNucleotides(axarr[1], counts2Repeats, 'Frequency of the 3 least frequent nucleotide by genome position, 2 repeats')
plt.xlabel ('Genome position')
plt.tight_layout()

Conclusion: Minor alleles suffer most of using only 2 repeats instead of 3. In many cases they entirely disappear, meaning their frequency is 0.

Impact of sequence coverage on frequency estimation

Now we are going to investigate the impact of sequence coverage on our ability to recover nucleotide frequencies. To do that we compute the Jensen-Shannon divergence (a symmetrised version of the Kullback-Leibler divergence).



In [7]:

    
import scipy
import math

def KullbackLeibler (p, q):
    if len(p) != len(q):
        print("Error in KL: lists with different lengths")
        exit(-1)
    kl = 0
    for i in range (len(p)):
        ratio = 1
        if p[i] == 0 and q[i] == 0:
            ratio = 1
        elif p[i] == 0:
            ratio = 0.0000000001/q[i]
        elif q[i] == 0:
            ratio = p[i]/0.0000000001
        else:
            ratio = p[i]/q[i]
        kl = kl + p[i]*math.log(ratio)
    return kl

def JensenShannon (p, q):
    meanpq = [sum(x)/2 for x in zip(p, q)]
    js = KullbackLeibler (p, meanpq) + KullbackLeibler (q, meanpq)
    return js

def computeDivergences(count1, count2):
    ratios1 = list(zip(count1['ARatio'], count1['CRatio'], count1['GRatio'], count1['TRatio']))
    ratios2 = list(zip(count2['ARatio'], count2['CRatio'], count2['GRatio'], count2['TRatio']))
    result=list()
    for i in range(len(ratios1)):
        result.append(JensenShannon (ratios1[i], ratios2[i]))
    return result

counts2Repeats['Divergence'] = computeDivergences(counts2Repeats, counts)



In [8]:

    
counts2Repeats.describe()









    Out[8]:






  
    
      
      position
      A
      C
      G
      T
      coverage
      majorBaseRatio
      secondMajorBaseRatio
      thirdMajorBaseRatio
      fourthMajorBaseRatio
      ARatio
      CRatio
      GRatio
      TRatio
      Divergence
    
  
  
    
      count
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
      7440.000000
    
    
      mean
      3720.500000
      6378.266263
      4450.078495
      4476.822984
      4855.572581
      20160.740323
      0.999802
      0.000182
      0.000015
      0.000001
      0.296490
      0.233578
      0.229992
      0.239940
      0.000132
    
    
      std
      2147.887334
      16949.345539
      13589.430771
      13364.229586
      14537.080609
      23661.456526
      0.000905
      0.000901
      0.000043
      0.000009
      0.456635
      0.423011
      0.420754
      0.426948
      0.000211
    
    
      min
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      59.000000
      0.931071
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      1860.750000
      0.000000
      0.000000
      0.000000
      0.000000
      6950.000000
      0.999751
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000027
    
    
      50%
      3720.500000
      1.000000
      0.000000
      0.000000
      1.000000
      12909.000000
      0.999883
      0.000104
      0.000000
      0.000000
      0.000044
      0.000000
      0.000000
      0.000022
      0.000067
    
    
      75%
      5580.250000
      5357.000000
      10.000000
      8.000000
      21.250000
      24012.500000
      1.000000
      0.000225
      0.000000
      0.000000
      0.999728
      0.000491
      0.000368
      0.000700
      0.000157
    
    
      max
      7440.000000
      203304.000000
      191022.000000
      182796.000000
      179859.000000
      203317.000000
      1.000000
      0.068892
      0.000647
      0.000235
      1.000000
      1.000000
      1.000000
      1.000000
      0.004305



In [9]:

    
%matplotlib inline

def scatterPlotDivergenceVsCoverage (subplot, counts, title):
    rcParams['figure.figsize'] = 18, 8
    subplot.suptitle(title, fontsize=15, fontweight='bold')
    orderDivergenceByCoverage = [x for (y,x) in sorted(zip(counts ['coverage'],counts ['Divergence']))]
    smoothedCurve = (runningMeanFast (orderDivergenceByCoverage, 200))
    subplot.plot (counts ['coverage'], counts ['Divergence'], 'r.', alpha=0.3) 
    subplot.plot (sorted(counts ['coverage']), smoothedCurve, '-', alpha=1, linewidth=2) 
    subplot.ylabel ('Divergence')
    fig_size = rcParams["figure.figsize"]
    subplot.legend (['Divergence'], loc = 'upper left')
    yaxis_lower_limit = counts ['Divergence'].min()
    yaxis_upper_limit = counts ['Divergence'].max()
    xaxis_lower_limit = counts ['coverage'].min()
    xaxis_upper_limit = counts ['coverage'].max()
    #subplot.yscale('log')
    #subplot.axhline(y = 0.01, linewidth=1, linestyle='dashed', color='0.75')
    subplot.axis ([xaxis_lower_limit, xaxis_upper_limit, yaxis_lower_limit, yaxis_upper_limit])


#f, axarr = plt.subplots(1, sharex=True,figsize=(15,15))
scatterPlotDivergenceVsCoverage(plt, counts2Repeats, 'Divergence in nucleotide frequencies between experiences against coverage')
plt.xlabel ('Coverage')
plt.tight_layout()

Conclusion: As expected, at higher coverage, frequencies found using 2 repeats match better those found using 3 repeats.

Now let's look at the differences in the major and various minor allele frequencies.



In [10]:

    
%matplotlib inline

def scatterPlotRatioDifferenceVsCoverage (subplot, ratio1, ratio2,  title):
    rcParams['figure.figsize'] = 18, 8
    subplot.set_title(title, fontsize=15, fontweight='bold')
    diffRatio = pd.Series([(x-y)/(x+y) for (x,y) in zip(ratio1,ratio2)])
    yaxis_lower_limit = diffRatio.min(skipna=True)
    yaxis_upper_limit = diffRatio.max(skipna=True)
    xaxis_lower_limit = counts ['coverage'].min()
    xaxis_upper_limit = counts ['coverage'].max()
    #diffRatio = abs((ratio1 - ratio2)/(ratio1)
    orderdiffRatioByCoverage = [x for (y,x) in sorted(zip(counts ['coverage'],diffRatio))]
    smoothedCurve = (runningMeanFast (orderdiffRatioByCoverage, 50))
    subplot.plot (counts ['coverage'], diffRatio, 'r.', alpha=0.3) 
    subplot.plot (sorted(counts ['coverage']), smoothedCurve, '-', alpha=1, linewidth=2) 
    subplot.set_ylabel ('Frequency difference')
    fig_size = rcParams["figure.figsize"]
    subplot.legend (['Relative frequency difference'], loc = 'upper left')
    #subplot.set_yscale('log')
    subplot.locator_params(axis='y',nbins=6)
    subplot.axhline(y = 0.00, linewidth=2, linestyle='dashed', color='0.1')
    subplot.axis ([xaxis_lower_limit, xaxis_upper_limit, yaxis_lower_limit, yaxis_upper_limit])


f, axarr = plt.subplots(4, sharex=True,figsize=(15,15))
scatterPlotRatioDifferenceVsCoverage(axarr[0], counts['majorBaseRatio'], counts2Repeats['majorBaseRatio'], 'Relative difference in nucleotide frequencies between experiences against coverage, major allele')
scatterPlotRatioDifferenceVsCoverage(axarr[1], counts['secondMajorBaseRatio'], counts2Repeats['secondMajorBaseRatio'], 'Relative difference in nucleotide frequencies between experiences against coverage, second major allele')
scatterPlotRatioDifferenceVsCoverage(axarr[2], counts['thirdMajorBaseRatio'], counts2Repeats['thirdMajorBaseRatio'], 'Relative difference in nucleotide frequencies between experiences against coverage, third major allele')
scatterPlotRatioDifferenceVsCoverage(axarr[3], counts['fourthMajorBaseRatio'], counts2Repeats['fourthMajorBaseRatio'], 'Relative difference in nucleotide frequencies between experiences against coverage, fourth major allele')



plt.xlabel ('Coverage')
plt.tight_layout()

Conclusion: The frequency of the major allele seems to be well estimated with 2 repeats instead of 3 ; however the minor alleles do less well.

General conclusions: What matters is really the high quality coverage, i.e. how many bases with quality over 60 have been read per position of the genome.

Therefore there is a trade-off: having more repeats increases the quality for all bases, but decreases the number of informative bases. We need to look further (e.g. into our real data) to investigate whether 2 repeats could be enough.



In [ ]:

	position	A	C	G	T	coverage	majorBaseRatio	secondMajorBaseRatio	thirdMajorBaseRatio	fourthMajorBaseRatio	ARatio	CRatio	GRatio	TRatio
count	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000	7440.000000
mean	3720.500000	6378.266263	4450.078495	4476.822984	4855.572581	20160.740323	0.999802	0.000182	0.000015	0.000001	0.296490	0.233578	0.229992	0.239940
std	2147.887334	16949.345539	13589.430771	13364.229586	14537.080609	23661.456526	0.000905	0.000901	0.000043	0.000009	0.456635	0.423011	0.420754	0.426948
min	1.000000	0.000000	0.000000	0.000000	0.000000	59.000000	0.931071	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	1860.750000	0.000000	0.000000	0.000000	0.000000	6950.000000	0.999751	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	3720.500000	1.000000	0.000000	0.000000	1.000000	12909.000000	0.999883	0.000104	0.000000	0.000000	0.000044	0.000000	0.000000	0.000022
75%	5580.250000	5357.000000	10.000000	8.000000	21.250000	24012.500000	1.000000	0.000225	0.000000	0.000000	0.999728	0.000491	0.000368	0.000700
max	7440.000000	203304.000000	191022.000000	182796.000000	179859.000000	203317.000000	1.000000	0.068892	0.000647	0.000235	1.000000	1.000000	1.000000	1.000000