Set the matplotlib magic to notebook enable inline plots.



In [ ]:

    
%pylab inline

Calculate the Nonredundant Read Fraction (NRF)

SAM format example:

SRR585264.8766235       0       1       4       15      35M     *       0       0       CTTAAACAATTATTCCCCCTGCAAACATTTTCAAT     GGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGG    XT:A:U   NM:i:1  X0:i:1  X1:i:6  XM:i:1  XO:i:0  XG:i:0  MD:Z:8T26

Import the required modules



In [ ]:

    
import subprocess
import matplotlib.pyplot as plt
import random
import numpy as np

Make figures prettier and biger



In [ ]:

    
plt.style.use('ggplot')
figsize(10,5)

Parse the SAM file and extract the unique start coordinates.

First store the file name in the variable



In [1]:

    
file = ""

Next we read the file using samtools. From each read we need to store the flag, chromosome name and start coordinate.



In [ ]:

    
p = subprocess.Popen(["samtools", "view", "-q10", "-F260", file],
                      stdout=subprocess.PIPE)
coords = []
for line in p.stdout:
    , ,  = line.decode('utf-8').split()[ : ]
    coords.append([ , , ])



In [ ]:



In [ ]:

What is the total number of our unique reads?



In [ ]:

    
len()

In python we can randomly sample the coordinates to get 1M for NRF calculations



In [ ]:

    
random.seed(1234)
sample = random.sample(coords, 1000000)

How many of those coordinates are unique? (We will use the set python object which only the unique items.)



In [ ]:

    
uniqueStarts = {'watson': set(), 'crick': set()}
for coord in sample:
    flag, ref, start = coord
    if int(flag) & 16:
        uniqueStarts['crick'].add((ref, start))
    else:
        uniqueStarts['watson'].add((ref, start))

How many on the Watson strand?



In [ ]:

    
len(uniqueStarts['watson'])

And on the Crick?



In [ ]:

Calculate the NRF



In [ ]:

    
NRF_input = ( + )*1.0 /
print(NRF_input)

Lets create a function from what we did above and apply it to all of our files!

To use our function on the real sequencing datasets (not only on a small subset) we need to optimize our method a bit- we will use python module called numpy.



In [ ]:

    
def calculateNRF(filePath, pickSample=True, sampleSize=10000000, seed=1234):
    p = subprocess.Popen(['samtools', 'view', '-q10', '-F260', filePath],
                          stdout=subprocess.PIPE)
    coordType = np.dtype({'names': ['flag', 'ref', 'start'],
                          'formats': ['uint16', 'U10', 'uint32']})
    coordArray = np.empty(10000000, dtype=coordType)
    i = 0
    for line in p.stdout:
        if i >= len(coordArray):
            coordArray = np.append(coordArray, np.empty(1000000, dtype=coordType), axis=0)
        fg, rf, st = line.decode('utf-8').split()[1:4]
        coordArray[i] = np.array((fg, rf, st), dtype=coordType)
        i += 1
    coordArray = coordArray[:i]
    sample = coordArray
    if pickSample and len(coordArray) > sampleSize:
        np.random.seed(seed)
        sample = np.random.choice(coordArray, sampleSize, replace=False)
    uniqueStarts = {'watson': set(), 'crick': set()}
    for read in sample:
        flag, ref, start = read
        if flag & 16:
            uniqueStarts['crick'].add((ref, start))
        else:
            uniqueStarts['watson'].add((ref, start))
    NRF = (len(uniqueStarts['watson']) + len(uniqueStarts['crick']))*1.0/len(sample)
    return NRF

Calculate the NRF for the chip-seq sample



In [ ]:

    
NRF_chip = calculateNRF("", sampleSize=1000000)
print(NRF_chip)

Plot the NRF!



In [ ]:



In [ ]:

    
plt.bar([0,2],[NRF_input, NRF_chip], width=1)
plt.xlim([-0.5,3.5]), plt.xticks([0.5, 2.5], ['Input', 'ChIP'])
plt.xlabel('Sample')
plt.ylabel('NRF')
plt.ylim([0, 1.25]), plt.yticks(np.arange(0, 1.2, 0.2))
plt.plot((-0.5,3.5), (0.8,0.8), 'red', linestyle='dashed')
plt.show()

Calculate the Signal Extraction Scaling

Load the results from the coverage calculations. Lets take the input sample first.

20      0       1000    6
20      1000    2000    15
20      2000    3000    13
...



In [ ]:

    
countList = []
with open('./bedtools/input_coverage.bed', 'r') as covFile:
    for line in covFile:
        countList.append(int(line.strip('\n').split('\t')[3]))



In [ ]:



In [ ]:

Lets see where do our reads align to the genome. Plot the distribution of tags along the genome.



In [ ]:

    
plt.plot(range(len(countList)), countList)
plt.xlabel('Bin number')
plt.ylabel('Bin coverage')
plt.xlim([0, len(countList)])
plt.show()

Now sort the list- order the windows based on the tag count



In [ ]:

    
countList.sort()

What do you suppose is in the beginning of our list?



In [ ]:

Sum all the aligned tags



In [ ]:

    
countSum = sum()
countSum

Calculate the summaric fraction of tags along the ordered windows.



In [ ]:

    
countFraction = []
for i, count in enumerate(countList):
    if i == 0:
        countFraction.append(count*1.0 / countSum)
    else:
        countFraction.append((count*1.0 / countSum) + countFraction[i-1])

Look at the last five items of the list:



In [ ]:

Calculate the number of windows.



In [ ]:

    
winNumber = 
winNumber

Calculate what fraction of a whole is the position of each window.



In [ ]:

    
winFraction = []
for i in range(winNumber):
    winFraction.append(i*1.0 / winNumber)

Look at the last five items of our new list:



In [ ]:

Now prepare the function!



In [ ]:

    
def calculateSES(filePath):
    countList = []
    with open(filePath, 'r') as covFile:
        for line in covFile:
            countList.append(int(line.strip('\n').split('\t')[3]))
    plt.plot(range(len(countList)), countList)
    plt.xlabel('Bin number')
    plt.ylabel('Bin coverage')
    plt.xlim([0, len(countList)])
    plt.show()
    countList.sort()
    countSum = sum(countList)
    countFraction = []
    for i, count in enumerate(countList):
        if i == 0:
            countFraction.append(count*1.0 / countSum)
        else:
            countFraction.append((count*1.0 / countSum) + countFraction[i-1])
    winNumber = len(countFraction)
    winFraction = []
    for i in range(winNumber):
        winFraction.append(i*1.0 / winNumber)
    return [winFraction, countFraction]

Use our function to calculate the signal extraction scaling for the Sox2 ChIP sample:



In [ ]:

    
chipSes = calculateSES("")

Now we can plot the calculated fractions for both the input and ChIP sample:



In [ ]:



In [ ]:

    
plt.plot(winFraction, countFraction, label='input')
plt.plot(chipSes[0], chipSes[1], label='Sox2 ChIP')
plt.ylim([0,1])
plt.xlabel('Ordered window franction')
plt.ylabel('Genome coverage fraction')
plt.legend(loc='best')
plt.show()