Set the matplotlib magic to notebook enable inline plots



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

Calculate the Nonredundant Read Fraction (NRF)

SAM format example:

SRR585264.8766235       0       1       4       15      35M     *       0       0       CTTAAACAATTATTCCCCCTGCAAACATTTTCAAT     GGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGG    XT:A:U   NM:i:1  X0:i:1  X1:i:6  XM:i:1  XO:i:0  XG:i:0  MD:Z:8T26

Import the required modules



In [2]:

    
import subprocess
import matplotlib.pyplot as plt
import random
import numpy as np

Make figures prettier and biger



In [3]:

    
plt.style.use('ggplot')
figsize(10,5)

Parse the SAM file and extract the unique start coordinates.

First store the file name in the variable



In [4]:

    
file = "./bwa/input2.sorted.bam"

Next we read the file using samtools. From each read we need to store the flag, chromosome name and start coordinate.



In [5]:

    
p = subprocess.Popen(["samtools", "view", "-q10", "-F260", file],
                      stdout=subprocess.PIPE)
coords = []
for line in p.stdout:
    flag, ref, start = line.decode('utf-8').split()[1:4]
    coords.append([flag, ref, start])



In [36]:

    
coords[:3]









    Out[36]:





[[u'16', u'20', u'194'], [u'16', u'20', u'211'], [u'16', u'20', u'479']]

What is the total number of our unique reads?



In [6]:

    
len(coords)









    Out[6]:





1459281

Randomly sample the coordinates to get 1M for NRF calculations



In [7]:

    
random.seed(1234)
sample = random.sample(coords, 1000000)



In [8]:

    
len(sample)









    Out[8]:





1000000

How many of those coordinates are unique? (We will use the set python object which only the unique items.)



In [9]:

    
uniqueStarts = {'watson': set(), 'crick': set()}
for coord in sample:
    flag, ref, start = coord
    if int(flag) & 16:
        uniqueStarts['crick'].add((ref, start))
    else:
        uniqueStarts['watson'].add((ref, start))

How many on the Watson strand?



In [10]:

    
len(uniqueStarts['watson'])









    Out[10]:





462589

And on the Crick?



In [11]:

    
len(uniqueStarts['crick'])









    Out[11]:





451756

Calculate the NRF



In [12]:

    
NRF_input = (len(uniqueStarts['watson']) + len(uniqueStarts['crick']))*1.0/len(sample)
print(NRF_input)

Lets create a function from what we did above and apply it to all of our files!

To use our function on the real sequencing datasets (not only on a small subset) we need to optimize our method a bit- we will use python module called numpy.



In [13]:

    
def calculateNRF(filePath, pickSample=True, sampleSize=10000000, seed=1234):
    p = subprocess.Popen(['samtools', 'view', '-q10', '-F260', filePath],
                          stdout=subprocess.PIPE)
    coordType = np.dtype({'names': ['flag', 'ref', 'start'],
                          'formats': ['uint16', 'U10', 'uint32']})
    coordArray = np.empty(10000000, dtype=coordType)
    i = 0
    for line in p.stdout:
        if i >= len(coordArray):
            coordArray = np.append(coordArray, np.empty(1000000, dtype=coordType), axis=0)
        fg, rf, st = line.decode('utf-8').split()[1:4]
        coordArray[i] = np.array((fg, rf, st), dtype=coordType)
        i += 1
    coordArray = coordArray[:i]
    sample = coordArray
    if pickSample and len(coordArray) > sampleSize:
        np.random.seed(seed)
        sample = np.random.choice(coordArray, sampleSize, replace=False)
    uniqueStarts = {'watson': set(), 'crick': set()}
    for read in sample:
        flag, ref, start = read
        if flag & 16:
            uniqueStarts['crick'].add((ref, start))
        else:
            uniqueStarts['watson'].add((ref, start))
    NRF = (len(uniqueStarts['watson']) + len(uniqueStarts['crick']))*1.0/len(sample)
    return NRF

Calculate the NRF for the chip-seq sample



In [14]:

    
NRF_chip = calculateNRF("./bwa/sox2_chip.sorted.bam", sampleSize=1000000)
print(NRF_chip)

Plot the NRF!



In [15]:

    
plt.bar([0,2],[NRF_input, NRF_chip], width=1)
plt.xlim([-0.5,3.5]), plt.xticks([0.5, 2.5], ['Input', 'ChIP'])
plt.xlabel('Sample')
plt.ylabel('NRF')
plt.ylim([0, 1.25]), plt.yticks(np.arange(0, 1.2, 0.2))
plt.plot((-0.5,3.5), (0.8,0.8), 'red', linestyle='dashed')
plt.show()

Calculate the Signal Extraction Scaling

Load the results from the coverage calculations



In [17]:

    
countList = []
with open('./bedtools/input_coverage.bed', 'r') as covFile:
    for line in covFile:
        countList.append(int(line.strip('\n').split('\t')[3]))
countList[0:6]









    Out[17]:





[6, 15, 13, 19, 1, 6]



In [18]:

    
countList[-15:]









    Out[18]:





[26, 42, 21, 21, 9, 16, 16, 28, 25, 19, 13, 4, 18, 36, 20]

Lets see where do our reads align to the genome. Plot the distribution of tags along the genome.



In [19]:

    
plt.plot(range(len(countList)), countList)
plt.xlabel('Bin number')
plt.ylabel('Bin coverage')
plt.xlim([0, len(countList)])
plt.show()

Now sort the list- order the windows based on the tag count



In [20]:

    
countList.sort()



In [21]:

    
countList[0:6]









    Out[21]:





[0, 0, 0, 0, 0, 0]

Sum all the aligned tags



In [22]:

    
countSum = sum(countList)
countSum









    Out[22]:





1563897

Calculate the summaric fraction of tags along the ordered windows.



In [ ]:

    
countFraction = []
for i, count in enumerate(countList):
    if i == 0:
        countFraction.append(count*1.0 / countSum)
    else:
        countFraction.append((count*1.0 / countSum) + countFraction[i-1])

Look at the last five items of the list:



In [42]:

    
countFraction[-5:]









    Out[42]:





[0.9965374957558283,
 0.9968501761945017,
 0.9973303868477608,
 0.9984954252102616,
 1.0000000000000335]

Calculate the number of windows.



In [24]:

    
winNumber = len(countFraction)
winNumber









    Out[24]:





101267

Calculate what fraction of a whole is the position of each window.



In [40]:

    
winFraction = []
for i in range(winNumber):
    winFraction.append(i*1.0 / winNumber)

Look at the last five items of our new list:



In [41]:

    
winFraction[-5:]









    Out[41]:





[0.9999506255739777,
 0.9999605004591822,
 0.9999703753443866,
 0.999980250229591,
 0.9999901251147956]

Now prepare the function!



In [26]:

    
def calculateSES(filePath):
    countList = []
    with open(filePath, 'r') as covFile:
        for line in covFile:
            countList.append(int(line.strip('\n').split('\t')[3]))
    plt.plot(range(len(countList)), countList)
    plt.xlabel('Bin number')
    plt.ylabel('Bin coverage')
    plt.xlim([0, len(countList)])
    plt.show()
    countList.sort()
    countSum = sum(countList)
    countFraction = []
    for i, count in enumerate(countList):
        if i == 0:
            countFraction.append(count*1.0 / countSum)
        else:
            countFraction.append((count*1.0 / countSum) + countFraction[i-1])
    winNumber = len(countFraction)
    winFraction = []
    for i in range(winNumber):
        winFraction.append(i*1.0 / winNumber)
    return [winFraction, countFraction]

Use our function to calculate the signal extraction scaling for the Sox2 ChIP sample:



In [27]:

    
chipSes = calculateSES("./bedtools/sox2_chip_coverage.bed")

Now we can plot the calculated fractions for both the input and ChIP sample:



In [39]:

    
plt.plot(winFraction, countFraction, label='input')
plt.plot(chipSes[0], chipSes[1], label='Sox2 ChIP')
plt.ylim([0,1])
plt.xlabel('Ordered window franction')
plt.ylabel('Genome coverage fraction')
plt.legend(loc='best')
plt.show()