Peakcalling Bam Stats and Filtering Report - Filtering Stats

This notebook is for the analysis of outputs from the peakcalling pipeline

There are severals stats that you want collected and graphed (topics covered in this notebook in bold).

These are:

  • how many reads input
  • how many reads removed at each step (numbers and percentages)
  • how many reads left after filtering
  • insert size distribution pre filtering for PE reads
  • how many reads mapping to each chromosome before filtering?
  • how many reads mapping to each chromosome after filtering?
  • X:Y reads ratio
  • samtools flags - check how many reads are in categories they shouldn't be
  • picard stats - check how many reads are in categories they shouldn't be

This notebook takes the sqlite3 database created by CGAT peakcalling_pipeline.py and uses it for plotting the above statistics

It assumes a file directory of:

    location of database = project_folder/csvdb

    location of this notebook = project_folder/notebooks.dir/

In [ ]:
import sqlite3

import pandas as pd
import numpy as np
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import CGATPipelines.Pipeline as P
import os
import statistics
import collections
#load R and the R packages required

#  use these functions to display tables nicely as html 
from IPython.display import display, HTML
plt.style.use('ggplot')

In [ ]:
%load_ext rpy2.ipython
%R require(ggplot2)

This is where and when the notebook was run


In [ ]:
!pwd
!date

First lets set the output path for where we want our plots to be saved and the database path and see what tables it contains


In [ ]:
database_path = '../csvdb'
output_path = '.'

This code adds a button to see/hide code in html


In [ ]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

The code below provides functions for accessing the project database and extract a table names so you can see what tables have been loaded into the database and are available for plotting. It also has a function for geting table from the database and indexing the table with the track name


In [ ]:
def getTableNamesFromDB(database_path):
    # Create a SQL connection to our SQLite database
    con = sqlite3.connect(database_path)
    cur = con.cursor()
    # the result of a "cursor.execute" can be iterated over by row
    cur.execute("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;")
    available_tables = (cur.fetchall())
    #Be sure to close the connection.
    con.close()
    return available_tables

db_tables = getTableNamesFromDB(database_path)
print('Tables contained by the database:')
for x in db_tables: 
    print('\t\t%s' % x[0])
    
#This function retrieves a table from sql database and indexes it with track name
def getTableFromDB(statement,database_path):
    '''gets table from sql database depending on statement
    and set track as index if contains track in column names'''
    conn = sqlite3.connect(database_path)
    df = pd.read_sql_query(statement,conn)
    if 'track' in df.columns:
        df.index = df['track']
    return df

Number of reads per samples

Firstly lets look at the size of our bam files pre and post filtering - hopefully the post-filtering bams will be smaller showing that some filtering has taken place


In [ ]:
#get table of bam file size 
filtering_stats = getTableFromDB('select * from post_filtering_read_counts;',database_path)
filtering_stats.index = filtering_stats['Input_Bam']
filtering_stats.drop('Input_Bam',1,inplace=True)

#sort dataframe by values in rows to get order filters were applied 
#this is based on the number of reads in each row
new_cols = filtering_stats.columns[filtering_stats.ix[filtering_stats.last_valid_index()].argsort()]
filtering_stats = filtering_stats[new_cols[::-1]]

#get number of reads in the bams before and after filtering - smallest_col = last filtering step applied
smallest_col  = filtering_stats.idxmin(axis=1)[1]

#plot bar graph of pre vs post filtering sizes
ax = filtering_stats[['pre_filtering',smallest_col]].divide(1000000).plot.bar()
ax.set_ylabel('Million Reads (not pairs)')
ax.legend(['pre_filtering','post_filtering'], loc=2,bbox_to_anchor=(1.05, 1),borderaxespad=0. )
ax.set_title('number of reads (not pairs) pre and post filtering')

This should give you a good idea of:

1) whether filtering has been applied - if post and pre filtering differ it has!
2) whether the proportion of filtering corresponds to the initial size of library
3) whether the proportion of filtering is consistent across samples
4) final bam size that is being taken forward to peakcalling - the requirements will differ for different technologies 

Encode ATAC-Seq Guidelines: Each replicate should have 25 million non duplicate, non-mitochondrial aligned reads for single-end or 50 million paired-end reads (i.e. there should be 25 million fragments regardless of sequencing type) - Jan 2017


In [ ]:
#get the order of filters applied
def get_filter_order(dataframe):
    '''function to print out the order of filters in dataframe'''
    print('order of filters applied to bam file:')
    for x in list(dataframe):
        if x != 'pre_filtering':
            print ('\t%s' % x)
    return list(dataframe)

filter_order = get_filter_order(filtering_stats)

In [ ]:
print('Table of number of reads remaining at each state of filtering ')
display(filtering_stats.T)

Lets graph the number of reads that remain at each step for each bam file


In [ ]:
#plot how the reads have been filtered 
ax = filtering_stats.T.divide(1000000).plot(rot=90)
ax.set_xlabel('filters')
ax.set_ylabel('million reads (not pairs)')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set_title('number of reads remaining at\neach stage of filtering')

Now lets look at the number of reads filtered at each step side by side - this uses R for plotting


In [ ]:
filtered_df = filtering_stats.copy()
filtered_df = filtered_df.divide(1000000)
filtered_df['Name'] = filtered_df.index

In [ ]:
%%R -i filtered_df -w 600 -h 600 -u px
library("reshape2")
filtered_df$Name <- factor(filtered_df$Name)

df.m = melt(filtered_df)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
## pink #CC79A7 # orange #D55E00 # 0072B2 blue # yellow #F0E442 # green #009E73 # light blue

g = ggplot(data=df.m, aes(factor(Name), y=value,fill=variable)) + labs(title="Number of individual reads remaining after each filtering step") + geom_bar(stat="identity",position="dodge", width=0.7)  +  scale_fill_manual(values=cbPalette) + theme_bw()
g + scale_y_continuous(name="million reads remaining \n (individual reads not pairs)") + theme(plot.title=element_text(size=16, hjust=0.5, face='bold'), legend.position="top",axis.text=element_text(size=15,face='bold'),axis.text.x=element_text(size=15,face='bold',angle=90),axis.title.x=element_blank(),axis.title.y=element_text(size=10,face='bold'))

Now have a look at the percentage of reads remaining at each stage of filtering

The percentage dataframe is created in python but uses R and ggplot for plotting


In [ ]:
#Make percentage of reads dataframe 
percentage_filtered_df = filtering_stats.copy()    
percentage_filtered_df = percentage_filtered_df.div(percentage_filtered_df.pre_filtering, axis='index')*100
percentage_filtered_df = percentage_filtered_df.round(3)
percentage_filtered_df['Name']= percentage_filtered_df.index
print('Table showing the percentage of reads remaining at each filtering step')
percentage_filtered_df.T

In [ ]:
%%R -i percentage_filtered_df -w 600 -h 600 -u px
library("reshape2")
percentage_filtered_df$Name <- factor(percentage_filtered_df$Name)

df.m = melt(percentage_filtered_df)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
## pink #CC79A7 # orange #D55E00 # 0072B2 blue # yellow #F0E442 # green #009E73 # light blue

g = ggplot(data=df.m, aes(factor(Name), y=value,fill=variable)) + labs(title="Percentage of reads remaining after each filtering step") + geom_bar(stat="identity",position="dodge", width=0.7)  +  scale_fill_manual(values=cbPalette) + theme_bw()
g + scale_y_continuous(name="Percentage reads remaining") + theme(plot.title=element_text(size=16, hjust=0.5, face='bold'), legend.position="top",axis.text=element_text(size=15,face='bold'),axis.text.x=element_text(size=15,face='bold',angle=90),axis.title.x=element_blank(),axis.title.y=element_text(size=10,face='bold'))

Now lets get the number of reads that are filtered out at each stage of the filtering by subtracting the number of reads at the filtering stage of interest from the number of reads in the stage prior to the filter of interst being applied


In [ ]:
#Get number of reads removed by each stage of filtering 
order_of_filters = get_filter_order(filtering_stats)

df_reads_removed = pd.DataFrame(index=filtering_stats.index)
for loc in range(len(order_of_filters)): 
    filt = order_of_filters[loc]
    if filt == 'pre_filtering':
        df_reads_removed['total_reads'] = filtering_stats['pre_filtering']
    else:
        previous_filter_step = order_of_filters[loc-1]
        #print("calcultation number removed by %s filtering step by doing number of reads in %s - number of reads in %s column \n" % (filt, previous_filter_step, filt))
        df_reads_removed['removed_by_%s_filter' % filt] = filtering_stats[previous_filter_step] - filtering_stats[filt]

print('\n\nTable shown as million reads removed by each filter:')
display(df_reads_removed.T.divide(1000000))
#plot how the reads have been filtered 

ax = df_reads_removed.divide(1000000).plot(rot=90)
ax.set_xlabel('filters')
ax.set_ylabel('million reads removed (not pairs)')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set_title('number of reads removed at each stage of filtering')

ax = df_reads_removed.T.divide(1000000).plot(rot=90)
ax.set_xlabel('filters')
ax.set_ylabel('million reads removed (not pairs)')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set_title('number of reads removed at each stage of filtering')

ax = df_reads_removed.T.divide(1000000).drop('total_reads').plot(rot=90,kind='bar')
ax.set_xlabel('filters')
ax.set_ylabel('million reads removed (not pairs)')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set_title('number of reads removed at each stage of filtering')

Now lets plot the number of reads reamining at each filtering step side by side


In [ ]:
df_reads_removed_mills = df_reads_removed.divide(1000000)
df_reads_removed_mills['Name'] = df_reads_removed_mills.index

In [ ]:
%%R -i df_reads_removed_mills -w 900 -h 800 -u px
library("reshape2")
df_reads_removed_mills$Name <- factor(df_reads_removed_mills$Name)

df.m = melt(df_reads_removed_mills)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
## pink #CC79A7 # orange #D55E00 # 0072B2 blue # yellow #F0E442 # green #009E73 # light blue

g = ggplot(data=df.m, aes(factor(Name), y=value,fill=variable)) + labs(title="Number of reads remaining after each filtering step") + geom_bar(stat="identity",position="dodge", width=0.7)  +  scale_fill_manual(values=cbPalette) + theme_bw()
g + scale_y_continuous(name="Number of reads filtered at each step") + theme(plot.title=element_text(size=16, hjust=0.5, face='bold'), legend.position='top',axis.text=element_text(size=15,face='bold'),axis.text.x=element_text(size=15,face='bold',angle=90),axis.title.x=element_blank(),axis.title.y=element_text(size=10,face='bold'))

Now lets get the percentage of reads removed at each filtering step


In [ ]:
#Get number of reads removed by each stage of filtering 
percentage_filtered_df = percentage_filtered_df.drop('Name',axis=1)
order_of_filters = get_filter_order(percentage_filtered_df)

df_percentreads_removed = pd.DataFrame(index=percentage_filtered_df.index)
for loc in range(len(order_of_filters)): 
    filt = order_of_filters[loc]
    
    if filt == 'pre_filtering':
        df_percentreads_removed['total_reads'] = percentage_filtered_df['pre_filtering']
    else:
        previous_filter_step = order_of_filters[loc-1]
        #print("calcultation number removed by %s filtering step by doing number of reads in %s - number of reads in %s column \n" % (filt, previous_filter_step, filt))
        df_percentreads_removed['removed_by_%s_filter' % filt] = percentage_filtered_df[previous_filter_step] - percentage_filtered_df[filt]

print('\n\nTable shown as million reads removed by each filter:')
display(df_percentreads_removed.T)
#plot how the reads have been filtered 

ax = df_percentreads_removed.plot(rot=90)
ax.set_xlabel('bam file')
ax.set_ylabel('percentage reads removed (not pairs)')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set_title('percentage of reads removed at each stage of filtering')

ax = df_percentreads_removed.T.plot(rot=90)
ax.set_xlabel('filters')
ax.set_ylabel('percentage reads removed (not pairs)')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set_title('percentage of reads removed at each stage of filtering')

ax = df_percentreads_removed.T.drop('total_reads').plot(rot=90,kind='bar')
ax.set_xlabel('filters')
ax.set_ylabel('percentage reads removed (not pairs)')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set_title('percentage of reads removed at each stage of filtering')

In [ ]:
df_percentreads_removed['Name'] = df_percentreads_removed.index

In [ ]:
%%R -i df_percentreads_removed -w 900 -h 800 -u px
library("reshape2")
df_percentreads_removed$Name <- factor(df_percentreads_removed$Name)

df.m = melt(df_percentreads_removed)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
## pink #CC79A7 # orange #D55E00 # 0072B2 blue # yellow #F0E442 # green #009E73 # light blue

g = ggplot(data=df.m, aes(factor(Name), y=value,fill=variable)) + labs(title="Percentage of reads remaining after each filtering step") + geom_bar(stat="identity",position="dodge", width=0.7)  +  scale_fill_manual(values=cbPalette) + theme_bw()
g + scale_y_continuous(name="Number of reads filtered at each step") + theme(plot.title=element_text(size=16, hjust=0.5, face='bold'), legend.position='top',axis.text=element_text(size=15,face='bold'),axis.text.x=element_text(size=15,face='bold',angle=90),axis.title.x=element_blank(),axis.title.y=element_text(size=10,face='bold'))

Great thats all the filtering stats done by now you should have a good idea about:

  • the number of reads present in your origional bam file
  • the number of reads left after filtering
  • the proportion of reads filtered by each filter
  • whether the proportion of reads filtered at each stage looks quite consistent across samples

Calculate the nonredundant fraction (NRF)

NRF = Number of distinct uniquely mapping reads (after removing duplicates)/Total number of reads

for ChIP-Seq

  • NRF < 0.5 = concerning
  • 0.8 > NRF > 0.5 = Acceptable
  • 0.9 > NRF > 0.8 = compliant
  • NRF > 0.9 = Ideal

for ATAC-Seq

  • NRF < 0.7 = Concerning
  • 0.9 > NRF > 0.7 = Acceptable
  • NRF > 0.9 = Ideal

In [ ]:
filtering_stats['NRF'] = filtering_stats.duplicates/filtering_stats.pre_filtering
filtering_stats