Peakcalling Bam Stats and Filtering Report - Insert Sizes

This notebook is for the analysis of outputs from the peakcalling pipeline

There are severals stats that you want collected and graphed (topics covered in this notebook in bold).

These are:

  • how many reads input
  • how many reads removed at each step (numbers and percentages)
  • how many reads left after filtering
  • inset size distribution pre filtering for PE reads
  • how many reads mapping to each chromosome before filtering?
  • how many reads mapping to each chromosome after filtering?
  • X:Y reads ratio
  • inset size distribution after filtering for PE reads
  • samtools flags - check how many reads are in categories they shouldn't be
  • picard stats - check how many reads are in categories they shouldn't be

This notebook takes the sqlite3 database created by CGAT peakcalling_pipeline.py and uses it for plotting the above statistics

It assumes a file directory of:

    location of database = project_folder/csvdb

    location of this notebook = project_folder/notebooks.dir/

Firstly lets load all the things that might be needed

Insert size distribution

This section get the size distribution of the fragements that have been sequeced in paired-end sequencing. The pipeline calculates the size distribution by caluculating the distance between the most 5' possition of both reads, for those mapping to the + stand this is the left-post possition, for those mapping to the - strand is the rightmost coordinate.

This plot is especially useful for ATAC-Seq experiments as good samples should show peaks with a period approximately equivelent to the length of a nucleosome (~ 146bp) a lack of this phasing might indicate poor quality samples and either over (if lots of small fragments) or under intergration (if an excess of large fragments) of the topoisomerase.


In [ ]:
import sqlite3

import pandas as pd
import numpy as np
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
#import CGATPipelines.Pipeline as P
import os
import statistics
#import collections
#load R and the R packages required
#%load_ext rpy2.ipython
#%R require(ggplot2)

#  use these functions to display tables nicely as html 
from IPython.display import display, HTML
plt.style.use('ggplot')
#plt.style.available

This is where we are and when the notebook was run


In [ ]:
!pwd
!date

First lets set the output path for where we want our plots to be saved and the database path and see what tables it contains


In [ ]:
database_path = '../csvdb'
output_path = '.'
#database_path= "/ifs/projects/charlotteg/pipeline_peakcalling/csvdb"

This code adds a button to see/hide code in html


In [ ]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

The code below provides functions for accessing the project database and extract a table names so you can see what tables have been loaded into the database and are available for plotting. It also has a function for geting table from the database and indexing the table with the track name


In [ ]:
def getTableNamesFromDB(database_path):
    # Create a SQL connection to our SQLite database
    con = sqlite3.connect(database_path)
    cur = con.cursor()
    # the result of a "cursor.execute" can be iterated over by row
    cur.execute("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;")
    available_tables = (cur.fetchall())
    #Be sure to close the connection.
    con.close()
    return available_tables

db_tables = getTableNamesFromDB(database_path)
print('Tables contained by the database:')
for x in db_tables: 
    print('\t\t%s' % x[0])
    
#This function retrieves a table from sql database and indexes it with track name
def getTableFromDB(statement,database_path):
    '''gets table from sql database depending on statement
    and set track as index if contains track in column names'''
    conn = sqlite3.connect(database_path)
    df = pd.read_sql_query(statement,conn)
    if 'track' in df.columns:
        df.index = df['track']
    return df

Insert Size Summary

1) lets getthe insert_sizes table from database

Firsly lets look at the summary statistics that us the mean fragment size, sequencing type and mean read length. This table is produced using macs2 for PE data, or bamtools for SE data

If IDR has been run the insert_size table will contain entries for the pooled and pseudo replicates too - we don't really want this as it will duplicate the data from the origional samples so we subset this out


In [ ]:
insert_df = getTableFromDB('select * from insert_sizes;',database_path)
insert_df = insert_df[insert_df["filename"].str.contains('pseudo')==False].copy()
insert_df = insert_df[insert_df["filename"].str.contains('pooled')==False].copy()

In [ ]:
def add_expt_to_insertdf(dataframe):
    ''' splits track name for example HsTh1-RATotal-R1.star into expt
    featues, expt, sample_treatment and replicate and adds these as 
    collumns to the dataframe'''
    expt = []
    treatment = []
    replicate = []
    for value in dataframe.filename:
        x = value.split('/')[-1]
        x = x.split('_insert')[0]
        # split into design features
        y = x.split('-')
        expt.append(y[-3])
        treatment.append(y[-2])
        replicate.append(y[-1])

    if len(expt) == len(treatment) and len(expt)== len(replicate):
        print ('all values in list correctly')
    else:
        print ('error in loading values into lists')

    #add collums to dataframe 
    dataframe['expt_name'] = expt
    dataframe['sample_treatment'] = treatment
    dataframe['replicate'] = replicate

    return dataframe

insert_df = add_expt_to_insertdf(insert_df)
insert_df

lets graph the fragment length mean and tag size grouped by sample so we can see if they are much different


In [ ]:
ax = insert_df.boxplot(column='fragmentsize_mean', by='sample_treatment')
ax.set_title('for mean fragment size',size=10)
ax.set_ylabel('mean fragment length')
ax.set_xlabel('sample treatment')

ax = insert_df.boxplot(column='tagsize', by='sample_treatment')
ax.set_title('for tag size',size=10)
ax.set_ylabel('tag size')
ax.set_xlabel('sample treatment')
ax.set_ylim(((insert_df.tagsize.min()-2),(insert_df.tagsize.max()+2)))

Ok now get get the fragment length distributiions for each sample and plot them


In [ ]:
def getFraglengthTables(database_path):
    '''Takes path to sqlite3 database and retrieves fraglengths tables for individual samples
    , returns a dictionary where keys = sample table names, values = fraglengths dataframe'''
    frag_tabs = []
    db_tables = getTableNamesFromDB(database_path)
    for table_name in db_tables:
        if 'fraglengths' in str(table_name[0]):
            tab_name = str(table_name[0])
            statement ='select * from %s;' % tab_name
            df = getTableFromDB(statement,database_path)
            frag_tabs.append((tab_name,df))
    print('detected fragment length distribution tables for %s files: \n' % len(frag_tabs))
    for val in frag_tabs:
        print(val[0])
    return frag_tabs

def getDFofFragLengths(database_path):
    ''' this takes a path to database and gets a dataframe where length of fragments is the index,
    each column is a sample and values are the number of reads that have that fragment length in that 
    sample
    '''
    fraglength_dfs_list = getFraglengthTables(database_path)
    dfs=[]
    for item in fraglength_dfs_list:
        track = item[0].split('_filtered_fraglengths')[0]
        df = item[1]
        #rename collumns so that they are correct - correct this in the pipeline then delete this
        #df.rename(columns={'frequency':'frag_length', 'frag_length':'frequency'}, inplace=True)
        df.index = df.frag_length
        df.drop('frag_length',axis=1,inplace=True)
        df.rename(columns={'frequency':track},inplace=True)
        dfs.append(df)
    
    frag_length_df = pd.concat(dfs,axis=1)
    frag_length_df.fillna(0, inplace=True)
    return frag_length_df


#Note the frequency and fragment lengths are around the wrong way! 
#frequency is actually fragment length, and fragement length is the frequency 

#This gets the tables from db and makes master df of all fragment length frequencies 
frag_length_df = getDFofFragLengths(database_path)

#plot fragment length frequencies 
ax = frag_length_df.divide(1000).plot()
ax.set_ylabel('Number of fragments\n(thousands)')
ax.legend(loc=2,bbox_to_anchor=(1.05, 1),borderaxespad=0. )
ax.set_title('fragment length distribution')
ax.set_xlabel('fragment length (bp)')
ax.set_xlim()

Now lets zoom in on the interesting region of the plot (the default in the code looks at fragment lengths from 0 to 800bp - you can change this below by setting the tuple in the ax.set_xlim() function


In [ ]:
ax = frag_length_df.divide(1000).plot(figsize=(9,9))
ax.set_ylabel('Number of fragments\n(thousands)')
ax.legend(loc=2,bbox_to_anchor=(1.05, 1),borderaxespad=0. )
ax.set_title('fragment length distribution')
ax.set_xlabel('fragment length (bp)')
ax.set_xlim((0,800))

it is a bit trickly to see differences between samples of different library sizes so lets look and see if the reads for each fragment length is similar


In [ ]:
percent_frag_length_df = pd.DataFrame(index=frag_length_df.index)

for column in frag_length_df:
    total_frags = frag_length_df[column].sum()
    percent_frag_length_df[column] = frag_length_df[column].divide(total_frags)*100
    

ax = percent_frag_length_df.plot(figsize=(9,9))
ax.set_ylabel('Percentage of fragments')
ax.legend(loc=2,bbox_to_anchor=(1.05, 1),borderaxespad=0. )
ax.set_title('percentage fragment length distribution')
ax.set_xlabel('fragment length (bp)')
ax.set_xlim((0,800))

SUMMARISE HERE

From these plots you should be able to tell wether there are any distinctive patterns in the size of the fragment lengths,this is especially important for ATAC-Seq data as in successful experiments you should be able to detect nucleosome phasing - it can also indicate over fragmentation or biases in cutting.

Lets looks at the picard insert size metrics also


In [ ]:
insert_df = getTableFromDB('select * from picard_stats_insert_size_metrics;',database_path)
for c in insert_df.columns:
    print  (c)
insert_df

These metrics are actually quite different to the ones we calculate themselves - for some reason it seems to split the files into 2 and dives a distribution for smaller fragments and for larger fragments- not sure why at the moment