Peakcalling Peak Stats

This notebook is for the analysis of outputs from the peakcalling pipeline relating to the quality of the peakcalling steps

There are severals stats that you want collected and graphed (topics covered in this notebook in bold).

These are:

  • Number of peaks called in each sample
  • Size distribution of the peaks
  • Number of reads in peaks
  • Location of peaks
  • correlation of peaks between samples
  • other things?

  • IDR stats

  • What peak lists are the best

This notebook takes the sqlite3 database created by CGAT peakcalling_pipeline.py and uses it for plotting the above statistics

It assumes a file directory of:

    location of database = project_folder/csvdb

    location of this notebook = project_folder/notebooks.dir/

Firstly lets load all the things that might be needed


In [ ]:
import sqlite3

import pandas as pd
import numpy as np
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
#import CGATPipelines.Pipeline as P
import os
import statistics
#import collections
#load R and the R packages required
#%load_ext rpy2.ipython
#%R require(ggplot2)

#  use these functions to display tables nicely as html 
from IPython.display import display, HTML
plt.style.use('bmh')
#plt.style.available

This is where we are and when the notebook was run


In [ ]:
!pwd
!date

First lets set the output path for where we want our plots to be saved and the database path and see what tables it contains


In [ ]:
database_path = '../csvdb'
output_path = '.'
#database_path= "/ifs/projects/charlotteg/pipeline_peakcalling/csvdb"

This code allows you to see/hide the code in the html verision


In [ ]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

The code below provides functions for accessing the project database and extract a table names so you can see what tables have been loaded into the database and are available for plotting. It also has a function for geting table from the database and indexing the table with the track name


In [ ]:
def getTableNamesFromDB(database_path):
    # Create a SQL connection to our SQLite database
    con = sqlite3.connect(database_path)
    cur = con.cursor()
    # the result of a "cursor.execute" can be iterated over by row
    cur.execute("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;")
    available_tables = (cur.fetchall())
    #Be sure to close the connection.
    con.close()
    return available_tables

db_tables = getTableNamesFromDB(database_path)
print('Tables contained by the database:')
for x in db_tables: 
    print('\t\t%s' % x[0])
    
#This function retrieves a table from sql database and indexes it with track name
def getTableFromDB(statement,database_path):
    '''gets table from sql database depending on statement
    and set track as index if contains track in column names'''
    conn = sqlite3.connect(database_path)
    df = pd.read_sql_query(statement,conn)
    if 'track' in df.columns:
        df.index = df['track']
    return df

In [ ]:


In [ ]:

Design of Experiment

Firstly lets check out the experimental design - this is specified in the design_file.tsv that is used to run the pipeline

1) lets get the table from database


In [ ]:
design_df= getTableFromDB('select * from design;',database_path)
design_df

Now lets double check what files peakcalling was performed for and whether they were paired with an input file. Input file is used in peakcalling to control for background noise. If the bamControl collumn has 'None' in it then a input control was not used for peakcalling

Lets also double check this in the 'peakcalling_bams_and_inputs' table that is used to generate the peakcalling statement:


In [ ]:
peakcalling_design_df= getTableFromDB('select * from peakcalling_bams_and_inputs;',database_path)
print ('''peakcalling_bams_and_inputs table used to generate the peakcalling statement:
           ChIPBams = the file you want to call peaks in e.g. ChIP or ATAC-Seq sample. 
           InputBam = the sample used as the control in peakcalling. In ChIP-Seq this would be your input control\n''')
peakcalling_design_df

Check the files are matched up correctly - if they are not there is a bug in the peakcalling section of the pipeline


now lets look at the insert sizes that are callculated by macs2 (for PE samples) or bamtools (SE reads)


In [ ]:
insert_df = getTableFromDB('select * from insert_sizes;',database_path)

Lets also have a quick check of the number of reads & number of fragments in our samples


In [ ]:
peakcalling_frags_df = getTableFromDB('select * from post_filtering_check;',database_path)
peakcalling_frags_df = peakcalling_frags_df[['Input_Filename','total_reads']].copy()
peakcalling_frags_df['total_fragments'] = peakcalling_frags_df['total_reads'].divide(2)
peakcalling_frags_df

Now lets look at the peakcalling_summary table which sumarizes the number of fragments and number of peaks called for each file


In [ ]:
peakcalling_summary_df= getTableFromDB('select * from peakcalling_summary;',database_path)
peakcalling_summary_df.rename(columns={'sample':'track'},inplace=True)
peakcalling_summary_df.index = peakcalling_summary_df['track']
peakcalling_summary_df.T

Is there any correlation between the number of peaks and the number of fragments? lets plot this. Can you see any saturation where an increase in fragment number does not result in any further gains in peak number?


In [ ]:
ax =peakcalling_summary_df[['number_of_peaks','fragment_treatment_total']].divide(1000).plot.scatter(x='fragment_treatment_total',
                                                                                        y='number_of_peaks')
ax.set_xlabel('number of PE fragments')
ax.set_title('correlation of number of fragments \n& number of peaks')
#ax.set_ylim((50000,160000))
#ax.set_xlim((20000000,70000000))

below code provides a look at published datasets you can look at if you want to


In [ ]:
#greenleaf_data = pd.read_csv('/Users/charlotteg/Documents/7_BassonProj/Mar17/allelic-atac-seq.csv')
#greenleaf_data.drop([0],inplace=True)
#greenleaf_data['total usable reads'] = greenleaf_data['total usable reads'] / 2
#ax = greenleaf_data.plot.scatter(x='total usable reads', y='# of allelic informative(AI) peaks (>=10 reads)')
#ax.set_ylim((50000,160000))
#ax.set_xlim((20000000,70000000))
#greenleaf_data

In [ ]:
#factor between number of reads and number of peaks

Now lets just look at the number of peaks called


In [ ]:
df = peakcalling_summary_df[['number_of_peaks','fragment_treatment_total']].copy()
df['frag_to_peaks'] = peakcalling_summary_df.fragment_treatment_total / peakcalling_summary_df.number_of_peaks
df

Plot bar graph of a number of peaks


In [ ]:
peakcalling_summary_df['number_of_peaks'].plot(kind='bar')

SUMMARISE HERE

From these plots you should be able to tell wether there are any distinctive relationships between number of fragmenst/reads and number of peaks. You should also get a good idea of the number of peaks that are being detected in peakcalling and this can provide an idea of whether the experiment has wored. It is strignly recommended to look at these peaks along with the bigwig files of the bams used to peak call in a genome browser so you can assess whether peaks are being called correcty.