This notebook is for the analysis of outputs from the peakcalling pipeline relating to the quality of the peakcalling steps
There are severals stats that you want collected and graphed (topics covered in this notebook in bold).
These are:
other things?
IDR stats
This notebook takes the sqlite3 database created by CGAT peakcalling_pipeline.py and uses it for plotting the above statistics
It assumes a file directory of:
location of database = project_folder/csvdb
location of this notebook = project_folder/notebooks.dir/
Firstly lets load all the things that might be needed
In [ ]:
import sqlite3
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
#import CGATPipelines.Pipeline as P
import os
import statistics
#import collections
#load R and the R packages required
#%load_ext rpy2.ipython
#%R require(ggplot2)
# use these functions to display tables nicely as html
from IPython.display import display, HTML
plt.style.use('bmh')
#plt.style.available
This is where we are and when the notebook was run
In [ ]:
!pwd
!date
First lets set the output path for where we want our plots to be saved and the database path and see what tables it contains
In [ ]:
database_path = '../csvdb'
output_path = '.'
#database_path= "/ifs/projects/charlotteg/pipeline_peakcalling/csvdb"
This code allows you to see/hide the code in the html verision
In [ ]:
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
The code below provides functions for accessing the project database and extract a table names so you can see what tables have been loaded into the database and are available for plotting. It also has a function for geting table from the database and indexing the table with the track name
In [ ]:
def getTableNamesFromDB(database_path):
# Create a SQL connection to our SQLite database
con = sqlite3.connect(database_path)
cur = con.cursor()
# the result of a "cursor.execute" can be iterated over by row
cur.execute("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;")
available_tables = (cur.fetchall())
#Be sure to close the connection.
con.close()
return available_tables
db_tables = getTableNamesFromDB(database_path)
print('Tables contained by the database:')
for x in db_tables:
print('\t\t%s' % x[0])
#This function retrieves a table from sql database and indexes it with track name
def getTableFromDB(statement,database_path):
'''gets table from sql database depending on statement
and set track as index if contains track in column names'''
conn = sqlite3.connect(database_path)
df = pd.read_sql_query(statement,conn)
if 'track' in df.columns:
df.index = df['track']
return df
In [ ]:
In [ ]:
1) lets get the table from database
In [ ]:
design_df= getTableFromDB('select * from design;',database_path)
design_df
Now lets double check what files peakcalling was performed for and whether they were paired with an input file. Input file is used in peakcalling to control for background noise. If the bamControl collumn has 'None' in it then a input control was not used for peakcalling
Lets also double check this in the 'peakcalling_bams_and_inputs' table that is used to generate the peakcalling statement:
In [ ]:
peakcalling_design_df= getTableFromDB('select * from peakcalling_bams_and_inputs;',database_path)
print ('''peakcalling_bams_and_inputs table used to generate the peakcalling statement:
ChIPBams = the file you want to call peaks in e.g. ChIP or ATAC-Seq sample.
InputBam = the sample used as the control in peakcalling. In ChIP-Seq this would be your input control\n''')
peakcalling_design_df
Check the files are matched up correctly - if they are not there is a bug in the peakcalling section of the pipeline
now lets look at the insert sizes that are callculated by macs2 (for PE samples) or bamtools (SE reads)
In [ ]:
insert_df = getTableFromDB('select * from insert_sizes;',database_path)
Lets also have a quick check of the number of reads & number of fragments in our samples
In [ ]:
peakcalling_frags_df = getTableFromDB('select * from post_filtering_check;',database_path)
peakcalling_frags_df = peakcalling_frags_df[['Input_Filename','total_reads']].copy()
peakcalling_frags_df['total_fragments'] = peakcalling_frags_df['total_reads'].divide(2)
peakcalling_frags_df
Now lets look at the peakcalling_summary table which sumarizes the number of fragments and number of peaks called for each file
In [ ]:
peakcalling_summary_df= getTableFromDB('select * from peakcalling_summary;',database_path)
peakcalling_summary_df.rename(columns={'sample':'track'},inplace=True)
peakcalling_summary_df.index = peakcalling_summary_df['track']
peakcalling_summary_df.T
Is there any correlation between the number of peaks and the number of fragments? lets plot this. Can you see any saturation where an increase in fragment number does not result in any further gains in peak number?
In [ ]:
ax =peakcalling_summary_df[['number_of_peaks','fragment_treatment_total']].divide(1000).plot.scatter(x='fragment_treatment_total',
y='number_of_peaks')
ax.set_xlabel('number of PE fragments')
ax.set_title('correlation of number of fragments \n& number of peaks')
#ax.set_ylim((50000,160000))
#ax.set_xlim((20000000,70000000))
below code provides a look at published datasets you can look at if you want to
In [ ]:
#greenleaf_data = pd.read_csv('/Users/charlotteg/Documents/7_BassonProj/Mar17/allelic-atac-seq.csv')
#greenleaf_data.drop([0],inplace=True)
#greenleaf_data['total usable reads'] = greenleaf_data['total usable reads'] / 2
#ax = greenleaf_data.plot.scatter(x='total usable reads', y='# of allelic informative(AI) peaks (>=10 reads)')
#ax.set_ylim((50000,160000))
#ax.set_xlim((20000000,70000000))
#greenleaf_data
In [ ]:
#factor between number of reads and number of peaks
Now lets just look at the number of peaks called
In [ ]:
df = peakcalling_summary_df[['number_of_peaks','fragment_treatment_total']].copy()
df['frag_to_peaks'] = peakcalling_summary_df.fragment_treatment_total / peakcalling_summary_df.number_of_peaks
df
Plot bar graph of a number of peaks
In [ ]:
peakcalling_summary_df['number_of_peaks'].plot(kind='bar')
From these plots you should be able to tell wether there are any distinctive relationships between number of fragmenst/reads and number of peaks. You should also get a good idea of the number of peaks that are being detected in peakcalling and this can provide an idea of whether the experiment has wored. It is strignly recommended to look at these peaks along with the bigwig files of the bams used to peak call in a genome browser so you can assess whether peaks are being called correcty.