Peakcalling Peak Stats

This notebook is for the analysis of outputs from the peakcalling pipeline relating to the quality of the peakcalling steps

There are severals stats that you want collected and graphed - you can click on the links below to find the jupyter notebooks where you can directly interact with the code or the html files that can be opened in your web browser.

Stats you should be interested in are:

Quality of Bam files for Peakcalling

  • how many reads input: notebook html
  • how many reads removed at each step (numbers and percentages): notebook html
  • how many reads left after filtering: notebook html
  • how many reads mapping to each chromosome before filtering?: notebook html
  • how many reads mapping to each chromosome after filtering?: notebook html
  • X:Y reads ratio: notebook html
  • inset size distribution after filtering for PE reads: notebook html
  • samtools flags - check how many reads are in categories they shouldn't be: notebook html
  • [picard stats - check how many reads are in categories they shouldn't be:

Peakcalling stats

  • Number of peaks called in each sample: notebook html
  • Number of reads in peaks: notebook html
  • Size distribution of the peaks
  • Location of peaks
  • correlation of peaks between samples
  • other things?

  • IDR stats

  • What peak lists are the best

This notebook takes the sqlite3 database created by CGAT peakcalling_pipeline.py and uses it for plotting the above statistics

It assumes a file directory of:

    location of database = project_folder/csvdb

    location of this notebook = project_folder/notebooks.dir/

Firstly lets load all the things that might be needed

This is where we are and when the notebook was run


In [2]:
!pwd
!date


/Users/charlotteg/Documents/7_BassonProj/Mar17
Sat 11 Mar 2017 19:50:55 GMT