In this section we perform simple functions, such as overlaps, on the most common file type used for describing genomic regions, the BED file. We will examine the results of the ChIP-Seq peak calling you have performed on the transcription factor PAX5 and perform simple operations on these files, using the bedtools suite of programs. You will then annotate the MACS2 peaks with respect to genomic annotations. Finally, we will select the most significantly enriched peaks, and extract the genomic sequence flanking their summits, the point of highest enrichment.
If you are not in there already, change into the data directory.
In [ ]:
cd data
The bedtools package permits complex, interval-based manipulation of BED and GTF files. They are also very fast. The general invocation of bedtools
is bedtools <COMMAND>
.
To get an overview of the available commands, simply call bedtools
without any command or options in the terminal window.
In [ ]:
bedtools
To get help for a command, type bedtools <COMMAND>
. Extensive documentation and examples are available at https://bedtools.readthedocs.org/en/latest/. We will now use bedtools to calculate simple coverage statistics of the peak calls over the genome (keep in mind that only peaks on Chromosome 1 are in the file).
To bring up the help page for the bedtools genomecov
command, type:
In [ ]:
bedtools genomecov
Calculate the genome coverage of the PAX5 peaks:
In [ ]:
bedtools genomecov -i PAX5_peaks.narrowPeak -g genome/hg19.chrom.sizes
In order to biologically interpret the results of ChIP-Seq experiments, it is useful to look at the genes and other annotated elements that are located in proximity to the identified enriched regions. We will now use bedtools
to identify how many PAX5 peaks overlap GENCODE genes.
First we use awk
to filter out only the genes from the GTF file:
In [ ]:
awk '$3=="gene"' genome/gencode.v18.annotation.gtf \
> genome/gencode.v18.annotation.genes.gtf
Next, count the total number of PAX5 peaks:
In [ ]:
wc -l PAX5_peaks.narrowPeak
Then use bedtools
to find the number overlapping GENCODE genes:
In [ ]:
bedtools intersect -a PAX5_peaks.narrowPeak \
-b genome/gencode.v18.annotation.genes.gtf | wc -l
You can use the bedtools closest
command to find the closest gene to each peak.
In [ ]:
bedtools closest -a PAX5_peaks.narrowPeak \
-b genome/gencode.v18.annotation.genes.gtf | head
Transcription factor binding near to the transcript start sites (TSS) of genes is known to drive gene expression or repression, so it is of interest to know which TSS regions are bound by PAX5. To determine this, we will first create a BED file of the GENCODE TSS using the GTF.
You can use this awk
command to create the TSS BED file:
In [ ]:
awk 'BEGIN {FS=OFS="\t"} { if($7=="+"){tss=$4-1} else { tss = $5 } \
print $1,tss, tss+1, ".", ".", $7, $9}' \
genome/gencode.v18.annotation.genes.gtf > genome/gencode.tss.bed
Now use the bedtools closest
command again to find the closest TSS to each peak:
In [ ]:
sortBed -i genome/gencode.tss.bed > genome/gencode.tss.sorted.bed
In [ ]:
bedtools closest -a PAX5_peaks.narrowPeak \
-b genome/gencode.tss.sorted.bed > PAX5_closestTSS.txt
Use head to inspect the results:
In [ ]:
head PAX5_closestTSS.txt
You have now matched up all the PAX5 transcription factor peaks to their nearest gene transcription start site.
Q1. Looking at the output of the bedtools genomecov
we ran, what percentage of chromosome 1 do the peaks of PAX5 cover?
Q2. Looking at the output from bedtools intersect
, what proportion of PAX5 peaks overlap genes?
Q3. Looking at PAX5_closestTSS.txt
, which gene was found to be closest to MACS peak 2?
You can head back to file formats or continue on to motif analysis.