Inspecting genomic regions using bedtools

In this section we perform simple functions, such as overlaps, on the most common file type used for describing genomic regions, the BED file. We will examine the results of the ChIP-Seq peak calling you have performed on the transcription factor PAX5 and perform simple operations on these files, using the bedtools suite of programs. You will then annotate the MACS2 peaks with respect to genomic annotations. Finally, we will select the most significantly enriched peaks, and extract the genomic sequence flanking their summits, the point of highest enrichment.

If you are not in there already, change into the data directory.


In [ ]:
cd data

The bedtools package permits complex, interval-based manipulation of BED and GTF files. They are also very fast. The general invocation of bedtools is bedtools <COMMAND>.

To get an overview of the available commands, simply call bedtools without any command or options in the terminal window.


In [ ]:
bedtools

To get help for a command, type bedtools <COMMAND>. Extensive documentation and examples are available at https://bedtools.readthedocs.org/en/latest/. We will now use bedtools to calculate simple coverage statistics of the peak calls over the genome (keep in mind that only peaks on Chromosome 1 are in the file).

To bring up the help page for the bedtools genomecov command, type:


In [ ]:
bedtools genomecov

Calculate the genome coverage of the PAX5 peaks:


In [ ]:
bedtools genomecov -i PAX5_peaks.narrowPeak -g genome/hg19.chrom.sizes

In order to biologically interpret the results of ChIP-Seq experiments, it is useful to look at the genes and other annotated elements that are located in proximity to the identified enriched regions. We will now use bedtools to identify how many PAX5 peaks overlap GENCODE genes.

First we use awk to filter out only the genes from the GTF file:


In [ ]:
awk '$3=="gene"' genome/gencode.v18.annotation.gtf \
> genome/gencode.v18.annotation.genes.gtf

Next, count the total number of PAX5 peaks:


In [ ]:
wc -l PAX5_peaks.narrowPeak

Then use bedtools to find the number overlapping GENCODE genes:


In [ ]:
bedtools intersect -a PAX5_peaks.narrowPeak \
-b genome/gencode.v18.annotation.genes.gtf | wc -l

You can use the bedtools closest command to find the closest gene to each peak.


In [ ]:
bedtools closest -a PAX5_peaks.narrowPeak \
-b genome/gencode.v18.annotation.genes.gtf | head

Transcription factor binding near to the transcript start sites (TSS) of genes is known to drive gene expression or repression, so it is of interest to know which TSS regions are bound by PAX5. To determine this, we will first create a BED file of the GENCODE TSS using the GTF.

You can use this awk command to create the TSS BED file:


In [ ]:
awk 'BEGIN {FS=OFS="\t"} { if($7=="+"){tss=$4-1} else { tss = $5 } \
print $1,tss, tss+1, ".", ".", $7, $9}' \
genome/gencode.v18.annotation.genes.gtf > genome/gencode.tss.bed

Now use the bedtools closest command again to find the closest TSS to each peak:


In [ ]:
sortBed -i genome/gencode.tss.bed > genome/gencode.tss.sorted.bed

In [ ]:
bedtools closest -a PAX5_peaks.narrowPeak \
-b genome/gencode.tss.sorted.bed > PAX5_closestTSS.txt

Use head to inspect the results:


In [ ]:
head PAX5_closestTSS.txt

You have now matched up all the PAX5 transcription factor peaks to their nearest gene transcription start site.


Questions

Q1. Looking at the output of the bedtools genomecov we ran, what percentage of chromosome 1 do the peaks of PAX5 cover?

Q2. Looking at the output from bedtools intersect, what proportion of PAX5 peaks overlap genes?

Q3. Looking at PAX5_closestTSS.txt, which gene was found to be closest to MACS peak 2?


What's next?

You can head back to file formats or continue on to motif analysis.