Over the years a set of commonly used file formats for genomic intervals have emerged. Most of these file formats are tabular where each row consists of an interval and columns have a pre-defined meaning, describing chromosomes, locations, scores, etc. The UCSC web browser has an informative list of these at http://genome.ucsc.edu/FAQ/FAQformat.html.
The BED format is the simplest file format of these. A minimal bed file has at least three columns denoting chromosome, start and end of an interval. The following example denotes three intervals, two on chromosome chr1 and one on chr2.
chromosome | start | end |
---|---|---|
chr1 | 50 | 100 |
chr1 | 500 | 1000 |
chr2 | 600 | 800 |
BED files follow the UCSC Genome Browser’s convention of making the start position 0-based and the end position 1-based. In other words, you should interpret the "start" column as being 1 base pair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base.
chromosome | start | end | description |
---|---|---|---|
chr1 | 0 | 1 | I-am-the-first-position-on-chrom-1 |
Using the bed format documentation found at http://genome.ucsc.edu/FAQ/FAQformat.html#format1 answer the following questions.
Q1. The simplest bed file contains just three columns (chromosome, start, end) and is often called BED3 format. What extra columns does BED6 contain?
Hint: look for information about columns 4 to 6 in the documentation http://genome.ucsc.edu/FAQ/FAQformat.html#format1
Q2. In the above examples, what are the lengths of the intervals?
Q3. Can you output a BED6 format with a transcript called “loc1”, transcribed on the forward strand and having three exons of length 100 starting at positions 1000, 2000 and 3000?
Hint: you will need one line per exon
In [ ]:
cd data
View the first 10 lines in PAX5_peaks.narrowPeak
using the head
command:
In [ ]:
head -10 PAX5_peaks.narrowPeak
NarrowPeak files can also be uploaded to IGV or other genome browsers.
Try uploading the peak file generated by MACS2 to IGV.
Q4. What additional information is given in the narrowPeak file, beside the location of the peaks?
Hint: See http://genome.ucsc.edu/FAQ/FAQformat.html#format12 for details
Q5. Does the first peak that was called look convincing to you?
A second popular format is the GTF format. Each row in a GTF formatted file denotes a genomic interval. The GTF format documentation can be found at http://mblab.wustl.edu/GTF2.html.
The three intervals from above might be:
seqid | source | type | start | stop | score | strand | phase | attributes |
---|---|---|---|---|---|---|---|---|
chr1 | gene | exon | 51 | 100 | . | + | 0 | gene_id "001";transcript_id "001.1"; |
chr1 | gene | exon | 501 | 1000 | . | + | 2 | gene_id "001";transcript_id "001.1"; |
chr2 | repeat | exon | 601 | 800 | . | + | . |
The 9th column permits intervals to be grouped and linked in a hierarchical fashion. This format is thus popular to describe gene models. Note how the first two intervals are linked through a common transcript_id and gene_id.
The aim of the GENCODE project is to annotate all evidence-based genes and gene features in the entire human genome at a high accuracy. Annotation of the GENCODE gene set is carried out using a mix of manual annotation, experimental analysis and computational biology methods. The GENCODE v18 geneset is available in the genome folder.
Look at the first 10 lines of the GENCODE annotation file:
In [ ]:
head -n 10 genome/gencode.v18.annotation.gtf
You can head back to identifying enriched areas using MACS or continue on to inspecting genomic regions using bedtools.