File Formats

BED files

Over the years a set of commonly used file formats for genomic intervals have emerged. Most of these file formats are tabular where each row consists of an interval and columns have a pre-defined meaning, describing chromosomes, locations, scores, etc. The UCSC web browser has an informative list of these at http://genome.ucsc.edu/FAQ/FAQformat.html.

The BED format is the simplest file format of these. A minimal bed file has at least three columns denoting chromosome, start and end of an interval. The following example denotes three intervals, two on chromosome chr1 and one on chr2.

chromosome	start	end
chr1	50	100
chr1	500	1000
chr2	600	800

BED files follow the UCSC Genome Browser’s convention of making the start position 0-based and the end position 1-based. In other words, you should interpret the "start" column as being 1 base pair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base.

chromosome	start	end	description
chr1	0	1	I-am-the-first-position-on-chrom-1

Using the bed format documentation found at http://genome.ucsc.edu/FAQ/FAQformat.html#format1 answer the following questions.

Questions

Q1. The simplest bed file contains just three columns (chromosome, start, end) and is often called BED3 format. What extra columns does BED6 contain?
Hint: look for information about columns 4 to 6 in the documentation http://genome.ucsc.edu/FAQ/FAQformat.html#format1

Q2. In the above examples, what are the lengths of the intervals?

Q3. Can you output a BED6 format with a transcript called “loc1”, transcribed on the forward strand and having three exons of length 100 starting at positions 1000, 2000 and 3000?
Hint: you will need one line per exon

narrowPeak files

The narrowPeak format is a BED6+4 format used to describe and visualise called peaks. Previously, we have used MACS2 to call peaks on the PAX5 ChIP-seq data set.

If you are not in there already, change into the data directory.



In [ ]:

    
cd data

View the first 10 lines in PAX5_peaks.narrowPeak using the head command:



In [ ]:

    
head -10 PAX5_peaks.narrowPeak

NarrowPeak files can also be uploaded to IGV or other genome browsers.

Try uploading the peak file generated by MACS2 to IGV.

Questions

Q4. What additional information is given in the narrowPeak file, beside the location of the peaks?
Hint: See http://genome.ucsc.edu/FAQ/FAQformat.html#format12 for details

Q5. Does the first peak that was called look convincing to you?

GTF files

A second popular format is the GTF format. Each row in a GTF formatted file denotes a genomic interval. The GTF format documentation can be found at http://mblab.wustl.edu/GTF2.html.

The three intervals from above might be:

seqid	source	type	start	stop	score	strand	phase	attributes
chr1	gene	exon	51	100	.	+	0	gene_id "001";transcript_id "001.1";
chr1	gene	exon	501	1000	.	+	2	gene_id "001";transcript_id "001.1";
chr2	repeat	exon	601	800	.	+	.

The 9th column permits intervals to be grouped and linked in a hierarchical fashion. This format is thus popular to describe gene models. Note how the first two intervals are linked through a common transcript_id and gene_id.

The aim of the GENCODE project is to annotate all evidence-based genes and gene features in the entire human genome at a high accuracy. Annotation of the GENCODE gene set is carried out using a mix of manual annotation, experimental analysis and computational biology methods. The GENCODE v18 geneset is available in the genome folder.

Look at the first 10 lines of the GENCODE annotation file:



In [ ]:

    
head -n 10 genome/gencode.v18.annotation.gtf

Questions

Q6. In the small example table above, why have the coordinates changed from the BED description?

What's next?

You can head back to identifying enriched areas using MACS or continue on to inspecting genomic regions using bedtools.