Introducing the tutorial dataset

The data we will use for this practical comes from the ENCODE (Encyclopedia of DNA Elements) Consortium, a big international collaboration aimed at building a comprehensive catalogue of functional elements in the human genome. As part of this project, many human tissues and cell lines were studied using high-throughput sequencing technologies.

In this tutorial, we will work on datasets from, GM12878, a lymphoblastoid cell line produced from the blood of a female donor of European ancestry. Specifically, we will look at binding data for the transcription factor PAX5. PAX5 is a known regulator of B-cell differentiation. Aberrant expression of PAX5 is linked to lymphoblastoid leukaemia. If there is time, we will also look at ChIP-seq data for Polymerase II and the histone modification H3K36me3.

The .fastq file that we will align is called PAX5.fastq. This file is based on PAX5 ChIP-Seq data produced by the Myers lab in the context of the ENCODE project. We will align these reads to the human genome.

The tutorial files can be found in the data directory. Let's go there now!

Move into the directory containing the tutorial data files.


In [ ]:
cd data

Check to see if the tutorial files are there.


In [ ]:
ls *.fastq

If the previous ls command didn't return anything, download and uncompress the tutorial files.


In [ ]:
wget ftp://ftp.sanger.ac.uk/pub/project/pathogens/workshops/chipseq_data.tar.gz
tar -xf chipseq_data.tar.gz
mv chipseq_data/* .

Take a look at one of the FASTQ files.


In [ ]:
head PAX5.fastq

What's next?

For a quick recap of what the tutorial covers and the software you will need, head back to the introduction.

Otherwise, let's get started with aligning the PAX5 sample to the genome.