NGS pipelining in the cloud

  • Download the necessary FASTQ files
  • Open a cloud instance and install required programs
  • Make a pipeline in Python/Snakemake/Nextflow to run a RNA-Seq gene expression study
  • Perform simple DE in Python (T-test based) or in R (edgeR, through Python)
  • Functional enrichment, expression plots

This exercise will teach you how to run a simple RNA-Seq pipeline on a cloud. We will use the Amazon cloud. Note that you will have to potentially pay for the use of the Amazon cloud, but I will teach you how to properly monitor your usage statistics and payment amounts. Also note that it is not expensive, but credit card data is required to open your Amazon account.

To figure out the pricing, go here:

https://aws.amazon.com/ec2/pricing/

Note that you will not be using the instance for eight hours fully, and that setup/development can be done on the cheapest instance available.

Open a cloud instance and install required programs

Use the EC2 interface to gain access to a small server.

If possible use this image:

http://thecloudmarket.com/image/ami-9bcb9cf2--cloudbiolinux-ubuntu-13-04-2013-08-28

Make relevant plots from the FASTQ file using BioPython

Follow the example bellow as an inspiration.


In [ ]:
from Bio import SeqIO
count = 0
for rec in SeqIO.parse("SRR020192.fastq", "fastq"):
  count += 1
print "%i reads" % count

quality = [(rec, min(rec.letter_annotations["phred_quality"])) for rec in SeqIO.parse("SRR020192.fastq", "fastq")]
count = SeqIO.write(quality, "good_quality.fastq", "fastq")

Setup the pipeline

Use this small pipelining example, in which I run two programs sequentially from Python. Use it as a template for your pipeline.


In [ ]:
from subprocess import Popen, PIPE

class Command( object ):
    def __init__( self, text ):
        self.args = text.split()
    def execute( self ):
        p = Popen(self.args, shell = False, stdin=None, stdout=PIPE, stderr=None)
        #result = p.communicate()[0]
        #print result
        status = p.wait()
        self.stdin, self.stdout, self.stderr = p.stdin, p.stdout, p.stderr
        
t1 = "makeblastdb -in data/ecoli_prom.fasta -dbtype nucl -out out/promdb"
t2 = "blastn -db out/promdb -query out/promoter.fasta"

pipeline = [Command(t1), Command(t2)]

for c in pipeline:
    print "Running command:", " ".join(c.args)
    c.execute()
    print "Input:", c.stdin
    print "Output:", c.stdout
    print "Error:", c.stderr

Mapping

We next have to map those reads back on what generated them, in RNA-Seq case we deal with RNA transcripts. We have several options:

  • De novo assembly. Use when you have no reference genome, or very poor annotations for it. Assembly will be difficult and incomplete.
  • Genome mapping. You might discover novel transcribed regions, especially useful to find gene models for species without annotations. Difficult to obtain accurate transcript expression estimates.
  • Transcriptome mapping: more contiguous mapped regions, faster mapping, problems with multiple transcripts for the same gene (reads map equally well on them).

A simplified set of commands to perform genome mapping is:

# index the reference genome
bowtie-build genome.fa genome.toplevel

# align the single reads
tophat genome.toplevel file_1.1.fastq,file_1.2.fastq

Question: What are bowtie and tophat doing? Look for their manual. How are single reads and paired end reads treated? Find out how to output in a specific location.

Task: Inside your 'lab' folder you have a few fastq files that were greatly trimmed down to speed computation for classroom purposes. Run the above mapping onto them sequentially. For this purpose create a small python script. Warning: I intentionally wrote the commands above in an overly simplified manner to give you some sense of the real skills required in sequencing: patience and problem solving!

ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.chromosome.4.fa.gz

ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

SAM, BAM file formats

This is the standard way of representing the results from the alignment step. It contains the same information as in the fastq file, plus some extra fields providing mapping information, for example, the coordinates where each of the reads was aligned. A SAM file is a plain text file with the information spread across different columns, and a BAM file is just its compressed version in binary format. In order to save disk space, we will typically work with BAM files; however, we can easily transform a BAM file into SAM format using samtools:

(do not run)

samtools view -h -o file.sam file.bam

We can now inspect the first lines of the file with standard Unix commands:

head -n20 file.sam

Alternatively, we can directly inspect the contents of a BAM file with the following samtools command:

samtools view file.bam | head

Visualization with IGV

http://www.broadinstitute.org/igv/

Download and install IGV.

  • Load the reference genome: Genomes > Load genome from file > genome.fa
  • Load the BAM file: File > Load from file > file.bam. (First index with samtools index bam_file)
  • Load the annotations: File > Load from file > genome.gtf

Can you find meaningful information? Look for examle for structural variants.

http://www.broadinstitute.org/igv/AlignmentData

Normalizing counts and differential expression

Normalization is concerned with having the counts relate based on the size of the templates (gene and library sizes). Differential expression will finally give you sets of genes that differ in expression among samples. Gene expression in RNA-Seq jargon means simply normalized read counts.

But in case we manage to reach here we can try to following:

Here is a helpful link to an NGS course I once helped supervise in Umeå: http://uppnex.se/twiki/do/view/Courses/NgsIntro1411/RNASeqMap

Cufflinks output: The first two files contain basic information about expressed genes and transcripts, respectively - those known from the annotation file as well as novel loci identified by cufflinks -and the strength of their expression, given in FPKM. FPKM stands for ‘Fragments Per Kilobase of exon per Million fragments mapped’ and is a normalized measure of transcript abundance. That is the short explanation. The longer version for the more mathematically inclined among us can be found at http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1226.html .

Cuffmerge: is a tool that takes cufflinks-derived annotation files (known & ‘novel’ loci) and reconciles them into a consensus annotation, discarding e.g. spuriously transcribed loci and merging overlapping loci into larger transcription units where possible.

Cuffdiff takes aligned reads from two or more samples, estimates comparable expression values and performs statistical analysis of the resulting data to determine which genes exhibit significantly different activity profiles between any two samples. The nitty-gritty details of the underlying mathematics can be found here:http://cufflinks.cbcb.umd.edu/howitworks.html.

The finer points

Verify your results by plotting the count values and by performing your own simpler DE study using Z-testing and FDR under scipy and scikit learn.


In [ ]:
samtools view accepted_hits.bam | htseq-count \
    --mode=intersection-nonempty \
    --stranded=no \
    --type=exon \
    --idattr=gene_id - \
    ../genome.gtf > htseq_count.out
    
#Assembly and transcript calling
cufflinks -o my_output_folder -p 8 -g genome.gtf my_infile.bam
#Cuffmerge: Reconciling different transcript models
cuffmerge -o merged -g reference/genes.gtf -p 8 -s reference/genome.fa transcripts.txt
#Differential expression analysis
cuffdiff -o cuffdiff.out -L rat,human -p 8 -u merged.gtf rat.bam human.bam

Galaxy

For finer points, setup Galagy and run your pipelining script through it.

Apart from pipelining, Galaxy contains a number of data analysis and visualization options. There are a number of public servers as well as a main server, available here: https://wiki.galaxyproject.org/Main.

The use of Galaxy is through web, thus it reduces the frustrations associated with command line linux environments, but for the fine tunning or for anything else than a simple project, Galaxy is usually installed locally and it is heavily tuned to the particular NGS project a user is involved with.

Your instanced BioLinux comes with Galaxy installed by default, so for learning or small scale projects you can use the cloud instance too.


In [ ]: