This exercise will teach you how to run a simple RNA-Seq pipeline on a cloud. We will use the Amazon cloud. Note that you will have to potentially pay for the use of the Amazon cloud, but I will teach you how to properly monitor your usage statistics and payment amounts. Also note that it is not expensive, but credit card data is required to open your Amazon account.
To figure out the pricing, go here:
https://aws.amazon.com/ec2/pricing/
Note that you will not be using the instance for eight hours fully, and that setup/development can be done on the cheapest instance available.
Use the EC2 interface to gain access to a small server.
If possible use this image:
http://thecloudmarket.com/image/ami-9bcb9cf2--cloudbiolinux-ubuntu-13-04-2013-08-28
In [ ]:
from Bio import SeqIO
count = 0
for rec in SeqIO.parse("SRR020192.fastq", "fastq"):
count += 1
print "%i reads" % count
quality = [(rec, min(rec.letter_annotations["phred_quality"])) for rec in SeqIO.parse("SRR020192.fastq", "fastq")]
count = SeqIO.write(quality, "good_quality.fastq", "fastq")
In [ ]:
from subprocess import Popen, PIPE
class Command( object ):
def __init__( self, text ):
self.args = text.split()
def execute( self ):
p = Popen(self.args, shell = False, stdin=None, stdout=PIPE, stderr=None)
#result = p.communicate()[0]
#print result
status = p.wait()
self.stdin, self.stdout, self.stderr = p.stdin, p.stdout, p.stderr
t1 = "makeblastdb -in data/ecoli_prom.fasta -dbtype nucl -out out/promdb"
t2 = "blastn -db out/promdb -query out/promoter.fasta"
pipeline = [Command(t1), Command(t2)]
for c in pipeline:
print "Running command:", " ".join(c.args)
c.execute()
print "Input:", c.stdin
print "Output:", c.stdout
print "Error:", c.stderr
We next have to map those reads back on what generated them, in RNA-Seq case we deal with RNA transcripts. We have several options:
A simplified set of commands to perform genome mapping is:
# index the reference genome
bowtie-build genome.fa genome.toplevel
# align the single reads
tophat genome.toplevel file_1.1.fastq,file_1.2.fastq
Question: What are bowtie and tophat doing? Look for their manual. How are single reads and paired end reads treated? Find out how to output in a specific location.
Task: Inside your 'lab' folder you have a few fastq files that were greatly trimmed down to speed computation for classroom purposes. Run the above mapping onto them sequentially. For this purpose create a small python script. Warning: I intentionally wrote the commands above in an overly simplified manner to give you some sense of the real skills required in sequencing: patience and problem solving!
ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.chromosome.4.fa.gz
ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
SAM, BAM file formats
This is the standard way of representing the results from the alignment step. It contains the same information as in the fastq file, plus some extra fields providing mapping information, for example, the coordinates where each of the reads was aligned. A SAM file is a plain text file with the information spread across different columns, and a BAM file is just its compressed version in binary format. In order to save disk space, we will typically work with BAM files; however, we can easily transform a BAM file into SAM format using samtools:
(do not run)
samtools view -h -o file.sam file.bam
We can now inspect the first lines of the file with standard Unix commands:
head -n20 file.sam
Alternatively, we can directly inspect the contents of a BAM file with the following samtools command:
samtools view file.bam | head
Visualization with IGV
http://www.broadinstitute.org/igv/
Download and install IGV.
Can you find meaningful information? Look for examle for structural variants.
Normalization is concerned with having the counts relate based on the size of the templates (gene and library sizes). Differential expression will finally give you sets of genes that differ in expression among samples. Gene expression in RNA-Seq jargon means simply normalized read counts.
But in case we manage to reach here we can try to following:
Here is a helpful link to an NGS course I once helped supervise in Umeå: http://uppnex.se/twiki/do/view/Courses/NgsIntro1411/RNASeqMap
Cufflinks output: The first two files contain basic information about expressed genes and transcripts, respectively - those known from the annotation file as well as novel loci identified by cufflinks -and the strength of their expression, given in FPKM. FPKM stands for ‘Fragments Per Kilobase of exon per Million fragments mapped’ and is a normalized measure of transcript abundance. That is the short explanation. The longer version for the more mathematically inclined among us can be found at http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1226.html .
Cuffmerge: is a tool that takes cufflinks-derived annotation files (known & ‘novel’ loci) and reconciles them into a consensus annotation, discarding e.g. spuriously transcribed loci and merging overlapping loci into larger transcription units where possible.
Cuffdiff takes aligned reads from two or more samples, estimates comparable expression values and performs statistical analysis of the resulting data to determine which genes exhibit significantly different activity profiles between any two samples. The nitty-gritty details of the underlying mathematics can be found here:http://cufflinks.cbcb.umd.edu/howitworks.html.
Verify your results by plotting the count values and by performing your own simpler DE study using Z-testing and FDR under scipy and scikit learn.
In [ ]:
samtools view accepted_hits.bam | htseq-count \
--mode=intersection-nonempty \
--stranded=no \
--type=exon \
--idattr=gene_id - \
../genome.gtf > htseq_count.out
#Assembly and transcript calling
cufflinks -o my_output_folder -p 8 -g genome.gtf my_infile.bam
#Cuffmerge: Reconciling different transcript models
cuffmerge -o merged -g reference/genes.gtf -p 8 -s reference/genome.fa transcripts.txt
#Differential expression analysis
cuffdiff -o cuffdiff.out -L rat,human -p 8 -u merged.gtf rat.bam human.bam
For finer points, setup Galagy and run your pipelining script through it.
Apart from pipelining, Galaxy contains a number of data analysis and visualization options. There are a number of public servers as well as a main server, available here: https://wiki.galaxyproject.org/Main.
The use of Galaxy is through web, thus it reduces the frustrations associated with command line linux environments, but for the fine tunning or for anything else than a simple project, Galaxy is usually installed locally and it is heavily tuned to the particular NGS project a user is involved with.
Your instanced BioLinux comes with Galaxy installed by default, so for learning or small scale projects you can use the cloud instance too.
In [ ]: