IN-CLASS RNASEQ 2

Now that we've aligned the reads, we want to count how many reads go to each gene. This is called the "quantification" step.

If there are more reads in one sample than another, then the one sample will naturally have more reads mapping to each gene even if no gene is differentially expressed. Therefore, we also need to normalized the counts somehow before proceeding to do statistical analysis.

There are many ways to do these two operations (quantification and normalization). We will start by showing one of the most common and standard ways. Then we will look at some issues with the standard approach and we will learn how to overcome those issues.

2. HTSeq / FPKM

A. HTSeq

HTSeq is a widely used program for counting the number or reads that map to genes. It has many options, We will run HTSeq by running a Perl script that in turn runs HTSeq for us. Run the Perl script as follows:

perl $HOME/19_RNA-Seq-I/scripts/runall_htseq.pl $HOME/RNASEQ/sample_dirs.txt $HOME/RNASEQ/reads/ /ext/data/RNASEQ/gene_info/mm9_ens.chr1_chr2_chrM.gtf

As with STAR on day I, we used a Perl script to do the work of running all four samples, so we don't have to do them one-by-one.

(1) To view the actual command to run HTSEQ for sample 1, run the following cat command and copy the result into the text box below.

cat $HOME/RNASEQ/shell_scripts/sample1.htseq.sh

Notice the option --stranded=no

Notice it takes two files as arguments:

A SAM file giving the aligned reads (aligned to the genome)
A file telling HTSeq where the genes are in the genome.

(2) Examining the htseq output:

Do a head on the file output by HTSeq as follows, and copy the result into the text box below.

head $HOME/RNASEQ/reads/sample1/sample1.htseqcount

(3) How many lines are in the htseq output file ($HOME/RNASEQ/reads/sample1/sample1.htseqcount)? Give the answer in the box below.

B. FPKM Normalization

Now we will do the normalization that accounts for variation in sequencing depth and in gene length.

We will do this by running a perl script. But this is also an easy thing to do in R if you ever need to do it yourself.

get_htseq_fpkm_spreadsheet $HOME/RNASEQ/sample_dirs.txt $HOME/RNASEQ/reads/

(4) Let's now examine the top five rows of the data from both before and after normalization. Run the following command and paste the restult in the box below.

head -5 $HOME/RNASEQ/HTSEQ_FPKM/FINAL_master_list_of_genes_counts.htseq.*txt

C. Identify Differentially Expressed Genes

The first two columns are from one treatment and the second two columns are from another treatment. So there could be differentially expressed genes between the two treatments.

[t-test]

We will run T-tests to compute p-values in R. Do this as follows.

It may take some time to calculate all of the p-values, so if it hangs don't worry it may just still be thinking. It should take minutes though, not hours, so if it hangs too long there may be a problem. If there is a problem and you can't recover, you can always kill the terminal and start a new one.



In [0]:

    
setwd("~/RNASEQ/HTSEQ_FPKM")
data=read.table(pipe("cut -f 1-5 FINAL_master_list_of_genes_counts.htseq.FPKM.txt"), header=TRUE, sep="\t", row.names=1)
    
t_test <- as.data.frame(c(rep(NA,dim(data)[1])))
colnames(t_test) = "P_VAL"
t_stat <- as.data.frame(c(rep(NA,dim(data)[1])))
colnames(t_stat) = "T_STAT"

for (i in 1:dim(data)[1]) {
    t_val_cur <- tryCatch(t.test(as.matrix(data[i,1:2]),as.matrix(data[i,3:4]))$p.value, error=function(x) 1)
    t_test$P_VAL[i]=t_val_cur
    t_stat_cur <- tryCatch(t.test(as.matrix(data[i,1:2]),as.matrix(data[i,3:4]))$statistic, error=function(x) 1)
    t_stat$T_STAT[i]=t_stat_cur
}
    
out_table = cbind(data,t_test,t_stat)

write.table(out_table, col.names=NA, quote=FALSE, sep="\t", file="t-test.RNASEQ.UNNORM.htseq.FPKM.txt")

[T-test output]

(5) Let's examine the top of the file we just generated. Paste the result in the box below.

head $HOME/RNASEQ/HTSEQ_FPKM/t-test.RNASEQ.UNNORM.htseq.FPKM.txt

Genes are considered significant if their p-values are small.

(6) Give the command to sort the t-test output file by p-value and output the top six lines.

So we can examine the top five genes plus the header line

Note that the unix sort command takes an argument of the form -kn where you replace n by the number of the column you want to sort by. Also don't forget the -n option which tells it to sort things as numbers and not as strings.

Put the command and it's output in the following two boxes.

[Histogram]

Next we will draw a histogram of the p-values. We will do this in R as follows.



In [0]:

    
setwd("~/RNASEQ/HTSEQ_FPKM")
d=read.delim("t-test.RNASEQ.UNNORM.htseq.FPKM.txt")
p_val=d$P_VAL 
#save as png
png("hist.HTSEQ_FPKM.png")
hist(p_val,xlim=c(0,0.1),ylim=c(0,100),main="RNASEQ.HTSEQ_FPKM",breaks=100)
dev.off()
#print to screen
hist(p_val,xlim=c(0,0.1),ylim=c(0,100),main="RNASEQ.HTSEQ_FPKM",breaks=100)

Next we will normalize the data in a more sophisticated way. Later we will re-calculate these p-values using this normalization and we will compare to the histogram we just created to see how the normalization changed things.

3. PORT - Gene Level

PORT: RNA-Seq Normalization and Quantification pipeline

Gene level normalization/quantification

This normalization method is called PORT. We start by making a config file:

cd $HOME/RNASEQ/ ls $HOME/RNASEQ/reads/*/*.fq > $HOME/RNASEQ/unaligned.txt more $HOME/19_RNA-Seq-I/scripts/port_gene.cfg

Now take a look quickly through the configuration file for PORT. PORT can do many things so it has a complicated config file. But you can usually leave things as their defaults.

[PART1]

PORT is run in two parts. Run part one as shown in the box below. Note that we tell PORT where the reads are with the -loc flag. By default PORT will then write the results to subdirectories of the same directory where the reads directory is. Later we will run PORT with other options and when we do that we will need to specify an alternate location to write the results. But for this first run we'll use the defaults as follows:

run_normalization --sample_dirs $HOME/RNASEQ/sample_dirs.txt --loc $HOME/RNASEQ/reads/ --unaligned $HOME/RNASEQ/unaligned.txt --alignedfilename Aligned.out.sam --cfg $HOME/19_RNA-Seq-I/scripts/port_gene.cfg

While we wait for PORT step one to finish, you can run the following command to monitor the progress.

bjobs

You can also watch the log file populate in real time with this command (use CTRL-C to exit)

tail -f $HOME/RNASEQ/logs/RNASEQ.run_normalization.log

After all jobs have finished take a look at this file which gives the expected number of reads:

more $HOME/RNASEQ/STATS/GENE/expected_num_reads_gnorm.txt

(7) Now look at this file which has the information about high expressors. Paste the result in the box below.

more $HOME/RNASEQ/STATS/GENE/percent_high_expresser_gene.txt

Notice that there is a gene that is extremely highly expressed in two samples.

Since it is one sample in each of the two conditions, this will introduce considerable global variance.

The more variance there is, the bigger the p-values are and consequently the less power we have to detect the real differential gene expression.

We will first proceed without accounting for this highly expressed gene. Then later we will go back and normalize again where we do properly account for this high expressor. We will draw the p-value histogram in both cases to see the effect of properly accounting for the high expressor.

[PART2]

We will do this in two different ways.

First Way: Without Filtering for highly expressed genes

Fire off part two as follows:

Monitor the job using bjobs.

The results will be written to the following directory:

$HOME/RNASEQ/NORMALIZED_DATA/GENE/SPREADSHEETS/

Second Way: With filtering for highly expressed genes

Wait until the previous run is finished. Once it is finished we want to run it again, but this time we will specify the option to filter high expressors. We only need to run part two again to accomplish this.

But first we need to create an alternative directory for these results to live in.

mkdir $HOME/RNASEQ/NORMALIZED_DATA_filter_highEXP

Now fire off PORT part two as follows. Note that we specify the --alt_out flag this time, to tell it where to write these results to.

The final spreadsheets for this run will be written to the following directory:

$HOME/RNASEQ/NORMALIZED_DATA_filter_highEXP/GENE/SPREADSHEETS/

Homework

For homework you will need to draw the same p-value histogram we drew earlier, but you will use the two PORT normalizations of the data that we just computed.

To do this you will need to mimic the R code we used today to compute the p-values and draw histograms. You just need to be careful to setwd to the right directory and to change the names of the input and output files as appropriate.

[Q1] Using the MIN spreadsheet in $HOME/RNASEQ/NORMALIZED_DATA/GENE/SPREADSHEETS/

a. Compute p-values



In [0]:

b. Make a histogram



In [0]:

[Q2] Using the MIN spreadsheet in $HOME/RNASEQ/NORMALIZED_DATA_filter_highEXP/GENE/SPREADSHEETS/,

a. Compute p-values



In [0]:

b. Make a histogram



In [0]: