Now that we've aligned the reads, we want to count how many reads go to each gene. This is called the "quantification" step.
If there are more reads in one sample than another, then the one sample will naturally have more reads mapping to each gene even if no gene is differentially expressed. Therefore, we also need to normalized the counts somehow before proceeding to do statistical analysis.
There are many ways to do these two operations (quantification and normalization). We will start by showing one of the most common and standard ways. Then we will look at some issues with the standard approach and we will learn how to overcome those issues.
As with STAR on day I, we used a Perl script to do the work of running all four samples, so we don't have to do them one-by-one.
(1) To view the actual command to run HTSEQ for sample 1, run the following cat
command and copy the result into the text box below.
Notice the option --stranded=no
Notice it takes two files as arguments:
(2) Examining the htseq output:
Do a head on the file output by HTSeq as follows, and copy the result into the text box below.
(3) How many lines are in the htseq output file ($HOME/RNASEQ/reads/sample1/sample1.htseqcount)? Give the answer in the box below.
(4) Let's now examine the top five rows of the data from both before and after normalization. Run the following command and paste the restult in the box below.
[t-test]
We will run T-tests to compute p-values in R. Do this as follows.
It may take some time to calculate all of the p-values, so if it hangs don't worry it may just still be thinking. It should take minutes though, not hours, so if it hangs too long there may be a problem. If there is a problem and you can't recover, you can always kill the terminal and start a new one.
In [0]:
setwd("~/RNASEQ/HTSEQ_FPKM")
data=read.table(pipe("cut -f 1-5 FINAL_master_list_of_genes_counts.htseq.FPKM.txt"), header=TRUE, sep="\t", row.names=1)
t_test <- as.data.frame(c(rep(NA,dim(data)[1])))
colnames(t_test) = "P_VAL"
t_stat <- as.data.frame(c(rep(NA,dim(data)[1])))
colnames(t_stat) = "T_STAT"
for (i in 1:dim(data)[1]) {
t_val_cur <- tryCatch(t.test(as.matrix(data[i,1:2]),as.matrix(data[i,3:4]))$p.value, error=function(x) 1)
t_test$P_VAL[i]=t_val_cur
t_stat_cur <- tryCatch(t.test(as.matrix(data[i,1:2]),as.matrix(data[i,3:4]))$statistic, error=function(x) 1)
t_stat$T_STAT[i]=t_stat_cur
}
out_table = cbind(data,t_test,t_stat)
write.table(out_table, col.names=NA, quote=FALSE, sep="\t", file="t-test.RNASEQ.UNNORM.htseq.FPKM.txt")
[T-test output]
(5) Let's examine the top of the file we just generated. Paste the result in the box below.
Genes are considered significant if their p-values are small.
(6) Give the command to sort the t-test output file by p-value and output the top six lines.
Note that the unix sort
command takes an argument of the form -kn
where you replace n
by the number of the column you want to sort by. Also don't forget the -n
option which tells it to sort things as numbers and not as strings.
Put the command and it's output in the following two boxes.
[Histogram]
Next we will draw a histogram of the p-values. We will do this in R as follows.
In [0]:
setwd("~/RNASEQ/HTSEQ_FPKM")
d=read.delim("t-test.RNASEQ.UNNORM.htseq.FPKM.txt")
p_val=d$P_VAL
#save as png
png("hist.HTSEQ_FPKM.png")
hist(p_val,xlim=c(0,0.1),ylim=c(0,100),main="RNASEQ.HTSEQ_FPKM",breaks=100)
dev.off()
#print to screen
hist(p_val,xlim=c(0,0.1),ylim=c(0,100),main="RNASEQ.HTSEQ_FPKM",breaks=100)
Next we will normalize the data in a more sophisticated way. Later we will re-calculate these p-values using this normalization and we will compare to the histogram we just created to see how the normalization changed things.
Now take a look quickly through the configuration file for PORT. PORT can do many things so it has a complicated config file. But you can usually leave things as their defaults.
PORT is run in two parts. Run part one as shown in the box below. Note that we tell PORT where the reads are with the -loc
flag. By default PORT will then write the results to subdirectories of the same directory where the reads directory is. Later we will run PORT with other options and when we do that we will need to specify an alternate location to write the results. But for this first run we'll use the defaults as follows:
While we wait for PORT step one to finish, you can run the following command to monitor the progress.
You can also watch the log file populate in real time with this command (use CTRL-C to exit)
After all jobs have finished take a look at this file which gives the expected number of reads:
(7) Now look at this file which has the information about high expressors. Paste the result in the box below.
Notice that there is a gene that is extremely highly expressed in two samples.
Since it is one sample in each of the two conditions, this will introduce considerable global variance.
The more variance there is, the bigger the p-values are and consequently the less power we have to detect the real differential gene expression.
We will first proceed without accounting for this highly expressed gene. Then later we will go back and normalize again where we do properly account for this high expressor. We will draw the p-value histogram in both cases to see the effect of properly accounting for the high expressor.
Monitor the job using bjobs
.
The results will be written to the following directory:
$HOME/RNASEQ/NORMALIZED_DATA/GENE/SPREADSHEETS/
Wait until the previous run is finished. Once it is finished we want to run it again, but this time we will specify the option to filter high expressors. We only need to run part two again to accomplish this.
But first we need to create an alternative directory for these results to live in.
Now fire off PORT part two as follows. Note that we specify the --alt_out flag this time, to tell it where to write these results to.
The final spreadsheets for this run will be written to the following directory:
For homework you will need to draw the same p-value histogram we drew earlier, but you will use the two PORT normalizations of the data that we just computed.
To do this you will need to mimic the R code we used today to compute the p-values and draw histograms. You just need to be careful to setwd
to the right directory and to change the names of the input and output files as appropriate.
[Q1] Using the MIN spreadsheet in $HOME/RNASEQ/NORMALIZED_DATA/GENE/SPREADSHEETS/
a. Compute p-values
In [0]:
b. Make a histogram
In [0]:
[Q2] Using the MIN spreadsheet in $HOME/RNASEQ/NORMALIZED_DATA_filter_highEXP/GENE/SPREADSHEETS/
,
a. Compute p-values
In [0]:
b. Make a histogram
In [0]: