IN-CLASS RNASEQ 3

4. Compare three histograms

Look at the three histograms you created yesterday (one in class, two for homework). Notice how the final normalization gave us many more genes with small p-values.

This illustrates the potential negative impact on the global analysis of even one gene that is highly expressed and highly variable.

This is just one of many factors that introduce global variance.

The PORT pipeline normalizes for many such factors.

5. PORT - Exon-Intron-Junction Level normalization/quantification

So far everything we have done is at the gene level.

But one of the powers of RNA-Seq is that it allows us to drill down to the level of exons/introns/junctions and ultimately all the way down to the level of single bases.

So we will now run PORT again, but at the exon/intron/junction level. This analysis is much more complicated and requries several normalization considerations that do not arise when we work at the gene level - for example the balance between intronic and exonic signal, which can cause problems if not properly handled.

However all of this complication is taken care of for you under the hood of PORT. So it is not more complicated to run PORT in this mode than it was in gene-level mode.

(1) Just as before, PORT is run in two parts. The only thing you have to change is the configuration file.

Run [PART1] using the following Exon-Intron-Junction level config file:

 $HOME/19_RNA-Seq-I/scripts/port_eij.cfg

Copy the full command you used to the box below:

(You can view the help page for PORT by running)

 run_normalization -h

PORT is configurable in many ways. If you run PORT on your own data, you can edit the config file to change PORT's behavior to suit your needs.

(2) How many highly expressed introns did PORT identify? Look in the appropriate file in the $HOME/RNASEQ/STATS/EXON_INTRON_JUNCTION folder.

(3) Run [PART2] filtering out the high expressers and type the command used. You will need to specify the "-cutoff_highexp 5" flag and don't forget to use the -alt_out flag as we did before, when we filtered for the high expressor in the gene level analysis.

6. Data Visualization

UCSC Genome Browser

UCSC Track Hub - Before Normalization

UCSC Track Hub - PORT

UCSC Track Hub - PORT (Filter High Expressers)

Now you know basically how to get from raw data to a normalized spreadsheet, and how to run a basic differential expression analysis.

If you need to do this for real, you can use the UPenn PMACS compute cluster. You can obtain an account by writing pmacshpc@med.upenn.edu.

On PMACS you can run the analysis similarly to how we did it here. To learn more about PORT, click here

Homework

[Q1] Run the same kind of differential expression analysis we ran at gene level, but this time run it at intron level.

You will find the spreadsheet with the intron data in the $HOME/RNASEQ/NORMALIZED_DATA_filter_highEXP/EXON_INTRON_JUNCTION/SPREADSHEETS/ directory (use the MIN spreadsheet for introns). You do not have to generate the histogram, just do the first part where you compute the p-values. Write the results to a file called t-test_intron-level.txt in the same directory as the inton data spreadsheet.



In [0]:

Question: What is the intron with the smallest p-value? Enter it's genome coordinates and its p-value in the boxes below:

[Q2] Download the unique coverage file for sample1 in $HOME/RNASEQ/NORMALIZED_DATA_filter_highEXP/EXON_INTRON_JUNCTION/COV/ and the high quality junction file for the same sample in $HOME/RNASEQ/NORMALIZED_DATA_filter_highEXP/EXON_INTRON_JUNCTION/JUNCTIONS/.

Upload the coverage file and junction file to the genome browser (mouse genome/mm9). Go to the intron location identified in Homework (Q1) and take a screenshot. Put the image in $HOME/21_RNA-Seq-III/homework_images/ directory.

In the box below the three ENSEMBL Transcript IDs for the transcripts that have the intron identified above. You may have to turn on the ENSEMBL annotation track if it is not on already.

Note that ENSEMBL Transcript IDs start with the prefix ENSMUST followed by a number.

[Q3] Now try to figure out what is the ENSEMBL gene ID for the gene that has the three transcripts you gave above. Gene IDs start with the prefix ENSMUSG followed by a number. Enter it in the box below.