Answers

To run the commands alongside the answers you must first have run all of the tutorial commands for that section and generated the output files they reference.

Tutorial sections

Preparing input data
Running a quality control (QC) analysis
Running a differential expression (DE) analysis
Running a GO term enrichment (GO) analysis

Let's go to our data directory.



In [ ]:

    
cd data

Preparing input data

Q1: How many genes are there in each of the count files?

41,388

There are several ways you can do this. First look at the fist few lines of one of the files.



In [ ]:

    
head counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv

We have to header lines. The first starts with "#" and gives the details of the program and command used to generate the count file.

# Program:featureCounts v1.4.5-p1; Command:"featureCounts" "-O" "-T" "1" "-t" "exon" "-g" "gene_id" "-a" "/lustre/scratch118/infgen/pathogen/pathpipe/refs/Mus/musculus/Mus_musculus_mm10.gtf" "-o" "390176.pe.markdup.bam.featurecounts.csv" "390176.pe.markdup.bam"

The second contains the column headers and starts with "Geneid"

Geneid  Chr Start   End Strand  Length  390176.pe.markdup.bam

So, we could count the total lines in the file and subtract 2 (the header lines).



In [ ]:

    
wc -l counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv

Or, we can use grep to exclude the header lines in turn using the -v option and count the remaining lines.



In [ ]:

    
grep -v '^#' counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv | grep -v '^Geneid' | wc -l

Or, we can use the -c option in grep to count the lines which don't match the header lines, combining the header filter into one search '^[#|Geneid]'. This essentially means count all the lines which don't (-v) start with (^) either "#" or (|) "Geneid".



In [ ]:

    
grep -vc '^[#|Geneid]' counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv

Q2: How many genes have associated annotations?

53,945

We can use wc -l to count the lines in the DEAGO-formatted annotation file.



In [10]:

    
wc -l ensembl_mm10_deago_formatted.tsv









    



   53945 ensembl_mm10_deago_formatted.tsv

Notice that there are more annotated genes than are present in our counts file. This is because we got our gene annotations from Ensembl where there may have been more genes than were in the GTF file used with featureCounts. Your annotation file can contain extra genes or no annotation at all for a gene, it won't matter to DEAGO as it will only pick up the annotations for genes in the count files. If a gene in the count file isn't annotated, DEAGO will set it as 'unknown'. However, the annotation file has to match at least one gene identifier from the counts files.

Q3: How many of those genes have associated gene names?

We can use many different ways. The simplest is to use one of awk's built-in variables, NF, which counts the number of fields in the line. Here, we're looking for all lines containing less than 3 fields (i.e. gene id only, gene id + gene name or gene_id + GO terms) with NR<3. We can then ask that the second field found contains GO terms (as gene name is empty) with $2 ~ "GO:".



In [22]:

    
awk 'NF<3 && $2 ~ "GO:"' ensembl_mm10_deago_formatted.tsv | wc -l

Q4: How many of those genes have associated GO terms?

32,069

We can just switch our search to look for genes where there are less than three fields NR<3, but where the second column doesn't contain GO terms $2 !~ "GO:".



In [23]:

    
awk 'NF<3 && $2 !~ "GO:"' ensembl_mm10_deago_formatted.tsv | wc -l

Running a quality control analysis

Q1: Do all the samples have similar total read counts?

All except WT_IL22 replicate 4 (4.1 and 4.2) which were higher than the others

Q2: Look at the PCA plot. How many clusters have the samples grouped into?

Q3: Do you notice anything in the PCA and sample-to-sample distances plot that you might want to look closer at?

There may be another factor causing variance.

Looking at the PCA plot there seem to be some two slight outliers in both the WT_Ctrl and WT_IL22 clusters. Also, the KO cluster is slightly split into two sub-clusters which don't correspond to the cell type or treatment. Looking at the sample-to-sample distances, it looks like there may be some clustering by replicate number (i.e. a batch effect), in particular with replicate 4. It may be worth considering including replicate as a factor in any downstream analyses.

Running a differential expression analysis

You will need to look at both the results tables and the results files to get the answers. Sec1 is found in the report as it has a log2FoldChange > 2. Fut8 and B4galt1 are not. This is because a default cutoff of log2FoldChange < -2 or > 2 is used for the report tables to keep the report as small as possible so that it can be opened in a web browser.

Sec1

To get these answers you can search the differential expression table.

Q1: What is the gene identifier (geneID)?

ENSMUSG00000040364

Q2: What is the log2 fold change?

5.66

Q3: What is the adjusted p-value?

6.66e-73

Fut8

To get the answer you will need to use awk or grep with the results file for the contrast between WT_IL22 and KO_IL22. You will need to update the "" portion of the name of the results folder so that it matches yours.

awk -F"\t" '$2=="Fut8"{print $1"\t"$2"\t"$36"\t"$40}' results_<timestamp>/wt_il22_vs_ko_il22_q0.05.txt

Q1: What is the gene identifier (geneID)?

ENSMUSG00000021065

Q2: What is the log2 fold change?

0.796

Q3: What is the adjusted p-value?

2.22e-19

B4galt1

awk -F"\t" '$2=="B4galt1"{print $1"\t"$2"\t"$36"\t"$40}' results_<timestamp>/wt_il22_vs_ko_il22_q0.05.txt

Q1: What is the gene identifier (geneID)?

ENSMUSG00000028413

Q2: What is the log2 fold change?

1.13

Q3: What is the adjusted p-value?

1.43e-41

Running a GO term enrichment analysis

Q1: Which biological processes is this upregulated gene associated with?

Fut2

GO:0006486 protein glycosylation

Sec1

GO:0006486 protein glycosylation

Note: be careful of similar gene names e.g. Sec11, try searching with Sec1,

Fut8

GO:0006486 protein glycosylation

B4galt1

GO:0006486 protein glycosylation
GO:0043065 positive regulation of apoptotic process
GO:0006954 inflammatory response