To run the commands alongside the answers you must first have run all of the tutorial commands for that section and generated the output files they reference.
Let's go to our data
directory.
In [ ]:
cd data
In [ ]:
head counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv
We have to header lines. The first starts with "#" and gives the details of the program and command used to generate the count file.
# Program:featureCounts v1.4.5-p1; Command:"featureCounts" "-O" "-T" "1" "-t" "exon" "-g" "gene_id" "-a" "/lustre/scratch118/infgen/pathogen/pathpipe/refs/Mus/musculus/Mus_musculus_mm10.gtf" "-o" "390176.pe.markdup.bam.featurecounts.csv" "390176.pe.markdup.bam"
The second contains the column headers and starts with "Geneid"
Geneid Chr Start End Strand Length 390176.pe.markdup.bam
So, we could count the total lines in the file and subtract 2 (the header lines).
In [ ]:
wc -l counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv
Or, we can use grep
to exclude the header lines in turn using the -v
option and count the remaining lines.
In [ ]:
grep -v '^#' counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv | grep -v '^Geneid' | wc -l
Or, we can use the -c
option in grep
to count the lines which don't match the header lines, combining the header filter into one search '^[#|Geneid]'
. This essentially means count all the lines which don't (-v
) start with (^
) either "#" or (|
) "Geneid".
In [ ]:
grep -vc '^[#|Geneid]' counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv
Q2: How many genes have associated annotations?
53,945
We can use wc -l
to count the lines in the DEAGO-formatted annotation file.
In [10]:
wc -l ensembl_mm10_deago_formatted.tsv
Notice that there are more annotated genes than are present in our counts file. This is because we got our gene annotations from Ensembl where there may have been more genes than were in the GTF file used with featureCounts. Your annotation file can contain extra genes or no annotation at all for a gene, it won't matter to DEAGO as it will only pick up the annotations for genes in the count files. If a gene in the count file isn't annotated, DEAGO will set it as 'unknown'. However, the annotation file has to match at least one gene identifier from the counts files.
Q3: How many of those genes have associated gene names?
0
We can use many different ways. The simplest is to use one of awk
's built-in variables, NF
, which counts the number of fields in the line. Here, we're looking for all lines containing less than 3 fields (i.e. gene id only, gene id + gene name or gene_id + GO terms) with NR<3
. We can then ask that the second field found contains GO terms (as gene name is empty) with $2 ~ "GO:"
.
In [22]:
awk 'NF<3 && $2 ~ "GO:"' ensembl_mm10_deago_formatted.tsv | wc -l
Q4: How many of those genes have associated GO terms?
32,069
We can just switch our search to look for genes where there are less than three fields NR<3
, but where the second column doesn't contain GO terms $2 !~ "GO:"
.
In [23]:
awk 'NF<3 && $2 !~ "GO:"' ensembl_mm10_deago_formatted.tsv | wc -l
Q1: Do all the samples have similar total read counts?
All except WT_IL22 replicate 4 (4.1 and 4.2) which were higher than the others
Q2: Look at the PCA plot. How many clusters have the samples grouped into?
3
Q3: Do you notice anything in the PCA and sample-to-sample distances plot that you might want to look closer at?
There may be another factor causing variance.
Looking at the PCA plot there seem to be some two slight outliers in both the WT_Ctrl and WT_IL22 clusters. Also, the KO cluster is slightly split into two sub-clusters which don't correspond to the cell type or treatment. Looking at the sample-to-sample distances, it looks like there may be some clustering by replicate number (i.e. a batch effect), in particular with replicate 4. It may be worth considering including replicate as a factor in any downstream analyses.
You will need to look at both the results tables and the results files to get the answers. Sec1 is found in the report as it has a log2FoldChange > 2. Fut8 and B4galt1 are not. This is because a default cutoff of log2FoldChange < -2 or > 2 is used for the report tables to keep the report as small as possible so that it can be opened in a web browser.
To get these answers you can search the differential expression table.
Q1: What is the gene identifier (geneID)?
ENSMUSG00000040364
Q2: What is the log2 fold change?
5.66
Q3: What is the adjusted p-value?
6.66e-73
To get the answer you will need to use awk
or grep
with the results file for the contrast between WT_IL22 and KO_IL22. You will need to update the "
awk -F"\t" '$2=="Fut8"{print $1"\t"$2"\t"$36"\t"$40}' results_<timestamp>/wt_il22_vs_ko_il22_q0.05.txt
Q1: What is the gene identifier (geneID)?
ENSMUSG00000021065
Q2: What is the log2 fold change?
0.796
Q3: What is the adjusted p-value?
2.22e-19
To get the answer you will need to use awk
or grep
with the results file for the contrast between WT_IL22 and KO_IL22. You will need to update the "
awk -F"\t" '$2=="B4galt1"{print $1"\t"$2"\t"$36"\t"$40}' results_<timestamp>/wt_il22_vs_ko_il22_q0.05.txt
Q1: What is the gene identifier (geneID)?
ENSMUSG00000028413
Q2: What is the log2 fold change?
1.13
Q3: What is the adjusted p-value?
1.43e-41
Q1: Which biological processes is this upregulated gene associated with?
GO:0006486 protein glycosylation
Note: be careful of similar gene names e.g. Sec11, try searching with Sec1,