Based on the bioconductor vignettes:
KEGGgraph: Application Examples by Jitao David Zhang (October 30, 2017)
and Bioconductor’s SPIA package Adi L. Tarca, Purvesh Khatri and Sorin Draghici (October 30, 2017)
In [4]:
source("https://bioconductor.org/biocLite.R")
In [8]:
# install KEGGgraph for manipulating KEGG pathways
biocLite("KEGGgraph")
# install SPIA for Signaling Pathway Impact Analysis
biocLite("SPIA")
In [7]:
# install libraries for visualization
biocLite("graph")
biocLite('Rgraphviz')
In [6]:
# install a database to allow mapping of microarray probesets to NCBI and UCSC gene identifiers
biocLite("hgu133plus2.db")
In [5]:
biocLite("AnnotationDbi")
biocLite("RBGL")
we will use the human colorectal cancer pathway in this demonstration, to show how to visualize KGML files. This pathway has the ID hsa05210, and on the KEGG website is rendered thus:
Remember, only the green-highlighted rectangles indicate gene products in the human pathway. The white boxes indicate gene products that are found in this pathway in other organisms.
In [1]:
# load the KEGGgraph library and its functions:
library(KEGGgraph)
Now we will load the KGML file for hsa00010 from file hsa00010.xml
. This file was downloaded from ftp://ftp.bioinformatics.jp/kegg/xml/kgml/non-metabolic/organisms/hsa.tar.gz, which unpacks to all the non-metabolic pathways and their image files. The library has an academic subscription to KEGG, making the FTP database of KEGG files freely accessible to anyone from within the WCM network.
In [10]:
demoKGMLfile <- "xml-nonmet-hsa/hsa05210.xml"
demoGraph <- parseKGML2Graph(demoKGMLfile, genesOnly=FALSE)
In [11]:
head(nodes(demoGraph))
Another option is to use retrieveKGML
to download the KGML file. This requires knowing the pathway ID and the three-letter organism code. This will save the KGML to the file specified.
In [12]:
FileName <- "hsa05210-web.xml"
retrieveKGML(pathwayid="05210", organism="hsa", destfile=FileName)
Once you have the KGML pathway loaded by parseKGML2Graph, you can visualize the nodes and their connections:
In [15]:
# to visualize the graph, we will use the Rgraphviz library:
library(Rgraphviz)
nodeInfo <- getKEGGnodeData(demoGraph)
nodeType <- sapply(nodeInfo, getType)
makeNodeRenderAttrs <- function(g, label=nodes(g),
shape="ellipse", fill="#e0e0e0",...) {
rv <- list(label=label, shape=shape, fill=fill, ...)
nA <- nodeRenderInfo(g)
for(i in seq(along=rv)) {
if (length(rv[[i]]) == 1) {
rv[[i]] <- rep(rv[[i]], numNodes(g))
} else {
if (length(rv[[i]]) != numNodes(g))
stop("Attribute vector must have as many elements as 'g' has nodes.")
}
names(rv[[i]]) <- nodes(g)
nA[[ names(rv)[[i]] ]] <- rv[[i]]
}
nodeRenderInfo(g) <- nA
return(g)
}
In [16]:
demoDrawn <- plotKEGGgraph(demoGraph)
In [17]:
# add some helpful coloring and shapes to differentiate elements
demoDrawnRefine <- makeNodeRenderAttrs(demoDrawn, fill=ifelse(nodeType=="gene", "lightblue", "orange"),
shape=ifelse(nodeType=="gene", "ellipse","rectangle"))
renderGraph(demoDrawnRefine)
In [20]:
library(SPIA)
In [21]:
if(require(SPIA)) {
data(colorectalcancer,package="SPIA")
} else {
data(colorectalcancerSPIA, package="KEGGgraph")
}
In [22]:
data(colorectalcancer)
In [23]:
# view the first rows of the analysed data set "top"
head(top)
We only consider the probes annotated with EntrezGeneID and differentially expressed with FDR p-value less than 0.05.
In [26]:
# load affymetrix array annotation data
library(hgu133plus2.db)
# get list of all differentially expressed genes (DEGs):
x <- hgu133plus2ENTREZID
top$ENTREZ <- unlist(as.list(x[top$ID]))
top <- top[!is.na(top$ENTREZ),]
top <- top[!duplicated(top$ENTREZ),]
tg1 <- top[top$adj.P.Val < 0.05,]
Documentation can be found here.
The SPIA algorithm takes as input two vectors: log2 fold changes of differentially expressed genes (DE_Colorectal) and those of all EntrezGeneID annotated genes (ALL_Colorectal), and returns a table of pathways ranked from the most to the least significant. The results show that Colorectal cancer pathway (ID:05210) is significantly activated (ranked the 5th of the result table). To visualize the results, we visualize the pathway with differentially expressed genes marked with color
Now create a dataframe of differentially expressed genes and their log fold-change values, and a dataframe of all Entrez genes on the Affymetrix array:
In [27]:
DE_Colorectal <- tg1$logFC
names(DE_Colorectal) <- as.vector(tg1$ENTREZ)
ALL_Colorectal <- top$ENTREZ
In [28]:
DE_Colorectal[1:5]
In [29]:
ALL_Colorectal[1:5]
NB: The default KEGG data that was preprocessed for SPIA analysis and is included for the hsa and mmu organisms was downloaded from KEGG’s website on: 09/07/2012. It is thus outdated, but can be updated using the FTP xml files.
To update the pathways available for SPIA, download the relevant pathway KGML files from the FTP site, then use the following code:
xmldir <- "path_to_kgml_files"
org <- "organism_code" # eg "hsa"
makeSPIAdata(kgml.path=xmldir,organism=org,out.path="./")
Then, when running the analysis, specify the directory to the prepared data files with the data.dir
parameter:
res <- spia(de=DE_Colorectal, all=ALL_Colorectal, organism="hsa",data.dir="./")
In [30]:
# This code takes a couple of minutes to run
# pathway analysis based on combined evidence;
# use nB=2000 or more for more accurate results
res=spia(de=DE_Colorectal,all=ALL_Colorectal,organism="hsa",nB=2000,plots=FALSE,beta=NULL,combine="fisher",verbose=FALSE)
#make the output fit this screen
res$Name=substr(res$Name,1,10)
#show first 15 pathways, omit KEGG links
res[1:20,-12]
In [31]:
plotP(res,threshold=0.05)
points(I(-log(pPERT))~I(-log(pNDE)),data=res[res$ID=="05210",],col="green",pch=19,cex=1.5)
We now have a look to see how many of the genes in our pathway are differentially expressed from our microarray data.
In [32]:
# load KEGGgraph object:
colFile <- "xml-nonmet-hsa/hsa05210.xml"
g <- parseKGML2Graph(colFile)
# get number of differentially expressed genes:
deKID <- translateGeneID2KEGGID(names(DE_Colorectal))
# get number of genes from microarray:
allKID <- translateGeneID2KEGGID(ALL_Colorectal)
# create binary list representing DEGs and non-DEGs:
isDiffExp <- nodes(g) %in% deKID
# print output:
sprintf("%2.2f%% genes in hsa05210 are differentially-expressed", mean(isDiffExp)*100)
In [35]:
library(RColorBrewer)
library(org.Hs.eg.db)
library(RBGL)
library(grid)
library(Rgraphviz)
# create color palette for displaying log fold change differences
ar <- 20
cols <- rev(colorRampPalette(brewer.pal(6, "RdBu"))(ar))
# match the node names from DE_Colorectal to the KEGG IDs
logfcs <- DE_Colorectal[match(nodes(g), deKID)]
names(logfcs) <- nodes(g)
# set null values to zero
logfcs[is.na(logfcs)] <- 0
incol <- round((logfcs+2)*5); incol[incol>ar] <- ar
# identify nodes in KEGG graph not in microarray
undetected <- !nodes(g) %in% allKID
# color undetected nodes yellow, and nodes with no fold change grey
logcol <- cols[incol]; logcol[logfcs==0] <- "darkgrey"; logcol[undetected] <- "yellow"
names(logcol) <- names(logfcs)
# set values of the nodes for our graph g:
nA <- makeNodeAttrs(g, fillcolor=logcol, label="", width=10, height=1.2)
par(mar=c(3,5,0,5), mgp=c(0,0,0))
layout(mat=matrix(c(rep(1,8),2), ncol=1, byrow=TRUE))
plot(g, "dot", nodeAttrs=nA)
image(as.matrix(seq(1,ar)), col=cols, yaxt="n", xaxt="n")
mtext("down-regulation", side=1, at=0, line=1)
mtext("up-regulation", side=1, at=1, line=1)
In the above figure, we observe that:
Not all the genes are connected with each other, 40.4761904761905% (34/84) nodes in the pathway have a degree of 0 – they are not conencted to any other node. There are several reasons for this. One of them is the genes with genetic alterations are indicated in the pathway map, but not interweaved into the pathway (cf. http: //www.genome.jp/dbget-bin/show_pathway?hsa05210). Another important factor is that in some pathways, especially human disease pathways, KEGG records genes involved in the disease but does not provide any edge in that pathway, although one can find their interaction partners in other pathways (e.g., Grb2 has no edges in colorectal cancer pathway, however in other maps including MAPK signaling pathway, both up- and down-stream interactors are found). One solution to the later problem is to merge (union) the related pathways together, to this end KEGGgraph provides the function mergeKEGGgraphs.
From the visualization above, it is difficult to recognize any patterns. To examine the pathway in more details, it is necessary to use Divide and Conquer strategy, namely to subset the graph into subgraphs. To this end KEGGgraph provides the function subKEGGgraph, which maintains the KEGG information while subsetting the graph.
In [36]:
# get degree of each node (how many other nodes it is connected to)
gDeg <- degree(g)
# identify unconnected nodes (degree 0)
gIsSingle <- gDeg[[1]] + gDeg[[2]] == 0
options(digits=3)
# translate KEGG identifiers (KEGGID) into Entrez GeneID
gGeneID <- translateKEGGID2GeneID(nodes(g))
# translate Entrez GeneID to gene symbol
gSymbol <- sapply(gGeneID, function(x) mget(x, org.Hs.egSYMBOL, ifnotfound=NA)[[1]])
# create lists of up- and down-regulated genes
isUp <- logfcs > 0
isDown <- logfcs < 0
singleUp <- isUp & gIsSingle
singleDown <- isDown & gIsSingle
List which genes are up or down regulated:
In [37]:
gSymbol[isUp==TRUE]
In [38]:
gSymbol[isDown==TRUE]
In [39]:
# get KEGG IDs for upregulated genes:
ups <- nodes(g)[logfcs > 0]
# get the neighbours of the up-regulated genes:
upNs <- unique(unlist(neighborhood(g, ups, return.self=TRUE)))
# create subgraph:
upSub <- subKEGGgraph(upNs, g)
# remove nodes in subgraph with zero connections
upNeighbor <- nodes(upSub)[sapply(neighborhood(upSub, nodes(upSub)), length)>0]
upNeighbor <- setdiff(upNeighbor, nodes(g)[undetected])
upSub <- subKEGGgraph(upNeighbor, upSub)
# translate KEGG identifiers (KEGGID) into Entrez GeneID
upSubGID <- translateKEGGID2GeneID(nodes(upSub))
# translate Entrez GeneID to gene symbol
upSymbol <- gSymbol[upSubGID]
# plot subgraph
upnA <- makeNodeAttrs(upSub, fillcolor=logcol[nodes(upSub)], label=upSymbol, fixedsize=TRUE, width=10, height=10, font=20)
plot(upSub, "dot", nodeAttrs=upnA)
We can use degree centrality (the sum of in- and out-degree of each node) as a measure of the relative importance of each node compared to others. In this graph, 8 nodes have an degree equal or larger than three (in decreasing order: AKT3, TCF7L1, CCND1, FOS, MAPK3, MAPK1, JUN, MAPK10), and 5 of them are upreguated to different extent. This reinforces the conclusion of SPIA that the pathway is activated - it seems that the relatively important nodes are up-regulated. Especially we note the upregulation of Jun and Fos, two transcription factors with many targeted genes, even though their up-stream interactors (MAPK3 and MAPK1) in this pathway map are either not significantly differentially expressed or slightly down-regulated (MAPK1, logFC=-0.381).
Further analysis could done, for example, by merging the colorectal pathway with linked pathways (Wnt signaling, apoptosis, etc) and investigate the graph characteristics of differnetially expressed genes and their links.
In [9]:
sessionInfo()
In [ ]: