We'll use the bladder cancer gene expression data to test out the REVEAL algorithm. First, we'll load the data and filter to include only genes for which the median log2 expression level is > 14. That should give us 13 genes to work with.
In [31]:
gene_matrix_for_network <- read.table("shared/bladder_cancer_genes_tcga.txt",
sep="\t",
header=TRUE,
row.names=1,
stringsAsFactors=FALSE)
print(sprintf("Dimensions of the original matrix of RNA-seq data: %d", dim(gene_matrix_for_network)))
M <- ncol(gene_matrix_for_network)
gene_matrix_filt <- gene_matrix_for_network[apply(gene_matrix_for_network, 1, median) > 14, ]
N <- nrow(gene_matrix_filt)
print(sprintf("Dimensions of the filtered gene matrix: %d", dim(gene_matrix_filt)))
Binarize the gene expression matrix using the mean value as a breakpoint, turning it into a NxM matrix of logicals (true/false).
In [14]:
mean_values <- apply(gene_matrix_filt, 1, mean)
gene_matrix_binarized <- (gene_matrix_filt > matrix(rep(mean_values, ncol(gene_matrix_filt)), ncol=ncol(gene_matrix_filt)))
M <- ncol(gene_matrix_filt)
Test your matrix by printing the first four columns of the first four rows:
In [32]:
gene_matrix_binarized[1:4,1:4]
The core part of the REVEAL algorithm is a function that can compute the joint entropy of a collection of logical (TRUE/FALSE) vectors X1, X2, ..., Xn (where length(X1) = length(Xi) = M
).
Write a function entropy_multiple_vecs
that takes as its input a n x M
matrix (where n
is the number of variables, i.e., the number of genes, and M
is the number of samples in which gene expression was measured). The function should use the log2 definition of the Shannon entropy. It should return the joint entropy H(X1, X2, ..., Xn) as a single scalar numeric value. I have created a skeleton version of this function for you, in which you can fill in the code. I have also created some test code that you can use to test your function, below.
In [29]:
entropy_multiple_vecs <- function(binary_vecs) {
stopifnot(ncol(binary_vecs) == M) # sanity check input
stopifnot(is.matrix(binary_vecs)) # sanity check input
n <- nrow(binary_vecs)
## convert the nxM matrix of logicals to a Mxn data frame of factors
df <- data.frame(lapply(data.frame(t(binary_vecs)), factor))
## use "xtabs" to tabulate the number of occurrences of each combination of logical
## values for the n inputs; divide by M and call the resulting matrix "probmat"
probmat <- xtabs(~., df)/M
## define a matrix "hmat" that is the negative of probmat element-wise-times log2 of probmat
hmat <- -probmat*log2(probmat)
## some entries of hmat will be NaN (do you know why?). Set them to zero
hmat[is.nan(hmat)] <- 0
## return the sum of entries of hmat
sum(hmat)
}
Here are some test cases to run, after you have written your function
In [30]:
print(sprintf("Test 1 for your entropy function. H(X,Y,Z) for three independent RVs X,Y,Z should be very close to 3: %f",
entropy_multiple_vecs(t(replicate(3,sample(c(FALSE,TRUE), size=M, replace=TRUE))))))
print(sprintf("Test 2 for your entropy function. H(X,Y,Z) for X,Y,Z always = FALSE, should be identically zero: %f",
entropy_multiple_vecs(matrix(rep(FALSE, 3*M), nrow=3))))
print(sprintf("Test 3 for your entropy function. H(X1,X2,X3,X4) for the first four rows of your matrix should be 3.937690: %f",
entropy_multiple_vecs(gene_matrix_binarized[1:4,])))
Here is an example implementation of the REVEAL algorithm using your function. It will find regulators and store regulators in the list regulators
(the regulators
list is initialized to NA for each gene which means that no regulators are known for that gene). Study the code and see if you can figure out exactly how it works. Note the use of do.call
, expand.grid
, and apply
in order to vectorize the code, and the use of list
and array
in order to define data structures used in the algorithm.
In [27]:
ratio_thresh <- 0.1
genes_to_fit <- 1:N
stage <- 0
regulators <- as.list(rep(NA, N))
min_gene_to_fit <- Inf
entropies_for_stages <- list()
max_stage <- 3
print(sprintf("Getting marginal entropies"))
entropies_for_stages[[1]] <- apply(gene_matrix_binarized, 1, function(row) { entropy_multiple_vecs(matrix(row, nrow=1)) })
last_gene_to_fit <- Inf
for (stage in 1:max_stage) {
for (gene in genes_to_fit) {
poss_regs = setdiff(1:N, gene)
poss_regs_combs = combn(poss_regs, stage, simplify=FALSE)
HGX = sapply(poss_regs_combs, function(poss_reg_comb) {
entropy_multiple_vecs(gene_matrix_binarized[c(poss_reg_comb, gene), ])
})
HX = sapply(poss_regs_combs, function(poss_reg_comb) {
entropy_multiple_vecs(gene_matrix_binarized[poss_reg_comb, , drop=FALSE])
})
HG = entropies_for_stages[[1]][gene]
min_value = min(HGX-HX)
if (HG - min_value >= ratio_thresh * HG) {
regulators[[gene]] <- poss_regs_combs[[which.min(HGX-HX)]]
genes_to_fit <- setdiff(genes_to_fit, gene)
}
}
}
regulators