Class 21: joint entropy and the REVEAL algorithm

We'll use the bladder cancer gene expression data to test out the REVEAL algorithm. First, we'll load the data and filter to include only genes for which the median log2 expression level is > 14. That should give us 13 genes to work with.


In [31]:
gene_matrix_for_network <- read.table("shared/bladder_cancer_genes_tcga.txt",
                                     sep="\t",
                                     header=TRUE,
                                     row.names=1,
                                     stringsAsFactors=FALSE)
print(sprintf("Dimensions of the original matrix of RNA-seq data: %d", dim(gene_matrix_for_network)))
M <- ncol(gene_matrix_for_network)
gene_matrix_filt <- gene_matrix_for_network[apply(gene_matrix_for_network, 1, median) > 14, ]

N <- nrow(gene_matrix_filt)

print(sprintf("Dimensions of the filtered gene matrix: %d", dim(gene_matrix_filt)))


[1] "Dimensions of the original matrix of RNA-seq data: 4473"
[2] "Dimensions of the original matrix of RNA-seq data: 414" 
[1] "Dimensions of the filtered gene matrix: 13" 
[2] "Dimensions of the filtered gene matrix: 414"

Binarize the gene expression matrix using the mean value as a breakpoint, turning it into a NxM matrix of logicals (true/false).


In [14]:
mean_values <- apply(gene_matrix_filt, 1, mean)
gene_matrix_binarized <- (gene_matrix_filt > matrix(rep(mean_values, ncol(gene_matrix_filt)), ncol=ncol(gene_matrix_filt)))
M <- ncol(gene_matrix_filt)

Test your matrix by printing the first four columns of the first four rows:


In [32]:
gene_matrix_binarized[1:4,1:4]


X01a589db.02d8.4d75.a2da.bb0bd8140a32X05126c55.410a.4450.8e88.3b0fa9e49b14X06e30c48.0b24.46e5.8876.8209ea65d704X088a8d3a.1884.4d14.93a2.66df2ff47628
ENSG00000111057FALSETRUE FALSEFALSE
ENSG00000170421FALSETRUE FALSEFALSE
ENSG00000179218 TRUETRUE TRUEFALSE
ENSG00000158710FALSETRUE FALSEFALSE

The core part of the REVEAL algorithm is a function that can compute the joint entropy of a collection of logical (TRUE/FALSE) vectors X1, X2, ..., Xn (where length(X1) = length(Xi) = M). Write a function entropy_multiple_vecs that takes as its input a n x M matrix (where n is the number of variables, i.e., the number of genes, and M is the number of samples in which gene expression was measured). The function should use the log2 definition of the Shannon entropy. It should return the joint entropy H(X1, X2, ..., Xn) as a single scalar numeric value. I have created a skeleton version of this function for you, in which you can fill in the code. I have also created some test code that you can use to test your function, below.


In [29]:
entropy_multiple_vecs <- function(binary_vecs) {
    stopifnot(ncol(binary_vecs) == M)  # sanity check input
    stopifnot(is.matrix(binary_vecs))  # sanity check input
    n <- nrow(binary_vecs)
    
    ## convert the nxM matrix of logicals to a Mxn data frame of factors
    df <- data.frame(lapply(data.frame(t(binary_vecs)), factor))
    
    ## use "xtabs" to tabulate the number of occurrences of each combination of logical
    ## values for the n inputs; divide by M and call the resulting matrix "probmat"
    probmat <- xtabs(~., df)/M

    ## define a matrix "hmat" that is the negative of probmat element-wise-times log2 of probmat
    hmat <- -probmat*log2(probmat)
    
    ## some entries of hmat will be NaN (do you know why?).  Set them to zero
    hmat[is.nan(hmat)] <- 0
    
    ## return the sum of entries of hmat
    sum(hmat)
}

Here are some test cases to run, after you have written your function


In [30]:
print(sprintf("Test 1 for your entropy function.  H(X,Y,Z) for three independent RVs X,Y,Z should be very close to 3: %f",
              entropy_multiple_vecs(t(replicate(3,sample(c(FALSE,TRUE), size=M, replace=TRUE))))))

print(sprintf("Test 2 for your entropy function.  H(X,Y,Z) for X,Y,Z always = FALSE, should be identically zero: %f",
              entropy_multiple_vecs(matrix(rep(FALSE, 3*M), nrow=3))))

print(sprintf("Test 3 for your entropy function.  H(X1,X2,X3,X4) for the first four rows of your matrix should be 3.937690: %f",
              entropy_multiple_vecs(gene_matrix_binarized[1:4,])))


[1] "Test 1 for your entropy function.  H(X,Y,Z) for three independent RVs X,Y,Z should be very close to 3: 2.986697"
[1] "Test 2 for your entropy function.  H(X,Y,Z) for X,Y,Z always = FALSE, should be identically zero: 0.000000"
[1] "Test 3 for your entropy function.  H(X1,X2,X3,X4) for the first four rows of your matrix should be 3.937690: 3.471945"

Here is an example implementation of the REVEAL algorithm using your function. It will find regulators and store regulators in the list regulators (the regulators list is initialized to NA for each gene which means that no regulators are known for that gene). Study the code and see if you can figure out exactly how it works. Note the use of do.call, expand.grid, and apply in order to vectorize the code, and the use of list and array in order to define data structures used in the algorithm.


In [27]:
ratio_thresh <- 0.1
genes_to_fit <- 1:N
stage <- 0
regulators <- as.list(rep(NA, N))
min_gene_to_fit <- Inf
entropies_for_stages <- list()
max_stage <- 3

print(sprintf("Getting marginal entropies"))
entropies_for_stages[[1]] <- apply(gene_matrix_binarized, 1, function(row) { entropy_multiple_vecs(matrix(row, nrow=1)) })
last_gene_to_fit <- Inf

for (stage in 1:max_stage) {
    for (gene in genes_to_fit) {
        poss_regs = setdiff(1:N, gene)
        poss_regs_combs = combn(poss_regs, stage, simplify=FALSE)
        HGX = sapply(poss_regs_combs, function(poss_reg_comb) {
                entropy_multiple_vecs(gene_matrix_binarized[c(poss_reg_comb, gene), ])
        })
        HX = sapply(poss_regs_combs, function(poss_reg_comb) {
                entropy_multiple_vecs(gene_matrix_binarized[poss_reg_comb, , drop=FALSE])
        })
        HG = entropies_for_stages[[1]][gene]
        min_value = min(HGX-HX)
        if (HG - min_value >= ratio_thresh * HG) {
            regulators[[gene]] <- poss_regs_combs[[which.min(HGX-HX)]]
            genes_to_fit <- setdiff(genes_to_fit, gene)
        }
    }
}

regulators


[1] "Getting marginal entropies"
  1. 2
  2. 1
  3. 9
  4. <NA>
  5. <NA>
  6. <NA>
  7. 8
  8. 7
  9. 3
    1. 8
    2. 9
    3. 11
  10. 12
  11. 11
    1. 2
    2. 7
    3. 10