Class 21: joint entropy and the REVEAL algorithm

We'll use the bladder cancer gene expression data to test out the REVEAL algorithm. First, we'll load the data and filter to include only genes for which the median log2 expression level is > 14. That should give us 13 genes to work with.



In [31]:

    
gene_matrix_for_network <- read.table("shared/bladder_cancer_genes_tcga.txt",
                                     sep="\t",
                                     header=TRUE,
                                     row.names=1,
                                     stringsAsFactors=FALSE)
print(sprintf("Dimensions of the original matrix of RNA-seq data: %d", dim(gene_matrix_for_network)))
M <- ncol(gene_matrix_for_network)
gene_matrix_filt <- gene_matrix_for_network[apply(gene_matrix_for_network, 1, median) > 14, ]

N <- nrow(gene_matrix_filt)

print(sprintf("Dimensions of the filtered gene matrix: %d", dim(gene_matrix_filt)))









    



[1] "Dimensions of the original matrix of RNA-seq data: 4473"
[2] "Dimensions of the original matrix of RNA-seq data: 414" 
[1] "Dimensions of the filtered gene matrix: 13" 
[2] "Dimensions of the filtered gene matrix: 414"

Binarize the gene expression matrix using the mean value as a breakpoint, turning it into a NxM matrix of logicals (true/false).



In [14]:

    
mean_values <- apply(gene_matrix_filt, 1, mean)
gene_matrix_binarized <- (gene_matrix_filt > matrix(rep(mean_values, ncol(gene_matrix_filt)), ncol=ncol(gene_matrix_filt)))
M <- ncol(gene_matrix_filt)

Test your matrix by printing the first four columns of the first four rows:



In [32]:

    
gene_matrix_binarized[1:4,1:4]









    





X01a589db.02d8.4d75.a2da.bb0bd8140a32 X05126c55.410a.4450.8e88.3b0fa9e49b14 X06e30c48.0b24.46e5.8876.8209ea65d704 X088a8d3a.1884.4d14.93a2.66df2ff47628

	ENSG00000111057 FALSE TRUE FALSE FALSE
	ENSG00000170421 FALSE TRUE FALSE FALSE
	ENSG00000179218  TRUE TRUE  TRUE FALSE
	ENSG00000158710 FALSE TRUE FALSE FALSE

The core part of the REVEAL algorithm is a function that can compute the joint entropy of a collection of logical (TRUE/FALSE) vectors X1, X2, ..., Xn (where length(X1) = length(Xi) = M). Write a function entropy_multiple_vecs that takes as its input a n x M matrix (where n is the number of variables, i.e., the number of genes, and M is the number of samples in which gene expression was measured). The function should use the log2 definition of the Shannon entropy. It should return the joint entropy H(X1, X2, ..., Xn) as a single scalar numeric value. I have created a skeleton version of this function for you, in which you can fill in the code. I have also created some test code that you can use to test your function, below.



In [29]:

    
entropy_multiple_vecs <- function(binary_vecs) {
    stopifnot(ncol(binary_vecs) == M)  # sanity check input
    stopifnot(is.matrix(binary_vecs))  # sanity check input
    n <- nrow(binary_vecs)
    
    ## convert the nxM matrix of logicals to a Mxn data frame of factors
    df <- data.frame(lapply(data.frame(t(binary_vecs)), factor))
    
    ## use "xtabs" to tabulate the number of occurrences of each combination of logical
    ## values for the n inputs; divide by M and call the resulting matrix "probmat"
    probmat <- xtabs(~., df)/M

    ## define a matrix "hmat" that is the negative of probmat element-wise-times log2 of probmat
    hmat <- -probmat*log2(probmat)
    
    ## some entries of hmat will be NaN (do you know why?).  Set them to zero
    hmat[is.nan(hmat)] <- 0
    
    ## return the sum of entries of hmat
    sum(hmat)
}

Here are some test cases to run, after you have written your function



In [30]:

    
print(sprintf("Test 1 for your entropy function.  H(X,Y,Z) for three independent RVs X,Y,Z should be very close to 3: %f",
              entropy_multiple_vecs(t(replicate(3,sample(c(FALSE,TRUE), size=M, replace=TRUE))))))

print(sprintf("Test 2 for your entropy function.  H(X,Y,Z) for X,Y,Z always = FALSE, should be identically zero: %f",
              entropy_multiple_vecs(matrix(rep(FALSE, 3*M), nrow=3))))

print(sprintf("Test 3 for your entropy function.  H(X1,X2,X3,X4) for the first four rows of your matrix should be 3.937690: %f",
              entropy_multiple_vecs(gene_matrix_binarized[1:4,])))









    



[1] "Test 1 for your entropy function.  H(X,Y,Z) for three independent RVs X,Y,Z should be very close to 3: 2.986697"
[1] "Test 2 for your entropy function.  H(X,Y,Z) for X,Y,Z always = FALSE, should be identically zero: 0.000000"
[1] "Test 3 for your entropy function.  H(X1,X2,X3,X4) for the first four rows of your matrix should be 3.937690: 3.471945"

Here is an example implementation of the REVEAL algorithm using your function. It will find regulators and store regulators in the list regulators (the regulators list is initialized to NA for each gene which means that no regulators are known for that gene). Study the code and see if you can figure out exactly how it works. Note the use of do.call, expand.grid, and apply in order to vectorize the code, and the use of list and array in order to define data structures used in the algorithm.



In [27]:

    
ratio_thresh <- 0.1
genes_to_fit <- 1:N
stage <- 0
regulators <- as.list(rep(NA, N))
min_gene_to_fit <- Inf
entropies_for_stages <- list()
max_stage <- 3

print(sprintf("Getting marginal entropies"))
entropies_for_stages[[1]] <- apply(gene_matrix_binarized, 1, function(row) { entropy_multiple_vecs(matrix(row, nrow=1)) })
last_gene_to_fit <- Inf

for (stage in 1:max_stage) {
    for (gene in genes_to_fit) {
        poss_regs = setdiff(1:N, gene)
        poss_regs_combs = combn(poss_regs, stage, simplify=FALSE)
        HGX = sapply(poss_regs_combs, function(poss_reg_comb) {
                entropy_multiple_vecs(gene_matrix_binarized[c(poss_reg_comb, gene), ])
        })
        HX = sapply(poss_regs_combs, function(poss_reg_comb) {
                entropy_multiple_vecs(gene_matrix_binarized[poss_reg_comb, , drop=FALSE])
        })
        HG = entropies_for_stages[[1]][gene]
        min_value = min(HGX-HX)
        if (HG - min_value >= ratio_thresh * HG) {
            regulators[[gene]] <- poss_regs_combs[[which.min(HGX-HX)]]
            genes_to_fit <- setdiff(genes_to_fit, gene)
        }
    }
}

regulators









    



[1] "Getting marginal entropies"






    





	2
	1
	9
	<NA>
	<NA>
	<NA>
	8
	7
	3
	
	8
	9
	11


	12
	11
	
	2
	7
	10

	X01a589db.02d8.4d75.a2da.bb0bd8140a32	X05126c55.410a.4450.8e88.3b0fa9e49b14	X06e30c48.0b24.46e5.8876.8209ea65d704	X088a8d3a.1884.4d14.93a2.66df2ff47628
ENSG00000111057	FALSE	TRUE	FALSE	FALSE
ENSG00000170421	FALSE	TRUE	FALSE	FALSE
ENSG00000179218	TRUE	TRUE	TRUE	FALSE
ENSG00000158710	FALSE	TRUE	FALSE	FALSE