Class 21: joint entropy and the REVEAL algorithm

We'll use the bladder cancer gene expression data to test out the REVEAL algorithm. First, we'll load the data and filter to include only genes for which the median log2 expression level is > 12 (as we did in class session 20). That should give us 164 genes to work with.

Import the Python modules that we will need for this exercise


In [31]:
import pandas
import numpy
import itertools

Load the data file shared/bladder_cancer_genes_tcga.txt into a pandas.DataFrame, convert it to a numpy.ndarray matrix, and print the matrix dimensions


In [4]:
gene_matrix_for_network_df = pandas.read_csv("shared/bladder_cancer_genes_tcga.txt", sep="\t")
gene_matrix_for_network = gene_matrix_for_network_df.as_matrix()
print(gene_matrix_for_network.shape)


(4473, 414)

Filter the matrix to include only rows for which the column-wise median is > 14; matrix should now be 13 x 414.


In [114]:
genes_keep = numpy.where(numpy.median(gene_matrix_for_network, axis=1) > 14)
matrix_filt = gene_matrix_for_network[genes_keep, ][0]
matrix_filt.shape
N = matrix_filt.shape[0]
M = matrix_filt.shape[1]

Binarize the gene expression matrix using the mean value as a breakpoint, turning it into a NxM matrix of booleans (True/False). Call it gene_matrix_binarized.


In [115]:
gene_matrix_binarized = numpy.tile(numpy.mean(matrix_filt, axis=1),(M,1)).transpose() < matrix_filt
print(gene_matrix_binarized.shape)


(13, 414)

Test your matrix by printing the first four columns of the first four rows:


In [129]:
gene_matrix_binarized[0:4,0:4]


Out[129]:
array([[False,  True, False, False],
       [False,  True, False, False],
       [ True,  True,  True, False],
       [False,  True, False, False]], dtype=bool)

The core part of the REVEAL algorithm is a function that can compute the joint entropy of a collection of binary (TRUE/FALSE) vectors X1, X2, ..., Xn (where length(X1) = length(Xi) = M). Write a function entropy_multiple_vecs that takes as its input a nxM matrix (where n is the number of variables, i.e., genes, and M is the number of samples in which gene expression was measured). The function should use the log2 definition of the Shannon entropy. It should return the joint entropy H(X1, X2, ..., Xn) as a scalar numeric value. I have created a skeleton version of this function for you, in which you can fill in the code. I have also created some test code that you can use to test your function, below.


In [96]:
def entropy_multiple_vecs(binary_vecs):
    ## use shape to get the numbers of rows and columns as [n,M]
    [n, M] = binary_vecs.shape
    
    # make a "M x n" dataframe from the transpose of the matrix binary_vecs
    binary_df = pandas.DataFrame(binary_vecs.transpose())
    
    # use the groupby method to obtain a data frame of counts of unique occurrences of the 2^n possible logical states
    binary_df_counts = binary_df.groupby(binary_df.columns.values.tolist()).size().values
    
    # divide the matrix of counts by M, to get a probability matrix
    probvec = binary_df_counts/M
    
    # compute the shannon entropy using the formula
    hvec = -probvec*numpy.log2(probvec)
    return numpy.sum(hvec)

This test case should produce the value 3.938:


In [97]:
print(entropy_multiple_vecs(gene_matrix_binarized[0:4,]))


3.81175425925

Example implementation of the REVEAL algorithm:

We'll go through stage 3


In [126]:
ratio_thresh = 0.1
genes_to_fit = list(range(0,N))
stage = 0
regulators = [None]*N
entropies_for_stages = [None]*N
max_stage = 4

entropies_for_stages[0] = numpy.zeros(N)

for i in range(0,N):
    single_row_matrix = gene_matrix_binarized[i,:,None].transpose()
    entropies_for_stages[0][i] = entropy_multiple_vecs(single_row_matrix)
    
genes_to_fit = set(range(0,N))

In [127]:
for stage in range(1,max_stage + 1):
    for gene in genes_to_fit.copy():
        # we are trying to find regulators for gene "gene"
        poss_regs = set(range(0,N)) - set([gene])
        poss_regs_combs = [list(x) for x in itertools.combinations(poss_regs, stage)]
        HGX = numpy.array([ entropy_multiple_vecs(gene_matrix_binarized[[gene] + poss_regs_comb,:]) for poss_regs_comb in poss_regs_combs ])
        HX = numpy.array([ entropy_multiple_vecs(gene_matrix_binarized[poss_regs_comb,:]) for poss_regs_comb in poss_regs_combs ])
        HG = entropies_for_stages[0][gene]
        min_value = numpy.min(HGX - HX)
        if HG - min_value >= ratio_thresh * HG:
            regulators[gene]=poss_regs_combs[numpy.argmin(HGX - HX)]
            genes_to_fit.remove(gene)
regulators


Out[127]:
[[1],
 [0],
 [8],
 None,
 None,
 None,
 [7],
 [6],
 [2],
 [7, 8, 10],
 [11],
 [10],
 [1, 6, 9]]

In [ ]: