CS446/546 - Class Session 19 - Correlation network

In this class session we are going to analyze gene expression data from a human bladder cancer cohort, using python. We will load a data matrix of expression measurements of 4,473 genes in 414 different bladder cancer samples. These genes have been selected because they are differentially expressed between normal bladder and bladder cancer (thus more likely to have a function in bladder cancer specifically), but the columns in the data matrix are restricted to bladder cancer samples (not normal bladder) because we want to obtain a network representing variation across cancers. The measurements in the matrix have already been normalized to account for inter-sample heterogeneity and then log2 transformed. Our job is to compute Pearson correlation coefficients between all pairs of genes, obtain Fisher-transformed z-scores for all pairs of genes, test each pair of genes for significance of the z score, adjust for multiple hypothesis testing, filter to eliminate any pair for which R < 0.75 or Padj > 0.01, load the graph into an igraph.Graph object, and plot the degree distribution on log-log scale. We will then answer two questions: (1) does the network look to be scale-free? and (2) what is it's best-fit scaling exponent?

We will start by importing all of the modules that we will need for this notebook. Note the difference in language-design philosophy between R (which requires loading one package for this analysis) and python (where we have to load seven modules). Python keeps its core minimal, whereas R has a lot of statistical and plotting functions in the base language (or in packages that are loaded by default).


In [57]:
import pandas
import scipy.stats
import matplotlib
import pylab
import numpy
import statsmodels.sandbox.stats.multicomp
import igraph
import math

Using pandas.read_csv, load the tab-deliminted text file of gene expression measurements (rows correspond to genes, columns correspond to bladder tumor samples), into a data frame gene_matrix_for_network_df.


In [58]:
gene_matrix_for_network_df = pandas.read_csv("shared/bladder_cancer_genes_tcga.txt", sep="\t")

Use the pandas.DataFrame.as_matrix method to make a matrix gene_matrix_for_network. Print out the dimensions of the matrix, by accessing its shape variable


In [59]:
gene_matrix_for_network = gene_matrix_for_network_df.as_matrix()
gene_matrix_for_network.shape


Out[59]:
(4473, 414)

Use del to delete the data frame, since we no longer need it (save memory)


In [60]:
del gene_matrix_for_network_df

Look at the online help for the numpy.corrcoef function, using help(numpy.corrcoef). When you pass a single argument x which is a 2D "array" (i.e., a matrix), by default does corrcoef compute coefficients for pairs of rows, or pairs of columns?


In [61]:
help(numpy.corrcoef)


Help on function corrcoef in module numpy.lib.function_base:

corrcoef(x, y=None, rowvar=True, bias=<class 'numpy._globals._NoValue'>, ddof=<class 'numpy._globals._NoValue'>)
    Return Pearson product-moment correlation coefficients.
    
    Please refer to the documentation for `cov` for more detail.  The
    relationship between the correlation coefficient matrix, `R`, and the
    covariance matrix, `C`, is
    
    .. math:: R_{ij} = \frac{ C_{ij} } { \sqrt{ C_{ii} * C_{jj} } }
    
    The values of `R` are between -1 and 1, inclusive.
    
    Parameters
    ----------
    x : array_like
        A 1-D or 2-D array containing multiple variables and observations.
        Each row of `x` represents a variable, and each column a single
        observation of all those variables. Also see `rowvar` below.
    y : array_like, optional
        An additional set of variables and observations. `y` has the same
        shape as `x`.
    rowvar : bool, optional
        If `rowvar` is True (default), then each row represents a
        variable, with observations in the columns. Otherwise, the relationship
        is transposed: each column represents a variable, while the rows
        contain observations.
    bias : _NoValue, optional
        Has no effect, do not use.
    
        .. deprecated:: 1.10.0
    ddof : _NoValue, optional
        Has no effect, do not use.
    
        .. deprecated:: 1.10.0
    
    Returns
    -------
    R : ndarray
        The correlation coefficient matrix of the variables.
    
    See Also
    --------
    cov : Covariance matrix
    
    Notes
    -----
    Due to floating point rounding the resulting array may not be Hermitian,
    the diagonal elements may not be 1, and the elements may not satisfy the
    inequality abs(a) <= 1. The real and imaginary parts are clipped to the
    interval [-1,  1] in an attempt to improve on that situation but is not
    much help in the complex case.
    
    This function accepts but discards arguments `bias` and `ddof`.  This is
    for backwards compatibility with previous versions of this function.  These
    arguments had no effect on the return values of the function and can be
    safely ignored in this and previous versions of numpy.

Compute the 4,473 x 4,473 matrix of gene-gene Pearson correlation coefficients, using numpy.corrcoef (this function treats each row as a variable, so you don't have to do any transposing of the matrix, unlike the situation in R).


In [62]:
gene_matrix_for_network_cor = numpy.corrcoef(gene_matrix_for_network)

Look at the online help for numpy.fill_diagonal. Does it return the modified matrix or modify the matrix argument in place?


In [63]:
help(numpy.fill_diagonal)


Help on function fill_diagonal in module numpy.lib.index_tricks:

fill_diagonal(a, val, wrap=False)
    Fill the main diagonal of the given array of any dimensionality.
    
    For an array `a` with ``a.ndim >= 2``, the diagonal is the list of
    locations with indices ``a[i, ..., i]`` all identical. This function
    modifies the input array in-place, it does not return a value.
    
    Parameters
    ----------
    a : array, at least 2-D.
      Array whose diagonal is to be filled, it gets modified in-place.
    
    val : scalar
      Value to be written on the diagonal, its type must be compatible with
      that of the array a.
    
    wrap : bool
      For tall matrices in NumPy version up to 1.6.2, the
      diagonal "wrapped" after N columns. You can have this behavior
      with this option. This affects only tall matrices.
    
    See also
    --------
    diag_indices, diag_indices_from
    
    Notes
    -----
    .. versionadded:: 1.4.0
    
    This functionality can be obtained via `diag_indices`, but internally
    this version uses a much faster implementation that never constructs the
    indices and uses simple slicing.
    
    Examples
    --------
    >>> a = np.zeros((3, 3), int)
    >>> np.fill_diagonal(a, 5)
    >>> a
    array([[5, 0, 0],
           [0, 5, 0],
           [0, 0, 5]])
    
    The same function can operate on a 4-D array:
    
    >>> a = np.zeros((3, 3, 3, 3), int)
    >>> np.fill_diagonal(a, 4)
    
    We only show a few blocks for clarity:
    
    >>> a[0, 0]
    array([[4, 0, 0],
           [0, 0, 0],
           [0, 0, 0]])
    >>> a[1, 1]
    array([[0, 0, 0],
           [0, 4, 0],
           [0, 0, 0]])
    >>> a[2, 2]
    array([[0, 0, 0],
           [0, 0, 0],
           [0, 0, 4]])
    
    The wrap option affects only tall matrices:
    
    >>> # tall matrices no wrap
    >>> a = np.zeros((5, 3),int)
    >>> fill_diagonal(a, 4)
    >>> a
    array([[4, 0, 0],
           [0, 4, 0],
           [0, 0, 4],
           [0, 0, 0],
           [0, 0, 0]])
    
    >>> # tall matrices wrap
    >>> a = np.zeros((5, 3),int)
    >>> fill_diagonal(a, 4, wrap=True)
    >>> a
    array([[4, 0, 0],
           [0, 4, 0],
           [0, 0, 4],
           [0, 0, 0],
           [4, 0, 0]])
    
    >>> # wide matrices
    >>> a = np.zeros((3, 5),int)
    >>> fill_diagonal(a, 4, wrap=True)
    >>> a
    array([[4, 0, 0, 0, 0],
           [0, 4, 0, 0, 0],
           [0, 0, 4, 0, 0]])

Set the diagonal elements of the matrix to zero, using numpy.fill_diagonal


In [64]:
numpy.fill_diagonal(gene_matrix_for_network_cor, 0)

Look at the online help for numpy.multiply. Does it do element-wise multiplication or matrix multiplication?


In [65]:
help(numpy.multiply)


Help on ufunc object:

multiply = class ufunc(builtins.object)
 |  Functions that operate element by element on whole arrays.
 |  
 |  To see the documentation for a specific ufunc, use `info`.  For
 |  example, ``np.info(np.sin)``.  Because ufuncs are written in C
 |  (for speed) and linked into Python with NumPy's ufunc facility,
 |  Python's help() function finds this page whenever help() is called
 |  on a ufunc.
 |  
 |  A detailed explanation of ufuncs can be found in the docs for :ref:`ufuncs`.
 |  
 |  Calling ufuncs:
 |  ===============
 |  
 |  op(*x[, out], where=True, **kwargs)
 |  Apply `op` to the arguments `*x` elementwise, broadcasting the arguments.
 |  
 |  The broadcasting rules are:
 |  
 |  * Dimensions of length 1 may be prepended to either array.
 |  * Arrays may be repeated along dimensions of length 1.
 |  
 |  Parameters
 |  ----------
 |  *x : array_like
 |      Input arrays.
 |  out : ndarray, None, or tuple of ndarray and None, optional
 |      Alternate array object(s) in which to put the result; if provided, it
 |      must have a shape that the inputs broadcast to. A tuple of arrays
 |      (possible only as a keyword argument) must have length equal to the
 |      number of outputs; use `None` for outputs to be allocated by the ufunc.
 |  where : array_like, optional
 |      Values of True indicate to calculate the ufunc at that position, values
 |      of False indicate to leave the value in the output alone.
 |  **kwargs
 |      For other keyword-only arguments, see the :ref:`ufunc docs <ufuncs.kwargs>`.
 |  
 |  Returns
 |  -------
 |  r : ndarray or tuple of ndarray
 |      `r` will have the shape that the arrays in `x` broadcast to; if `out` is
 |      provided, `r` will be equal to `out`. If the function has more than one
 |      output, then the result will be a tuple of arrays.
 |  
 |  Methods defined here:
 |  
 |  __call__(self, /, *args, **kwargs)
 |      Call self as a function.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __str__(self, /)
 |      Return str(self).
 |  
 |  accumulate(...)
 |      accumulate(array, axis=0, dtype=None, out=None, keepdims=None)
 |      
 |      Accumulate the result of applying the operator to all elements.
 |      
 |      For a one-dimensional array, accumulate produces results equivalent to::
 |      
 |        r = np.empty(len(A))
 |        t = op.identity        # op = the ufunc being applied to A's  elements
 |        for i in range(len(A)):
 |            t = op(t, A[i])
 |            r[i] = t
 |        return r
 |      
 |      For example, add.accumulate() is equivalent to np.cumsum().
 |      
 |      For a multi-dimensional array, accumulate is applied along only one
 |      axis (axis zero by default; see Examples below) so repeated use is
 |      necessary if one wants to accumulate over multiple axes.
 |      
 |      Parameters
 |      ----------
 |      array : array_like
 |          The array to act on.
 |      axis : int, optional
 |          The axis along which to apply the accumulation; default is zero.
 |      dtype : data-type code, optional
 |          The data-type used to represent the intermediate results. Defaults
 |          to the data-type of the output array if such is provided, or the
 |          the data-type of the input array if no output array is provided.
 |      out : ndarray, None, or tuple of ndarray and None, optional
 |          A location into which the result is stored. If not provided or `None`,
 |          a freshly-allocated array is returned. For consistency with
 |          :ref:`ufunc.__call__`, if given as a keyword, this may be wrapped in a
 |          1-element tuple.
 |      
 |          .. versionchanged:: 1.13.0
 |             Tuples are allowed for keyword argument.
 |      keepdims : bool
 |          Has no effect. Deprecated, and will be removed in future.
 |      
 |      Returns
 |      -------
 |      r : ndarray
 |          The accumulated values. If `out` was supplied, `r` is a reference to
 |          `out`.
 |      
 |      Examples
 |      --------
 |      1-D array examples:
 |      
 |      >>> np.add.accumulate([2, 3, 5])
 |      array([ 2,  5, 10])
 |      >>> np.multiply.accumulate([2, 3, 5])
 |      array([ 2,  6, 30])
 |      
 |      2-D array examples:
 |      
 |      >>> I = np.eye(2)
 |      >>> I
 |      array([[ 1.,  0.],
 |             [ 0.,  1.]])
 |      
 |      Accumulate along axis 0 (rows), down columns:
 |      
 |      >>> np.add.accumulate(I, 0)
 |      array([[ 1.,  0.],
 |             [ 1.,  1.]])
 |      >>> np.add.accumulate(I) # no axis specified = axis zero
 |      array([[ 1.,  0.],
 |             [ 1.,  1.]])
 |      
 |      Accumulate along axis 1 (columns), through rows:
 |      
 |      >>> np.add.accumulate(I, 1)
 |      array([[ 1.,  1.],
 |             [ 0.,  1.]])
 |  
 |  at(...)
 |      at(a, indices, b=None)
 |      
 |      Performs unbuffered in place operation on operand 'a' for elements
 |      specified by 'indices'. For addition ufunc, this method is equivalent to
 |      `a[indices] += b`, except that results are accumulated for elements that
 |      are indexed more than once. For example, `a[[0,0]] += 1` will only
 |      increment the first element once because of buffering, whereas
 |      `add.at(a, [0,0], 1)` will increment the first element twice.
 |      
 |      .. versionadded:: 1.8.0
 |      
 |      Parameters
 |      ----------
 |      a : array_like
 |          The array to perform in place operation on.
 |      indices : array_like or tuple
 |          Array like index object or slice object for indexing into first
 |          operand. If first operand has multiple dimensions, indices can be a
 |          tuple of array like index objects or slice objects.
 |      b : array_like
 |          Second operand for ufuncs requiring two operands. Operand must be
 |          broadcastable over first operand after indexing or slicing.
 |      
 |      Examples
 |      --------
 |      Set items 0 and 1 to their negative values:
 |      
 |      >>> a = np.array([1, 2, 3, 4])
 |      >>> np.negative.at(a, [0, 1])
 |      >>> print(a)
 |      array([-1, -2, 3, 4])
 |      
 |      ::
 |      
 |      Increment items 0 and 1, and increment item 2 twice:
 |      
 |      >>> a = np.array([1, 2, 3, 4])
 |      >>> np.add.at(a, [0, 1, 2, 2], 1)
 |      >>> print(a)
 |      array([2, 3, 5, 4])
 |      
 |      ::
 |      
 |      Add items 0 and 1 in first array to second array,
 |      and store results in first array:
 |      
 |      >>> a = np.array([1, 2, 3, 4])
 |      >>> b = np.array([1, 2])
 |      >>> np.add.at(a, [0, 1], b)
 |      >>> print(a)
 |      array([2, 4, 3, 4])
 |  
 |  outer(...)
 |      outer(A, B, **kwargs)
 |      
 |      Apply the ufunc `op` to all pairs (a, b) with a in `A` and b in `B`.
 |      
 |      Let ``M = A.ndim``, ``N = B.ndim``. Then the result, `C`, of
 |      ``op.outer(A, B)`` is an array of dimension M + N such that:
 |      
 |      .. math:: C[i_0, ..., i_{M-1}, j_0, ..., j_{N-1}] =
 |         op(A[i_0, ..., i_{M-1}], B[j_0, ..., j_{N-1}])
 |      
 |      For `A` and `B` one-dimensional, this is equivalent to::
 |      
 |        r = empty(len(A),len(B))
 |        for i in range(len(A)):
 |            for j in range(len(B)):
 |                r[i,j] = op(A[i], B[j]) # op = ufunc in question
 |      
 |      Parameters
 |      ----------
 |      A : array_like
 |          First array
 |      B : array_like
 |          Second array
 |      kwargs : any
 |          Arguments to pass on to the ufunc. Typically `dtype` or `out`.
 |      
 |      Returns
 |      -------
 |      r : ndarray
 |          Output array
 |      
 |      See Also
 |      --------
 |      numpy.outer
 |      
 |      Examples
 |      --------
 |      >>> np.multiply.outer([1, 2, 3], [4, 5, 6])
 |      array([[ 4,  5,  6],
 |             [ 8, 10, 12],
 |             [12, 15, 18]])
 |      
 |      A multi-dimensional example:
 |      
 |      >>> A = np.array([[1, 2, 3], [4, 5, 6]])
 |      >>> A.shape
 |      (2, 3)
 |      >>> B = np.array([[1, 2, 3, 4]])
 |      >>> B.shape
 |      (1, 4)
 |      >>> C = np.multiply.outer(A, B)
 |      >>> C.shape; C
 |      (2, 3, 1, 4)
 |      array([[[[ 1,  2,  3,  4]],
 |              [[ 2,  4,  6,  8]],
 |              [[ 3,  6,  9, 12]]],
 |             [[[ 4,  8, 12, 16]],
 |              [[ 5, 10, 15, 20]],
 |              [[ 6, 12, 18, 24]]]])
 |  
 |  reduce(...)
 |      reduce(a, axis=0, dtype=None, out=None, keepdims=False)
 |      
 |      Reduces `a`'s dimension by one, by applying ufunc along one axis.
 |      
 |      Let :math:`a.shape = (N_0, ..., N_i, ..., N_{M-1})`.  Then
 |      :math:`ufunc.reduce(a, axis=i)[k_0, ..,k_{i-1}, k_{i+1}, .., k_{M-1}]` =
 |      the result of iterating `j` over :math:`range(N_i)`, cumulatively applying
 |      ufunc to each :math:`a[k_0, ..,k_{i-1}, j, k_{i+1}, .., k_{M-1}]`.
 |      For a one-dimensional array, reduce produces results equivalent to:
 |      ::
 |      
 |       r = op.identity # op = ufunc
 |       for i in range(len(A)):
 |         r = op(r, A[i])
 |       return r
 |      
 |      For example, add.reduce() is equivalent to sum().
 |      
 |      Parameters
 |      ----------
 |      a : array_like
 |          The array to act on.
 |      axis : None or int or tuple of ints, optional
 |          Axis or axes along which a reduction is performed.
 |          The default (`axis` = 0) is perform a reduction over the first
 |          dimension of the input array. `axis` may be negative, in
 |          which case it counts from the last to the first axis.
 |      
 |          .. versionadded:: 1.7.0
 |      
 |          If this is `None`, a reduction is performed over all the axes.
 |          If this is a tuple of ints, a reduction is performed on multiple
 |          axes, instead of a single axis or all the axes as before.
 |      
 |          For operations which are either not commutative or not associative,
 |          doing a reduction over multiple axes is not well-defined. The
 |          ufuncs do not currently raise an exception in this case, but will
 |          likely do so in the future.
 |      dtype : data-type code, optional
 |          The type used to represent the intermediate results. Defaults
 |          to the data-type of the output array if this is provided, or
 |          the data-type of the input array if no output array is provided.
 |      out : ndarray, None, or tuple of ndarray and None, optional
 |          A location into which the result is stored. If not provided or `None`,
 |          a freshly-allocated array is returned. For consistency with
 |          :ref:`ufunc.__call__`, if given as a keyword, this may be wrapped in a
 |          1-element tuple.
 |      
 |          .. versionchanged:: 1.13.0
 |             Tuples are allowed for keyword argument.
 |      keepdims : bool, optional
 |          If this is set to True, the axes which are reduced are left
 |          in the result as dimensions with size one. With this option,
 |          the result will broadcast correctly against the original `arr`.
 |      
 |          .. versionadded:: 1.7.0
 |      
 |      Returns
 |      -------
 |      r : ndarray
 |          The reduced array. If `out` was supplied, `r` is a reference to it.
 |      
 |      Examples
 |      --------
 |      >>> np.multiply.reduce([2,3,5])
 |      30
 |      
 |      A multi-dimensional array example:
 |      
 |      >>> X = np.arange(8).reshape((2,2,2))
 |      >>> X
 |      array([[[0, 1],
 |              [2, 3]],
 |             [[4, 5],
 |              [6, 7]]])
 |      >>> np.add.reduce(X, 0)
 |      array([[ 4,  6],
 |             [ 8, 10]])
 |      >>> np.add.reduce(X) # confirm: default axis value is 0
 |      array([[ 4,  6],
 |             [ 8, 10]])
 |      >>> np.add.reduce(X, 1)
 |      array([[ 2,  4],
 |             [10, 12]])
 |      >>> np.add.reduce(X, 2)
 |      array([[ 1,  5],
 |             [ 9, 13]])
 |  
 |  reduceat(...)
 |      reduceat(a, indices, axis=0, dtype=None, out=None)
 |      
 |      Performs a (local) reduce with specified slices over a single axis.
 |      
 |      For i in ``range(len(indices))``, `reduceat` computes
 |      ``ufunc.reduce(a[indices[i]:indices[i+1]])``, which becomes the i-th
 |      generalized "row" parallel to `axis` in the final result (i.e., in a
 |      2-D array, for example, if `axis = 0`, it becomes the i-th row, but if
 |      `axis = 1`, it becomes the i-th column).  There are three exceptions to this:
 |      
 |      * when ``i = len(indices) - 1`` (so for the last index),
 |        ``indices[i+1] = a.shape[axis]``.
 |      * if ``indices[i] >= indices[i + 1]``, the i-th generalized "row" is
 |        simply ``a[indices[i]]``.
 |      * if ``indices[i] >= len(a)`` or ``indices[i] < 0``, an error is raised.
 |      
 |      The shape of the output depends on the size of `indices`, and may be
 |      larger than `a` (this happens if ``len(indices) > a.shape[axis]``).
 |      
 |      Parameters
 |      ----------
 |      a : array_like
 |          The array to act on.
 |      indices : array_like
 |          Paired indices, comma separated (not colon), specifying slices to
 |          reduce.
 |      axis : int, optional
 |          The axis along which to apply the reduceat.
 |      dtype : data-type code, optional
 |          The type used to represent the intermediate results. Defaults
 |          to the data type of the output array if this is provided, or
 |          the data type of the input array if no output array is provided.
 |      out : ndarray, None, or tuple of ndarray and None, optional
 |          A location into which the result is stored. If not provided or `None`,
 |          a freshly-allocated array is returned. For consistency with
 |          :ref:`ufunc.__call__`, if given as a keyword, this may be wrapped in a
 |          1-element tuple.
 |      
 |          .. versionchanged:: 1.13.0
 |             Tuples are allowed for keyword argument.
 |      
 |      Returns
 |      -------
 |      r : ndarray
 |          The reduced values. If `out` was supplied, `r` is a reference to
 |          `out`.
 |      
 |      Notes
 |      -----
 |      A descriptive example:
 |      
 |      If `a` is 1-D, the function `ufunc.accumulate(a)` is the same as
 |      ``ufunc.reduceat(a, indices)[::2]`` where `indices` is
 |      ``range(len(array) - 1)`` with a zero placed
 |      in every other element:
 |      ``indices = zeros(2 * len(a) - 1)``, ``indices[1::2] = range(1, len(a))``.
 |      
 |      Don't be fooled by this attribute's name: `reduceat(a)` is not
 |      necessarily smaller than `a`.
 |      
 |      Examples
 |      --------
 |      To take the running sum of four successive values:
 |      
 |      >>> np.add.reduceat(np.arange(8),[0,4, 1,5, 2,6, 3,7])[::2]
 |      array([ 6, 10, 14, 18])
 |      
 |      A 2-D example:
 |      
 |      >>> x = np.linspace(0, 15, 16).reshape(4,4)
 |      >>> x
 |      array([[  0.,   1.,   2.,   3.],
 |             [  4.,   5.,   6.,   7.],
 |             [  8.,   9.,  10.,  11.],
 |             [ 12.,  13.,  14.,  15.]])
 |      
 |      ::
 |      
 |       # reduce such that the result has the following five rows:
 |       # [row1 + row2 + row3]
 |       # [row4]
 |       # [row2]
 |       # [row3]
 |       # [row1 + row2 + row3 + row4]
 |      
 |      >>> np.add.reduceat(x, [0, 3, 1, 2, 0])
 |      array([[ 12.,  15.,  18.,  21.],
 |             [ 12.,  13.,  14.,  15.],
 |             [  4.,   5.,   6.,   7.],
 |             [  8.,   9.,  10.,  11.],
 |             [ 24.,  28.,  32.,  36.]])
 |      
 |      ::
 |      
 |       # reduce such that result has the following two columns:
 |       # [col1 * col2 * col3, col4]
 |      
 |      >>> np.multiply.reduceat(x, [0, 3], 1)
 |      array([[    0.,     3.],
 |             [  120.,     7.],
 |             [  720.,    11.],
 |             [ 2184.,    15.]])
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  identity
 |      The identity value.
 |      
 |      Data attribute containing the identity element for the ufunc, if it has one.
 |      If it does not, the attribute value is None.
 |      
 |      Examples
 |      --------
 |      >>> np.add.identity
 |      0
 |      >>> np.multiply.identity
 |      1
 |      >>> np.power.identity
 |      1
 |      >>> print(np.exp.identity)
 |      None
 |  
 |  nargs
 |      The number of arguments.
 |      
 |      Data attribute containing the number of arguments the ufunc takes, including
 |      optional ones.
 |      
 |      Notes
 |      -----
 |      Typically this value will be one more than what you might expect because all
 |      ufuncs take  the optional "out" argument.
 |      
 |      Examples
 |      --------
 |      >>> np.add.nargs
 |      3
 |      >>> np.multiply.nargs
 |      3
 |      >>> np.power.nargs
 |      3
 |      >>> np.exp.nargs
 |      2
 |  
 |  nin
 |      The number of inputs.
 |      
 |      Data attribute containing the number of arguments the ufunc treats as input.
 |      
 |      Examples
 |      --------
 |      >>> np.add.nin
 |      2
 |      >>> np.multiply.nin
 |      2
 |      >>> np.power.nin
 |      2
 |      >>> np.exp.nin
 |      1
 |  
 |  nout
 |      The number of outputs.
 |      
 |      Data attribute containing the number of arguments the ufunc treats as output.
 |      
 |      Notes
 |      -----
 |      Since all ufuncs can take output arguments, this will always be (at least) 1.
 |      
 |      Examples
 |      --------
 |      >>> np.add.nout
 |      1
 |      >>> np.multiply.nout
 |      1
 |      >>> np.power.nout
 |      1
 |      >>> np.exp.nout
 |      1
 |  
 |  ntypes
 |      The number of types.
 |      
 |      The number of numerical NumPy types - of which there are 18 total - on which
 |      the ufunc can operate.
 |      
 |      See Also
 |      --------
 |      numpy.ufunc.types
 |      
 |      Examples
 |      --------
 |      >>> np.add.ntypes
 |      18
 |      >>> np.multiply.ntypes
 |      18
 |      >>> np.power.ntypes
 |      17
 |      >>> np.exp.ntypes
 |      7
 |      >>> np.remainder.ntypes
 |      14
 |  
 |  signature
 |  
 |  types
 |      Returns a list with types grouped input->output.
 |      
 |      Data attribute listing the data-type "Domain-Range" groupings the ufunc can
 |      deliver. The data-types are given using the character codes.
 |      
 |      See Also
 |      --------
 |      numpy.ufunc.ntypes
 |      
 |      Examples
 |      --------
 |      >>> np.add.types
 |      ['??->?', 'bb->b', 'BB->B', 'hh->h', 'HH->H', 'ii->i', 'II->I', 'll->l',
 |      'LL->L', 'qq->q', 'QQ->Q', 'ff->f', 'dd->d', 'gg->g', 'FF->F', 'DD->D',
 |      'GG->G', 'OO->O']
 |      
 |      >>> np.multiply.types
 |      ['??->?', 'bb->b', 'BB->B', 'hh->h', 'HH->H', 'ii->i', 'II->I', 'll->l',
 |      'LL->L', 'qq->q', 'QQ->Q', 'ff->f', 'dd->d', 'gg->g', 'FF->F', 'DD->D',
 |      'GG->G', 'OO->O']
 |      
 |      >>> np.power.types
 |      ['bb->b', 'BB->B', 'hh->h', 'HH->H', 'ii->i', 'II->I', 'll->l', 'LL->L',
 |      'qq->q', 'QQ->Q', 'ff->f', 'dd->d', 'gg->g', 'FF->F', 'DD->D', 'GG->G',
 |      'OO->O']
 |      
 |      >>> np.exp.types
 |      ['f->f', 'd->d', 'g->g', 'F->F', 'D->D', 'G->G', 'O->O']
 |      
 |      >>> np.remainder.types
 |      ['bb->b', 'BB->B', 'hh->h', 'HH->H', 'ii->i', 'II->I', 'll->l', 'LL->L',
 |      'qq->q', 'QQ->Q', 'ff->f', 'dd->d', 'gg->g', 'OO->O']

Look at the online help for numpy.tri. Does it modify a matrix argument in-place or return a matrix? What is in the matrix that it returns?


In [66]:
help(numpy.tri)


Help on function tri in module numpy.lib.twodim_base:

tri(N, M=None, k=0, dtype=<class 'float'>)
    An array with ones at and below the given diagonal and zeros elsewhere.
    
    Parameters
    ----------
    N : int
        Number of rows in the array.
    M : int, optional
        Number of columns in the array.
        By default, `M` is taken equal to `N`.
    k : int, optional
        The sub-diagonal at and below which the array is filled.
        `k` = 0 is the main diagonal, while `k` < 0 is below it,
        and `k` > 0 is above.  The default is 0.
    dtype : dtype, optional
        Data type of the returned array.  The default is float.
    
    Returns
    -------
    tri : ndarray of shape (N, M)
        Array with its lower triangle filled with ones and zero elsewhere;
        in other words ``T[i,j] == 1`` for ``i <= j + k``, 0 otherwise.
    
    Examples
    --------
    >>> np.tri(3, 5, 2, dtype=int)
    array([[1, 1, 1, 0, 0],
           [1, 1, 1, 1, 0],
           [1, 1, 1, 1, 1]])
    
    >>> np.tri(3, 5, -1)
    array([[ 0.,  0.,  0.,  0.,  0.],
           [ 1.,  0.,  0.,  0.,  0.],
           [ 1.,  1.,  0.,  0.,  0.]])

Set the upper-triangle of the matrix to zero, using numpy.multiply and numpy.tri:


In [67]:
gene_matrix_for_network_cor = numpy.multiply(gene_matrix_for_network_cor, numpy.tri(*gene_matrix_for_network_cor.shape))

Using numpy.where, get a tuple of two numpy.arrays containing the row/col indices of the entries of the matrix for which R >= 0.75. Use array indexing to obtain the R values for these matrix entries, as a numpy array cor_coeff_values_above_thresh.


In [68]:
inds_correl_above_thresh = numpy.where(gene_matrix_for_network_cor >= 0.75)
cor_coeff_values_above_thresh = gene_matrix_for_network_cor[inds_correl_above_thresh]

Refer to Eq. (13.5) in the assigned readding for today's class (p9 of the PDF). Obtain a numpy array of the correlation coefficients that exceeded 0.75, and Fisher-transform the correlation coefficient values to get a vector z_scores of z scores. Each of these z scores will correspond to an edge in the network, unless the absolute z score is too small such that we can't exclude the null hypothesis that the corresponding two genes' expression values are indepdenent (we will perform that check in the next step).


In [69]:
z_scores = 0.5*numpy.log((1 + cor_coeff_values_above_thresh)/
                         (1 - cor_coeff_values_above_thresh))

Delete the correlation matrix object in order to save memory (we won't need it from here on out).


In [70]:
del gene_matrix_for_network_cor

Assume that under the null hypothesis that two genes are independent, then sqrt(M-3)z for the pair of genes is an independent sample from the normal distribution with zero mean and unit variance, where M is the number of samples used to compute the Pearson correlation coefficient (i.e., M = 414). For each entry in z_scores compute a P value as the area under two tails of the normal distribution N(x), where the two tails are x < -sqrt(M-3)z and x > sqrt(M-3)z. (You'll know you are doing it right if z=0 means you get a P value of 1). You will want to use the functions numpy.abs and scipy.stats.norm.cdf, as well as the math.sqrt function (in order to compute the square root).


In [71]:
M = gene_matrix_for_network.shape[1]
P_values = 2*scipy.stats.norm.cdf(-numpy.abs(z_scores)*math.sqrt(M-3))

Adjust the P values for multiple hypothesis testing, using the statsmodels.sandbox.stats.multicomp.multipletests function wth method="fdr_bh"


In [72]:
P_values_adj = statsmodels.sandbox.stats.multicomp.multipletests(P_values, method="fdr_bh")[1]

Verify that we don't need to drop any entries due to the adjusted P value not being small enough (use numpy.where and len); this should produce zero since we have M=414 samples per gene.


In [73]:
len(numpy.where(P_values_adj >= 0.01)[0])


Out[73]:
0

Read the online help for the function zip. What does it do?


In [74]:
help(zip)


Help on class zip in module builtins:

class zip(object)
 |  zip(iter1 [,iter2 [...]]) --> zip object
 |  
 |  Return a zip object whose .__next__() method returns a tuple where
 |  the i-th element comes from the i-th iterable argument.  The .__next__()
 |  method continues until the shortest iterable in the argument sequence
 |  is exhausted and then it raises StopIteration.
 |  
 |  Methods defined here:
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  __next__(self, /)
 |      Implement next(self).
 |  
 |  __reduce__(...)
 |      Return state information for pickling.

We want to pass our tuple of numpy arrays containing row and column indices to Graph.TupleList; however, Graph.TupleList accepts a tuple list, not a tuple of numpy arrays. So we need to make a tuple list, using zip:


In [75]:
row_col_inds_tuple_list = zip(inds_correl_above_thresh[0], inds_correl_above_thresh[1])

## [note this can be done more elegantly using the unary "*" operator:
##    row_col_inds_tuple_list = zip(*inds_correl_above_thresh)  
##  see how we only need to type the variable name once, if we use the unary "*" ]

Make an undirected graph from the row/column indices of the (upper-triangle) gene pairs whose correlations were above our threshold, using igraph.Graph.TupleList. Print a summary of the network, as a sanity check, using the igraph.Graph.summary method.


In [76]:
final_network = igraph.Graph.TupleList(row_col_inds_tuple_list)
final_network.summary()


Out[76]:
'IGRAPH UN-- 1394 9916 -- \n+ attr: name (v)'

Plot the degree distribution on log-log scale; does it appear to be scale-free?


In [77]:
degree_dist = final_network.degree_distribution()
xs, ys = zip(*[(left, count) for left, _, count in degree_dist.bins()])
matplotlib.pyplot.scatter(xs, ys, marker="o")
ax = matplotlib.pyplot.gca()
ax.set_yscale("log")
ax.set_xscale("log")
matplotlib.pyplot.ylim((0.5,1000))
pylab.xlabel("k")
pylab.ylabel("N(k)")
pylab.show()


Use the igraph.statistics.power_law_fit function to estimate the scaling exponent alpha of the degree distribution:


In [78]:
igraph.statistics.power_law_fit(final_network.degree()).alpha


Out[78]:
2.7453577551868795

extra challenge:

If you got this far, see if you can scatter plot the relationship between R (as the independent variable) and -log10(P) value (as the dependent variable). When the effect size variable (e.g., R) can range from negative to positive, this plot is sometimes called a "volcano plot".


In [85]:
inds_use = numpy.where(P_values_adj > 0)
matplotlib.pyplot.scatter(cor_coeff_values_above_thresh[inds_use], -numpy.log10(P_values_adj[inds_use]))
pylab.xlabel("R")
pylab.ylabel("-log10(P)")
pylab.show()


extra-extra challenge

For each of the gene pairs for which R>0.75, see if you can compute the t-test P value for each correlation coefficient (don't bother adjusting for false discovery rate control). Compare to the (un-adjusted) P values that you got using the Fisher transformation, using a scatter plot. How do they compare? Which test has better statistical power, for this case where M = 414? (If you are wondering, general advice is to use Fisher if M>=10; for very small numbers of samples, use the Student t test).


In [114]:
ts = numpy.divide(cor_coeff_values_above_thresh * math.sqrt(M - 2), numpy.sqrt(1 - cor_coeff_values_above_thresh**2))
P_values_studentT = 2*scipy.stats.t.cdf(-ts, M-2)
inds_use = numpy.where(numpy.logical_and(P_values > 0, P_values_studentT > 0))
matplotlib.pyplot.scatter(-numpy.log10(P_values[inds_use]),
                          -numpy.log10(P_values_studentT[inds_use]))
pylab.xlabel("Fisher transform")
pylab.ylabel("Student t")
pylab.show()