Intro

Juan Nunez-Iglesias
Victorian Life Sciences Computation Initiative (VLSCI)
University of Melbourne

Quick example: gene expression, without numpy

	Cell type A	Cell type B	Cell type C	Cell type D
Gene 0	100	200	50	400
Gene 1	50	0	0	100
Gene 2	350	100	50	200



In [1]:

    
gene0 = [100, 200, 50, 400]
gene1 = [50, 0, 0, 100]
gene2 = [350, 100, 50, 200]
expression_data = [gene0, gene1, gene2]

Why is this a bad idea?

Now with NumPy



In [2]:

    
import numpy as np
a = np.array(expression_data)
print(a)









    



[[100 200  50 400]
 [ 50   0   0 100]
 [350 100  50 200]]

We are going to:

Obtain an RPKM expression matrix
Quantile normalize the data

RKPM: Reads per kilobase per million reads

Blabla about gene expression

using the awesome power of NumPy

Inside a numpy ndarray



In [3]:

    
def print_info(a):
    print('number of elements:', a.size)
    print('number of dimensions:', a.ndim)
    print('shape:', a.shape)
    print('data type:', a.dtype)
    print('strides:', a.strides)
    print('flags:')
    print(a.flags)
    
print_info(a)









    



number of elements: 12
number of dimensions: 2
shape: (3, 4)
data type: int64
strides: (32, 8)
flags:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False



In [4]:

    
print(a.data)









    



<memory at 0x7f7f9fab5c88>



In [5]:

    
print(a.flatten())
print(a.ravel())









    



[100 200  50 400  50   0   0 100 350 100  50 200]
[100 200  50 400  50   0   0 100 350 100  50 200]



In [6]:

    
bbytes = a.flatten().view(dtype=np.uint8)



In [7]:

    
# performance issues with C_alignment vs Fortran alignment
big_3d_image = np.random.rand(250,250,250)
print_info(big_3d_image)









    



number of elements: 15625000
number of dimensions: 3
shape: (250, 250, 250)
data type: float64
strides: (500000, 2000, 8)
flags:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False



In [8]:

    
%%timeit  big_3d_image = np.random.rand(250,250,250)
#big_3d_image *=5

#for plane in big_3d_image:
#    plane *= 5

# fortran contiguous
for i in range(big_3d_image.shape[1]):
    big_3d_image[:,:,i] *=5
#brighter_3d_image = big_3d_image*5.1









    



10 loops, best of 3: 177 ms per loop



In [9]:

    
bbytes









    Out[9]:





array([100,   0,   0,   0,   0,   0,   0,   0, 200,   0,   0,   0,   0,
         0,   0,   0,  50,   0,   0,   0,   0,   0,   0,   0, 144,   1,
         0,   0,   0,   0,   0,   0,  50,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0, 100,   0,   0,   0,   0,   0,   0,   0,  94,
         1,   0,   0,   0,   0,   0,   0, 100,   0,   0,   0,   0,   0,
         0,   0,  50,   0,   0,   0,   0,   0,   0,   0, 200,   0,   0,
         0,   0,   0,   0,   0], dtype=uint8)



In [10]:

    
# a.ravel gives a flattened version of the array. Difference with flatten:
abytes = a.ravel().view(dtype=np.uint8)



In [11]:

    
print_info(abytes)









    



number of elements: 96
number of dimensions: 1
shape: (96,)
data type: uint8
strides: (1,)
flags:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False



In [12]:

    
print(abytes[:24])









    



[100   0   0   0   0   0   0   0 200   0   0   0   0   0   0   0  50   0
   0   0   0   0   0   0]

Example: take the transpose of `a`



In [13]:

    
print_info(a)









    



number of elements: 12
number of dimensions: 2
shape: (3, 4)
data type: int64
strides: (32, 8)
flags:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False



In [14]:

    
print_info(a.T)









    



number of elements: 12
number of dimensions: 2
shape: (4, 3)
data type: int64
strides: (8, 32)
flags:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

Example: skipping rows and columns with slicing



In [15]:

    
print_info(a.T)









    



number of elements: 12
number of dimensions: 2
shape: (4, 3)
data type: int64
strides: (8, 32)
flags:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False



In [16]:

    
print_info(a.T[::2])









    



number of elements: 6
number of dimensions: 2
shape: (2, 3)
data type: int64
strides: (16, 32)
flags:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False



In [17]:

    
print_info(a.T[::2, ::2])









    



number of elements: 4
number of dimensions: 2
shape: (2, 2)
data type: int64
strides: (16, 64)
flags:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

Getting a copy



In [18]:

    
b = a



In [19]:

    
print(b)









    



[[100 200  50 400]
 [ 50   0   0 100]
 [350 100  50 200]]



In [20]:

    
a[0, 0] = 5
print(b)
a[0, 0] = 100









    



[[  5 200  50 400]
 [ 50   0   0 100]
 [350 100  50 200]]

Advanced operations: axis-wise evaluation



In [21]:

    
expr = np.load('expr.npy')



In [22]:

    
print_info(expr)









    



number of elements: 7687500
number of dimensions: 2
shape: (20500, 375)
data type: uint32
strides: (4, 82000)
flags:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

This has the raw read count data. 20500 genes are read for 375 patients. However, each sample gets a different number of reads, so we want to normalize by the library size, which is the total number of reads across a column. (The total number of reads of one person)

The np.sum function returns the sum of all the elements of an array. With the axis argument, you can take the sum along the given axis.



In [23]:

    
lib_size = np.sum(expr, axis=0)

Exercise

Generate a 10 x 3 array of random numbers. From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.



In [24]:

    
x = np.random.rand(10,3)

def find_closest_value_per_column(x, value=0.75):
    
    xacc = abs(x - value)
    return np.argmin(xacc, axis=1)

find_closest_value_per_column(x)









    Out[24]:





array([0, 1, 2, 2, 2, 2, 2, 0, 0, 0])



In [ ]:



In [25]:

    
np.argmax(x)









    Out[25]:





2



In [26]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [27]:

    
plot?



In [28]:

    
x = np.random.rand(100,100)

x[20:30,20:30] = 3
x[25,25] = 3.5

imshow(x, cmap='gray')
xm, ym =unravel_index(max_idx, x.shape)
max_idx = np.argmax(x)
unravel_index(max_idx, x.shape)
plot(ym, xm, 'ro')









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-28-b04c70f5921b> in <module>()
      5 
      6 imshow(x, cmap='gray')
----> 7 xm, ym =unravel_index(max_idx, x.shape)
      8 max_idx = np.argmax(x)
      9 unravel_index(max_idx, x.shape)

NameError: name 'max_idx' is not defined



In [ ]:

Advanced operations: broadcasting

In order to normalize every column by its corresponding library size, we have to align the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use np.newaxis to match the dimensions.



In [29]:

    
print(expr.shape)
print(lib_size.shape)
print(lib_size[np.newaxis, :].shape)









    



(20500, 375)
(375,)
(1, 375)

However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:



In [30]:

    
np.all(expr / lib_size ==
       expr / lib_size[np.newaxis, :])









    Out[30]:





True



In [31]:

    
expr_lib = expr / lib_size

We also multiply by $10^6$ in order to keep the numbers on a readable scale (reads per million reads).



In [32]:

    
expr_lib *= 1e6



In [33]:

    
imshow(expr_lib)









    Out[33]:





<matplotlib.image.AxesImage at 0x7f7f717c24e0>

Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).



In [34]:

    
gene_len = np.load('gene-lens.npy')
print(gene_len.shape)

Exercise: broadcast `expr_lib` and `gene_len` together to produce RPKM



In [35]:

    
gene_len.shape
lib_size.shape









    Out[35]:





(375,)



In [36]:

    
#rpkm[i,j] = 10**11 * expr[i,j] / (lib_size[j]*gene_size[i])
# option 1: Use np.newaxis
%timeit 10**3 * expr_lib / gene_len[:, np.newaxis]









    



10 loops, best of 3: 48.9 ms per loop



In [114]:

    
# probably better
rpkm = 10**3 * expr_lib / gene_len[:, np.newaxis]



In [38]:

    
# have the expr_lib matrix dancing around untill it works
rpkm2 = (10**3* expr_lib.T / gene_len).T



In [39]:

    
rpkm1 - rpkm2









    Out[39]:





array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])



In [40]:

    
%timeit (10**11* (expr_lib / lib_size ).T / gene_len).T









    



10 loops, best of 3: 74.9 ms per loop



In [41]:

    
from matplotlib import pyplot as plt
from scipy import stats

def plot_col_density(data, xlim=None, *args, **kwargs):
    # Use gaussian smoothing to estimate the density
    density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]
    if xlim is not None:
        m, M = xlim
    else:
        m, M = np.min(data), np.max(data)
    x = np.linspace(m, M, 100)

    plt.figure()
    for density in density_per_col:
        plt.plot(x, density(x), *args, **kwargs)
    plt.xlabel('log-counts')
    plt.ylabel('frequency')
    if xlim is not None:
        plt.xlim(xlim)
    plt.show()



In [24]:

    
%matplotlib inline



In [123]:

    
plt.style.use('ggplot')



In [124]:

    
plot_col_density(np.log(expr+1))



In [126]:

    
plot_col_density(np.log(rpkm + 1), xlim=(0, 20))

Exercise: 3D broadcasting

Below, produce the array containing the sum of every element in x with every element in y



In [42]:

    
x = np.random.rand(3, 5)
y = np.random.randint(10, size=8)
z = x[..., newaxis] + y#[:,newaxis, newaxis]



In [43]:

    
print(x.shape)
print(y.shape)









    



(3, 5)
(8,)



In [44]:

    
# To use broadcasting efficiently, the dimensions need to either match or be one. 
# This is the explicit way.
x[newaxis,:,:] + y[:,newaxis, newaxis]









    Out[44]:





array([[[ 0.86462109,  0.45549902,  0.66052184,  0.96832494,  0.29765314],
        [ 0.18511956,  0.7596875 ,  0.6910831 ,  0.15048665,  0.47884507],
        [ 0.22415889,  0.95820765,  0.3423851 ,  0.56196965,  0.38359481]],

       [[ 0.86462109,  0.45549902,  0.66052184,  0.96832494,  0.29765314],
        [ 0.18511956,  0.7596875 ,  0.6910831 ,  0.15048665,  0.47884507],
        [ 0.22415889,  0.95820765,  0.3423851 ,  0.56196965,  0.38359481]],

       [[ 8.86462109,  8.45549902,  8.66052184,  8.96832494,  8.29765314],
        [ 8.18511956,  8.7596875 ,  8.6910831 ,  8.15048665,  8.47884507],
        [ 8.22415889,  8.95820765,  8.3423851 ,  8.56196965,  8.38359481]],

       [[ 4.86462109,  4.45549902,  4.66052184,  4.96832494,  4.29765314],
        [ 4.18511956,  4.7596875 ,  4.6910831 ,  4.15048665,  4.47884507],
        [ 4.22415889,  4.95820765,  4.3423851 ,  4.56196965,  4.38359481]],

       [[ 3.86462109,  3.45549902,  3.66052184,  3.96832494,  3.29765314],
        [ 3.18511956,  3.7596875 ,  3.6910831 ,  3.15048665,  3.47884507],
        [ 3.22415889,  3.95820765,  3.3423851 ,  3.56196965,  3.38359481]],

       [[ 9.86462109,  9.45549902,  9.66052184,  9.96832494,  9.29765314],
        [ 9.18511956,  9.7596875 ,  9.6910831 ,  9.15048665,  9.47884507],
        [ 9.22415889,  9.95820765,  9.3423851 ,  9.56196965,  9.38359481]],

       [[ 4.86462109,  4.45549902,  4.66052184,  4.96832494,  4.29765314],
        [ 4.18511956,  4.7596875 ,  4.6910831 ,  4.15048665,  4.47884507],
        [ 4.22415889,  4.95820765,  4.3423851 ,  4.56196965,  4.38359481]],

       [[ 5.86462109,  5.45549902,  5.66052184,  5.96832494,  5.29765314],
        [ 5.18511956,  5.7596875 ,  5.6910831 ,  5.15048665,  5.47884507],
        [ 5.22415889,  5.95820765,  5.3423851 ,  5.56196965,  5.38359481]]])

Exercise: explicit broadcasting and stride tricks

Use np.broadcast_arrays to get the same-shape arrays that numpy adds together. Then use print_info on the output. Notice anything weird?



In [ ]:

Stride tricks

By manipulating the shape and strides of an array, we can produce a "virtual" array much bigger than its memory usage:



In [45]:

    
def repeat(arr, n):
    return np.lib.stride_tricks.as_strided(arr,
                                           shape=(n,) + arr.shape,
                                           strides=(0,) + arr.strides)



In [46]:

    
a = repeat(np.random.rand(5), 4)
a









    Out[46]:





array([[ 0.54598241,  0.86746443,  0.31584841,  0.81297895,  0.20909736],
       [ 0.54598241,  0.86746443,  0.31584841,  0.81297895,  0.20909736],
       [ 0.54598241,  0.86746443,  0.31584841,  0.81297895,  0.20909736],
       [ 0.54598241,  0.86746443,  0.31584841,  0.81297895,  0.20909736]])



In [47]:

    
print_info(a)









    



number of elements: 20
number of dimensions: 2
shape: (4, 5)
data type: float64
strides: (0, 8)
flags:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

Exercise: `np.lib.stride_tricks.as_strided`

Use as_strided to produce a sliding-window view of a 1D array.



In [49]:

    
def sliding_window(arr, size=2):
    """Produce an array of sliding window views of `arr`
    
    Parameters
    ----------
    arr : 1D array, shape (N,)
        The input array.
    size : int, optional
        The size of the sliding window.
        
    Returns
    -------
    arr_slide : 2D array, shape (N - size + 1, size)
        The sliding windows of size `size` of `arr`.
        
    Examples
    --------
    >>> a = np.array([0, 1, 2, 3])
    >>> sliding_window(a, 2)
    array([[0, 1],
           [1, 2],
           [2, 3]])
    """
    N = len(arr)
    desired_shape = (N - size + 1, size)
    s = arr.strides[0] # like this it will be independent on the datatype
    
    return np.lib.stride_tricks.as_strided(arr,
                                           shape=desired_shape,
                                           strides=(s, s))
    #return arr  # fix this



In [152]:

    
# test your code here
%timeit sliding_window(np.random.rand(1e5), 3).sum(axis=1)/3.









    



100 loops, best of 3: 3.05 ms per loop



In [ ]:

Fancy indexing

You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.



In [52]:

    
values = np.array([0, 5, 99])
selector = np.random.randint(0, 3, size=(3, 4))
print(selector)
print(values[selector])









    



[[2 2 1 1]
 [2 1 0 2]
 [1 1 0 0]]
[[99 99  5  5]
 [99  5  0 99]
 [ 5  5  0  0]]

Exercise: quantile normalization

Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.

Hint: look for documentation for scipy.mstats.rankdata, np.sort, and np.argsort.



In [90]:

    
np.mgrid[0:4,0:3][1]









    Out[90]:





array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])



In [88]:

    
repeat(np.arange(0,3),4)









    Out[88]:





array([[0, 1, 2],
       [0, 1, 2],
       [0, 1, 2],
       [0, 1, 2]])



In [140]:

    
testlist = np.array([1.1,2.1, 2.1, 2.1, 3.1, 6.1, 4.1, 5.1])



In [ ]:



In [141]:

    
indices = np.argsort(testlist)
unique_items,number_of_occurences = np.unique(testlist, return_counts=True)



In [142]:

    
indices, unique_items, number_of_occurences









    Out[142]:





(array([0, 1, 2, 3, 4, 6, 7, 5]),
 array([ 1.1,  2.1,  3.1,  4.1,  5.1,  6.1]),
 array([1, 3, 1, 1, 1, 1]))



In [143]:

    
index = 0
while index < len(testlist):
    if number_of_occurences[i] == 1:
        pass
    else:
        indices[index+1:index+number_of_occurences[i]] -= np.arange(number_of_occurences[i]-1)
        indices[index+number_of_occurences[i]:] -= number_of_occurences[i]-1
    index += 1    
indices









    Out[143]:





array([0, 1, 2, 3, 4, 6, 7, 5])



In [138]:

    
index









    Out[138]:





6



In [110]:

    
np.unique(testdata)

for r in testdata.T:
    print (np.unique(r, return_counts=True))









    



(array([ 2.,  3.,  4.,  5.]), array([1, 1, 1, 1]))
(array([ 1.,  2.,  4.]), array([1, 1, 2]))
(array([ 3.,  4.,  6.,  8.]), array([1, 1, 1, 1]))



In [111]:

    
np.argsort?



In [104]:

    
testdata = np.loadtxt('./testdata.dat', delimiter=',')
print(testdata)
# convert into an array with rank values according to column
rank_idx = np.argsort(np.argsort(testdata,axis=0), axis=0)

#np.argsort(rank_idx, axis=0)

# TODO SOLVE FOR 2 values of 4


# sort the input data
sorted_testdata = np.sort(testdata, axis=0)
# determine the ranks by computing the mean for each row
ranks = np.mean(sorted_testdata, axis=1)
ranks[rank_idx]
#print(rank_idx)
#testdata[rank_idx, np.mgrid[0:4,0:3][1]]









    



[[ 5.  4.  3.]
 [ 2.  1.  4.]
 [ 3.  4.  6.]
 [ 4.  2.  8.]]






    Out[104]:





array([[ 5.66666667,  4.66666667,  2.        ],
       [ 2.        ,  2.        ,  3.        ],
       [ 3.        ,  5.66666667,  4.66666667],
       [ 4.66666667,  3.        ,  5.66666667]])



In [102]:









    Out[102]:





array([ 2.        ,  3.        ,  4.66666667,  5.66666667])



In [100]:

    
np.sort(testdata, axis=0)









    Out[100]:





array([[ 2.,  1.,  3.],
       [ 3.,  2.,  4.],
       [ 4.,  4.,  6.],
       [ 5.,  4.,  8.]])



In [112]:

    
def qnorm(x):
    """Quantile normalize an input matrix.
    
    Parameters
    ----------
    x : 2D array of float, shape (M, N)
        The input data, with each column being a
        distribution to normalize.
        
    Returns
    -------
    xn : 2D array of float, shape (M, N)
        The normalized data.
    """
    # convert into an array with rank values according to column
    rank_idx = np.argsort(np.argsort(x,axis=0), axis=0)
    # sort the input data
    sorted_x = np.sort(x, axis=0)
    # determine the ranks by computing the mean for each row
    ranks = np.mean(sorted_x, axis=1)
    xn = ranks[rank_idx]
    return xn



In [115]:

    
logexpr = np.log(expr + 1)
logrpkm = np.log(rpkm + 1)



In [116]:

    
logexprn = qnorm(logexpr)
logrpkmn = qnorm(logrpkm)



In [117]:

    
plot_col_density(logexprn)



In [118]:

    
plot_col_density(logrpkmn, xlim=(0, 0.25))

Advanced exercise Jack's dilemma

(If time permits.)

Date: Wed, 16 Jul 2008 16:45:37 -0500
From: Jack Cook
To: <numpy-discussion@scipy.org>
Subject: Numpy Advanced Indexing Question

Greetings,

I have an I,J,K 3D volume of amplitude values at regularly sampled time intervals. I have an I,J 2D slice which contains a time (K) value at each I, J location. What I would like to do is extract a subvolume at a constant +/- K window around the slice. Is there an easy way to do this using advanced indexing or some other method? Thanks in advanced for your help.

-- Jack



In [ ]:

    
# "data"

ni, nj, nk = (10, 15, 20)
amplitude = np.random.rand(ni, nj, nk)
horizon = np.random.randint(5, 15, size=(ni, nj))



In [ ]:

Even more advanced: NumPy Array Interface

An author of a foreign package (included with the exercizes as problems/mutable_str.py) provides a string class that allocates its own memory:

ipython
In [1]: from mutable_str import MutableString
In [2]: s = MutableString('abcde')
In [3]: print s
abcde

You'd like to view these mutable (mutable means the ability to modify in place) strings as ndarrays, in order to manipulate the underlying memory.

Add an array_interface dictionary attribute to s, then convert s to an ndarray. Numerically add "2" to the array (use the in-place operator +=).

Then print the original string to ensure that its value was modified.

Hint: Documentation for NumPy's __array_interface__ may be found in the online docs.

Here's a skeleton outline:



In [ ]:

    
import numpy as np
from mutable_str import MutableString

s = MutableString('abcde')

# --- EDIT THIS SECTION ---

# Create an array interface to this foreign object
s.__array_interface__ = {'data' : (XXX, False), # (ptr, is read_only?)
                         'shape' : XXX,
                         'typestr' : '|u1', # typecode unsigned character
                         }

# --- EDIT THIS SECTION ---

print 'String before converting to array:', s
sa = np.asarray(s)

print 'String after converting to array:', sa

sa += 2
print 'String after adding "2" to array:', s



In [ ]: