# Intro

Juan Nunez-Iglesias
Victorian Life Sciences Computation Initiative (VLSCI)
University of Melbourne

# Quick example: gene expression, without numpy

Cell type A Cell type B Cell type C Cell type D
Gene 0 100 200 50 400
Gene 1 50 0 0 100
Gene 2 350 100 50 200
``````

In [1]:

gene0 = [100, 200, 50, 400]
gene1 = [50, 0, 0, 100]
gene2 = [350, 100, 50, 200]
expression_data = [gene0, gene1, gene2]

``````

Why is this a bad idea?

# Now with NumPy

``````

In [2]:

import numpy as np
a = np.array(expression_data)
print(a)

``````
``````

[[100 200  50 400]
[ 50   0   0 100]
[350 100  50 200]]

``````

We are going to:

• Obtain an RPKM expression matrix
• Quantile normalize the data

using the awesome power of NumPy

# Inside a numpy ndarray

``````

In [3]:

def print_info(a):
print('number of elements:', a.size)
print('number of dimensions:', a.ndim)
print('shape:', a.shape)
print('data type:', a.dtype)
print('strides:', a.strides)
print('flags:')
print(a.flags)

print_info(a)

``````
``````

number of elements: 12
number of dimensions: 2
shape: (3, 4)
data type: int64
strides: (32, 8)
flags:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````
``````

In [4]:

print(a.data)

``````
``````

<memory at 0x7f7f9fab5c88>

``````
``````

In [5]:

print(a.flatten())
print(a.ravel())

``````
``````

[100 200  50 400  50   0   0 100 350 100  50 200]
[100 200  50 400  50   0   0 100 350 100  50 200]

``````
``````

In [6]:

bbytes = a.flatten().view(dtype=np.uint8)

``````
``````

In [7]:

# performance issues with C_alignment vs Fortran alignment
big_3d_image = np.random.rand(250,250,250)
print_info(big_3d_image)

``````
``````

number of elements: 15625000
number of dimensions: 3
shape: (250, 250, 250)
data type: float64
strides: (500000, 2000, 8)
flags:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````
``````

In [8]:

%%timeit  big_3d_image = np.random.rand(250,250,250)
#big_3d_image *=5

#for plane in big_3d_image:
#    plane *= 5

# fortran contiguous
for i in range(big_3d_image.shape[1]):
big_3d_image[:,:,i] *=5
#brighter_3d_image = big_3d_image*5.1

``````
``````

10 loops, best of 3: 177 ms per loop

``````
``````

In [9]:

bbytes

``````
``````

Out[9]:

array([100,   0,   0,   0,   0,   0,   0,   0, 200,   0,   0,   0,   0,
0,   0,   0,  50,   0,   0,   0,   0,   0,   0,   0, 144,   1,
0,   0,   0,   0,   0,   0,  50,   0,   0,   0,   0,   0,   0,
0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
0,   0,   0,   0, 100,   0,   0,   0,   0,   0,   0,   0,  94,
1,   0,   0,   0,   0,   0,   0, 100,   0,   0,   0,   0,   0,
0,   0,  50,   0,   0,   0,   0,   0,   0,   0, 200,   0,   0,
0,   0,   0,   0,   0], dtype=uint8)

``````
``````

In [10]:

# a.ravel gives a flattened version of the array. Difference with flatten:
abytes = a.ravel().view(dtype=np.uint8)

``````
``````

In [11]:

print_info(abytes)

``````
``````

number of elements: 96
number of dimensions: 1
shape: (96,)
data type: uint8
strides: (1,)
flags:
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````
``````

In [12]:

print(abytes[:24])

``````
``````

[100   0   0   0   0   0   0   0 200   0   0   0   0   0   0   0  50   0
0   0   0   0   0   0]

``````

### Example: take the transpose of `a`

``````

In [13]:

print_info(a)

``````
``````

number of elements: 12
number of dimensions: 2
shape: (3, 4)
data type: int64
strides: (32, 8)
flags:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````
``````

In [14]:

print_info(a.T)

``````
``````

number of elements: 12
number of dimensions: 2
shape: (4, 3)
data type: int64
strides: (8, 32)
flags:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````

### Example: skipping rows and columns with slicing

``````

In [15]:

print_info(a.T)

``````
``````

number of elements: 12
number of dimensions: 2
shape: (4, 3)
data type: int64
strides: (8, 32)
flags:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````
``````

In [16]:

print_info(a.T[::2])

``````
``````

number of elements: 6
number of dimensions: 2
shape: (2, 3)
data type: int64
strides: (16, 32)
flags:
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````
``````

In [17]:

print_info(a.T[::2, ::2])

``````
``````

number of elements: 4
number of dimensions: 2
shape: (2, 2)
data type: int64
strides: (16, 64)
flags:
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````

### Getting a copy

``````

In [18]:

b = a

``````
``````

In [19]:

print(b)

``````
``````

[[100 200  50 400]
[ 50   0   0 100]
[350 100  50 200]]

``````
``````

In [20]:

a[0, 0] = 5
print(b)
a[0, 0] = 100

``````
``````

[[  5 200  50 400]
[ 50   0   0 100]
[350 100  50 200]]

``````

``````

In [21]:

``````
``````

In [22]:

print_info(expr)

``````
``````

number of elements: 7687500
number of dimensions: 2
shape: (20500, 375)
data type: uint32
strides: (4, 82000)
flags:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````

This has the raw read count data. 20500 genes are read for 375 patients. However, each sample gets a different number of reads, so we want to normalize by the library size, which is the total number of reads across a column. (The total number of reads of one person)

The `np.sum` function returns the sum of all the elements of an array. With the `axis` argument, you can take the sum along the given axis.

``````

In [23]:

lib_size = np.sum(expr, axis=0)

``````

### Exercise

Generate a 10 x 3 array of random numbers. From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.

``````

In [24]:

x = np.random.rand(10,3)

def find_closest_value_per_column(x, value=0.75):

xacc = abs(x - value)
return np.argmin(xacc, axis=1)

find_closest_value_per_column(x)

``````
``````

Out[24]:

array([0, 1, 2, 2, 2, 2, 2, 0, 0, 0])

``````
``````

In [ ]:

``````
``````

In [25]:

np.argmax(x)

``````
``````

Out[25]:

2

``````
``````

In [26]:

%pylab inline

``````
``````

Populating the interactive namespace from numpy and matplotlib

``````
``````

In [27]:

plot?

``````
``````

In [28]:

x = np.random.rand(100,100)

x[20:30,20:30] = 3
x[25,25] = 3.5

imshow(x, cmap='gray')
xm, ym =unravel_index(max_idx, x.shape)
max_idx = np.argmax(x)
unravel_index(max_idx, x.shape)
plot(ym, xm, 'ro')

``````
``````

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-28-b04c70f5921b> in <module>()
5
6 imshow(x, cmap='gray')
----> 7 xm, ym =unravel_index(max_idx, x.shape)
8 max_idx = np.argmax(x)
9 unravel_index(max_idx, x.shape)

NameError: name 'max_idx' is not defined

``````
``````

In [ ]:

``````

In order to normalize every column by its corresponding library size, we have to align the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use `np.newaxis` to match the dimensions.

``````

In [29]:

print(expr.shape)
print(lib_size.shape)
print(lib_size[np.newaxis, :].shape)

``````
``````

(20500, 375)
(375,)
(1, 375)

``````

However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:

``````

In [30]:

np.all(expr / lib_size ==
expr / lib_size[np.newaxis, :])

``````
``````

Out[30]:

True

``````
``````

In [31]:

expr_lib = expr / lib_size

``````

We also multiply by \$10^6\$ in order to keep the numbers on a readable scale (reads per million reads).

``````

In [32]:

expr_lib *= 1e6

``````
``````

In [33]:

imshow(expr_lib)

``````
``````

Out[33]:

<matplotlib.image.AxesImage at 0x7f7f717c24e0>

``````

Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).

``````

In [34]:

print(gene_len.shape)

``````
``````

(20500,)

``````

### Exercise: broadcast `expr_lib` and `gene_len` together to produce RPKM

``````

In [35]:

gene_len.shape
lib_size.shape

``````
``````

Out[35]:

(375,)

``````
``````

In [36]:

#rpkm[i,j] = 10**11 * expr[i,j] / (lib_size[j]*gene_size[i])
# option 1: Use np.newaxis
%timeit 10**3 * expr_lib / gene_len[:, np.newaxis]

``````
``````

10 loops, best of 3: 48.9 ms per loop

``````
``````

In [114]:

# probably better
rpkm = 10**3 * expr_lib / gene_len[:, np.newaxis]

``````
``````

In [38]:

# have the expr_lib matrix dancing around untill it works
rpkm2 = (10**3* expr_lib.T / gene_len).T

``````
``````

In [39]:

rpkm1 - rpkm2

``````
``````

Out[39]:

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
[ 0.,  0.,  0., ...,  0.,  0.,  0.],
[ 0.,  0.,  0., ...,  0.,  0.,  0.],
...,
[ 0.,  0.,  0., ...,  0.,  0.,  0.],
[ 0.,  0.,  0., ...,  0.,  0.,  0.],
[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

``````
``````

In [40]:

%timeit (10**11* (expr_lib / lib_size ).T / gene_len).T

``````
``````

10 loops, best of 3: 74.9 ms per loop

``````
``````

In [41]:

from matplotlib import pyplot as plt
from scipy import stats

def plot_col_density(data, xlim=None, *args, **kwargs):
# Use gaussian smoothing to estimate the density
density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]
if xlim is not None:
m, M = xlim
else:
m, M = np.min(data), np.max(data)
x = np.linspace(m, M, 100)

plt.figure()
for density in density_per_col:
plt.plot(x, density(x), *args, **kwargs)
plt.xlabel('log-counts')
plt.ylabel('frequency')
if xlim is not None:
plt.xlim(xlim)
plt.show()

``````
``````

In [24]:

%matplotlib inline

``````
``````

In [123]:

plt.style.use('ggplot')

``````
``````

In [124]:

plot_col_density(np.log(expr+1))

``````
``````

``````
``````

In [126]:

plot_col_density(np.log(rpkm + 1), xlim=(0, 20))

``````
``````

``````

Below, produce the array containing the sum of every element in `x` with every element in `y`

``````

In [42]:

x = np.random.rand(3, 5)
y = np.random.randint(10, size=8)
z = x[..., newaxis] + y#[:,newaxis, newaxis]

``````
``````

In [43]:

print(x.shape)
print(y.shape)

``````
``````

(3, 5)
(8,)

``````
``````

In [44]:

# To use broadcasting efficiently, the dimensions need to either match or be one.
# This is the explicit way.
x[newaxis,:,:] + y[:,newaxis, newaxis]

``````
``````

Out[44]:

array([[[ 0.86462109,  0.45549902,  0.66052184,  0.96832494,  0.29765314],
[ 0.18511956,  0.7596875 ,  0.6910831 ,  0.15048665,  0.47884507],
[ 0.22415889,  0.95820765,  0.3423851 ,  0.56196965,  0.38359481]],

[[ 0.86462109,  0.45549902,  0.66052184,  0.96832494,  0.29765314],
[ 0.18511956,  0.7596875 ,  0.6910831 ,  0.15048665,  0.47884507],
[ 0.22415889,  0.95820765,  0.3423851 ,  0.56196965,  0.38359481]],

[[ 8.86462109,  8.45549902,  8.66052184,  8.96832494,  8.29765314],
[ 8.18511956,  8.7596875 ,  8.6910831 ,  8.15048665,  8.47884507],
[ 8.22415889,  8.95820765,  8.3423851 ,  8.56196965,  8.38359481]],

[[ 4.86462109,  4.45549902,  4.66052184,  4.96832494,  4.29765314],
[ 4.18511956,  4.7596875 ,  4.6910831 ,  4.15048665,  4.47884507],
[ 4.22415889,  4.95820765,  4.3423851 ,  4.56196965,  4.38359481]],

[[ 3.86462109,  3.45549902,  3.66052184,  3.96832494,  3.29765314],
[ 3.18511956,  3.7596875 ,  3.6910831 ,  3.15048665,  3.47884507],
[ 3.22415889,  3.95820765,  3.3423851 ,  3.56196965,  3.38359481]],

[[ 9.86462109,  9.45549902,  9.66052184,  9.96832494,  9.29765314],
[ 9.18511956,  9.7596875 ,  9.6910831 ,  9.15048665,  9.47884507],
[ 9.22415889,  9.95820765,  9.3423851 ,  9.56196965,  9.38359481]],

[[ 4.86462109,  4.45549902,  4.66052184,  4.96832494,  4.29765314],
[ 4.18511956,  4.7596875 ,  4.6910831 ,  4.15048665,  4.47884507],
[ 4.22415889,  4.95820765,  4.3423851 ,  4.56196965,  4.38359481]],

[[ 5.86462109,  5.45549902,  5.66052184,  5.96832494,  5.29765314],
[ 5.18511956,  5.7596875 ,  5.6910831 ,  5.15048665,  5.47884507],
[ 5.22415889,  5.95820765,  5.3423851 ,  5.56196965,  5.38359481]]])

``````

### Exercise: explicit broadcasting and stride tricks

Use `np.broadcast_arrays` to get the same-shape arrays that numpy adds together. Then use `print_info` on the output. Notice anything weird?

``````

In [ ]:

``````

## Stride tricks

By manipulating the shape and strides of an array, we can produce a "virtual" array much bigger than its memory usage:

``````

In [45]:

def repeat(arr, n):
return np.lib.stride_tricks.as_strided(arr,
shape=(n,) + arr.shape,
strides=(0,) + arr.strides)

``````
``````

In [46]:

a = repeat(np.random.rand(5), 4)
a

``````
``````

Out[46]:

array([[ 0.54598241,  0.86746443,  0.31584841,  0.81297895,  0.20909736],
[ 0.54598241,  0.86746443,  0.31584841,  0.81297895,  0.20909736],
[ 0.54598241,  0.86746443,  0.31584841,  0.81297895,  0.20909736],
[ 0.54598241,  0.86746443,  0.31584841,  0.81297895,  0.20909736]])

``````
``````

In [47]:

print_info(a)

``````
``````

number of elements: 20
number of dimensions: 2
shape: (4, 5)
data type: float64
strides: (0, 8)
flags:
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

``````

### Exercise: `np.lib.stride_tricks.as_strided`

Use `as_strided` to produce a sliding-window view of a 1D array.

``````

In [49]:

def sliding_window(arr, size=2):
"""Produce an array of sliding window views of `arr`

Parameters
----------
arr : 1D array, shape (N,)
The input array.
size : int, optional
The size of the sliding window.

Returns
-------
arr_slide : 2D array, shape (N - size + 1, size)
The sliding windows of size `size` of `arr`.

Examples
--------
>>> a = np.array([0, 1, 2, 3])
>>> sliding_window(a, 2)
array([[0, 1],
[1, 2],
[2, 3]])
"""
N = len(arr)
desired_shape = (N - size + 1, size)
s = arr.strides[0] # like this it will be independent on the datatype

return np.lib.stride_tricks.as_strided(arr,
shape=desired_shape,
strides=(s, s))
#return arr  # fix this

``````
``````

In [152]:

%timeit sliding_window(np.random.rand(1e5), 3).sum(axis=1)/3.

``````
``````

100 loops, best of 3: 3.05 ms per loop

``````
``````

In [ ]:

``````

# Fancy indexing

You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.

``````

In [52]:

values = np.array([0, 5, 99])
selector = np.random.randint(0, 3, size=(3, 4))
print(selector)
print(values[selector])

``````
``````

[[2 2 1 1]
[2 1 0 2]
[1 1 0 0]]
[[99 99  5  5]
[99  5  0 99]
[ 5  5  0  0]]

``````

### Exercise: quantile normalization

Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.

Hint: look for documentation for `scipy.mstats.rankdata`, `np.sort`, and `np.argsort`.

``````

In [90]:

np.mgrid[0:4,0:3][1]

``````
``````

Out[90]:

array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])

``````
``````

In [88]:

repeat(np.arange(0,3),4)

``````
``````

Out[88]:

array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])

``````
``````

In [140]:

testlist = np.array([1.1,2.1, 2.1, 2.1, 3.1, 6.1, 4.1, 5.1])

``````
``````

In [ ]:

``````
``````

In [141]:

indices = np.argsort(testlist)
unique_items,number_of_occurences = np.unique(testlist, return_counts=True)

``````
``````

In [142]:

indices, unique_items, number_of_occurences

``````
``````

Out[142]:

(array([0, 1, 2, 3, 4, 6, 7, 5]),
array([ 1.1,  2.1,  3.1,  4.1,  5.1,  6.1]),
array([1, 3, 1, 1, 1, 1]))

``````
``````

In [143]:

index = 0
while index < len(testlist):
if number_of_occurences[i] == 1:
pass
else:
indices[index+1:index+number_of_occurences[i]] -= np.arange(number_of_occurences[i]-1)
indices[index+number_of_occurences[i]:] -= number_of_occurences[i]-1
index += 1
indices

``````
``````

Out[143]:

array([0, 1, 2, 3, 4, 6, 7, 5])

``````
``````

In [138]:

index

``````
``````

Out[138]:

6

``````
``````

In [110]:

np.unique(testdata)

for r in testdata.T:
print (np.unique(r, return_counts=True))

``````
``````

(array([ 2.,  3.,  4.,  5.]), array([1, 1, 1, 1]))
(array([ 1.,  2.,  4.]), array([1, 1, 2]))
(array([ 3.,  4.,  6.,  8.]), array([1, 1, 1, 1]))

``````
``````

In [111]:

np.argsort?

``````
``````

In [104]:

print(testdata)
# convert into an array with rank values according to column
rank_idx = np.argsort(np.argsort(testdata,axis=0), axis=0)

#np.argsort(rank_idx, axis=0)

# TODO SOLVE FOR 2 values of 4

# sort the input data
sorted_testdata = np.sort(testdata, axis=0)
# determine the ranks by computing the mean for each row
ranks = np.mean(sorted_testdata, axis=1)
ranks[rank_idx]
#print(rank_idx)
#testdata[rank_idx, np.mgrid[0:4,0:3][1]]

``````
``````

[[ 5.  4.  3.]
[ 2.  1.  4.]
[ 3.  4.  6.]
[ 4.  2.  8.]]

Out[104]:

array([[ 5.66666667,  4.66666667,  2.        ],
[ 2.        ,  2.        ,  3.        ],
[ 3.        ,  5.66666667,  4.66666667],
[ 4.66666667,  3.        ,  5.66666667]])

``````
``````

In [102]:

``````
``````

Out[102]:

array([ 2.        ,  3.        ,  4.66666667,  5.66666667])

``````
``````

In [100]:

np.sort(testdata, axis=0)

``````
``````

Out[100]:

array([[ 2.,  1.,  3.],
[ 3.,  2.,  4.],
[ 4.,  4.,  6.],
[ 5.,  4.,  8.]])

``````
``````

In [112]:

def qnorm(x):
"""Quantile normalize an input matrix.

Parameters
----------
x : 2D array of float, shape (M, N)
The input data, with each column being a
distribution to normalize.

Returns
-------
xn : 2D array of float, shape (M, N)
The normalized data.
"""
# convert into an array with rank values according to column
rank_idx = np.argsort(np.argsort(x,axis=0), axis=0)
# sort the input data
sorted_x = np.sort(x, axis=0)
# determine the ranks by computing the mean for each row
ranks = np.mean(sorted_x, axis=1)
xn = ranks[rank_idx]
return xn

``````
``````

In [115]:

logexpr = np.log(expr + 1)
logrpkm = np.log(rpkm + 1)

``````
``````

In [116]:

logexprn = qnorm(logexpr)
logrpkmn = qnorm(logrpkm)

``````
``````

In [117]:

plot_col_density(logexprn)

``````
``````

``````
``````

In [118]:

plot_col_density(logrpkmn, xlim=(0, 0.25))

``````
``````

``````

(If time permits.)

```Date: Wed, 16 Jul 2008 16:45:37 -0500
From: Jack Cook
To: <numpy-discussion@scipy.org>
```

Greetings,

I have an I,J,K 3D volume of amplitude values at regularly sampled time intervals. I have an I,J 2D slice which contains a time (K) value at each I, J location. What I would like to do is extract a subvolume at a constant +/- K window around the slice. Is there an easy way to do this using advanced indexing or some other method? Thanks in advanced for your help.

-- Jack

``````

In [ ]:

# "data"

ni, nj, nk = (10, 15, 20)
amplitude = np.random.rand(ni, nj, nk)
horizon = np.random.randint(5, 15, size=(ni, nj))

``````
``````

In [ ]:

``````

## Even more advanced: NumPy Array Interface

An author of a foreign package (included with the exercizes as `problems/mutable_str.py`) provides a string class that allocates its own memory:

``````ipython
In [1]: from mutable_str import MutableString
In [2]: s = MutableString('abcde')
In [3]: print s
abcde``````

You'd like to view these mutable (mutable means the ability to modify in place) strings as ndarrays, in order to manipulate the underlying memory.

Add an array_interface dictionary attribute to s, then convert s to an ndarray. Numerically add "2" to the array (use the in-place operator `+=`).

Then print the original string to ensure that its value was modified.

Hint: Documentation for NumPy's `__array_interface__` may be found in the online docs.

Here's a skeleton outline:

``````

In [ ]:

import numpy as np
from mutable_str import MutableString

s = MutableString('abcde')

# --- EDIT THIS SECTION ---

# Create an array interface to this foreign object
s.__array_interface__ = {'data' : (XXX, False), # (ptr, is read_only?)
'shape' : XXX,
'typestr' : '|u1', # typecode unsigned character
}

# --- EDIT THIS SECTION ---

print 'String before converting to array:', s
sa = np.asarray(s)

print 'String after converting to array:', sa

sa += 2
print 'String after adding "2" to array:', s

``````
``````

In [ ]:

``````