Cell type A | Cell type B | Cell type C | Cell type D | |
---|---|---|---|---|
Gene 0 | 100 | 200 | 50 | 400 |
Gene 1 | 50 | 0 | 0 | 100 |
Gene 2 | 350 | 100 | 50 | 200 |
In [1]:
gene0 = [100, 200, 50, 400]
gene1 = [50, 0, 0, 100]
gene2 = [350, 100, 50, 200]
expression_data = [gene0, gene1, gene2]
Why is this a bad idea?
In [2]:
import numpy as np
a = np.array(expression_data)
print(a)
We are going to:
RKPM: Reads per kilobase per million reads
Blabla about gene expression
using the awesome power of NumPy
In [3]:
def print_info(a):
print('number of elements:', a.size)
print('number of dimensions:', a.ndim)
print('shape:', a.shape)
print('data type:', a.dtype)
print('strides:', a.strides)
print('flags:')
print(a.flags)
print_info(a)
In [4]:
print(a.data)
In [5]:
print(a.flatten())
print(a.ravel())
In [6]:
bbytes = a.flatten().view(dtype=np.uint8)
In [7]:
# performance issues with C_alignment vs Fortran alignment
big_3d_image = np.random.rand(250,250,250)
print_info(big_3d_image)
In [8]:
%%timeit big_3d_image = np.random.rand(250,250,250)
#big_3d_image *=5
#for plane in big_3d_image:
# plane *= 5
# fortran contiguous
for i in range(big_3d_image.shape[1]):
big_3d_image[:,:,i] *=5
#brighter_3d_image = big_3d_image*5.1
In [9]:
bbytes
Out[9]:
In [10]:
# a.ravel gives a flattened version of the array. Difference with flatten:
abytes = a.ravel().view(dtype=np.uint8)
In [11]:
print_info(abytes)
In [12]:
print(abytes[:24])
In [13]:
print_info(a)
In [14]:
print_info(a.T)
In [15]:
print_info(a.T)
In [16]:
print_info(a.T[::2])
In [17]:
print_info(a.T[::2, ::2])
In [18]:
b = a
In [19]:
print(b)
In [20]:
a[0, 0] = 5
print(b)
a[0, 0] = 100
In [21]:
expr = np.load('expr.npy')
In [22]:
print_info(expr)
This has the raw read count data. 20500 genes are read for 375 patients. However, each sample gets a different number of reads, so we want to normalize by the library size, which is the total number of reads across a column. (The total number of reads of one person)
The np.sum
function returns the sum of all the elements of an array. With the axis
argument, you can take the sum along the given axis.
In [23]:
lib_size = np.sum(expr, axis=0)
In [24]:
x = np.random.rand(10,3)
def find_closest_value_per_column(x, value=0.75):
xacc = abs(x - value)
return np.argmin(xacc, axis=1)
find_closest_value_per_column(x)
Out[24]:
In [ ]:
In [25]:
np.argmax(x)
Out[25]:
In [26]:
%pylab inline
In [27]:
plot?
In [28]:
x = np.random.rand(100,100)
x[20:30,20:30] = 3
x[25,25] = 3.5
imshow(x, cmap='gray')
xm, ym =unravel_index(max_idx, x.shape)
max_idx = np.argmax(x)
unravel_index(max_idx, x.shape)
plot(ym, xm, 'ro')
In [ ]:
In order to normalize every column by its corresponding library size, we have to align the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use np.newaxis
to match the dimensions.
In [29]:
print(expr.shape)
print(lib_size.shape)
print(lib_size[np.newaxis, :].shape)
However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:
In [30]:
np.all(expr / lib_size ==
expr / lib_size[np.newaxis, :])
Out[30]:
In [31]:
expr_lib = expr / lib_size
We also multiply by $10^6$ in order to keep the numbers on a readable scale (reads per million reads).
In [32]:
expr_lib *= 1e6
In [33]:
imshow(expr_lib)
Out[33]:
Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).
In [34]:
gene_len = np.load('gene-lens.npy')
print(gene_len.shape)
In [35]:
gene_len.shape
lib_size.shape
Out[35]:
In [36]:
#rpkm[i,j] = 10**11 * expr[i,j] / (lib_size[j]*gene_size[i])
# option 1: Use np.newaxis
%timeit 10**3 * expr_lib / gene_len[:, np.newaxis]
In [114]:
# probably better
rpkm = 10**3 * expr_lib / gene_len[:, np.newaxis]
In [38]:
# have the expr_lib matrix dancing around untill it works
rpkm2 = (10**3* expr_lib.T / gene_len).T
In [39]:
rpkm1 - rpkm2
Out[39]:
In [40]:
%timeit (10**11* (expr_lib / lib_size ).T / gene_len).T
In [41]:
from matplotlib import pyplot as plt
from scipy import stats
def plot_col_density(data, xlim=None, *args, **kwargs):
# Use gaussian smoothing to estimate the density
density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]
if xlim is not None:
m, M = xlim
else:
m, M = np.min(data), np.max(data)
x = np.linspace(m, M, 100)
plt.figure()
for density in density_per_col:
plt.plot(x, density(x), *args, **kwargs)
plt.xlabel('log-counts')
plt.ylabel('frequency')
if xlim is not None:
plt.xlim(xlim)
plt.show()
In [24]:
%matplotlib inline
In [123]:
plt.style.use('ggplot')
In [124]:
plot_col_density(np.log(expr+1))
In [126]:
plot_col_density(np.log(rpkm + 1), xlim=(0, 20))
In [42]:
x = np.random.rand(3, 5)
y = np.random.randint(10, size=8)
z = x[..., newaxis] + y#[:,newaxis, newaxis]
In [43]:
print(x.shape)
print(y.shape)
In [44]:
# To use broadcasting efficiently, the dimensions need to either match or be one.
# This is the explicit way.
x[newaxis,:,:] + y[:,newaxis, newaxis]
Out[44]:
In [ ]:
In [45]:
def repeat(arr, n):
return np.lib.stride_tricks.as_strided(arr,
shape=(n,) + arr.shape,
strides=(0,) + arr.strides)
In [46]:
a = repeat(np.random.rand(5), 4)
a
Out[46]:
In [47]:
print_info(a)
In [49]:
def sliding_window(arr, size=2):
"""Produce an array of sliding window views of `arr`
Parameters
----------
arr : 1D array, shape (N,)
The input array.
size : int, optional
The size of the sliding window.
Returns
-------
arr_slide : 2D array, shape (N - size + 1, size)
The sliding windows of size `size` of `arr`.
Examples
--------
>>> a = np.array([0, 1, 2, 3])
>>> sliding_window(a, 2)
array([[0, 1],
[1, 2],
[2, 3]])
"""
N = len(arr)
desired_shape = (N - size + 1, size)
s = arr.strides[0] # like this it will be independent on the datatype
return np.lib.stride_tricks.as_strided(arr,
shape=desired_shape,
strides=(s, s))
#return arr # fix this
In [152]:
# test your code here
%timeit sliding_window(np.random.rand(1e5), 3).sum(axis=1)/3.
In [ ]:
You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.
In [52]:
values = np.array([0, 5, 99])
selector = np.random.randint(0, 3, size=(3, 4))
print(selector)
print(values[selector])
Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.
Hint: look for documentation for scipy.mstats.rankdata
, np.sort
, and np.argsort
.
In [90]:
np.mgrid[0:4,0:3][1]
Out[90]:
In [88]:
repeat(np.arange(0,3),4)
Out[88]:
In [140]:
testlist = np.array([1.1,2.1, 2.1, 2.1, 3.1, 6.1, 4.1, 5.1])
In [ ]:
In [141]:
indices = np.argsort(testlist)
unique_items,number_of_occurences = np.unique(testlist, return_counts=True)
In [142]:
indices, unique_items, number_of_occurences
Out[142]:
In [143]:
index = 0
while index < len(testlist):
if number_of_occurences[i] == 1:
pass
else:
indices[index+1:index+number_of_occurences[i]] -= np.arange(number_of_occurences[i]-1)
indices[index+number_of_occurences[i]:] -= number_of_occurences[i]-1
index += 1
indices
Out[143]:
In [138]:
index
Out[138]:
In [110]:
np.unique(testdata)
for r in testdata.T:
print (np.unique(r, return_counts=True))
In [111]:
np.argsort?
In [104]:
testdata = np.loadtxt('./testdata.dat', delimiter=',')
print(testdata)
# convert into an array with rank values according to column
rank_idx = np.argsort(np.argsort(testdata,axis=0), axis=0)
#np.argsort(rank_idx, axis=0)
# TODO SOLVE FOR 2 values of 4
# sort the input data
sorted_testdata = np.sort(testdata, axis=0)
# determine the ranks by computing the mean for each row
ranks = np.mean(sorted_testdata, axis=1)
ranks[rank_idx]
#print(rank_idx)
#testdata[rank_idx, np.mgrid[0:4,0:3][1]]
Out[104]:
In [102]:
Out[102]:
In [100]:
np.sort(testdata, axis=0)
Out[100]:
In [112]:
def qnorm(x):
"""Quantile normalize an input matrix.
Parameters
----------
x : 2D array of float, shape (M, N)
The input data, with each column being a
distribution to normalize.
Returns
-------
xn : 2D array of float, shape (M, N)
The normalized data.
"""
# convert into an array with rank values according to column
rank_idx = np.argsort(np.argsort(x,axis=0), axis=0)
# sort the input data
sorted_x = np.sort(x, axis=0)
# determine the ranks by computing the mean for each row
ranks = np.mean(sorted_x, axis=1)
xn = ranks[rank_idx]
return xn
In [115]:
logexpr = np.log(expr + 1)
logrpkm = np.log(rpkm + 1)
In [116]:
logexprn = qnorm(logexpr)
logrpkmn = qnorm(logrpkm)
In [117]:
plot_col_density(logexprn)
In [118]:
plot_col_density(logrpkmn, xlim=(0, 0.25))
(If time permits.)
Date: Wed, 16 Jul 2008 16:45:37 -0500
From: Jack Cook
To: <numpy-discussion@scipy.org>
Subject: Numpy Advanced Indexing Question
Greetings,
I have an I,J,K 3D volume of amplitude values at regularly sampled time intervals. I have an I,J 2D slice which contains a time (K) value at each I, J location. What I would like to do is extract a subvolume at a constant +/- K window around the slice. Is there an easy way to do this using advanced indexing or some other method? Thanks in advanced for your help.
-- Jack
In [ ]:
# "data"
ni, nj, nk = (10, 15, 20)
amplitude = np.random.rand(ni, nj, nk)
horizon = np.random.randint(5, 15, size=(ni, nj))
In [ ]:
An author of a foreign package (included with the exercizes as
problems/mutable_str.py
) provides a string class that
allocates its own memory:
ipython
In [1]: from mutable_str import MutableString
In [2]: s = MutableString('abcde')
In [3]: print s
abcde
You'd like to view these mutable (mutable means the ability to modify in place) strings as ndarrays, in order to manipulate the underlying memory.
Add an array_interface dictionary attribute to s, then convert s to an
ndarray. Numerically add "2" to the array (use the in-place operator +=
).
Then print the original string to ensure that its value was modified.
Hint: Documentation for NumPy's
__array_interface__
may be found in the online docs.
Here's a skeleton outline:
In [ ]:
import numpy as np
from mutable_str import MutableString
s = MutableString('abcde')
# --- EDIT THIS SECTION ---
# Create an array interface to this foreign object
s.__array_interface__ = {'data' : (XXX, False), # (ptr, is read_only?)
'shape' : XXX,
'typestr' : '|u1', # typecode unsigned character
}
# --- EDIT THIS SECTION ---
print 'String before converting to array:', s
sa = np.asarray(s)
print 'String after converting to array:', sa
sa += 2
print 'String after adding "2" to array:', s
In [ ]: