Scientific Programming in Python

Topic 3: Handling Very Large Arrays, Memory Mappings

Notebook created by Martín Villanueva - martin.villanueva@usm.cl - DI UTFSM - April 2017.


In [1]:
%matplotlib inline
%load_ext memory_profiler

import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import scipy.sparse
import sys
import h5py


def get_size(obj, seen=None):
    """
    Recursively finds size of objects
    ref: https://goshippo.com/blog/measure-real-size-any-python-object/
    """
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size

Memory Profiling

In order to use the memit magic, you have to install the memory_profiler module, with one of this commands:

conda install memory_profiler
pip install memory_profiler

In [2]:
%memit?

Measure memory usage of a Python statement

Usage, in line mode:

%memit [-r<R>t<T>i<I>] statement

Usage, in cell mode:

%%memit [-r<R>t<T>i<I>] setup_code
  code...
  code...

This function can be used both as a line and cell magic:

  • In line mode you can measure a single-line statement (though multiple ones can be chained with using semicolons).

  • In cell mode, the statement in the first line is used as setup code (executed but not measured) and the body of the cell is measured. The cell body has access to any variables created in the setup code.

Options:

-r: repeat the loop iteration times and take the best result. Default: 1

-t: timeout after seconds. Default: None

-i: Get time information at an interval of I times per second. Defaults to 0.1 so that there is ten measurements per second.

-c: If present, add the memory usage of any children process to the report.

-o: If present, return a object containing memit run details

-q: If present, be quiet and do not output a result.

Examples

::

In [1]: %memit range(10000)
  peak memory: 21.42 MiB, increment: 0.41 MiB

  In [2]: %memit range(1000000)
  peak memory: 52.10 MiB, increment: 31.08 MiB

  In [3]: %%memit l=range(1000000)
     ...: len(l)
     ...:
  peak memory: 52.14 MiB, increment: 0.08 MiB

In [3]:
%memit  np.ones((100,100), dtype=np.float64)


peak memory: 74.50 MiB, increment: 0.28 MiB

1.- Very Large Arrays

numpy.arrays are meant to live in memory. If you want to work with matrices larger than your RAM, you have to work around that. There are at least two approaches you can follow:

  1. Try a more efficient matrix representation that exploits any special structure that your matrices have. For example, there are efficient data structures for sparse matrices (matrices with lots of zeros), like scipy.sparse.csc_matrix.
  2. Modify your algorithm to work on submatrices. You can read from disk only the matrix blocks that are currently being used in computations. Algorithms designed to run on clusters usually work blockwise, since the data is scatted across different computers, and passed by only when needed.

2.- NumPy's Memory Mappings

Some important features of Memory Mappings:

  1. Memory mapping lets you work with huge arrays almost as if they were regular arrays.
  2. Python code that accepts a NumPy array as input will also accept a memmap array.
  3. We need to ensure that the array is used efficiently, i.e, the array is never loaded as a whole.
  4. This method is not the most adapted for long-term storage of data and data sharing (HDF5 is it!).

1.- Let's create a memory-mapped array:


In [4]:
nrows, ncols = 100, 1000000
f = np.memmap('memmapped.dat', dtype=np.float32, mode='w+', shape=(nrows, ncols))

print(type(f))


<class 'numpy.core.memmap.memmap'>

2.- Let's feed the array with random values, one column at a time because our system's memory is limited! (Reducing garbage collector work)


In [5]:
%%memit
for i in range(nrows):
    tmp = np.random.random(ncols)
    f[i,:] = tmp
    del tmp


peak memory: 463.69 MiB, increment: 389.13 MiB

In [6]:
%%memit
np.random.random((nrows,ncols))


peak memory: 1203.21 MiB, increment: 739.52 MiB

We save the last column of the array:


In [7]:
x = f[-1,:]

3.- Now, we flush memory changes to disk by deleting the object:


In [8]:
del f

4.- Reading a memory-mapped array from disk involves the same memmap function. The data type and the shape need to be specified again, as this information is not stored in the file:


In [9]:
%%memit 
f = np.memmap('memmapped.dat', dtype=np.float32, shape=(nrows, ncols))


peak memory: 463.70 MiB, increment: 0.00 MiB

In [10]:
%%memit
np.array_equal(f[-1,:],x)


peak memory: 468.48 MiB, increment: 4.78 MiB

In [11]:
print(type(f[-1,:]))


<class 'numpy.core.memmap.memmap'>

3.- HDF5 and h5py

Why HD5 instead of native NumPy's memmap:

  1. HDF5 (Hierarchical Data Format version 5) is an open format specification, i.e, language independet. Can be used with Python, Matlab, R, C, Java, etc.
  2. HDF5 not only implement memory mappings but also provides a POSIX-like hierarchy to organize different arrays.
  3. It is more versatile than NumPy's memmap. The latter (numpy.memmap) stores the array in binary form on the hard-disk using contiguous blocks of memory. The first (hdf5) work with chunks (which are atomic objects) organized in a B-tree structure. The content of a chunk is contiguously stored in the hard-disk.

There are two libraries to work with HDF5 in Python: PyTables and h5py. We use the second one since it is more lightweight and more adapted than PyTablesin some situations.

1.- Let's create a new empty HDF5 file:


In [12]:
f = h5py.File("myfile.h5", "w")

2.- We create a 1000x1000 array (just the memory mapping in fact...)


In [13]:
nrows = 100
ncols = 1000000
dset = f.create_dataset("dataset", (nrows,ncols), dtype='f')

In [14]:
print(type(dset))
print(dset.shape)
print(dset.dtype)


<class 'h5py._hl.dataset.Dataset'>
(100, 1000000)
float32

3.- We now fill it with random values


In [15]:
%%memit
for i in range(nrows):
    tmp = np.random.random(ncols)
    dset[i,:] = tmp
    del tmp


peak memory: 466.38 MiB, increment: 1.12 MiB

In [16]:
%%memit
a = np.random.random((nrows,ncols))


peak memory: 1229.33 MiB, increment: 762.95 MiB

In [17]:
# we will store the last row
x = dset[-1,:]
print(type(x))


<class 'numpy.ndarray'>

About groups and hierarchical organization...

An HDF5 file is a container for two kinds of objects: datasets, which are array-like collections of data, and groups, which are folder-like containers that hold datasets and other groups. The most fundamental thing to remember when using h5py is:

Groups work like folders, and datasets work like NumPy arrays!


In [18]:
# The File object we created is itself a group, in this case the root group
print( f.name )
print( dset.name )


/
/dataset

In [19]:
# Lets create a subgroup
f.create_group("subgroup")


Out[19]:
<HDF5 group "/subgroup" (0 members)>

In [20]:
# Lets create a new dataset inside this subgroup
dset2 = f.create_dataset('subgroup/dataset2', (10,), dtype='i')
print(dset2.name)


/subgroup/dataset2

In [21]:
# Retrieving datasets/arrays from hdf5 file
print( f["dataset"] )
print( f["subgroup/dataset2"] )


<HDF5 dataset "dataset": shape (100, 1000000), type "<f4">
<HDF5 dataset "dataset2": shape (10,), type "<i4">

In [22]:
f.close()

Now lets re-open the hdf5 file, and compare the last row:


In [23]:
%%memit
f = h5py.File("myfile.h5", "r+")
dset = f["dataset"]


peak memory: 851.89 MiB, increment: 0.04 MiB

In [24]:
%%memit
print(np.array_equal(dset[-1,:],x))


True
True
True
peak memory: 856.66 MiB, increment: 4.77 MiB

A practical example: Huge matrix multiplication.

Supose we want to perform the dot product between $A \in \mathbb{R}^{m\times n}$ and $b \in \mathbb{R}^n$, such that $A$ doesn't fit in the available RAM memory. How can we solve it? R: Memory Mappings.


In [25]:
f.close()

In [26]:
# We first create the hdf5 file, and add a dataset/array
nrows = 1000
ncols = 100000

f = h5py.File("test.h5", "w")
dset = f.create_dataset("array", (nrows,ncols), dtype='f')

In [27]:
%%memit
tmp = np.random.random((nrows,ncols))


peak memory: 1263.43 MiB, increment: 406.77 MiB

We fill the array with random values, without loading the whole array


In [28]:
%%memit -r 2
for i in range(nrows):
    tmp = np.random.random(ncols)
    dset[i,:] = tmp
    del tmp


peak memory: 298.52 MiB, increment: 0.04 MiB

Then we create the $b$ array and a $c$ empty array to store the results:


In [29]:
b = np.random.random(ncols)
c = np.empty(ncols)

And perform dot product without loading $A$ (completely) into main memory:


In [30]:
%%memit
for i in range(nrows):
    c[i] = np.dot(dset[i,:],b)


peak memory: 78.02 MiB, increment: 3.11 MiB

In [31]:
%%timeit
for i in range(nrows):
    c[i] = np.dot(dset[i,:],b)


1 loop, best of 3: 474 ms per loop

We save memory, but pay with overhead in time (The time needed to retreive each row in dset into main memory):


In [32]:
A = np.random.random((nrows,ncols))

In [33]:
%%timeit 
np.dot(A,b)


10 loops, best of 3: 56.3 ms per loop

Note: The performance of memory mapping approach can be improved by multiplying more than one row at each iteration!

Chunking the datasets: Improving I/O performance

  • An hdf5 dataset created with the default settings will be contiguous; i.e, laid out on disk in traditional C order.
  • Datasets may also be created using hdf5’s chunked storage layout. This means the dataset is divided up into regularly-sized pieces which are stored randomly on disk, and indexed using a B-tree.

We a single data of a chunk is indexed, the whole chunk is loaded (chunks are atomic)!


In [34]:
# To enable chunked storage, set the keyword chunks to a tuple indicating the chunk shape
f = h5py.File("test2.h5", "w")
dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100), dtype='f')

Data must be read and written now in blocks with shape (100,100): dset[0:100,0:100] for example.

Trade-off: There is a trade-off between many small chunks (large overhead due to managing lots of chunks) and a few large chunks (inefficient disk I/O). The chunk size is recommended to be smaller than 1 MB.

4.- Sparse Matrices

A sparse array is an array in which most of the elements have the value 0. The occurrence of zero-value elements in a large array is inefficient for both computation and storage. An array in which there is a large number of zero elements is referred to as being sparse.

Such matrices appear naturally in many applications:

  1. Finite Element Methods (PDEs solving method).
  2. Finite Differences Methods (PDEs solving method).
  3. Discrete Wavelet Transforms (Á trous)
  4. Machine Learning: SVM and support vectors.
  5. Machine Learning: Neural Networks (FeedForward) Weights.
  6. Etc...

Generate a sparse matrix of the given shape and density with randomly distributed values:


In [35]:
nrows = 20
ncols = 20
A = sp.sparse.random(nrows, ncols, density=0.1, format='csr')
B = A.toarray()
print(type(A))
print(type(B))


<class 'scipy.sparse.csr.csr_matrix'>
<class 'numpy.ndarray'>

And we can visualize the matrix structure with matplotlib.pyplot.matshow function:


In [36]:
plt.figure(figsize=(15,15))
plt.matshow(B, cmap=plt.cm.gray)
plt.show()


<matplotlib.figure.Figure at 0x10cad61d0>

How sparse matrices are stored in memory?

Answer: With linkded lists! (We just store the needed values in such lists). There are seven available sparse matrix types/formats:

  1. csc_matrix: Compressed Sparse Column format
  2. csr_matrix: Compressed Sparse Row format
  3. bsr_matrix: Block Sparse Row format
  4. lil_matrix: List of Lists format
  5. dok_matrix: Dictionary of Keys format
  6. coo_matrix: COOrdinate format (aka IJV, triplet format)
  7. dia_matrix: DIAgonal format

CSR example

For more info see Scipy Documentation

Important note: Despite their similarity to NumPy arrays, it is strongly discouraged to use NumPy functions directly on these matrices because NumPy may not properly convert them for computations, leading to unexpected (and incorrect) results.

A practical example again...


In [37]:
nrows = 1000
ncols = 100000
A = sp.sparse.random(nrows, ncols, density=0.1, format='csr')
B = A.toarray()
b = np.random.random(ncols)

In [38]:
print("Size of A:",get_size(A),"Bytes")
print("Size of B:",get_size(B),"Bytes")


Size of A: 4940 Bytes
Size of B: 800000368 Bytes

In [39]:
%timeit A.dot(b)


10 loops, best of 3: 23.6 ms per loop

In [40]:
%timeit B.dot(b)


10 loops, best of 3: 57.2 ms per loop