Processing large NumPy arrays with memory mapping

Reference: IPython Interactive Computing and Visualization Cookbook - Second Edition, by Cyrille Rossant


Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory.

A common solution is to use memory mapping and implement out-of-core computations.

The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.

Accessing a portion of the array results in the corresponding data being automatically fetched from the hard drive. Therefore, we only consume what we use.


In [1]:
import numpy as np

In [2]:
# Let's create a Memory-Mapped Array in write mode

nrows, ncols = 1000000, 100
f = np.memmap('memmapped.dat', dtype=np.float32, mode='w+', shape=(nrows, ncols))

Let's feed the array with random values, one column at a time because our system's memory is limited!


In [3]:
for i in range(ncols):
    f[:, i] = np.random.rand(nrows)

Save the last column of the Array


In [4]:
x = f[:, -1]

Now, we flush memory changes to disk by deleting the object:


In [5]:
del f

Reading a memory-mapped array from disk involves the same memmap() function. The data type and the shape need to be specified again, as this information is not stored in the file:


In [8]:
f = np.memmap('memmapped.dat', dtype=np.float32,
                  shape=(nrows, ncols))

In [9]:
np.array_equal(f[:, -1], x)


Out[9]:
True

In [10]:
del f

Note:

This method is not adapted for long-term storage of data and data sharing. A better file format for this specific case will be the HDF5.

How memmap works

Memory mapping lets you work with huge arrays almost as if they were regular arrays. Python code that accepts a NumPy array as input will also accept a memmap array. However, we need to ensure that the array is used efficiently. That is, the array is never loaded as a whole (otherwise, it would waste system memory and would obviate any advantage of the technique).

Memory mapping is also useful when you have a huge file containing raw data in a homogeneous binary format with a known data type and shape.

In this case, an alternative solution is to use NumPy's fromfile() function with a file handle created with Python's native open() function.

Using f.seek() lets you position the cursor at any location and load a given number of bytes into a NumPy array.

The numpy package makes it possible to memory map large contiguous chunks of binary files as shared memory for all the Python processes running on a given host:

Memmap Operations


In [11]:
mm_w = np.memmap('small_test.mmap', shape=10, dtype=np.float32, mode='w+')
print(mm_w)


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  • This binary file can then be mapped as a new numpy array by all the engines having access to the same filesystem.
  • The mode='r+' opens this shared memory area in read write mode:

In [12]:
mm_r = np.memmap('small_test.mmap', dtype=np.float32, mode='r+')
print(mm_r)


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

In [13]:
mm_w[0] = 42
print(mm_w)


[42.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

In [14]:
print(mm_r)


[42.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
  • Memory mapped arrays created with mode='r+' can be modified and the modifications are shared
    • in case of multiple process

In [15]:
mm_r[1] = 43

In [16]:
print(mm_r)


[42. 43.  0.  0.  0.  0.  0.  0.  0.  0.]

Memmap arrays generally behave very much like regular in-memory numpy arrays:


In [17]:
print(mm_r.sum())
print("sum={0}, mean={1}, std={2}".format(mm_r.sum(), 
                                          np.mean(mm_r), np.std(mm_r)))


85.0
sum=85.0, mean=8.5, std=17.0014705657959

Before allocating more data let us define a couple of utility functions from the previous exercise (and more) to monitor what is used by which engine and what is still free on the cluster as a whole:

  • Let's allocate a 80MB memmap array:

In [18]:
np.memmap('bigger_test.mmap', shape=10 * int(1e6), dtype=np.float64, mode='w+')


Out[18]:
memmap([0., 0., 0., ..., 0., 0., 0.])

No significant memory was used in this operation as we just asked the OS to allocate the buffer on the hard drive and just maitain a virtual memory area as a cheap reference to this buffer.

Let's open new references to the same buffer from all the engines at once:


In [19]:
%time big_mmap = np.memmap('bigger_test.mmap', dtype=np.float64, mode='r+')


CPU times: user 616 µs, sys: 778 µs, total: 1.39 ms
Wall time: 17.3 ms

In [20]:
big_mmap


Out[20]:
memmap([0., 0., 0., ..., 0., 0., 0.])
  • Let's trigger an actual load of the data from the drive into the in-memory disk cache of the OS, this can take some time depending on the speed of the hard drive (on the order of 100MB/s to 300MB/s hence 3s to 8s for this dataset):

In [21]:
%time np.sum(big_mmap)


CPU times: user 20.5 ms, sys: 32.9 ms, total: 53.5 ms
Wall time: 54.3 ms
Out[21]:
0.0
  • Now back into memory

In [22]:
%time np.sum(big_mmap)


CPU times: user 15 ms, sys: 1.36 ms, total: 16.4 ms
Wall time: 14.7 ms
Out[22]:
0.0