Processing large NumPy arrays with memory mapping

Reference: IPython Interactive Computing and Visualization Cookbook - Second Edition, by Cyrille Rossant

Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory.

A common solution is to use memory mapping and implement out-of-core computations.

The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.

Accessing a portion of the array results in the corresponding data being automatically fetched from the hard drive. Therefore, we only consume what we use.



In [1]:

    
import numpy as np



In [2]:

    
# Let's create a Memory-Mapped Array in write mode

nrows, ncols = 1000000, 100
f = np.memmap('memmapped.dat', dtype=np.float32, mode='w+', shape=(nrows, ncols))

Let's feed the array with random values, one column at a time because our system's memory is limited!



In [3]:

    
for i in range(ncols):
    f[:, i] = np.random.rand(nrows)

Save the last column of the Array



In [4]:

    
x = f[:, -1]

Now, we flush memory changes to disk by deleting the object:



In [5]:

    
del f

Reading a memory-mapped array from disk involves the same memmap() function. The data type and the shape need to be specified again, as this information is not stored in the file:



In [8]:

    
f = np.memmap('memmapped.dat', dtype=np.float32,
                  shape=(nrows, ncols))



In [9]:

    
np.array_equal(f[:, -1], x)









    Out[9]:





True



In [10]:

    
del f

Note:

This method is not adapted for long-term storage of data and data sharing. A better file format for this specific case will be the HDF5.

How `memmap` works

Memory mapping lets you work with huge arrays almost as if they were regular arrays. Python code that accepts a NumPy array as input will also accept a memmap array. However, we need to ensure that the array is used efficiently. That is, the array is never loaded as a whole (otherwise, it would waste system memory and would obviate any advantage of the technique).

Memory mapping is also useful when you have a huge file containing raw data in a homogeneous binary format with a known data type and shape.

In this case, an alternative solution is to use NumPy's fromfile() function with a file handle created with Python's native open() function.

Using f.seek() lets you position the cursor at any location and load a given number of bytes into a NumPy array.

The numpy package makes it possible to memory map large contiguous chunks of binary files as shared memory for all the Python processes running on a given host:

Memmap Operations



In [11]:

    
mm_w = np.memmap('small_test.mmap', shape=10, dtype=np.float32, mode='w+')
print(mm_w)









    



[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

This binary file can then be mapped as a new numpy array by all the engines having access to the same filesystem.
The mode='r+' opens this shared memory area in read write mode:



In [12]:

    
mm_r = np.memmap('small_test.mmap', dtype=np.float32, mode='r+')
print(mm_r)









    



[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]



In [13]:

    
mm_w[0] = 42
print(mm_w)









    



[42.  0.  0.  0.  0.  0.  0.  0.  0.  0.]



In [14]:

    
print(mm_r)









    



[42.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Memory mapped arrays created with mode='r+' can be modified and the modifications are shared
- in case of multiple process



In [15]:

    
mm_r[1] = 43



In [16]:

    
print(mm_r)









    



[42. 43.  0.  0.  0.  0.  0.  0.  0.  0.]

Memmap arrays generally behave very much like regular in-memory numpy arrays:



In [17]:

    
print(mm_r.sum())
print("sum={0}, mean={1}, std={2}".format(mm_r.sum(), 
                                          np.mean(mm_r), np.std(mm_r)))









    



85.0
sum=85.0, mean=8.5, std=17.0014705657959

Before allocating more data let us define a couple of utility functions from the previous exercise (and more) to monitor what is used by which engine and what is still free on the cluster as a whole:

Let's allocate a 80MB memmap array:



In [18]:

    
np.memmap('bigger_test.mmap', shape=10 * int(1e6), dtype=np.float64, mode='w+')









    Out[18]:





memmap([0., 0., 0., ..., 0., 0., 0.])

No significant memory was used in this operation as we just asked the OS to allocate the buffer on the hard drive and just maitain a virtual memory area as a cheap reference to this buffer.

Let's open new references to the same buffer from all the engines at once:



In [19]:

    
%time big_mmap = np.memmap('bigger_test.mmap', dtype=np.float64, mode='r+')









    



CPU times: user 616 µs, sys: 778 µs, total: 1.39 ms
Wall time: 17.3 ms



In [20]:

    
big_mmap









    Out[20]:





memmap([0., 0., 0., ..., 0., 0., 0.])

Let's trigger an actual load of the data from the drive into the in-memory disk cache of the OS, this can take some time depending on the speed of the hard drive (on the order of 100MB/s to 300MB/s hence 3s to 8s for this dataset):



In [21]:

    
%time np.sum(big_mmap)









    



CPU times: user 20.5 ms, sys: 32.9 ms, total: 53.5 ms
Wall time: 54.3 ms






    Out[21]:





0.0

Now back into memory



In [22]:

    
%time np.sum(big_mmap)









    



CPU times: user 15 ms, sys: 1.36 ms, total: 16.4 ms
Wall time: 14.7 ms






    Out[22]:





0.0

Processing large NumPy arrays with memory mapping

How memmap works

Memmap Operations

How `memmap` works