Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory.
A common solution is to use memory mapping and implement out-of-core computations.
The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.
Accessing a portion of the array results in the corresponding data being automatically fetched from the hard drive. Therefore, we only consume what we use.
In [1]:
import numpy as np
In [2]:
# Let's create a Memory-Mapped Array in write mode
nrows, ncols = 1000000, 100
f = np.memmap('memmapped.dat', dtype=np.float32, mode='w+', shape=(nrows, ncols))
Let's feed the array with random values, one column at a time because our system's memory is limited!
In [3]:
for i in range(ncols):
f[:, i] = np.random.rand(nrows)
Save the last column of the Array
In [4]:
x = f[:, -1]
Now, we flush memory changes to disk by deleting the object:
In [5]:
del f
Reading a memory-mapped array from disk involves the same memmap() function. The data type and the shape need to be specified again, as this information is not stored in the file:
In [8]:
f = np.memmap('memmapped.dat', dtype=np.float32,
shape=(nrows, ncols))
In [9]:
np.array_equal(f[:, -1], x)
Out[9]:
In [10]:
del f
Note:
This method is not adapted for long-term storage of data and data sharing. A better file format for this specific case will be the HDF5.
Memory mapping lets you work with huge arrays almost as if they were regular arrays. Python code that accepts a NumPy array as input will also accept a memmap
array. However, we need to ensure that the array is used efficiently. That is, the array is never loaded as a whole (otherwise, it would waste system memory and would obviate any advantage of the technique).
Memory mapping is also useful when you have a huge file containing raw data in a homogeneous binary format with a known data type and shape.
In this case, an alternative solution is to use NumPy's fromfile()
function with a file handle created with Python's native open()
function.
Using f.seek()
lets you position the cursor at any location and load a given number of bytes into a NumPy array.
The numpy package makes it possible to memory map large contiguous chunks of binary files as shared memory for all the Python processes running on a given host:
In [11]:
mm_w = np.memmap('small_test.mmap', shape=10, dtype=np.float32, mode='w+')
print(mm_w)
mode='r+'
opens this shared memory area in read write mode:
In [12]:
mm_r = np.memmap('small_test.mmap', dtype=np.float32, mode='r+')
print(mm_r)
In [13]:
mm_w[0] = 42
print(mm_w)
In [14]:
print(mm_r)
mode='r+'
can be modified and the modifications are shared
In [15]:
mm_r[1] = 43
In [16]:
print(mm_r)
Memmap arrays generally behave very much like regular in-memory numpy arrays:
In [17]:
print(mm_r.sum())
print("sum={0}, mean={1}, std={2}".format(mm_r.sum(),
np.mean(mm_r), np.std(mm_r)))
Before allocating more data let us define a couple of utility functions from the previous exercise (and more) to monitor what is used by which engine and what is still free on the cluster as a whole:
In [18]:
np.memmap('bigger_test.mmap', shape=10 * int(1e6), dtype=np.float64, mode='w+')
Out[18]:
No significant memory was used in this operation as we just asked the OS to allocate the buffer on the hard drive and just maitain a virtual memory area as a cheap reference to this buffer.
Let's open new references to the same buffer from all the engines at once:
In [19]:
%time big_mmap = np.memmap('bigger_test.mmap', dtype=np.float64, mode='r+')
In [20]:
big_mmap
Out[20]:
In [21]:
%time np.sum(big_mmap)
Out[21]:
In [22]:
%time np.sum(big_mmap)
Out[22]: