Notebook created by Martín Villanueva - martin.villanueva@usm.cl
- DI UTFSM - April 2017.
In [1]:
%matplotlib inline
%load_ext memory_profiler
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import scipy.sparse
import sys
import h5py
def get_size(obj, seen=None):
"""
Recursively finds size of objects
ref: https://goshippo.com/blog/measure-real-size-any-python-object/
"""
size = sys.getsizeof(obj)
if seen is None:
seen = set()
obj_id = id(obj)
if obj_id in seen:
return 0
# Important mark as seen *before* entering recursion to gracefully handle
# self-referential objects
seen.add(obj_id)
if isinstance(obj, dict):
size += sum([get_size(v, seen) for v in obj.values()])
size += sum([get_size(k, seen) for k in obj.keys()])
elif hasattr(obj, '__dict__'):
size += get_size(obj.__dict__, seen)
elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
size += sum([get_size(i, seen) for i in obj])
return size
In order to use the memit
magic, you have to install the memory_profiler
module, with one of this commands:
conda install memory_profiler
pip install memory_profiler
In [2]:
%memit?
Measure memory usage of a Python statement
Usage, in line mode:
%memit [-r<R>t<T>i<I>] statement
Usage, in cell mode:
%%memit [-r<R>t<T>i<I>] setup_code
code...
code...
This function can be used both as a line and cell magic:
In line mode you can measure a single-line statement (though multiple ones can be chained with using semicolons).
In cell mode, the statement in the first line is used as setup code (executed but not measured) and the body of the cell is measured. The cell body has access to any variables created in the setup code.
Options:
-r
-t
-i: Get time information at an interval of I times per second. Defaults to 0.1 so that there is ten measurements per second.
-c: If present, add the memory usage of any children process to the report.
-o: If present, return a object containing memit run details
-q: If present, be quiet and do not output a result.
::
In [1]: %memit range(10000)
peak memory: 21.42 MiB, increment: 0.41 MiB
In [2]: %memit range(1000000)
peak memory: 52.10 MiB, increment: 31.08 MiB
In [3]: %%memit l=range(1000000)
...: len(l)
...:
peak memory: 52.14 MiB, increment: 0.08 MiB
In [3]:
%memit np.ones((100,100), dtype=np.float64)
numpy.arrays
are meant to live in memory. If you want to work with matrices larger than your RAM, you have to work around that. There are at least two approaches you can follow:
scipy.sparse.csc_matrix
.Some important features of Memory Mappings:
1.- Let's create a memory-mapped array:
In [4]:
nrows, ncols = 100, 1000000
f = np.memmap('memmapped.dat', dtype=np.float32, mode='w+', shape=(nrows, ncols))
print(type(f))
2.- Let's feed the array with random values, one column at a time because our system's memory is limited! (Reducing garbage collector work)
In [5]:
%%memit
for i in range(nrows):
tmp = np.random.random(ncols)
f[i,:] = tmp
del tmp
In [6]:
%%memit
np.random.random((nrows,ncols))
We save the last column of the array:
In [7]:
x = f[-1,:]
3.- Now, we flush memory changes to disk by deleting the object:
In [8]:
del f
4.- Reading a memory-mapped array from disk involves the same memmap
function. The data type
and the shape
need to be specified again, as this information is not stored in the file:
In [9]:
%%memit
f = np.memmap('memmapped.dat', dtype=np.float32, shape=(nrows, ncols))
In [10]:
%%memit
np.array_equal(f[-1,:],x)
In [11]:
print(type(f[-1,:]))
Why HD5 instead of native NumPy's memmap:
numpy.memmap
) stores the array in binary form on the hard-disk using contiguous blocks of memory. The first (hdf5
) work with chunks (which are atomic objects) organized in a B-tree structure. The content of a chunk is contiguously stored in the hard-disk.There are two libraries to work with HDF5
in Python: PyTables
and h5py
. We use the second one since it is more lightweight and more adapted than PyTables
in some situations.
1.- Let's create a new empty HDF5
file:
In [12]:
f = h5py.File("myfile.h5", "w")
2.- We create a 1000x1000
array (just the memory mapping in fact...)
In [13]:
nrows = 100
ncols = 1000000
dset = f.create_dataset("dataset", (nrows,ncols), dtype='f')
In [14]:
print(type(dset))
print(dset.shape)
print(dset.dtype)
3.- We now fill it with random values
In [15]:
%%memit
for i in range(nrows):
tmp = np.random.random(ncols)
dset[i,:] = tmp
del tmp
In [16]:
%%memit
a = np.random.random((nrows,ncols))
In [17]:
# we will store the last row
x = dset[-1,:]
print(type(x))
An HDF5 file is a container for two kinds of objects: datasets, which are array-like collections of data, and groups, which are folder-like containers that hold datasets and other groups. The most fundamental thing to remember when using h5py is:
Groups work like folders, and datasets work like NumPy arrays!
In [18]:
# The File object we created is itself a group, in this case the root group
print( f.name )
print( dset.name )
In [19]:
# Lets create a subgroup
f.create_group("subgroup")
Out[19]:
In [20]:
# Lets create a new dataset inside this subgroup
dset2 = f.create_dataset('subgroup/dataset2', (10,), dtype='i')
print(dset2.name)
In [21]:
# Retrieving datasets/arrays from hdf5 file
print( f["dataset"] )
print( f["subgroup/dataset2"] )
In [22]:
f.close()
Now lets re-open the hdf5
file, and compare the last row:
In [23]:
%%memit
f = h5py.File("myfile.h5", "r+")
dset = f["dataset"]
In [24]:
%%memit
print(np.array_equal(dset[-1,:],x))
In [25]:
f.close()
In [26]:
# We first create the hdf5 file, and add a dataset/array
nrows = 1000
ncols = 100000
f = h5py.File("test.h5", "w")
dset = f.create_dataset("array", (nrows,ncols), dtype='f')
In [27]:
%%memit
tmp = np.random.random((nrows,ncols))
We fill the array with random values, without loading the whole array
In [28]:
%%memit -r 2
for i in range(nrows):
tmp = np.random.random(ncols)
dset[i,:] = tmp
del tmp
Then we create the $b$ array and a $c$ empty array to store the results:
In [29]:
b = np.random.random(ncols)
c = np.empty(ncols)
And perform dot product without loading $A$ (completely) into main memory:
In [30]:
%%memit
for i in range(nrows):
c[i] = np.dot(dset[i,:],b)
In [31]:
%%timeit
for i in range(nrows):
c[i] = np.dot(dset[i,:],b)
We save memory, but pay with overhead in time (The time needed to retreive each row in dset
into main memory):
In [32]:
A = np.random.random((nrows,ncols))
In [33]:
%%timeit
np.dot(A,b)
Note: The performance of memory mapping approach can be improved by multiplying more than one row at each iteration!
hdf5
dataset created with the default settings will be contiguous; i.e, laid out on disk in traditional C order. hdf5
’s chunked storage layout. This means the dataset is divided up into regularly-sized pieces which are stored randomly on disk, and indexed using a B-tree.We a single data of a chunk is indexed, the whole chunk is loaded (chunks are atomic)!
In [34]:
# To enable chunked storage, set the keyword chunks to a tuple indicating the chunk shape
f = h5py.File("test2.h5", "w")
dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100), dtype='f')
Data must be read and written now in blocks with shape (100,100): dset[0:100,0:100]
for example.
Trade-off: There is a trade-off between many small chunks (large overhead due to managing lots of chunks) and a few large chunks (inefficient disk I/O). The chunk size is recommended to be smaller than 1 MB.
A sparse array is an array in which most of the elements have the value 0. The occurrence of zero-value elements in a large array is inefficient for both computation and storage. An array in which there is a large number of zero elements is referred to as being sparse.
Such matrices appear naturally in many applications:
Generate a sparse matrix of the given shape
and density
with randomly distributed values:
In [35]:
nrows = 20
ncols = 20
A = sp.sparse.random(nrows, ncols, density=0.1, format='csr')
B = A.toarray()
print(type(A))
print(type(B))
And we can visualize the matrix structure with matplotlib.pyplot.matshow
function:
In [36]:
plt.figure(figsize=(15,15))
plt.matshow(B, cmap=plt.cm.gray)
plt.show()
Answer: With linkded lists! (We just store the needed values in such lists). There are seven available sparse matrix types/formats:
For more info see Scipy Documentation
Important note: Despite their similarity to NumPy arrays, it is strongly discouraged to use NumPy functions directly on these matrices because NumPy may not properly convert them for computations, leading to unexpected (and incorrect) results.
In [37]:
nrows = 1000
ncols = 100000
A = sp.sparse.random(nrows, ncols, density=0.1, format='csr')
B = A.toarray()
b = np.random.random(ncols)
In [38]:
print("Size of A:",get_size(A),"Bytes")
print("Size of B:",get_size(B),"Bytes")
In [39]:
%timeit A.dot(b)
In [40]:
%timeit B.dot(b)