This short post details an easy way to speed up array comparison. My problem was that I had to compare two 3D arrays, A and B, These arrays are different only in their first dimension, and have equal $2^{nd}$ and $3^{rd}$ dimensions. I needed to find out whether any of the 2D arrays contained along the first dimension of, say, A were also present along the first dimension of B. The first approach one might try is based on nested for-loops. This becomes quickly unwieldy with even moderate-sized arrays. The faster alternative is to hash the 2D data in one of the arrays and store the hash table; I prefer to do that with the smaller array for space use efficiency. The next step is to go along the first dimension of the other array, hash the data and compare. This ends up requiring only serial for-loops. Note here that the Python version is 3.6, which implements numpy ndarray hashing differently than 2.x. Note also the use of f-strings; a nifty new feature of python 3.6.


In [1]:
import sys
import numpy as np
import time
print(f"Python: {sys.version.rsplit('|')[0]}")
print(f"Numpy: {np.__version__}")


Python: 3.6.2 
Numpy: 1.13.1

First create the arrays...


In [2]:
A = np.random.randint(0, 10, size=270).reshape((30, 3, 3))
B = np.random.randint(0, 10, size=450000).reshape((50000, 3, 3))

Then spread the 2D arrays in A along B's first dimension...


In [3]:
dupIdx = np.random.choice(B.shape[0], size=A.shape[0], replace=False)
B[dupIdx] = A

Next is to implement, execute, and time the nested for-loops one would need to compare both arrays:


In [4]:
start = time.clock()
count = 0
for bi in B:
    for ai in A:
        if np.array_equal(ai, bi):
            count += 1
tot1 = time.clock() - start
print(f"duplicates: {count} -- time taken: {tot1}")


duplicates: 30 -- time taken: 5.684732

That's pretty long for such small arrays. Now to implement, execute, and time the hashing-based approach:


In [5]:
start = time.clock()
hshTbl = {}
count = 0
for idx, ai in enumerate(A):
    hsh = hash(bytes(ai))
    hshTbl[hsh] = idx
for bi in B:
    hsh = hash(bytes(bi))
    if hsh in hshTbl and np.array_equal(A[hshTbl[hsh]], bi):
        count += 1
tot2 = time.clock() - start
print(f"duplicates: {count} -- time taken: {tot2}")


duplicates: 30 -- time taken: 0.06085700000000038

In [7]:
tot1 / tot2


Out[7]:
93.41130847724936

Almost 100x faster! Happy coding!