Benchmarks of different version of Cross Correlations

Author: Cody W. Eilar

In this notebook, I explore speed comparisons of several different methods of implementing cross correlation in C++. All these experiments were done using a kernel that is 17x17.


In [46]:
from __future__ import print_function
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Table of Values

In the below table, I compare four different methods for implementing cross correlation.

  • NoGpuSupport - this is OpenCV's normal implementation of cross correlation using the matchTemplate function.
  • GpuSupport - This is OpenCV's GPU implementation of the matchTemplate function.
  • Decompose - This is somewhat of a dummy test that demonstrates the maximum speed that overlap and add could work. This does not include any addtions or memory copies. It just calculates how long it takes to compute the cross correlation multiple times based on what value the image would be broke up. i.e. if the L value is set to 512, there would be a total of 4 cross correlation operations for a 1024x1024 image.
  • OverlapAdd - This benchmark demonstrates my implementation of overlap and add for a 2D signal.

In the table below, Problem Space is referring to the image size, i.e. 512 means that the image is 512x512.


In [53]:
data = pd.read_csv("./cross_correlation_results.csv")
data[['Experiment', 'Problem Space', 'Baseline', 'Iterations/sec', 'Min (us)', 'Mean (us)',
      'Max (us)', 'Standard Deviation']]


Out[53]:
Experiment Problem Space Baseline Iterations/sec Min (us) Mean (us) Max (us) Standard Deviation
0 NoGpuSupport 512 1.000000 340.69200 14676 20277.300 30644 8987.5200
1 NoGpuSupport 1024 1.000000 95.08600 52584 53188.700 54085 791.8660
2 NoGpuSupport 2048 1.000000 19.32700 258706 312162.000 393190 71356.0000
3 NoGpuSupport 4096 1.000000 5.20190 961188 5034320.000 12925656 6835280.0000
4 NoGpuSupport 8192 1.000000 1.10492 4525209 19837100.000 50431764 26495800.0000
5 GpuSupport 512 0.023167 14705.90000 340 925.333 2083 1002.5900
6 GpuSupport 1024 0.005401 17605.60000 284 336.000 373 46.3573
7 GpuSupport 2048 0.001473 13123.40000 381 667.667 1123 398.7200
8 GpuSupport 4096 0.001002 5192.11000 963 1037.330 1088 65.7749
9 GpuSupport 8192 0.013229 83.52540 59862 62917.300 66453 3321.6500
10 Decompose 512 0.084969 4009.62000 1247 7933.330 21256 11537.8000
11 Decompose 1024 0.087327 1088.85000 4592 5887.670 7780 1675.6600
12 Decompose 2048 0.054796 352.70900 14176 16838.000 20755 3464.4000
13 Decompose 4096 0.055190 94.25430 53048 56219.000 60922 4154.5400
14 Decompose 8192 0.057984 19.05560 262390 269734.000 275596 6726.4700
15 OverlapAdd 512 0.291496 1168.77000 4278 16098.700 37689 18725.9000
16 OverlapAdd 1024 0.211642 449.27700 11129 13553.300 18065 3910.8500
17 OverlapAdd 2048 0.194275 99.48270 50260 51690.000 52716 1276.8700
18 OverlapAdd 4096 0.258704 20.10750 248663 253193.000 261774 7435.4900
19 OverlapAdd 8192 0.265302 4.16477 1200547 1305220.000 1360988 90717.4000

Plotted results

In the following plot, I visually demonstrate the statistics for Iterations/sec. These are plotted using log scaling so that smaller values can easily be seen.


In [52]:
import matplotlib.cm as cm
prob_space = data.groupby('Experiment')
ind = np.arange(len(data.groupby('Problem Space')))
colors = cm.rainbow(np.linspace(0, 1, len(ind)))

fig, ax = plt.subplots()
width = .15; 
offset = 0
rects = []
names = []
for (name, group), c in zip(prob_space, colors): 
    names.append(name)
    rects.append(ax.bar(ind +offset, np.log10(group['Iterations/sec']), width, color=c))
    offset = offset + width
    
ax.set_ylabel('log10(Frames per second)')
ax.set_xlabel('Image size in pixels')
ax.set_title('Comparison of Xcorr Methods')
ax.set_xticks(ind + width)
ax.set_xticklabels(data['Problem Space'].unique())
ax.legend(rects, names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)


Out[52]:
<matplotlib.legend.Legend at 0x11a3d7cd0>

Results

From the plot above, we can see that the best implementation, even for large images (8k x 8k), is the GPU supported matchTemplate function available in OpenCV. As we would expect, the decompose experiment is always better than the OverlapAdd experiment but is never able to beat out the GpuSupport experiment. Despite all this, all GPU implementations significantly beat ou the CPU only implementation. Experiments were also attempted using 16k x 16k images, but I found that there wasn't sufficient memory on the GPU to handle images of this size. In that regard, overlap and add can easily handle images of arbitrary size.