Benchmarks of different version of Cross Correlations

Author: Cody W. Eilar

In this notebook, I explore speed comparisons of several different methods of implementing cross correlation in C++. All these experiments were done using a kernel that is 17x17.



In [46]:

    
from __future__ import print_function
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Table of Values

In the below table, I compare four different methods for implementing cross correlation.

NoGpuSupport - this is OpenCV's normal implementation of cross correlation using the matchTemplate function.
GpuSupport - This is OpenCV's GPU implementation of the matchTemplate function.
Decompose - This is somewhat of a dummy test that demonstrates the maximum speed that overlap and add could work. This does not include any addtions or memory copies. It just calculates how long it takes to compute the cross correlation multiple times based on what value the image would be broke up. i.e. if the L value is set to 512, there would be a total of 4 cross correlation operations for a 1024x1024 image.
OverlapAdd - This benchmark demonstrates my implementation of overlap and add for a 2D signal.

In the table below, Problem Space is referring to the image size, i.e. 512 means that the image is 512x512.



In [53]:

    
data = pd.read_csv("./cross_correlation_results.csv")
data[['Experiment', 'Problem Space', 'Baseline', 'Iterations/sec', 'Min (us)', 'Mean (us)',
      'Max (us)', 'Standard Deviation']]









    Out[53]:






  
    
      
      Experiment
      Problem Space
      Baseline
      Iterations/sec
      Min (us)
      Mean (us)
      Max (us)
      Standard Deviation
    
  
  
    
      0
      NoGpuSupport
      512
      1.000000
      340.69200
      14676
      20277.300
      30644
      8987.5200
    
    
      1
      NoGpuSupport
      1024
      1.000000
      95.08600
      52584
      53188.700
      54085
      791.8660
    
    
      2
      NoGpuSupport
      2048
      1.000000
      19.32700
      258706
      312162.000
      393190
      71356.0000
    
    
      3
      NoGpuSupport
      4096
      1.000000
      5.20190
      961188
      5034320.000
      12925656
      6835280.0000
    
    
      4
      NoGpuSupport
      8192
      1.000000
      1.10492
      4525209
      19837100.000
      50431764
      26495800.0000
    
    
      5
      GpuSupport
      512
      0.023167
      14705.90000
      340
      925.333
      2083
      1002.5900
    
    
      6
      GpuSupport
      1024
      0.005401
      17605.60000
      284
      336.000
      373
      46.3573
    
    
      7
      GpuSupport
      2048
      0.001473
      13123.40000
      381
      667.667
      1123
      398.7200
    
    
      8
      GpuSupport
      4096
      0.001002
      5192.11000
      963
      1037.330
      1088
      65.7749
    
    
      9
      GpuSupport
      8192
      0.013229
      83.52540
      59862
      62917.300
      66453
      3321.6500
    
    
      10
      Decompose
      512
      0.084969
      4009.62000
      1247
      7933.330
      21256
      11537.8000
    
    
      11
      Decompose
      1024
      0.087327
      1088.85000
      4592
      5887.670
      7780
      1675.6600
    
    
      12
      Decompose
      2048
      0.054796
      352.70900
      14176
      16838.000
      20755
      3464.4000
    
    
      13
      Decompose
      4096
      0.055190
      94.25430
      53048
      56219.000
      60922
      4154.5400
    
    
      14
      Decompose
      8192
      0.057984
      19.05560
      262390
      269734.000
      275596
      6726.4700
    
    
      15
      OverlapAdd
      512
      0.291496
      1168.77000
      4278
      16098.700
      37689
      18725.9000
    
    
      16
      OverlapAdd
      1024
      0.211642
      449.27700
      11129
      13553.300
      18065
      3910.8500
    
    
      17
      OverlapAdd
      2048
      0.194275
      99.48270
      50260
      51690.000
      52716
      1276.8700
    
    
      18
      OverlapAdd
      4096
      0.258704
      20.10750
      248663
      253193.000
      261774
      7435.4900
    
    
      19
      OverlapAdd
      8192
      0.265302
      4.16477
      1200547
      1305220.000
      1360988
      90717.4000

Plotted results

In the following plot, I visually demonstrate the statistics for Iterations/sec. These are plotted using log scaling so that smaller values can easily be seen.



In [52]:

    
import matplotlib.cm as cm
prob_space = data.groupby('Experiment')
ind = np.arange(len(data.groupby('Problem Space')))
colors = cm.rainbow(np.linspace(0, 1, len(ind)))

fig, ax = plt.subplots()
width = .15; 
offset = 0
rects = []
names = []
for (name, group), c in zip(prob_space, colors): 
    names.append(name)
    rects.append(ax.bar(ind +offset, np.log10(group['Iterations/sec']), width, color=c))
    offset = offset + width
    
ax.set_ylabel('log10(Frames per second)')
ax.set_xlabel('Image size in pixels')
ax.set_title('Comparison of Xcorr Methods')
ax.set_xticks(ind + width)
ax.set_xticklabels(data['Problem Space'].unique())
ax.legend(rects, names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)









    Out[52]:





<matplotlib.legend.Legend at 0x11a3d7cd0>

Results

From the plot above, we can see that the best implementation, even for large images (8k x 8k), is the GPU supported matchTemplate function available in OpenCV. As we would expect, the decompose experiment is always better than the OverlapAdd experiment but is never able to beat out the GpuSupport experiment. Despite all this, all GPU implementations significantly beat ou the CPU only implementation. Experiments were also attempted using 16k x 16k images, but I found that there wasn't sufficient memory on the GPU to handle images of this size. In that regard, overlap and add can easily handle images of arbitrary size.

	Experiment	Problem Space	Baseline	Iterations/sec	Min (us)	Mean (us)	Max (us)	Standard Deviation
0	NoGpuSupport	512	1.000000	340.69200	14676	20277.300	30644	8987.5200
1	NoGpuSupport	1024	1.000000	95.08600	52584	53188.700	54085	791.8660
2	NoGpuSupport	2048	1.000000	19.32700	258706	312162.000	393190	71356.0000
3	NoGpuSupport	4096	1.000000	5.20190	961188	5034320.000	12925656	6835280.0000
4	NoGpuSupport	8192	1.000000	1.10492	4525209	19837100.000	50431764	26495800.0000
5	GpuSupport	512	0.023167	14705.90000	340	925.333	2083	1002.5900
6	GpuSupport	1024	0.005401	17605.60000	284	336.000	373	46.3573
7	GpuSupport	2048	0.001473	13123.40000	381	667.667	1123	398.7200
8	GpuSupport	4096	0.001002	5192.11000	963	1037.330	1088	65.7749
9	GpuSupport	8192	0.013229	83.52540	59862	62917.300	66453	3321.6500
10	Decompose	512	0.084969	4009.62000	1247	7933.330	21256	11537.8000
11	Decompose	1024	0.087327	1088.85000	4592	5887.670	7780	1675.6600
12	Decompose	2048	0.054796	352.70900	14176	16838.000	20755	3464.4000
13	Decompose	4096	0.055190	94.25430	53048	56219.000	60922	4154.5400
14	Decompose	8192	0.057984	19.05560	262390	269734.000	275596	6726.4700
15	OverlapAdd	512	0.291496	1168.77000	4278	16098.700	37689	18725.9000
16	OverlapAdd	1024	0.211642	449.27700	11129	13553.300	18065	3910.8500
17	OverlapAdd	2048	0.194275	99.48270	50260	51690.000	52716	1276.8700
18	OverlapAdd	4096	0.258704	20.10750	248663	253193.000	261774	7435.4900
19	OverlapAdd	8192	0.265302	4.16477	1200547	1305220.000	1360988	90717.4000