Stephen Bailey, 15 January 2017
This notebook documents the scaling performance of
redrock.zscan.parallel_calc_zchi2_targets(), which fits
target spectra to a redshifted 5-component PCA template to map
the best fit $\chi^2$ vs. redshift. It uses a combination of
zscan is parallelized with python multiprocessing within a single node.
The processes are data parallel, i.e. they do not communicate with each other.
The plot below shows the rate of
target redshifts processed per second vs. number of processes used. For Haswell tests,
$OMP_NUM_THREADS = $MKL_NUM_THREADS = 64 // numprocesses,
while for KNL tests these were set to either 272 // numprocesses (corresponding to the
number of hyperthreading cores) or 68 // numprocesses (corresponding to the number of physical cores). Compared to real data, these tests used spectra with fewer wavelength samples and tested fewer redshifts in order to have the tests quickly.
Initial observations:
Code versions used (as reported by "git describe"):
To reproduce these results on NERSC cori:
module load python/3.5-anaconda
git clone https://github.com/sbailey/redrock
cd redrock; git checkout 3addbf9; cd ..
git clone https://github.com/sbailey/knltest
cd knltest; git checkout ec2d7fa; cd ..
export PYTHONPATH=`pwd`/redrock/py:$PYTHONPATH
srun -N 1 -p debug -C haswell -t 00:20:00 python knltest/code/rrzscan.py
srun -N 1 -p debug -C knl,quad,flat -t 00:20:00 python knltest/code/rrzscan.py
Note: I did not explicitely set any other OMP or MKL environment variables nor did I provide -c cores option to srun. It is possible that some parameter like that is required for KNL but not Haswell, but I haven't found what that is yet.
In [1]:
%pylab inline
import numpy as np
import io
In [2]:
#- Datasets from running "python rrzscan.py [haswell|knl]
hswdata = '''\
ncpu nthread time rate
1 64 4.74 1055.49
2 32 3.21 1556.29
4 16 1.49 3351.20
8 8 1.04 4786.74
16 4 0.72 6923.60
32 2 0.48 10412.14
64 1 0.46 10914.74
'''
knldata272 = '''\
ncpu nthread time rate
4 68 42.73 117.01
8 34 40.08 124.74
16 17 38.29 130.59
32 8 42.92 116.49
64 4 38.48 129.93
128 2 45.27 110.45
'''
knldata68 = '''\
ncpu nthread time rate
4 17 34.60 144.51
8 8 35.54 140.68
16 4 34.00 147.08
32 2 38.74 129.06
64 1 34.92 143.20
128 1 39.03 128.09
'''
def parse_data(datastr):
dtype = [('ncpu', int), ('nthread', int), ('time', float), ('rate', float)]
data = np.loadtxt(io.StringIO(datastr), skiprows=1, dtype=dtype)
return data
hsw = parse_data(hswdata)
knl272 = parse_data(knldata272)
knl68 = parse_data(knldata68)
In [3]:
figure(figsize=(6,6))
plot(hsw['ncpu'], hsw['rate'], 'bs-', label='Haswell64')
plot(knl272['ncpu'], knl272['rate'], 'rs-', label='KNL272')
plot(knl68['ncpu'], knl68['rate'], 'ks-', label='KNL68')
plot(hsw['ncpu'], hsw['ncpu']*hsw['rate'][0], 'g-', label='perfect')
loglog()
xticks([1,2,4,8,16,32,64,128], [1,2,4,8,16,32,64,128])
xlim(0.7, 128/0.7)
ylim(50, 2e5)
xlabel('number of cores used')
ylabel('target redshifts per second')
title('Redrock zscan 10 targets x 500 redshifts')
legend(loc='upper left')
savefig('rrzscan.png')
In [4]:
np.max(hsw['rate']) / np.max(knl68['rate'])
Out[4]:
In [10]:
hsw1data = """\
ncpu nthread time rate
1 1 4.55 1098.74
2 1 2.27 2202.55
4 1 1.20 4173.72
8 1 1.04 4814.61
16 1 0.64 7849.45
32 1 0.45 11176.03
64 1 0.45 11002.66
"""
knl1data = """\
ncpu nthread time rate
1 1 26.83 186.33
4 1 30.39 164.54
8 1 31.03 161.13
16 1 31.62 158.14
32 1 32.61 153.32
64 1 34.84 143.50
128 1 38.69 129.25
"""
hsw1 = parse_data(hsw1data)
knl1 = parse_data(knl1data)
figure(figsize=(6,6))
plot(hsw1['ncpu'], hsw1['rate'], 'bs-', label='Haswell')
plot(hsw['ncpu'], hsw['rate'], 'b.-', label='_none_', alpha=0.5)
plot(knl1['ncpu'], knl1['rate'], 'rs-', label='KNL')
plot(knl68['ncpu'], knl68['rate'], 'r.-', label='_none_', alpha=0.5)
plot(hsw1['ncpu'], hsw1['ncpu']*hsw1['rate'][0], 'g-', label='perfect')
loglog()
xticks([1,2,4,8,16,32,64,128], [1,2,4,8,16,32,64,128])
xlim(0.7, 128/0.7)
ylim(50, 2e5)
xlabel('number of cores used')
ylabel('target redshifts per second')
title('Redrock zscan 10 targets x 500 redshifts')
legend(loc='upper left')
Out[10]:
Full logs cut-and-paste from each run
<cori code> python debug.py cori
len(redshifts) = 500
len(template.redshifts) = 20
WARNING: Using 50 cores for 50 redshifts
len(redshifts) = 500
ncpu nthread time rate
1 64 4.74 1055.49
2 32 3.21 1556.29
4 16 1.49 3351.20
8 8 1.04 4786.74
16 4 0.72 6923.60
32 2 0.48 10412.14
64 1 0.46 10914.74
<cori code> python debug.py knlfast
len(redshifts) = 250
len(template.redshifts) = 20
WARNING: Using 25 cores for 25 redshifts
len(redshifts) = 250
ncpu nthread time rate
4 68 10.75 116.29
8 34 10.62 117.71
16 17 10.54 118.60
32 8 11.61 107.70
64 4 12.06 103.67
128 2 15.89 78.66
#- with nthread = 272 // ncpu
<cori code> python debug.py knl
len(redshifts) = 500
len(template.redshifts) = 20
WARNING: Using 50 cores for 50 redshifts
len(redshifts) = 500
ncpu nthread time rate
4 68 42.73 117.01
8 34 40.08 124.74
16 17 38.29 130.59
32 8 42.92 116.49
64 4 38.48 129.93
128 2 45.27 110.45
#- second run iwth nthread = 272 // npu later on
len(redshifts) = 500
len(template.redshifts) = 20
WARNING: Using 50 cores for 50 redshifts
len(redshifts) = 500
ncpu nthread time rate
4 68 42.56 117.49
8 34 39.34 127.11
16 17 37.70 132.64
32 8 42.42 117.88
64 4 37.10 134.76
128 2 44.61 112.09
#- with nthread = 136 // ncpu
<cori code> python debug.py knl
len(redshifts) = 500
len(template.redshifts) = 20
WARNING: Using 50 cores for 50 redshifts
len(redshifts) = 500
ncpu nthread time rate
4 34 35.76 139.81
8 17 36.71 136.20
16 8 38.89 128.56
32 4 35.89 139.30
64 2 39.45 126.74
128 1 38.88 128.60
#- with nthread = 68 // ncpu
<cori code> python debug.py knl
len(redshifts) = 500
len(template.redshifts) = 20
WARNING: Using 50 cores for 50 redshifts
len(redshifts) = 500
ncpu nthread time rate
4 17 34.60 144.51
8 8 35.54 140.68
16 4 34.00 147.08
32 2 38.74 129.06
64 1 34.92 143.20
128 1 39.03 128.09
[cori04 temp] srun -N 1 -p debug -C haswell --cpu_bind=cores -t 00:10:00 python knltest/code/rrzscan.py
srun: job 3614909 queued and waiting for resources
srun: job 3614909 has been allocated resources
len(redshifts) = 500
len(targets) = 10
ncpu nthread time rate
0 1 6.09 821.52
0 32 6.09 820.93
0 64 6.09 820.84
1 64 4.86 1028.44
2 32 3.74 1336.07
4 16 4.09 1221.34
8 8 3.58 1394.77
16 4 3.91 1279.03
32 2 8.34 599.68
64 1 4.04 1238.11
[cori04 temp] srun -N 1 -p debug -C haswell -t 00:10:00 python knltest/code/rrzscan.py
srun: job 3614945 queued and waiting for resources
srun: job 3614945 has been allocated resources
len(redshifts) = 500
len(targets) = 10
ncpu nthread time rate
0 1 4.74 1054.48
0 32 4.73 1057.31
0 64 4.72 1058.22
1 64 4.85 1031.84
2 32 2.57 1947.81
4 16 2.30 2175.37
8 8 1.04 4825.79
16 4 0.75 6678.03
32 2 0.49 10204.57
64 1 0.48 10386.34
Doesn't make much of a difference
[cori04 temp] srun -N 1 -p regular -C knl,quad,flat -t 00:30:00 python knltest/code/rrzscan.py
srun: job 3615063 queued and waiting for resources
srun: job 3615063 has been allocated resources
len(redshifts) = 500
len(targets) = 10
ncpu nthread time rate
0 1 31.56 158.42
0 136 31.65 157.99
0 272 31.60 158.21
1 1 26.83 186.33
4 1 30.39 164.54
8 1 31.03 161.13
16 1 31.62 158.14
32 1 32.61 153.32
64 1 34.84 143.50
128 1 38.69 129.25
[cori04 temp] srun -N 1 -p regular -C knl,quad,cache -t 00:30:00 python knltest/code/rrzscan.py
srun: job 3615257 queued and waiting for resources
srun: job 3615257 has been allocated resources
len(redshifts) = 500
len(targets) = 10
ncpu nthread time rate
0 1 33.86 147.68
0 136 33.88 147.58
0 272 33.81 147.90
1 1 27.94 178.96
4 1 31.82 157.14
8 1 32.57 153.52
16 1 33.45 149.48
32 1 34.62 144.41
64 1 37.12 134.68
128 1 41.66 120.03
In [ ]: