Performance Benchmarking for KV Drive

The goal of these set of experiments is to characterize the variability across platforms in a systematic and consistent way in terms of KV drive. The steps of experiments are as follows,

Run Stress-ng benchmarks on one KV drive;
Run Stress-ng benchmarks on machine issdm-6, and get the "without limit" result;
Find all the common benchmarks from both results;
Calculate the speedup (normalized value) of each benchmark based on the one from KV drive (issdm-6 (without limit) / KV drive);
Use torpor to calculate the best cpu quota by minimizing the average speedups. We will later use this parameter to limit the cpu usage in the docker container;
Run Stress-ng benchmarks in the constrained docker container on machine issdm-6, and get the "with limit" result;
Calculate the speedup based on KV drive again (issdm-6 (with limit) / KV drive), then we get a new "speedup range", which should be must smaller than the previous one.
Run a bunch of other benchmarks on both KV drive and constrained docker container to verify if they are all within in the later "speedup range".
Make conclusion.



In [1]:

    
%matplotlib inline
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

pd.set_option("display.max_rows", 8)

First, we load all test data.



In [2]:

    
df = pd.read_csv('stress-ng/second/results/combo/1/alltests.csv')

Let's have a look at the pattern of data.



In [3]:

    
df.head()









    Out[3]:






  
    
      
      machine
      limits
      benchmark
      class
      lower_is_better
      repetition
      result
    
  
  
    
      0
      issdm-6
      with
      stressng-cpu-ackermann
      cpu
      False
      1
      0.738698
    
    
      1
      issdm-6
      with
      stressng-cpu-bitops
      cpu
      False
      1
      128.877630
    
    
      2
      issdm-6
      with
      stressng-cpu-callfunc
      cpu
      False
      1
      17637.443025
    
    
      3
      issdm-6
      with
      stressng-cpu-cdouble
      cpu
      False
      1
      225.798848
    
    
      4
      issdm-6
      with
      stressng-cpu-cfloat
      cpu
      False
      1
      160.300367

Show all the test machines.



In [4]:

    
df['machine'].unique()









    Out[4]:





array(['issdm-6', 'kv3'], dtype=object)

Define a predicate for machine issdm-6



In [5]:

    
machine_is_issdm_6 = df['machine'] == 'issdm-6'

The number of benchmarks we ran on issdm-6 with limit is



In [6]:

    
limits_is_with = df['limits'] == 'with'
df_issdm_6_with_limit = df[machine_is_issdm_6 & limits_is_with]
len(df_issdm_6_with_limit)









    Out[6]:





122

The number of benchmarks we ran on issdm-6 without limit is



In [7]:

    
limits_is_without = df['limits'] == 'without'
len(df[machine_is_issdm_6 & limits_is_without])









    Out[7]:





124

The number of benchmarks we ran on kv3



In [8]:

    
df_kv3 = df[df['machine'] == 'kv3']
len(df_kv3)









    Out[8]:





129

Because some benchmarks could fail druing the test suite running, those failed tests are not in the result report. We want to know how many common tests they both complated.



In [9]:

    
df_common = pd.merge(df_issdm_6_with_limit, df_kv3, how='inner', on='benchmark')
len(df_common)









    Out[9]:





113

Read the normalized results.



In [10]:

    
df = pd.read_csv('stress-ng/second/results/combo/1/alltests_with_normalized_results_1.1.csv')

Show some of the data lines. The normalized value is the speedup based on kv3. It becomes a negative value when the benchmark runs on issdm-6 is slower than on kv3 (slowdown).



In [11]:

    
df.head()









    Out[11]:






  
    
      
      benchmark
      base_result
      machine
      limits
      class
      lower_is_better
      repetition
      result
      normalized
    
  
  
    
      0
      stressng-cpu-ackermann
      1.344893
      issdm-6
      with
      cpu
      False
      1
      0.738698
      -1.820626
    
    
      1
      stressng-cpu-ackermann
      1.344893
      issdm-6
      without
      cpu
      False
      3
      7.413836
      5.512584
    
    
      2
      stressng-cpu-bitops
      540.245816
      issdm-6
      with
      cpu
      False
      1
      128.877630
      -4.191929
    
    
      3
      stressng-cpu-bitops
      540.245816
      issdm-6
      without
      cpu
      False
      3
      1291.897690
      2.391315
    
    
      4
      stressng-cpu-callfunc
      28821.929834
      issdm-6
      with
      cpu
      False
      1
      17637.443025
      -1.634133

There is one benchmark not present in both with and without limit result set.



In [12]:

    
len(df) / 2









    Out[12]:





113.5

Since the number of common benchmarks is 113, we wnat to find the one benchmark less than two results (all from issdm-6).



In [13]:

    
grouped = df.groupby('benchmark')
df[grouped['benchmark'].transform(len) < 2]









    Out[13]:






  
    
      
      benchmark
      base_result
      machine
      limits
      class
      lower_is_better
      repetition
      result
      normalized
    
  
  
    
      172
      stressng-memory-oom-pipe
      0.199394
      issdm-6
      without
      memory
      False
      3
      0.088345
      -2.256992

In other words, stressng-memory-oom-pipe should not in the with limit results of issdm-6



In [14]:

    
df_issdm_6_with_limit[df_issdm_6_with_limit['benchmark'] == 'stressng-memory-oom-pipe'].empty









    Out[14]:





True

We can find the number of benchmarks are speed-up and the number of them are slowdown on without limit results.



In [15]:

    
predicate_without_limits = df['limits'] == 'without'
predicate = predicate_without_limits & (df['normalized'] >= 0)
len(df[predicate])









    Out[15]:





109



In [16]:

    
predicate = predicate_without_limits & (df['normalized'] < 0)
len(df[predicate])









    Out[16]:





5

All right, let's draw a bar plot for all results.



In [17]:

    
sns.set()
sns.set_context("poster")
plt.xticks(rotation=90)
sns.barplot(x='benchmark', y='normalized', hue='limits', data=df)









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f5ea4516080>

Which one have the greatest and smallest speedup on without limit banchmark results?



In [18]:

    
df_without_sorted = df[df['limits'] == 'without'].sort_values(by='normalized', ascending=0)
head_without = df_without_sorted.head()
tail_without = df_without_sorted.tail()
head_without.append(tail_without)









    Out[18]:






  
    
      
      benchmark
      base_result
      machine
      limits
      class
      lower_is_better
      repetition
      result
      normalized
    
  
  
    
      79
      stressng-cpu-int64double
      2.651071
      issdm-6
      without
      cpu
      False
      3
      3407.650197
      1285.386245
    
    
      85
      stressng-cpu-int32double
      2.672973
      issdm-6
      without
      cpu
      False
      3
      3411.702766
      1276.370082
    
    
      129
      stressng-cpu-sqrt
      5.177912
      issdm-6
      without
      cpu
      False
      3
      3842.276549
      742.051342
    
    
      41
      stressng-cpu-gamma
      0.197453
      issdm-6
      without
      cpu
      False
      3
      122.824428
      622.043869
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      178
      stressng-memory-stackmmap
      0.199517
      issdm-6
      without
      memory
      False
      3
      0.099770
      -1.999769
    
    
      204
      stressng-string-strcasecmp
      13501.029625
      issdm-6
      without
      string
      False
      3
      6366.731877
      -2.120559
    
    
      172
      stressng-memory-oom-pipe
      0.199394
      issdm-6
      without
      memory
      False
      3
      0.088345
      -2.256992
    
    
      89
      stressng-cpu-jenkin
      471476.431807
      issdm-6
      without
      cpu
      False
      3
      17104.186860
      -27.564972
    
  

10 rows × 9 columns

Let's have a look at the speedup frequency on without limit benchmark results.



In [19]:

    
ax = df[df['limits'] == 'without'].groupby('limits').normalized.hist(bins=100,xrot=90,figsize=(20,10),alpha=0.5)
plt.xlabel('Speedup (re-execution / original)')
plt.ylabel('Frequency (# of benchmarks)')









    Out[19]:





<matplotlib.text.Text at 0x7f5ea1ed0cf8>

Which one have the greatest and smallest speedup on with limit benchmark results?



In [20]:

    
df_with_sorted = df[df['limits'] == 'with'].sort_values(by='normalized', ascending=0)
head_with = df_with_sorted.head()
tail_with = df_with_sorted.tail()
head_with.append(tail_with)









    Out[20]:






  
    
      
      benchmark
      base_result
      machine
      limits
      class
      lower_is_better
      repetition
      result
      normalized
    
  
  
    
      78
      stressng-cpu-int64double
      2.651071
      issdm-6
      with
      cpu
      False
      1
      338.585053
      127.716328
    
    
      84
      stressng-cpu-int32double
      2.672973
      issdm-6
      with
      cpu
      False
      1
      340.177617
      127.265639
    
    
      128
      stressng-cpu-sqrt
      5.177912
      issdm-6
      with
      cpu
      False
      1
      386.598691
      74.663048
    
    
      40
      stressng-cpu-gamma
      0.197453
      issdm-6
      with
      cpu
      False
      1
      12.284550
      62.215059
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      80
      stressng-cpu-int64longdouble
      371.986874
      issdm-6
      with
      cpu
      False
      1
      46.107568
      -8.067805
    
    
      217
      stressng-string-strncasecmp
      11049.836636
      issdm-6
      with
      string
      False
      1
      621.401437
      -17.782123
    
    
      203
      stressng-string-strcasecmp
      13501.029625
      issdm-6
      with
      string
      False
      1
      629.827637
      -21.436070
    
    
      88
      stressng-cpu-jenkin
      471476.431807
      issdm-6
      with
      cpu
      False
      1
      1706.527391
      -276.278268
    
  

10 rows × 9 columns

The average speedup of with limit benchmarks is,



In [21]:

    
df[df['limits'] == 'with']['normalized'].mean()









    Out[21]:





4.8308031516988885

Let's have a look at the speedup frequency on with limit benchmark results.



In [22]:

    
ax = df[df['limits'] == 'with'].groupby('limits').normalized.hist(bins=100,xrot=90,figsize=(20,10),alpha=0.5)
plt.xlabel('Speedup (re-execution / original)')
plt.ylabel('Frequency (# of benchmarks)')









    Out[22]:





<matplotlib.text.Text at 0x7f5ea1ab4b70>

The stressng-cpu-jenkin benchmark is a collection of (non-cryptographic) hash functions for multi-byte keys. See Jenkins hash function from Wikipedia for more details.

We got the speedup boundary from -276.278268 to 127.716328 by using parameters --cpuset-cpus=1 --cpu-quota=1000 --cpu-period=10000, which means the docker container only uses 1ms CPU worth of run-time every 10ms on cpu 1 (See cpu for more details).

Now we use 9 other benchmark programs to verify this result. These programs are,

blogbench: filesystem benchmark.
compilebench: It tries to age a filesystem by simulating some of the disk IO common in creating, compiling, patching, stating and reading kernel trees.
fhourstones: This integer benchmark solves positions in the game of connect-4.
himeno: Himeno benchmark score is affected by the performance of a computer, especially memory band width. This benchmark program takes measurements to proceed major loops in solving the Poisson’s equation solution using the Jacobi iteration method.
interbench: It is designed to measure the effect of changes in Linux kernel design or system configuration changes such as cpu, I/O scheduler and filesystem changes and options.
nbench: NBench(Wikipedia) is a synthetic computing benchmark program developed in the mid-1990s by the now defunct BYTE magazine intended to measure a computer's CPU, FPU, and Memory System speed.
pybench: It is a collection of tests that provides a standardized way to measure the performance of Python implementations.
ramsmp: RAMspeed is a free open source command line utility to measure cache and memory performance of computer systems.
stockfish-7: It is a simple benchmark by letting Stockfish analyze a set of positions for a given limit each.

Read verification tests data.



In [23]:

    
df = pd.read_csv('verification/results/1/alltests_with_normalized_results_1.0.csv')

Show number of test benchmarks.



In [24]:

    
len(df)









    Out[24]:





93

Let's see the speedup of each individual result.



In [25]:

    
sns.set()
sns.set_context("poster")
plt.xticks(rotation=90)
sns.barplot(x='benchmark', y='normalized', data=df)









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f5ea1ba7358>

Sort the test results set by the absolute value of normalized



In [26]:

    
df.reindex(df.normalized.abs().sort_values(ascending=0).index).head(8)









    Out[26]:






  
    
      
      benchmark
      base_result
      lower_is_better
      result
      normalized
    
  
  
    
      29
      nbench_fp
      0.165
      False
      21.959
      133.084848
    
    
      1
      blogbench_reads
      271351.000
      False
      9113.000
      -29.776254
    
    
      18
      interbench_audio_write
      275.000
      True
      15.000
      -18.333333
    
    
      26
      interbench_video_compile
      165.000
      True
      11.000
      -15.000000
    
    
      7
      compilebench_read_tree
      0.840
      False
      10.950
      13.035714
    
    
      24
      interbench_video_write
      116.000
      True
      10.000
      -11.600000
    
    
      0
      blogbench_writes
      464.000
      False
      55.000
      -8.436364
    
    
      88
      ramsmp_copy
      1289.030
      False
      245.050
      -5.260273

See the histogram of all of speedups after filtering out the one outlier.



In [27]:

    
df_t = df[df['benchmark'] != 'nbench_fp']
ax = df_t.normalized.hist(bins=100,xrot=90,figsize=(20,10),alpha=0.5)
plt.xlabel('Speedup (re-execution / original)')
plt.ylabel('Frequency (# of benchmarks)')









    Out[27]:





<matplotlib.text.Text at 0x7f5ea0353208>

The average of speedup of the test benchmarks without the one outlier is,



In [28]:

    
df_t['normalized'].mean()









    Out[28]:





0.37300988620902487

Conclusion: Except the nbench_fp, all the 92 benchmarks fall within our predicted speedup range [-276.278268, 127.716328], and most of them (86) are in [-6, 4], which has the length of variety 8 (=10-2, becuase there won't be any speedup sit in (-1, 1)).

Question: Is it an acceptable emulational environment for the KV drive?

	machine	limits	benchmark	class	lower_is_better	repetition	result
0	issdm-6	with	stressng-cpu-ackermann	cpu	False	1	0.738698
1	issdm-6	with	stressng-cpu-bitops	cpu	False	1	128.877630
2	issdm-6	with	stressng-cpu-callfunc	cpu	False	1	17637.443025
3	issdm-6	with	stressng-cpu-cdouble	cpu	False	1	225.798848
4	issdm-6	with	stressng-cpu-cfloat	cpu	False	1	160.300367

	benchmark	base_result	machine	limits	class	lower_is_better	repetition	result	normalized
0	stressng-cpu-ackermann	1.344893	issdm-6	with	cpu	False	1	0.738698	-1.820626
1	stressng-cpu-ackermann	1.344893	issdm-6	without	cpu	False	3	7.413836	5.512584
2	stressng-cpu-bitops	540.245816	issdm-6	with	cpu	False	1	128.877630	-4.191929
3	stressng-cpu-bitops	540.245816	issdm-6	without	cpu	False	3	1291.897690	2.391315
4	stressng-cpu-callfunc	28821.929834	issdm-6	with	cpu	False	1	17637.443025	-1.634133

	benchmark	base_result	machine	limits	class	lower_is_better	repetition	result	normalized
79	stressng-cpu-int64double	2.651071	issdm-6	without	cpu	False	3	3407.650197	1285.386245
85	stressng-cpu-int32double	2.672973	issdm-6	without	cpu	False	3	3411.702766	1276.370082
129	stressng-cpu-sqrt	5.177912	issdm-6	without	cpu	False	3	3842.276549	742.051342
41	stressng-cpu-gamma	0.197453	issdm-6	without	cpu	False	3	122.824428	622.043869
...	...	...	...	...	...	...	...	...	...
178	stressng-memory-stackmmap	0.199517	issdm-6	without	memory	False	3	0.099770	-1.999769
204	stressng-string-strcasecmp	13501.029625	issdm-6	without	string	False	3	6366.731877	-2.120559
172	stressng-memory-oom-pipe	0.199394	issdm-6	without	memory	False	3	0.088345	-2.256992
89	stressng-cpu-jenkin	471476.431807	issdm-6	without	cpu	False	3	17104.186860	-27.564972

	benchmark	base_result	lower_is_better	result	normalized
29	nbench_fp	0.165	False	21.959	133.084848
1	blogbench_reads	271351.000	False	9113.000	-29.776254
18	interbench_audio_write	275.000	True	15.000	-18.333333
26	interbench_video_compile	165.000	True	11.000	-15.000000
7	compilebench_read_tree	0.840	False	10.950	13.035714
24	interbench_video_write	116.000	True	10.000	-11.600000
0	blogbench_writes	464.000	False	55.000	-8.436364
88	ramsmp_copy	1289.030	False	245.050	-5.260273