Tune TensorFlow Serving

Guidelines

CPU-only

If your system is CPU-only (no GPU), then consider the following values:

num_batch_threads equal to the number of CPU cores
max_batch_size to infinity (ie. MAX_INT)
batch_timeout_micros to 0.

Then experiment with batch_timeout_micros values in the 1-10 millisecond (1000-10000 microsecond) range, while keeping in mind that 0 may be the optimal value.

GPU

If your model uses a GPU device for part or all of your its inference work, consider the following value:

num_batch_threads to the number of CPU cores.
batch_timeout_micros to infinity while tuning max_batch_size to achieve the desired balance between throughput and average latency. Consider values in the hundreds or thousands.

For online serving, tune batch_timeout_micros to rein in tail latency.

The idea is that batches normally get filled to max_batch_size, but occasionally when there is a lapse in incoming requests, to avoid introducing a latency spike it makes sense to process whatever's in the queue even if it represents an underfull batch.

The best value for batch_timeout_micros is typically a few milliseconds, and depends on your context and goals.

Zero is a value to consider as it works well for some workloads. For bulk-processing batch jobs, choose a large value, perhaps a few seconds, to ensure good throughput but not wait too long for the final (and likely underfull) batch.

Close TensorFlow Serving and Load Test Terminals

Open a Terminal through Jupyter Notebook

Enable Request Batching

Start TensorFlow Serving in Separate Terminal

Point to the optimized version of the model.

tensorflow_model_server \
  --port=9000 \
  --model_name=linear \
  --model_base_path=/root/models/linear_fully_optimized/cpu/ \
  --enable_batching=true \
  --num_batch_threads=4 \
  --max_batch_size=9999999 \
  --batch_timeout_micros=10000

The params are as follows:

port for TensorFlow Serving (int)
model_name (anything)
model_base_path (/path/to/model/ above all versioned sub-directories)
enable_batching (true|false)
num_batch_threads (usually equal to the number of CPU cores or a multiple thereof)
max_batch_size (# of requests - start with infinity, tune down to find the right balance between latency and throughput)
batch_timeout_micros (minimum batch window duration)

Start Load Test in the Terminal

loadtest high

Notice the throughput and avg/min/max latencies:

summary ... =  301.1/s Avg:   227 Min:     3 Max:   456 Err:     0 (0.00%)

Modify Request Batching Parameters, Repeat Load Test

Gain intuition on the performance impact of changing the request batching parameters.