If your system is CPU-only (no GPU), then consider the following values:
num_batch_threads
equal to the number of CPU coresmax_batch_size
to infinity (ie. MAX_INT)batch_timeout_micros
to 0. Then experiment with batch_timeout_micros values in the 1-10 millisecond (1000-10000 microsecond) range, while keeping in mind that 0 may be the optimal value.
If your model uses a GPU device for part or all of your its inference work, consider the following value:
num_batch_threads
to the number of CPU cores.batch_timeout_micros
to infinity while tuning max_batch_size
to achieve the desired balance between throughput and average latency. Consider values in the hundreds or thousands.For online serving, tune batch_timeout_micros
to rein in tail latency.
The idea is that batches normally get filled to max_batch_size, but occasionally when there is a lapse in incoming requests, to avoid introducing a latency spike it makes sense to process whatever's in the queue even if it represents an underfull batch.
The best value for batch_timeout_micros
is typically a few milliseconds, and depends on your context and goals.
Zero is a value to consider as it works well for some workloads. For bulk-processing batch jobs, choose a large value, perhaps a few seconds, to ensure good throughput but not wait too long for the final (and likely underfull) batch.
Point to the optimized version of the model.
tensorflow_model_server \
--port=9000 \
--model_name=linear \
--model_base_path=/root/models/linear_fully_optimized/cpu/ \
--enable_batching=true \
--num_batch_threads=4 \
--max_batch_size=9999999 \
--batch_timeout_micros=10000
The params are as follows:
port
for TensorFlow Serving (int)model_name
(anything)model_base_path
(/path/to/model/ above all versioned sub-directories)enable_batching
(true|false)num_batch_threads
(usually equal to the number of CPU cores or a multiple thereof)max_batch_size
(# of requests - start with infinity, tune down to find the right balance between latency and throughput)batch_timeout_micros
(minimum batch window duration)