In [ ]:
# Copyright 2019 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

Jasper Inference For TensorRT 6

This Jupyter notebook provides scripts to perform high-performance inference using NVIDIA TensorRT. Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch.

1. Overview

The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment. This post-processing step is called decoding.

The original paper is Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf.

1.1 Model architecture

By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks. Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout. In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution. For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization. More information on the model architecture can be found in the root folder

1.2 TensorRT Inference pipeline

The Jasper inference pipeline consists of 3 components: data preprocessor, acoustic model and greedy decoder. The acoustic model is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and also what differentiates Jasper from the competition. So, we focus on the acoustic model for the most part. For the non-TRT Jasper inference pipeline, all 3 components are implemented and run with native PyTorch. For the TensorRT inference pipeline, we show the speedup of running the acoustic model with TensorRT, while preprocessing and decoding are reused from the native PyTorch pipeline. To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into an ONNX file. Finally, a TensorRT engine is constructed from the ONNX file, serialized to TRT plan file, and also launched to do inference. Note that TensorRT engine is being runtime optimized before serialization. TRT tries a vast set of options to find the strategy that performs best on user’s GPU - so it takes a few minutes. After the TRT plan file is created, it can be reused.

1.3 Learning objectives

This notebook demonstrates:

  • Speed up Jasper Inference with TensorRT
  • The use/download of fine-tuned NVIDIA Jasper models
  • Use of Mixed Precision for Inference

2. Requirements

Please refer to README.md

3. Jasper Inference

3.1 Start a detached session in the NGC container


In [ ]:
!nvidia-docker run -it -d --rm --name "JasperTRT" \
  --runtime=nvidia \
  --shm-size=4g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v $PWD/data:/datasets \
  -v $PWD/checkpoint:/checkpoints/ \
  -v $PWD/result:/results/ \
  -v $PWD:/workspace/jasper/ \
  jasper:trt6 bash

You can also specify single or multiple GPUs to run the container by adding "NV_GPU" before the "nvidia-docker run" command. For example, to specify GPU ID 2 to run the container, add "NV_GPU=2" before the "nvidia-docker run" command. You can use the command "nvidia-smi" to check your GPU IDs and utilization.


In [ ]:
#check the container that you just started
!docker ps -a

3.2 Download and preprocess the dataset.

You will not need to download the dataset if you directly go to Section 5 to play with audio examples.

If LibriSpeech http://www.openslr.org/12 has already been downloaded and preprocessed, no further steps in this subsection need to be taken. If LibriSpeech has not been downloaded already, note that only a subset of LibriSpeech is typically used for inference (dev- and test-). LibriSpeech contains 1000 hours of 16kHz read English speech derived from public domain audiobooks from LibriVox project and has been carefully segmented and aligned. For more information, see paper LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS paper. To acquire the inference subset of LibriSpeech run (does not require GPU):


In [ ]:
!nvidia-docker exec -it JasperTRT bash trt/scripts/download_inference_librispeech.sh

Once the data download is complete, the following folders should exist:

  • /datasets/LibriSpeech/
    • dev-clean/
    • dev-other/
    • test-clean/
    • test-other/

Since /datasets/ is mounted to on the host, once the dataset is downloaded it is accessible from outside of the container at /LibriSpeech.

Next, preprocessing the data can be performed with the following command:


In [ ]:
!nvidia-docker exec -it JasperTRT bash trt/scripts/preprocess_inference_librispeech.sh

Once the data is preprocessed, the following additional files should now exist:

  • /datasets/LibriSpeech/
    • librispeech-dev-clean-wav.json
    • librispeech-dev-other-wav.json
    • librispeech-test-clean-wav.json
    • librispeech-test-other-wav.json
    • dev-clean/
    • dev-other/
    • test-clean/
    • test-other/

3.3. Start TensorRT inference prediction

Inside the container, use the following script to run inference with TensorRT. You will need to set the parameters such as:

  • CHECKPOINT: Model checkpoint path
  • TRT_PRECISION: "fp32" or "fp16". Defines which precision kernels will be used for TensorRT engine (default: "fp32")
  • PYTORCH_PRECISION: "fp32" or "fp16". Defines which precision will be used for inference in PyTorch (default: "fp32")
  • TRT_PREDICTION_PATH: file to store inference prediction results generated with TensorRT
  • PYT_PREDICTION_PATH: file to store inference prediction results generated with native PyTorch
  • DATASET: LibriSpeech dataset (default: dev-clean)
  • NUM_STEPS: Number of inference steps (default: -1)
  • BATCH_SIZE: Mini batch size (default: 1)
  • NUM_FRAMES: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 3600)

In [ ]:
!nvidia-docker exec -it -e CHECKPOINT=/checkpoints/jasper_fp16.pt -e TRT_PREDICTION_PATH=/results/result.txt JasperTRT bash trt/scripts/trt_inference.sh

3.4. Start TensorRT Inference Benchmark

Run the following commmand to run inference benchmark with TensorRT inside the container.

You will need to set the parameters such as:

  • CHECKPOINT: Model checkpoint path
  • NUM_STEPS: number of inference steps. If -1 runs inference on entire dataset. (default: -1)
  • NUM_FRAMES: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 512)
  • BATCH_SIZE: data batch size (default: 64)
  • TRT_PRECISION: "fp32" or "fp16". Defines which precision kernels will be used for TensorRT engine (default: "fp32")
  • PYTORCH_PRECISION: "fp32" or "fp16". Defines which precision will be used for inference in PyTorch (default: "fp32")
  • CSV_PATH: file to store CSV results (default: "/results/res.csv")

In [ ]:
!nvidia-docker exec -it -e CHECKPOINT=/checkpoints/jasper_fp16.pt -e TRT_PREDICTION_PATH=/results/benchmark.txt JasperTRT bash trt/scripts/trt_inference_benchmark.sh

4. Automatic Mixed Precision

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.

Using mixed precision training requires two steps:

  • Porting the model to use the FP16 data type where appropriate.
  • Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK. For information about:

How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.

Techniques used for mixed precision training, see the blog Mixed-Precision Training of Deep Neural Networks.

APEX tools for mixed precision training, see the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.

To enable mixed precision, we can specify the variables TRT_PRECISION and PYTORCH_PRECISION by setting them to TRT_PRECISION=fp16 and PYTORCH_PRECISION=fp16 when running the inference. To run the TensorRT inference benchmarking using automatic mixed precision:


In [ ]:
!nvidia-docker exec -it -e CHECKPOINT=/checkpoints/jasper_fp16.pt -e TRT_PREDICTION_PATH=/results/benchmark.txt -e TRT_PRECISION=fp16 -e PYTORCH_PRECISION=fp16 -e CSV_PATH=/result/res_fp16.csv JasperTRT bash trt/scripts/trt_inference_benchmark.sh

From the performance metrics (pyt_infer) that you get from res.csv (for fp32) and res_fp16.csv (for automatic mixed precision) files, you can see that automatic mixed precision can speedup the inference efficiently compared to fp32.

5. Play with audio examples

You can perform inference using pre-trained checkpoints which takes audio file (in .wav format) as input, and produces the corresponding text file. You can customize the content of the text file. For example, there are several examples of input files at "notebooks" dirctory and we can listen to example1.wav:


In [ ]:
import IPython.display as ipd
ipd.Audio('notebooks/example1.wav', rate=22050)

You can run inference using the trt/perf.py script:

  • the checkpoint is passed as --ckpt argument
  • --model_toml specifies the path to network configuration file (see examples in "config" directory)
  • --make_onnx exports to ONNX file at the path if set
  • --engine_path saves the engine file (*.plan)

To create a new engine file (jasper.plan) for TensorRT and run it using fp32 (building the engine for the first time can take several minutes):


In [ ]:
!nvidia-docker exec -it JasperTRT python trt/perf.py --ckpt_path /checkpoints/jasper_fp16.pt --wav=notebooks/example1.wav --model_toml=configs/jasper10x5dr_nomask.toml --make_onnx --onnx_path jasper.onnx --engine_path jasper.plan

If you already have the engine file (jasper.plan), to run an existing engine file of TensorRT using fp32:


In [ ]:
!nvidia-docker exec -it JasperTRT python trt/perf.py --wav=notebooks/example1.wav --model_toml=configs/jasper10x5dr_nomask.toml --use_existing_engine --engine_path jasper.plan --trt_fp16

To run inference of the input audio file using automatic mixed precision, add the argument --trt_fp16. Using automatic mixed precision, the inference time can be reduced efficiently compared to that of using fp32 (building the engine for the first time can take several minutes):


In [ ]:
!nvidia-docker exec -it JasperTRT python trt/perf.py --ckpt_path /checkpoints/jasper_fp16.pt --wav=notebooks/example1.wav --model_toml=configs/jasper10x5dr_nomask.toml --make_onnx --onnx_path jasper.onnx --engine_path jasper_fp16.plan --trt_fp16

If you already have the engine file (jasper_fp16.plan), to run an existing engine file of TensorRT using automatic mixed precision:


In [ ]:
!nvidia-docker exec -it JasperTRT python trt/perf.py --wav=notebooks/example1.wav --model_toml=configs/jasper10x5dr_nomask.toml --use_existing_engine --engine_path jasper_fp16.plan --trt_fp16

You can play with other examples at "notebooks" dirctory. You can also input your own audio files and generate the output text files in this way.

For more information about TensorRT and building an engine file in Python, please see: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#python_topics


In [ ]:
#stop your container in the end
!docker stop JasperTRT

7. What's next

Now you are familiar with running Jasper inference with TensorRT, using automatic mixed precision, you may want to play with your own dataset, or train the model using your own dataset. For information on training, please see our Github repo: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper