How to export 🤗 Transformers Models to ONNX ?

ONNX is open format for machine learning models. It allows to save your neural network's computation graph in a framework agnostic way, which might be particulary helpful when deploying deep learning models.

Indeed, businesses might have other requirements (languages, hardware, ...) for which the training framework might not be the best suited in inference scenarios. In that context, having a representation of the actual computation graph that can be shared accross various business units and logics across an organization might be a desirable component.

Along with the serialization format, ONNX also provides a runtime library which allows efficient and hardware specific execution of the ONNX graph. This is done through the onnxruntime project and already includes collaborations with many hardware vendors to seamlessly deploy models on various platforms.

Through this notebook we'll walk you through the process to convert a PyTorch or TensorFlow transformers model to the ONNX and leverage onnxruntime to run inference tasks on models from 🤗 transformers

Exporting 🤗 transformers model to ONNX

Exporting models (either PyTorch or TensorFlow) is easily achieved through the conversion tool provided as part of 🤗 transformers repository.

Under the hood the process is sensibly the following:

Allocate the model from transformers (PyTorch or TensorFlow)
Forward dummy inputs through the model this way ONNX can record the set of operations executed
Optionally define dynamic axes on input and output tensors
Save the graph along with the network parameters



In [ ]:

    
!pip install --upgrade git+https://github.com/huggingface/transformers



In [ ]:

    
!rm -rf onnx/
from transformers.convert_graph_to_onnx import convert

# Handles all the above steps for you
convert(framework="pt", model="bert-base-cased", output="onnx/bert-base-cased.onnx", opset=11)

# Tensorflow 
# convert(framework="tf", model="bert-base-cased", output="onnx/bert-base-cased.onnx", opset=11)

How to leverage runtime for inference over an ONNX graph

As mentionned in the introduction, ONNX is a serialization format and many side projects can load the saved graph and run the actual computations from it. Here, we'll focus on the official onnxruntime. The runtime is implemented in C++ for performance reasons and provides API/Bindings for C++, C, C#, Java and Python.

In the case of this notebook, we will use the Python API to highlight how to load a serialized ONNX graph and run inference workload on various backends through onnxruntime.

onnxruntime is available on pypi:

onnxruntime: ONNX + MLAS (Microsoft Linear Algebra Subprograms)
onnxruntime-gpu: ONNX + MLAS + CUDA



In [ ]:

    
!pip install transformers onnxruntime-gpu onnx psutil matplotlib

Preparing for an Inference Session

Inference is done using a specific backend definition which turns on hardware specific optimizations of the graph.

Optimizations are basically of three kinds:

Constant Folding: Convert static variables to constants in the graph
Deadcode Elimination: Remove nodes never accessed in the graph
Operator Fusing: Merge multiple instruction into one (Linear -> ReLU can be fused to be LinearReLU)

ONNX Runtime automatically applies most optimizations by setting specific SessionOptions.

Note:Some of the latest optimizations that are not yet integrated into ONNX Runtime are available in optimization script that tunes models for the best performance.



In [ ]:

    
# # An optional step unless
# # you want to get a model with mixed precision for perf accelartion on newer GPU
# # or you are working with Tensorflow(tf.keras) models or pytorch models other than bert

# !pip install onnxruntime-tools
# from onnxruntime_tools import optimizer

# # Mixed precision conversion for bert-base-cased model converted from Pytorch
# optimized_model = optimizer.optimize_model("bert-base-cased.onnx", model_type='bert', num_heads=12, hidden_size=768)
# optimized_model.convert_model_float32_to_float16()
# optimized_model.save_model_to_file("bert-base-cased.onnx")

# # optimizations for bert-base-cased model converted from Tensorflow(tf.keras)
# optimized_model = optimizer.optimize_model("bert-base-cased.onnx", model_type='bert_keras', num_heads=12, hidden_size=768)
# optimized_model.save_model_to_file("bert-base-cased.onnx")



In [2]:

    
from os import environ
from psutil import cpu_count

# Constants from the performance optimization available in onnxruntime
# It needs to be done before importing onnxruntime
environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True))
environ["OMP_WAIT_POLICY"] = 'ACTIVE'

from onnxruntime import InferenceSession, SessionOptions, get_all_providers



In [3]:

    
def create_model_for_provider(model_path: str, provider: str) -> InferenceSession: 
  
  assert provider in get_all_providers(), f"provider {provider} not found, {get_all_providers()}"

  # Few properties than might have an impact on performances (provided by MS)
  options = SessionOptions()
  options.intra_op_num_threads = 1

  # Load the model as a graph and prepare the CPU backend 
  return InferenceSession(model_path, options, providers=[provider])

Forwarding through our optimized ONNX model running on CPU

When the model is loaded for inference over a specific provider, for instance CPUExecutionProvider as above, an optimized graph can be saved. This graph will might include various optimizations, and you might be able to see some higher-level operations in the graph (through Netron for instance) such as:

EmbedLayerNormalization
Attention
FastGeLU

These operations are an example of the kind of optimization onnxruntime is doing, for instance here gathering multiple operations into bigger one (Operator Fusing).



In [4]:

    
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
cpu_model = create_model_for_provider("onnx/bert-base-cased.onnx", "CPUExecutionProvider")

# Inputs are provided through numpy array
model_inputs = tokenizer.encode_plus("My name is Bert", return_tensors="pt")
inputs_onnx = {k: v.cpu().detach().numpy() for k, v in model_inputs.items()}

# Run the model (None = get all the outputs)
sequence, pooled = cpu_model.run(None, inputs_onnx)

# Print information about outputs

print(f"Sequence output: {sequence.shape}, Pooled output: {pooled.shape}")









    



Sequence output: (1, 6, 768), Pooled output: (1, 768)

Benchmarking different CPU & GPU providers

Disclamer: results may vary from the actual hardware used to run the model



In [5]:

    
from torch.cuda import get_device_name
from contextlib import contextmanager
from dataclasses import dataclass
from time import time
from tqdm import trange

print(f"Doing GPU inference on {get_device_name(0)}", flush=True)

@contextmanager
def track_infer_time(buffer: [int]):
    start = time()
    yield
    end = time()

    buffer.append(end - start)


@dataclass
class OnnxInferenceResult:
  model_inference_time: [int]  
  optimized_model_path: str


# All the providers we'll be using in the test
results = {}
providers = [
  "CUDAExecutionProvider",
  "CPUExecutionProvider",            
  "TensorrtExecutionProvider",
  "DnnlExecutionProvider",          
]

# Iterate over all the providers
for provider in providers:

  # Create the model with the specified provider
  model = create_model_for_provider("onnx/bert-base-cased.onnx", provider)

  # Keep track of the inference time
  time_buffer = []

  # Warm up the model
  for _ in trange(10, desc="Warming up"):
    model.run(None, inputs_onnx)

  # Compute 
  for _ in trange(100, desc=f"Tracking inference time on {provider}"):
    with track_infer_time(time_buffer):
      model.run(None, inputs_onnx)

  # Store the result
  results[provider] = OnnxInferenceResult(
      time_buffer,
      model.get_session_options().optimized_model_filepath
  )









    



Doing GPU inference on TITAN RTX






    



Warming up: 100%|██████████| 10/10 [00:00<00:00, 333.82it/s]
Tracking inference time on CUDAExecutionProvider: 100%|██████████| 100/100 [00:00<00:00, 521.76it/s]
Warming up: 100%|██████████| 10/10 [00:00<00:00, 62.95it/s]
Tracking inference time on CPUExecutionProvider: 100%|██████████| 100/100 [00:01<00:00, 68.65it/s]
Warming up: 100%|██████████| 10/10 [00:00<00:00, 69.72it/s]
Tracking inference time on TensorrtExecutionProvider: 100%|██████████| 100/100 [00:01<00:00, 71.31it/s]
Warming up: 100%|██████████| 10/10 [00:00<00:00, 66.28it/s]
Tracking inference time on DnnlExecutionProvider: 100%|██████████| 100/100 [00:01<00:00, 72.03it/s]



In [7]:

    
from transformers import BertModel

# Add PyTorch to the providers
model_pt = BertModel.from_pretrained("bert-base-cased")
for _ in trange(10, desc="Warming up"):
  model_pt(**model_inputs)

# Compute 
time_buffer = []
for _ in trange(100, desc=f"Tracking inference time on PyTorch"):
  with track_infer_time(time_buffer):
    model_pt(**model_inputs)

# Store the result
results["Pytorch"] = OnnxInferenceResult(
    time_buffer, 
    model.get_session_options().optimized_model_filepath
)









    



Warming up: 100%|██████████| 10/10 [00:00<00:00, 18.04it/s]
Tracking inference time on PyTorch: 100%|██████████| 100/100 [00:05<00:00, 18.88it/s]

Show the inference performance of each providers

Note: PyTorch model benchmark is run on CPU



In [24]:

    
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os

# Compute average inference time + std
time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}
time_results_std = np.std([v.model_inference_time for v in results.values()]) * 1000

plt.rcdefaults()
fig, ax = plt.subplots(figsize=(16, 12))
ax.set_ylabel("Avg Inference time (ms)")
ax.set_title("Average inference time (ms) for each provider")
ax.bar(time_results.keys(), time_results.values(), yerr=time_results_std)
plt.show()