A Practical Guide to Exporting Transformers to ONNX Using Hugging Face Optimum

Jul 01, 2025 By Alison Perry

The growing use of transformer models in machine learning has created a demand for faster inference and more flexible deployment. While these models deliver impressive results, they are often heavy, slow, and not ideal for production environments out of the box. That's where exporting them to a more efficient format, such as ONNX (Open Neural Network Exchange), becomes useful.

ONNX helps break the dependency on a specific framework, making models easier to run across various platforms and devices. Hugging Face Optimum offers a straightforward way to convert transformers into ONNX without needing to dig deep into framework-specific conversion logic. This guide covers how it works, why you'd want to use it, and how to handle practical aspects of the conversion process.

Why Convert Transformers to ONNX?

Transformers are powerful but computationally expensive. PyTorch-based models from Hugging Face's Transformers library are great for experimentation but might not perform well when deployed in production settings where speed, memory, or cross-platform compatibility matters. ONNX helps by representing models in a platform-agnostic format that can run on various backends, such as ONNX Runtime, TensorRT, or OpenVINO, depending on the hardware and use case.

Using ONNX also helps strip down unnecessary overhead tied to PyTorch or TensorFlow environments. The ONNX model graph is optimized to be faster and use fewer resources, a feature that is particularly beneficial for usage in edge devices, mobile applications, or big-scale inference systems. Hugging Face Optimum packages this feature in an easy-to-use interface so that users do not need to write direct conversion code or deal with multiple tools.

Getting Started with Hugging Face Optimum

Hugging Face Optimum is a companion library built to optimize transformer models for deployment. One of its key features is ONNX export, and it integrates smoothly with Hugging Face Transformers. Before converting anything, make sure the necessary packages are installed. This includes transformers, optimum, onnxruntime, and optionally optimum[onnxruntime] if you want everything bundled.

Start by loading the model and tokenizer you want to convert. Most common transformer models like BERT, DistilBERT, and GPT-2 are supported. Hugging Face Optimum uses optimum.exporters.onnx under the hood, offering CLI tools and Python APIs for model export. Both approaches work well, though the Python API gives more control.

To export a model using the command line, a simple command looks like this:

optimum-cli export onnx --model bert-base-uncased ./onnx-model

This fetches the model from the Hugging Face Hub and saves the ONNX version locally. If you prefer using Python, you can do:

from optimum.onnxruntime import ORTModelForSequenceClassification

from transformers import AutoTokenizer

model_id = "bert-base-uncased"

model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model.save_pretrained("./onnx-model")

tokenizer.save_pretrained("./onnx-model")

In this code, export=True ensures that the PyTorch model is exported to ONNX when loaded.

What Happens During Conversion?

When you export a transformer to ONNX using Hugging Face Optimum, the library first runs shape inference and checks the model's compatibility with ONNX. Not every model or layer is fully supported by the ONNX format, but most standard architectures convert cleanly.

During export, the model is traced using torch.onnx.export or other backend-specific exporters. The resulting ONNX model contains a simplified graph of the operations needed for inference. Some dynamic operations are converted into static ones when possible, which makes the model easier to optimize later.

The exported file usually includes:

A .onnx file containing the model graph.
A tokenizer configuration in JSON format.
Additional configuration files such as config.json and preprocessor_config.json.

These components are needed to replicate inference exactly as it works in Transformers.

ONNX also has different versions and operator sets, so Hugging Face Optimum tries to match the ONNX opset with what’s most compatible with the model architecture and target runtime. You can manually adjust this by passing the opset argument during export, but the default usually works fine.

Optimizing and Running the ONNX Model

Exporting the model is only the first step. To get real gains, you need to run it using an optimized inference engine like ONNX Runtime. ONNX Runtime supports several acceleration options, including CPU, GPU, and custom accelerators, depending on your setup.

You can load and run inference on your ONNX model like this:

import onnxruntime

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./onnx-model")

session = onnxruntime.InferenceSession("./onnx-model/model.onnx")

inputs = tokenizer("This is a test sentence.", return_tensors="np")

outputs = session.run(None, dict(inputs))

Inference with ONNX Runtime can lead to significant speedups—up to 2x or 3x, depending on the model and hardware. Memory consumption also tends to be lower since the ONNX graph is stripped of training components and other unused logic.

You can go further by using ONNX Runtime's optimization tools, quantization, or hardware-specific providers like TensorRT. Hugging Face Optimum offers integration points for some of these, though not all are covered in the basic export step. Quantization, for instance, can reduce model size and speed up inference even more by converting model weights from float32 to int8.

When working with ONNX models in real applications, batch size and sequence length impact performance significantly. Fixed-shape models (e.g., batch size 1, sequence length 128) often run faster than models supporting dynamic shapes, so if your use case is predictable, fix these values during export.

Conclusion

Converting transformer models to ONNX using Hugging Face Optimum is a practical step for making machine learning projects more efficient and deployment-ready. This process simplifies the transition from research to production by reducing model size, improving inference speed, and enabling compatibility across different platforms and hardware. With minimal setup, you can move from a PyTorch-based model to a fully optimized ONNX version that runs smoothly with ONNX Runtime or other supported engines. Hugging Face Optimum streamlines the entire process, letting you focus more on the application and less on the infrastructure. It's a valuable tool when performance and flexibility are crucial in real-world AI tasks.

How to Convert Transformers to ONNX with Hugging Face Optimum for Faster Inference

Why Convert Transformers to ONNX?

Getting Started with Hugging Face Optimum

What Happens During Conversion?

Optimizing and Running the ONNX Model

Conclusion

You May Like

Starting Strong: The Power of a Course Launch Community Event

How Stacking Combines Models for Better Predictions

Running Stable Diffusion with JAX and Flax: What You Need to Know

Understanding Common Table Expressions (CTEs) for Cleaner SQL Queries

Why These GitHub Repos Boost Data Science Learning

Understanding the Annotated Diffusion Model in AI Image Generation

Getting Started with Your First ML Project: A Beginner Guide to Machine Learning

Why Vyper Is Gaining Ground in Smart Contract Development

Why Data Quality Is the Backbone of Reliable Machine Learning

Margaret Mitchell: A Thoughtful Voice Among Machine Learning Experts

Why Meta-Reinforcement Learning Is a Big Deal for Data Science

How Google Cloud Platform Drives Innovation and Scalability in 2025