Skip to content
Advertisement

Does converting a seq2seq NLP model to the ONNX format negatively affect its performance?

I was looking at potentially converting an ml NLP model to the ONNX format in order to take advantage of its speed increase (ONNX Runtime). However, I don’t really understand what is fundamentally changed in the new models compared to the old models. Also, I don’t know if there are any drawbacks. Any thoughts on this would be very appreciated.

Advertisement

Answer

performance of the model by accuracy will be the same (just considering the output of encoder and decoder). inference performance may vary based on the method you used for inferencing (eg: greedy search, beam search, top-k & top-p ). for more info on this.

for onnx seq2seq model, you need to implement model.generate() method by hand. But onnxt5 lib has done a good job of implementing greedy search (for onnx model). However, most NLP generative models yield good results by beam search method (you can refer to the linked source for how huggingface implemented beam search for their models). Unfortunately for onnx models, you have to implement it by yourself.

the inference speed definitely increases as shown in this notebook by onnx-runtime (the example is on bert). you have to run both the encoder and decoder separately on the onnx-runtime & can take advantage of the onnx-runtime. if you wanna know more about onnx and its runtime refer to this link.

Update: you can refer to fastT5 library, it implements both greedy and beam search for t5. for bart have a look at this issue.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement