I was looking at potentially converting an ml NLP model to the ONNX format in order to take advantage of its speed increase (ONNX Runtime). However, I don’t really understand what is fundamentally changed in the new models compared to the old models. Also, I don’t know if there are any drawbacks. Any thoughts on this would be very appreciated.
Advertisement
Answer
performance of the model by accuracy will be the same (just considering the output of encoder and decoder). inference performance may vary based on the method you used for inferencing (eg: greedy search, beam search, top-k & top-p ). for more info on this.
for onnx seq2seq model, you need to implement model.generate()
method by hand. But onnxt5
lib has done a good job of implementing greedy search (for onnx model). However, most NLP generative models yield good results by beam search method (you can refer to the linked source for how huggingface implemented beam search for their models). Unfortunately for onnx models, you have to implement it by yourself.
the inference speed definitely increases as shown in this notebook by onnx-runtime (the example is on bert). you have to run both the encoder and decoder separately on the onnx-runtime & can take advantage of the onnx-runtime. if you wanna know more about onnx and its runtime refer to this link.
Update: you can refer to fastT5
library, it implements both greedy
and beam search
for t5. for bart
have a look at this issue.