Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for exporting openai/whisper-large-v3 to ONNX #1752

Open
mmingo848 opened this issue Mar 10, 2024 · 10 comments
Open

Documentation for exporting openai/whisper-large-v3 to ONNX #1752

mmingo848 opened this issue Mar 10, 2024 · 10 comments
Labels
feature-request New feature or request onnx Related to the ONNX export

Comments

@mmingo848
Copy link

mmingo848 commented Mar 10, 2024

Feature request

Hello, I am exporting the OpenAI Whisper-large0v3 to ONNX and see it exports several files, most importantly in this case encoder (encoder_model.onnx & encoder_model.onnx.data) and decoder (decoder_model.onnx, decoder_model.onnx.data, decoder_with_past_model.onnx, decoder_with_past_model.onnx.data) files. I'd like to also be able to use as much as possible from the pipe in the new onnx files:

pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, chunk_length_s=30, batch_size=16, return_timestamps=True, torch_dtype=torch_dtype, device=device, )

Is there documentation that explains how to incorporate all these different things? I know transformer models are much different in this whole process and I cannot find a clear A -> B process on how to export this model and perform tasks such as quantization, etc. I see I can do the following for the tokenizer with ONNX, but I'd like more insight about the rest I mentioned above (how to use the seperate onnx files & how to use as much as the preexisting pipeline).

processor.tokenizer.save_pretrained(onnx_path)

I also see I can do:

model = ORTModelForSpeechSeq2Seq.from_pretrained( model_id, export=True )

but I cannot find documentation on how to specify where it is exported to, which seem's like I am either missing something fairly simple or it is just not hyperlinked in the documentation.

Motivation

I'd love to see further documentation on the entire export process for this highly popular model. Deployment is significantly slowed due to there not being a easy to find A -> B process for exporting the model and using the pipeline given in the vanilla model.

Your contribution

I am able to provide additional information to make this process easier.

@fxmarty
Copy link
Contributor

fxmarty commented Mar 19, 2024

@mmingo848 You can use:

optimum-cli export onnx --help
optimum-cli export onnx --model openai/whisper-large-v3 whisper_onnx

and then use ORTModelForSpeechSeq2Seq.

Although decoder_model.onnx and decoder_with_past_model.onnx are saved in the output folder, they are not required for inference and you can just use decoder_model_merged.onnx for the decoder, which handles both without KV cache (first decoding step) and with KV cache (following decoding steps) cases. ORTModelForSpeechSeq2Seq does not use decoder_model.onnx and decoder_with_past_model.onnx by default.

Feel free to refer to:

Let me know if this documentation is helpful!

@MrRace
Copy link

MrRace commented Mar 29, 2024

@fxmarty the log:

Validating ONNX model /share_model_zoo/LLM/openai/onnx_whisper-large-v3/encoder_model.onnx...
        -[✓] ONNX model output names match reference model (last_hidden_state)
        - Validating ONNX Model output "last_hidden_state":
                -[✓] (2, 1500, 1280) matches (2, 1500, 1280)
                -[x] values not close enough, max diff: 0.019733428955078125 (atol: 0.001)
Validating ONNX model /share_model_zoo/LLM/openai/onnx_whisper-large-v3/decoder_model.onnx...
        -[✓] ONNX model output names match reference model (logits)
        - Validating ONNX Model output "logits":
                -[✓] (2, 16, 51866) matches (2, 16, 51866)
                -[✓] all values close (atol: 0.001)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 0.001:
- last_hidden_state: max diff = 0.019733428955078125.

you can see [x] values not close enough, max diff: 0.019733428955078125 (atol: 0.001) for the "last_hidden_state", Is it normal for this situation to occur?

@fxmarty
Copy link
Contributor

fxmarty commented Mar 29, 2024

@MrRace Yes it can happen, I would not be worried. We should improve the warning.

@MrRace
Copy link

MrRace commented Apr 1, 2024

@mmingo848 You can use:

optimum-cli export onnx --help
optimum-cli export onnx --model openai/whisper-large-v3 whisper_onnx

and then use ORTModelForSpeechSeq2Seq.

Although decoder_model.onnx and decoder_with_past_model.onnx are saved in the output folder, they are not required for inference and you can just use decoder_model_merged.onnx for the decoder, which handles both without KV cache (first decoding step) and with KV cache (following decoding steps) cases. ORTModelForSpeechSeq2Seq does not use decoder_model.onnx and decoder_with_past_model.onnx by default.

Feel free to refer to:

Let me know if this documentation is helpful!

@fxmarty I exported the Whisper ONNX model files using the following command:

optimum-cli export onnx --model /share_model_zoo/LLM/openai/whisper-large-v3/ --task automatic-speech-recognition --device cuda:0 /share_model_zoo/LLM/openai/onnx_gpu_whisper-large-v3/

Under the export directory /share_model_zoo/LLM/openai/onnx_gpu_whisper-large-v3/, there are four ONNX model files: encoder_model.onnx_data,
encoder_model.onnx,
decoder_model.onnx_data, and
decoder_model.onnx.

image

However, the decoder_model_merged.onnx and decoder_with_past_model.onnx files you mentioned are not present. Why?

@fxmarty
Copy link
Contributor

fxmarty commented Apr 2, 2024

@MrRace You need --task automatic-speech-recognition-with-past. There should be a log during the export about it (that specifying --task automatic-speech-recognition disables KV cache).

@MrRace
Copy link

MrRace commented Apr 2, 2024

@MrRace You need --task automatic-speech-recognition-with-past. There should be a log during the export about it (that specifying --task automatic-speech-recognition disables KV cache).

@fxmarty Thank you very much for your response. However, after following the commands you provided, the following error occurred. How can I fix this error? Thanks again.

Validation for the model /share_model_zoo/LLM/openai/onnx_gpu_with-past-whisper-large-v3/decoder_model_merged.onnx raised: [ONNXRuntimeError] : 1 : FAIL : Load model from /b4-ai/share_model_zoo/LLM/openai/onnx_gpu_with-past-whisper-large-v3/decoder_model_merged.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:180 onnxruntime::Model::Model(onnx::ModelProto&&, const PathString&, const IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

Traceback (most recent call last):
  File "/share/opt/minicoda/lib/python3.11/site-packages/optimum/exporters/onnx/convert.py", line 1207, in onnx_export_from_model
    validate_models_outputs(
  File "/share/opt/minicoda/lib/python3.11/site-packages/optimum/exporters/onnx/convert.py", line 182, in validate_models_outputs
    raise exceptions[-1][1]
  File "/share/opt/minicoda/lib/python3.11/site-packages/optimum/exporters/onnx/convert.py", line 165, in validate_models_outputs
    validate_model_outputs(
  File "/share/opt/minicoda/lib/python3.11/site-packages/optimum/exporters/onnx/convert.py", line 233, in validate_model_outputs
    raise error
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /share_model_zoo/LLM/openai/onnx_gpu_with-past-whisper-large-v3/decoder_model_merged.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:180 onnxruntime::Model::Model(onnx::ModelProto&&, const PathString&, const IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

optimum : 1.18.0
onnx 1.16.0
onnxruntime 1.17.1
onnxruntime_extensions 0.10.1
onnxruntime-gpu 1.17.1

@fxmarty
Copy link
Contributor

fxmarty commented Apr 2, 2024

Yes, this was fixed in #1780, which is not yet in a release.

Please downgrade to onnx 1.15 or use optimum from source.

@MrRace
Copy link

MrRace commented Apr 3, 2024

Yes, this was fixed in #1780, which is not yet in a release.

Please downgrade to onnx 1.15 or use optimum from source.

@fxmarty Thanks a lot, it can work. After obtaining the decoder_model_merged.onnx and decoder_with_past_model.onnx files, how can I perform inference on test audio? Could you provide a complete example? Or could you advise on how to modify my example below? Thank you very much.

import os
import pdb

from onnxruntime import InferenceSession
import onnxruntime as ort
from transformers import WhisperProcessor
import time
import soundfile as sf

print("onnxruntime device=", ort.get_device())

onnx_model_dir = "/share_model_zoo/LLM/openai/onnx_gpu_with-past-whisper-large-v3/"
onnx_model_file = "decoder_model_merged.onnx"
onnx_model_file_path = os.path.join(onnx_model_dir, onnx_model_file)
print("Use onnx file=", onnx_model_file_path)
is_use_gpu = True
if is_use_gpu:
    session = InferenceSession(onnx_model_file_path, providers=['CUDAExecutionProvider'])
    print("Use onnxruntime-GPU")
else:
    session = InferenceSession(onnx_model_file_path, providers=['CPUExecutionProvider'])
    print("Use onnxruntime-CPU")


processor = WhisperProcessor.from_pretrained(onnx_model_dir)

test_audio_file = "./samples/jfk.wav"
array, sampling_rate = sf.read(test_audio_file)

input_features = processor(array, sampling_rate=sampling_rate, return_tensors="pt").input_features

# for i in range(len(session.get_inputs())):
#     print("session.get_inputs()[{}].name={}".format(i, session.get_inputs()[i].name))

# inference
decoder_input = {session.get_inputs()[0].name: input_features}
decoder_output = session.run(None, decoder_input)
print("decoder_output=", decoder_output)

The above code will raise an error, such as ValueError: Required inputs (['encoder_hidden_states', 'past_key_values.0.decoder.key', 'past_key_values.0.decoder.value', 'past_key_values.0.encoder.key', 'past_key_values.0.encoder.value', 'past_key_values.1.decod ... and so on.

@fxmarty
Copy link
Contributor

fxmarty commented Apr 5, 2024

Hi @MrRace, if you don't want to reimplement the inference code from scratch, I advise you to use https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForSpeechSeq2Seq. An example is available there. By default, only encoder_model.onnx and decoder_model_merged.onnx will be used at inference.

I advise you to use https://github.com/lutzroeder/netron if you would like to visualize the ONNX graphs and understand their inputs/outputs.

@MrRace
Copy link

MrRace commented Apr 7, 2024

Hi @MrRace, if you don't want to reimplement the inference code from scratch, I advise you to use https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForSpeechSeq2Seq. An example is available there. By default, only encoder_model.onnx and decoder_model_merged.onnx will be used at inference.

I advise you to use https://github.com/lutzroeder/netron if you would like to visualize the ONNX graphs and understand their inputs/outputs.

@fxmarty Thanks a lot for your reply, yeah, I want to implement it from scratch to better understand the overall inference process.

@tengomucho tengomucho added feature-request New feature or request onnx Related to the ONNX export labels Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature or request onnx Related to the ONNX export
Projects
None yet
Development

No branches or pull requests

4 participants