Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Irvingao · 2024-05-21T05:22:20Z

🚀 The feature, motivation and pitch

As we all know, GPT-4o is an end2end multi-modal models, which support Speech to Text/Speech. I have some ideas about it:

Speech to Text: Can we have a try by combining the pretrained ASR encoder and a trainable linear projection to make Speech to Text possible?
Speech to Speech: Align the pretrained ASR decoder with the main LLM backbone.

Alternatives

No response

Additional context

No response

byrTony-Frankzyq · 2024-05-21T15:58:11Z

For your first idea, I think the asr example have done it.

Irvingao · 2024-05-21T16:14:01Z

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

byrTony-Frankzyq · 2024-05-21T16:22:49Z

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right?

Irvingao · 2024-05-21T16:25:09Z

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right? Though not fully understand

Exactly.

zszheng147 · 2024-05-22T01:23:52Z

Are you talking about ASR for the speech-to-text task? If so, you can try our ASR example.

We may support speech-to-speech in the future, but as this task is much more difficult than ASR or TTS, it is more like combining these two seamlessly. Thank you for your advice; we will take it into consideration.

If you have any further questions or need additional assistance, feel free to ask!

Learneducn · 2024-07-24T06:06:08Z

I used the SLAM framework to fine-tune the inference results. Why are the test results on librispeech not as good as directly using the whisper open source model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Irvingao commented May 21, 2024

byrTony-Frankzyq commented May 21, 2024

Irvingao commented May 21, 2024

byrTony-Frankzyq commented May 21, 2024 •

edited

Loading

Irvingao commented May 21, 2024

zszheng147 commented May 22, 2024 •

edited

Loading

Learneducn commented Jul 24, 2024

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Comments

Irvingao commented May 21, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

byrTony-Frankzyq commented May 21, 2024

Irvingao commented May 21, 2024

byrTony-Frankzyq commented May 21, 2024 • edited Loading

Irvingao commented May 21, 2024

zszheng147 commented May 22, 2024 • edited Loading

Learneducn commented Jul 24, 2024

byrTony-Frankzyq commented May 21, 2024 •

edited

Loading

zszheng147 commented May 22, 2024 •

edited

Loading