Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Open
Irvingao opened this issue May 21, 2024 · 6 comments

Comments

@Irvingao
Copy link

🚀 The feature, motivation and pitch

As we all know, GPT-4o is an end2end multi-modal models, which support Speech to Text/Speech. I have some ideas about it:

  1. Speech to Text: Can we have a try by combining the pretrained ASR encoder and a trainable linear projection to make Speech to Text possible?
  2. Speech to Speech: Align the pretrained ASR decoder with the main LLM backbone.

Alternatives

No response

Additional context

No response

@byrTony-Frankzyq
Copy link
Collaborator

For your first idea, I think the asr example have done it.

@Irvingao
Copy link
Author

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

@byrTony-Frankzyq
Copy link
Collaborator

byrTony-Frankzyq commented May 21, 2024

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right?

@Irvingao
Copy link
Author

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right? Though not fully understand

Exactly.

@zszheng147
Copy link
Collaborator

zszheng147 commented May 22, 2024

Are you talking about ASR for the speech-to-text task? If so, you can try our ASR example.

We may support speech-to-speech in the future, but as this task is much more difficult than ASR or TTS, it is more like combining these two seamlessly. Thank you for your advice; we will take it into consideration.

If you have any further questions or need additional assistance, feel free to ask!

@Learneducn
Copy link

I used the SLAM framework to fine-tune the inference results. Why are the test results on librispeech not as good as directly using the whisper open source model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants