Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support concurrent inference from multiple models #512

Closed
mikestaub opened this issue Jul 23, 2024 · 4 comments
Closed

support concurrent inference from multiple models #512

mikestaub opened this issue Jul 23, 2024 · 4 comments

Comments

@mikestaub
Copy link

I would like to stream the response from two different LLMs simultaneously

@CharlieFRuan
Copy link
Contributor

Thanks for the request. Having multiple models in a single engine simultaneously is something we are looking into now. Meanwhile, would having two MLCEngine work for your case?

@mikestaub
Copy link
Author

Yes that should work, assuming the device has enough resources, is this possible today? Is there an example I can play with?

@CharlieFRuan
Copy link
Contributor

Hi @mikestaub, from npm 0.2.60, a single engine can load multiple models, and the models can process requests concurrently. However, I have not tested the performance benefit (if any) to process requests simultaneously, as opposed to sequentially. Though being able to load multiple models definitely brings convenience, making the engine behave like an endpoint like OpenAI(), assuming enough resources from the device.

Note: each model can still only process one request at a time (i.e. concurrent batching is not supported).

The two main related PRs are:

See examples/multi-models for an example, which has the effect below with parallelGeneration():

web-llm-multi-models.mov

@CharlieFRuan
Copy link
Contributor

Closing this issue as completed. Feel free to reopen/open new ones if issues arise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants