Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Allow concurrent inference for multi model in WebWorker #546

Merged
merged 2 commits into from
Aug 13, 2024

Conversation

CharlieFRuan
Copy link
Contributor

This is a follow-up to #542. Update examples/multi-model to use web worker, and to also show case generating responses from two models concurrently from the same engine. This is already supported for MLCEngine prior to this PR, but WebWorkerMLCEngine needed a patch. Specifically:

  • Prior to this PR, WebWorkerMLCEngineHandler maintains a single asyncGenreator, assuming there is only one model.
  • Now, to support concurrent streaming request, we replace this.asyncGenerator with this.loadedModelIdToAsyncGenerator, which maps from a model id to its dedicated asyncGenerator
  • As a result, messages related to streaming need to specify the selectedModelId, hence the updates for the message sending and handling of chatCompletionStreamInit, completionStreamInit, completionStreamNextChunk
    • Because upon handling these messages, the handler needs to know for which model to initiate an async generator, or on which async generator to call .next() on
    • This also means completion() and chatCompletion() of WebWorkerMLCEngine will call getModelIdToUse(), which prior to this PR delays till the underlying MLCEngine
  • As of now, this.loadedModelIdToAsyncGenerator may not be cleaned properly when one asyncGenerator finishes. We only call clear at unload(), which may not be called upon reload(). However, service_worker may skip reload(). Will leave it as is for now.

Tested with WebLLMChat, also tests WebLLMChat terminating service worker manually.

@CharlieFRuan CharlieFRuan merged commit d351b6a into mlc-ai:main Aug 13, 2024
1 check passed
@CharlieFRuan
Copy link
Contributor Author

Demo of examples/multi-models with parallelGeneration():

web-llm-multi-models.mov

CharlieFRuan added a commit that referenced this pull request Aug 13, 2024
### Changes
- #546
- Fix WebWorker's async generator, supporting concurrent generation from
different models in the same engine
- See the PR description, the demo in the reply, and
`examples/multi-models` for more

### TVMjs
Still compiled at
apache/tvm@1fcb620,
no change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant