[Fix] Allow concurrent inference for multi model in WebWorker #546

CharlieFRuan · 2024-08-13T20:00:10Z

This is a follow-up to #542. Update examples/multi-model to use web worker, and to also show case generating responses from two models concurrently from the same engine. This is already supported for MLCEngine prior to this PR, but WebWorkerMLCEngine needed a patch. Specifically:

Prior to this PR, WebWorkerMLCEngineHandler maintains a single asyncGenreator, assuming there is only one model.
Now, to support concurrent streaming request, we replace this.asyncGenerator with this.loadedModelIdToAsyncGenerator, which maps from a model id to its dedicated asyncGenerator
As a result, messages related to streaming need to specify the selectedModelId, hence the updates for the message sending and handling of chatCompletionStreamInit, completionStreamInit, completionStreamNextChunk
- Because upon handling these messages, the handler needs to know for which model to initiate an async generator, or on which async generator to call .next() on
- This also means completion() and chatCompletion() of WebWorkerMLCEngine will call getModelIdToUse(), which prior to this PR delays till the underlying MLCEngine
As of now, this.loadedModelIdToAsyncGenerator may not be cleaned properly when one asyncGenerator finishes. We only call clear at unload(), which may not be called upon reload(). However, service_worker may skip reload(). Will leave it as is for now.

Tested with WebLLMChat, also tests WebLLMChat terminating service worker manually.

CharlieFRuan · 2024-08-13T20:05:17Z

Demo of examples/multi-models with parallelGeneration():

web-llm-multi-models.mov

### Changes - #546 - Fix WebWorker's async generator, supporting concurrent generation from different models in the same engine - See the PR description, the demo in the reply, and `examples/multi-models` for more ### TVMjs Still compiled at apache/tvm@1fcb620, no change

CharlieFRuan added 2 commits August 13, 2024 14:59

Update multi-models example to have parallel inference

f707d8d

[Fix] Allow concurrent inference for multi model in WebWorker

a112dae

CharlieFRuan merged commit d351b6a into mlc-ai:main Aug 13, 2024
1 check passed

CharlieFRuan mentioned this pull request Aug 13, 2024

[Version][Trivial] Bump version to 0.2.60 #547

Merged

This was referenced Aug 13, 2024

support concurrent inference from multiple models #512

Closed

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Allow concurrent inference for multi model in WebWorker #546

[Fix] Allow concurrent inference for multi model in WebWorker #546

CharlieFRuan commented Aug 13, 2024

CharlieFRuan commented Aug 13, 2024

[Fix] Allow concurrent inference for multi model in WebWorker #546

[Fix] Allow concurrent inference for multi model in WebWorker #546

Conversation

CharlieFRuan commented Aug 13, 2024

CharlieFRuan commented Aug 13, 2024