Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Fix] Implement lock to ensure FCFS of requests to same model (#549)
A model cannot handle > 1 concurrent request (e.g. >1 calls to `chat.completions.create()`) since we do not support continuous batching, and each request requires its own resources such as the KV cache. (Though "concurrent" requests to different models in the same engine is supported) As a result, as pointed out in #522, when users try something like the following code: ```typescript const engine = await CreateMLCEngine("Phi-3-mini-4k-instruct-q4f16_1-MLC") async function sendRequest() { const reply = await engine.chat.completions.create({ messages: [{ role: "user", content: "Hello!" }], max_tokens: 64, }); console.log(reply.choices[0].message.content); } await Promise.all([sendRequest(), sendRequest()]); ``` the model's state and the generation result are messed up. To resolve this, we implement `CustomLock` using Promise, maintaining a queue to ensure FCFS for incoming requests to a model, such that for a single model, a request only starts when all previous requests are finished. The code above now works. ### Implementation Details - We add `loadedModelIdToLock` to MLCEngine, maintaining a lock for each loaded engine - Reminder: the need for a critical section is only per model, since each loaded model has its own `LLMChatPipeline` / `EmbeddingPipeline` - `loadedModelIdToLock` is cleared in `unload()`, set in `reloadInternal()` - We acquire lock at the very beginning of `completion()`, `chatCompletion()` and `embedding()`, after knowing which model this current call will use - We release lock at the end of `embedding()`, `completion()` and `chatCompletion()` (for non-streaming cases), and `asyncGenerate()` (for streaming cases) - Since we also want to release the lock when errors occur, we wrap the code with a big `try` `finally` - Since `asyncGenerate()` is an async generator, we add `try` `catch` fine-grainedly, only in places that can throw errors - This makes the code less readable, but not sure if there is a better solution. - For WebWorkerMLCEngine, no special handling is needed, since the WebWorkerMLCEngineHandler calls the underlying engine's APIs (e.g. `chatCompletion()`), which will block ### Tested - Tested `CustomLock` implementation with unit test (implementation follows [this blog post](https://jackpordi.com/posts/locks-in-js-because-why-not)) - Above example now works - [get-started, get-started-web-worker] x [streaming, non-streaming] x [concurrent requests, single request] - examples/simple-chat-ts - examples/multi-models - WebLLMChat (with generation interrupts, manual termination of service worker) - Opening two tabs WebLLMChat, sending concurrent request, the latter request will wait for the previous one to finish (prior to this PR, garbage output will be generated just like the above simple example, since the two WebLLMChat shares the same service worker, hence the same engine).
- Loading branch information