[Tracking][WebLLM] Function calling (beta) and Embeddings #526

CharlieFRuan · 2024-08-04T23:02:26Z

This issue tracks various action items we would like to complete with regard to the features function calling and embeddings.

Function calling (beta)

We are calling it beta because multiple iterations may be needed for function calling. It may be hard to conform different open-source models' function calling formats to OpenAI API. We will try to make each iteration non-breaking.

F1 Enable manual function calling (completed via [Tool] Support manual function calling #527, supported in npm 0.2.53)
- That is, function calling with only system, user, assistant, tool messages, without using the tools and tool_calls fields of OpenAI API
- Reach parity with the examples provided in MLC-LLM: [Tool] Prelim support for function calling with Llama3.1 and Hermes2 mlc-llm#2744
- This requires various runtime changes as initiated in Function calling complete example #467
- Add examples for Llama3.1 and Hermes2 after supported
F2 Support function calling following OpenAI API with tools and tool_calls
- Essentially supporting the example in https://platform.openai.com/docs/guides/function-calling.
  - There are new fields in the official OpenAI API, which we should support as well if possible.
- This may limit the flexibility for the user. For instance, while Llama3.1 offers roughly 3 formats for function calling, using tools will force us to use only one of them
- The previous PR only offers minimal one-round function calling support [OpenAI] Function calling API for Hermes-2-Pro #451
  - We want to allow the model to make tool calls or respond in natural language at its own discretion
F3 Use BNFGrammar to guarantee tool call generation correctness
- This requires the model to use a special token to signify the beginning of a function call, <tool_call> in the case of Hermes2. Upon such a token being generated, we instantiate a BNFGrammar instance. When ended, force it to generate </tool_call>. Before and after this tool call, the model can either generate natural language or other tool calls.
- The previous PR forces the response to be a tool call, which limits the flexibility a lot [OpenAI] Function calling API for Hermes-2-Pro #451
Related issues
- Improvements for Function Calling #462
- ♥️ Function Calling-only model #297

Embedding, Multi-model Engine, Concurrency

E1 Loading multiple models within an engine (completed via [API][Engine] Support loading multiple models in a single engine #542, supported in npm 0.2.59)
- For applications like RAG, two models are needed to complete this, one embedding model and one LLM. We'd like to hold all models in a single MLCEngine instead of instantiating multiple engines. This makes MLCEngine behave like an endpoint, and offers the possibility for intra-engine optimizations in the future.
- Each model can process requests concurrently if needed: [Fix] Allow concurrent inference for multi model in WebWorker #546 (published in npm 0.2.60)
E2 Fix concurrent request issues (completed via [Fix] Implement lock to ensure FCFS of requests to same model #549, supported in npm 0.2.61)
- With a single model, we encounter correctness issues with multiple concurrent requests as broughtup in Support concurrent requests to a single model instance #522
- After E1, we need to further pay close attention to potential concurrency issues.
E3 Implement engine.embeddings.create() (completed via [Embeddings][OpenAI] Support embeddings via engine.embeddings.create() #538, supported in npm 0.2.58)
E4 Add an example for RAG (completed via [RAG] Add example for RAG with Langchain.js #550)
Related issues

The text was updated successfully, but these errors were encountered:

CharlieFRuan · 2024-08-12T06:57:24Z

Some future TODOs for embeddings:

Support nomic-v1.5 for longer context (may not simply return first token logits like snowflake-arctic, hence requiring changes in EmbeddingPipeline)
Support matryoshka models, hence support dimension field for those models
Implement prefill chunking (currently not needed since we only support 512 window size)

CharlieFRuan changed the title ~~[Tracking][WebLLM] Function calling and Embeddings~~ [Tracking][WebLLM] Function calling (beta) and Embeddings Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

CharlieFRuan commented Aug 4, 2024 •

edited

Loading

CharlieFRuan commented Aug 12, 2024

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

Comments

CharlieFRuan commented Aug 4, 2024 • edited Loading

Function calling (beta)

Embedding, Multi-model Engine, Concurrency

CharlieFRuan commented Aug 12, 2024

CharlieFRuan commented Aug 4, 2024 •

edited

Loading