[OpenAI] Add include_usage in streaming, add prefill decode speed to usage #456

CharlieFRuan · 2024-06-03T21:56:02Z

This change is not breaking. Behavior only changes when stream_options: {include_usage: True} is set for streaming, during which we yield another chunk with empty choices after the last chunk.

Also include prefill_tokens_per_s and decode_tokens_per_s for usage in both streaming and non-streaming, hence replacing the previous usage of runtimeStatsText().

### Changes - New models: - Mistral-7B-Instruct-v0.3-q4f16_1-MLC (we had v0.2 before) - Mistral-7B-Instruct-v0.3-q4f32_1-MLC - TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC (we had v0.4 before) - TinyLlama-1.1B-Chat-v1.0-q4f32_1-MLC (we had v0.4 before) - #457 - **[Breaking] renamed `max_gen_len` to `max_tokens` in ChatCompletionRequest** - Remove usage of `mean_gen_len` and `shift_fill_factor`, throw error when request's prompt exceed contextWindowSize - Terminate generation with `"length"` stop reason when decoding exceed contextWindowSize - #456 - Add `include_usage` in streaming, add prefill/decode speed to `usage` to replace `runtimeStatsText()` - #455 - Allow overriding KVCache settings via `ModelRecord.overrides` or `chatOptions` - Can use sliding window on any model now by specifying `sliding_window_size` and `attention_sink_size` ### TVMjs Compiled at apache/tvm@1400627 with no changes ### WASM version No change -- 0.2.39

Recently we added `usage` to the last chunk in streaming if the user specifies `streamOptions: { include_usage: True}` in the request, to be compatible with the latest OpenAI API; for more see #456. This PR updates our streaming examples to use `chunk.usage` instead of `runtimeStatsText()`. It is expected that we will deprecate `runtimeStatsText()` in the future. Currently, only low-level API `forwardTokenAndSample()` still needs it as they do not use OpenAI API (e.g. `examples/logit-processor`). Such examples include: - simple-chat-ts - simple-chat-js - next-simple-chat - streaming (already updated)

CharlieFRuan force-pushed the pr-0603-usage branch from dcadbde to a9a34cb Compare June 4, 2024 03:42

CharlieFRuan marked this pull request as ready for review June 4, 2024 03:47

[OpenAI] Add usage to streaming, add prefill and decode speed to usage

4d340b1

CharlieFRuan force-pushed the pr-0603-usage branch from a9a34cb to 4d340b1 Compare June 4, 2024 03:56

CharlieFRuan mentioned this pull request Jun 4, 2024

[Tracking][WebLLM] Runtime updates #429

Closed

10 tasks

Add include_usage and yield usage in another last chunk

e33f0e7

CharlieFRuan changed the title ~~[OpenAI] Add usage to last chunk in streaming, add prefill and decode speed to usage~~ [OpenAI] Add stream_options and include_usage in streaming, add prefill decode speed to usage Jun 4, 2024

CharlieFRuan changed the title ~~[OpenAI] Add stream_options and include_usage in streaming, add prefill decode speed to usage~~ [OpenAI] Add include_usage in streaming, add prefill decode speed to usage Jun 4, 2024

Update readme

3f9e529

CharlieFRuan force-pushed the pr-0603-usage branch 2 times, most recently from 5dc369a to baf10fd Compare June 5, 2024 12:31

Update last chunk handling in examples

7340078

CharlieFRuan force-pushed the pr-0603-usage branch from baf10fd to 7340078 Compare June 5, 2024 12:32

CharlieFRuan merged commit ee2745d into mlc-ai:main Jun 5, 2024
1 check passed

CharlieFRuan mentioned this pull request Jun 5, 2024

[Version] Bump version to 0.2.42 #459

Merged

CharlieFRuan mentioned this pull request Jun 6, 2024

[Streaming] Replace runtimeStatsText() with usage in examples #460

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenAI] Add include_usage in streaming, add prefill decode speed to usage #456

[OpenAI] Add include_usage in streaming, add prefill decode speed to usage #456

CharlieFRuan commented Jun 3, 2024 •

edited

Loading

[OpenAI] Add include_usage in streaming, add prefill decode speed to usage #456

[OpenAI] Add include_usage in streaming, add prefill decode speed to usage #456

Conversation

CharlieFRuan commented Jun 3, 2024 • edited Loading

CharlieFRuan commented Jun 3, 2024 •

edited

Loading