Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenAI] Add include_usage in streaming, add prefill decode speed to usage #456

Merged
merged 4 commits into from
Jun 5, 2024

Conversation

CharlieFRuan
Copy link
Contributor

@CharlieFRuan CharlieFRuan commented Jun 3, 2024

This change is not breaking. Behavior only changes when stream_options: {include_usage: True} is set for streaming, during which we yield another chunk with empty choices after the last chunk.

Also include prefill_tokens_per_s and decode_tokens_per_s for usage in both streaming and non-streaming, hence replacing the previous usage of runtimeStatsText().

@CharlieFRuan CharlieFRuan marked this pull request as ready for review June 4, 2024 03:47
@CharlieFRuan CharlieFRuan changed the title [OpenAI] Add usage to last chunk in streaming, add prefill and decode speed to usage [OpenAI] Add stream_options and include_usage in streaming, add prefill decode speed to usage Jun 4, 2024
@CharlieFRuan CharlieFRuan changed the title [OpenAI] Add stream_options and include_usage in streaming, add prefill decode speed to usage [OpenAI] Add include_usage in streaming, add prefill decode speed to usage Jun 4, 2024
@CharlieFRuan CharlieFRuan force-pushed the pr-0603-usage branch 2 times, most recently from 5dc369a to baf10fd Compare June 5, 2024 12:31
@CharlieFRuan CharlieFRuan merged commit ee2745d into mlc-ai:main Jun 5, 2024
1 check passed
CharlieFRuan added a commit that referenced this pull request Jun 5, 2024
### Changes
- New models:
  - Mistral-7B-Instruct-v0.3-q4f16_1-MLC (we had v0.2 before)
  - Mistral-7B-Instruct-v0.3-q4f32_1-MLC
  - TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC (we had v0.4 before)
  - TinyLlama-1.1B-Chat-v1.0-q4f32_1-MLC (we had v0.4 before)
- #457
- **[Breaking] renamed `max_gen_len` to `max_tokens` in
ChatCompletionRequest**
- Remove usage of `mean_gen_len` and `shift_fill_factor`, throw error
when request's prompt exceed contextWindowSize
- Terminate generation with `"length"` stop reason when decoding exceed
contextWindowSize
- #456
- Add `include_usage` in streaming, add prefill/decode speed to `usage`
to replace `runtimeStatsText()`
- #455
- Allow overriding KVCache settings via `ModelRecord.overrides` or
`chatOptions`
- Can use sliding window on any model now by specifying
`sliding_window_size` and `attention_sink_size`

### TVMjs
Compiled at
apache/tvm@1400627
with no changes

### WASM version
No change -- 0.2.39
tqchen pushed a commit that referenced this pull request Jun 6, 2024
Recently we added `usage` to the last chunk in streaming if the user
specifies `streamOptions: { include_usage: True}` in the request, to be
compatible with the latest OpenAI API; for more see
#456.

This PR updates our streaming examples to use `chunk.usage` instead of
`runtimeStatsText()`. It is expected that we will deprecate
`runtimeStatsText()` in the future. Currently, only low-level API
`forwardTokenAndSample()` still needs it as they do not use OpenAI API
(e.g. `examples/logit-processor`).

Such examples include:
- simple-chat-ts
- simple-chat-js
- next-simple-chat
- streaming (already updated)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant