-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenAI] Add include_usage in streaming, add prefill decode speed to usage #456
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CharlieFRuan
force-pushed
the
pr-0603-usage
branch
from
June 4, 2024 03:42
dcadbde
to
a9a34cb
Compare
CharlieFRuan
force-pushed
the
pr-0603-usage
branch
from
June 4, 2024 03:56
a9a34cb
to
4d340b1
Compare
10 tasks
CharlieFRuan
changed the title
[OpenAI] Add usage to last chunk in streaming, add prefill and decode speed to usage
[OpenAI] Add stream_options and include_usage in streaming, add prefill decode speed to usage
Jun 4, 2024
CharlieFRuan
changed the title
[OpenAI] Add stream_options and include_usage in streaming, add prefill decode speed to usage
[OpenAI] Add include_usage in streaming, add prefill decode speed to usage
Jun 4, 2024
CharlieFRuan
force-pushed
the
pr-0603-usage
branch
2 times, most recently
from
June 5, 2024 12:31
5dc369a
to
baf10fd
Compare
CharlieFRuan
force-pushed
the
pr-0603-usage
branch
from
June 5, 2024 12:32
baf10fd
to
7340078
Compare
CharlieFRuan
added a commit
that referenced
this pull request
Jun 5, 2024
### Changes - New models: - Mistral-7B-Instruct-v0.3-q4f16_1-MLC (we had v0.2 before) - Mistral-7B-Instruct-v0.3-q4f32_1-MLC - TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC (we had v0.4 before) - TinyLlama-1.1B-Chat-v1.0-q4f32_1-MLC (we had v0.4 before) - #457 - **[Breaking] renamed `max_gen_len` to `max_tokens` in ChatCompletionRequest** - Remove usage of `mean_gen_len` and `shift_fill_factor`, throw error when request's prompt exceed contextWindowSize - Terminate generation with `"length"` stop reason when decoding exceed contextWindowSize - #456 - Add `include_usage` in streaming, add prefill/decode speed to `usage` to replace `runtimeStatsText()` - #455 - Allow overriding KVCache settings via `ModelRecord.overrides` or `chatOptions` - Can use sliding window on any model now by specifying `sliding_window_size` and `attention_sink_size` ### TVMjs Compiled at apache/tvm@1400627 with no changes ### WASM version No change -- 0.2.39
tqchen
pushed a commit
that referenced
this pull request
Jun 6, 2024
Recently we added `usage` to the last chunk in streaming if the user specifies `streamOptions: { include_usage: True}` in the request, to be compatible with the latest OpenAI API; for more see #456. This PR updates our streaming examples to use `chunk.usage` instead of `runtimeStatsText()`. It is expected that we will deprecate `runtimeStatsText()` in the future. Currently, only low-level API `forwardTokenAndSample()` still needs it as they do not use OpenAI API (e.g. `examples/logit-processor`). Such examples include: - simple-chat-ts - simple-chat-js - next-simple-chat - streaming (already updated)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change is not breaking. Behavior only changes when
stream_options: {include_usage: True}
is set for streaming, during which we yield another chunk with emptychoices
after the last chunk.Also include
prefill_tokens_per_s
anddecode_tokens_per_s
forusage
in both streaming and non-streaming, hence replacing the previous usage ofruntimeStatsText()
.