Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

Commit

Permalink
Update models/README.md
Browse files Browse the repository at this point in the history
Co-authored-by: shrekris-anyscale <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
  • Loading branch information
sihanwang41 and shrekris-anyscale authored Jan 8, 2024
1 parent b2f71dd commit ae1c87a
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ RayLLM supports continuous batching, meaning incoming requests are processed as
* `scheduler_policy` is to choose scheduler policy between max_utilization/guaranteed_no_evict.
(`MAX_UTILIZATION` packs as many requests as the underlying TRT engine can support in any iteration of the InflightBatching generation loop. While this is expected to maximize GPU throughput, it might require that some requests be paused and restarted depending on peak KV cache memory availability.
`GUARANTEED_NO_EVICT` uses KV cache more conservatively and guarantees that a request, once started, runs to completion without eviction.)
* `logger_level` is to configure log level for TensorRT-LLM engine. ("INFO", "ERROR", "VERBOSE", "WARNING")
* `logger_level` is to configure log level for TensorRT-LLM engine. ("VERBOSE", "INFO", "WARNING", "ERROR")
* `max_num_sequences` is the maximum number of requests/sequences the backend can maintain state
* `max_tokens_in_paged_kv_cache` is to configure the maximum number of tokens in the paged kv cache.
* `kv_cache_free_gpu_mem_fraction` is to configure K-V Cache free gpu memory fraction.
Expand Down

0 comments on commit ae1c87a

Please sign in to comment.