From 8665b507e8e78e00aa54e20e98d78553f2061417 Mon Sep 17 00:00:00 2001 From: Kushal Agrawal <98145879+kushal34712@users.noreply.github.com> Date: Sat, 5 Oct 2024 15:56:46 +0530 Subject: [PATCH] Update README.md --- README.md | 48 ++++++++++++++++++++++++------------------------ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 3b164840e..13c398290 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@

- mistral.rs + Mistral.rs

@@ -7,7 +7,7 @@ Blazingly fast LLM inference.

-| Rust Documentation | Python Documentation | Discord | Matrix | +Rust Documentation | Python Documentation | Discord | Matrix

Please submit requests for new models [here](https://github.com/EricLBuehler/mistral.rs/issues/156). @@ -18,7 +18,7 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis 2) [Get models](#getting-models) -3) Deploy with our easy to use APIs +3) Deploy with our easy-to-use APIs - [Python](examples/python) - [Rust](mistralrs/examples) - [OpenAI compatible HTTP server](docs/HTTP.md) @@ -41,7 +41,7 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis ``` ./mistralrs-server -i toml -f toml-selectors/anymoe_lora.toml ``` -- φ³ Run the new Phi 3.5/3.1/3 model with 128K context window +- φ³ Run the new Phi 3.5/3.1/3 model with a 128K context window ``` ./mistralrs-server -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3 @@ -76,7 +76,7 @@ Mistal.rs supports several model categories: ## Description **Easy**: -- Lightweight OpenAI API compatible HTTP server +- Lightweight OpenAI API-compatible HTTP server - Python API - Grammar support with Regex and Yacc - [ISQ](docs/ISQ.md) (In situ quantization): run `.safetensors` models directly from 🤗 Hugging Face by quantizing in-place @@ -91,11 +91,11 @@ Mistal.rs supports several model categories: - [Details](docs/QUANTS.md) - GGML: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit, with ISQ support. - GPTQ: 2-bit, 3-bit, 4-bit and 8-bit -- HQQ: 4-bit and 8 bit, with ISQ support +- HQQ: 4-bit and 8-bit, with ISQ support **Powerful**: - LoRA support with weight merging -- First X-LoRA inference platform with first class support +- First X-LoRA inference platform with first-class support - [AnyMoE](docs/ANYMOE.md): Build a memory-efficient MoE model from anything, in seconds - Various [sampling and penalty](docs/SAMPLING.mds) methods - Tool calling: [docs](docs/TOOL_CALLING.md) @@ -293,7 +293,7 @@ This is passed in the following ways: [Here](examples/python/token_source.py) is an example of setting the token source. -If token cannot be loaded, no token will be used (i.e. effectively using `none`). +If a token cannot be loaded, no token will be used (i.e. effectively using `none`). ### Loading models from local files: @@ -321,7 +321,7 @@ Throughout mistral.rs, any model ID argument or option may be a local path and s ### Running GGUF models -To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be a HF model ID. +To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be an HF model ID. GGUF models contain a tokenizer. However, mistral.rs allows you to run the model with a tokenizer from a specified model, typically the official one. This means there are two options: 1) [With a specified tokenizer](#with-a-specified-tokenizer) @@ -339,7 +339,7 @@ If the specified tokenizer model ID contains a `tokenizer.json`, then it will be #### With the builtin tokenizer -Using the builtin tokenizer: +Using the built-in tokenizer: ```bash ./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf @@ -357,7 +357,7 @@ There are a few more ways to configure: The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID. -If that does not work, you can either [provide a tokenizer](#with-a-specified-tokenizer) (recommended), or specify a custom chat template. +If that does not work, you can either [provide a tokenizer](#with-a-specified-tokenizer) (recommended) or specify a custom chat template. ```bash ./mistralrs-server --chat-template gguf -m . -f Phi-3.5-mini-instruct-Q4_K_M.gguf @@ -366,10 +366,10 @@ If that does not work, you can either [provide a tokenizer](#with-a-specified-to **Tokenizer** The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise, -please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face. +please consider using the method demonstrated in the examples below, where the tokenizer is sourced from Hugging Face. **Supported GGUF tokenizer types** -- `llama` (sentencepiece) +- `llama` (sentence piece) - `gpt2` (BPE) ## Run with the CLI @@ -380,7 +380,7 @@ Additionally, for models without quantization, the model architecture should be ### Architecture for plain models -> Note: for plain models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16` or `auto` to choose based on the device. This is specified in the `--dype`/`-d` parameter after the model architecture (`plain`). +> Note: for plain models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16`, or `auto` to choose based on the device. This is specified in the `--dype`/`-d` parameter after the model architecture (`plain`). If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue. @@ -397,7 +397,7 @@ If you do not specify the architecture, an attempt will be made to use the model ### Architecture for vision models -> Note: for vision models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16` or `auto` to choose based on the device. This is specified in the `--dype`/`-d` parameter after the model architecture (`vision-plain`). +> Note: for vision models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16`, or `auto` to choose based on the device. This is specified in the `--dype`/`-d` parameter after the model architecture (`vision-plain`). - `phi3v` - `idefics2` @@ -421,7 +421,7 @@ If you do not specify the architecture, an attempt will be made to use the model ### Interactive mode -You can launch interactive mode, a simple chat application running in the terminal, by passing `-i`: +You can launch interactive mode, a simple chat application running in the terminal, bypassing `-i`: ```bash ./mistralrs-server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3 @@ -469,7 +469,7 @@ Example: > Note: All CUDA tests for mistral.rs conducted with PagedAttention enabled, block size = 32 -Please submit more benchmarks via raising an issue! +Please submit more benchmarks by raising an issue! ## Supported models @@ -539,21 +539,21 @@ Please submit more benchmarks via raising an issue! |Llama 3.2 Vision| | -### Using derivative model +### Using the derivative model To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass `--help` after the subcommand. For example, when using a different model than the default, specify the following for the following types of models: - **Plain**: Model id -- **Quantized**: Quantized model id, quantized filename, and tokenizer id +- **Quantized**: Quantized model ID, quantized filename, and tokenizer ID - **X-LoRA**: Model id, X-LoRA ordering -- **X-LoRA quantized**: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering +- **X-LoRA quantized**: Quantized model ID, quantized filename, tokenizer ID, and X-LoRA ordering - **LoRA**: Model id, LoRA ordering -- **LoRA quantized**: Quantized model id, quantized filename, tokenizer id, and LoRA ordering +- **LoRA quantized**: Quantized model ID, quantized filename, tokenizer ID, and LoRA ordering - **Vision Plain**: Model id See [this](#adapter-ordering-file) section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed. -It is also important to check the chat template style of the model. If the HF hub repo has a `tokenizer_config.json` file, it is not necessary to specify. Otherwise, templates can be found in `chat_templates` and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages. +It is also important to check the chat template style of the model. If the HF hub repo has a `tokenizer_config.json` file, it is not necessary to specify. Otherwise, templates can be found in `chat_templates` and should be passed before the subcommand. If the model is not instruction-tuned, no chat template will be found and the APIs will only accept a prompt, no messages. For example, when using a Zephyr model: @@ -568,7 +568,7 @@ Mistral.rs will attempt to automatically load a chat template and tokenizer. Thi ## Contributing -Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. +Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull a request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this. ## FAQ @@ -581,7 +581,7 @@ If you want to add a new model, please contact us via an issue and we can coordi - Error: `recompile with -fPIE`: - Some Linux distributions require compiling with `-fPIE`. - Set the `CUDA_NVCC_FLAGS` environment variable to `-fPIE` during build: `CUDA_NVCC_FLAGS=-fPIE` -- Error `CUDA_ERROR_NOT_FOUND` or symbol not found when using a normal or vison model: +- Error `CUDA_ERROR_NOT_FOUND` or symbol not found when using a normal or vision model: - For non-quantized models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16` or `auto` to choose based on the device. ## Credits