Generate standalone UQFF models (#849)

* Add method for extracting residual tensors * Adding some better serialization * Full serialization * Add support for mllama * Save processor filename * Add support for the vision models * Add support to apis * Fix some bugs * Undo gemma weights * Fixes for gemma and regexing * Clippy * Update docs * Further update docs * Ensure generation * Typo
EricLBuehler · Oct 15, 2024 · 8fa1a0c · 8fa1a0c
1 parent eaaaa84
commit 8fa1a0c
Show file tree

Hide file tree

Showing 53 changed files with 1,708 additions and 620 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/README.md b/README.md
@@ -28,6 +28,9 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
 
 *After following installation instructions*
 
+- Check out UQFF for prequantized models of various methods!
+    - Models can be found [here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).
+
 - 🦙📷 Run the **Llama 3.2 Vision** Model: [documentation and guide here](docs/VLLAMA.md)
 
     <img src="https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg" alt="Mount Washington" width = "400" height = "267">
@@ -106,7 +109,7 @@ Mistal.rs supports several model categories:
 - [PagedAttention](docs/PAGED_ATTENTION.md) and continuous batching
 - Prefix caching
 - [Topology](docs/TOPOLOGY.md): Configure ISQ and device mapping easily
-- [UQFF](docs/UQFF.md): Quantized file format for easy mixing of quants, see some [models](docs/UQFF.md#list-of-models) which have already been converted.
+- [UQFF](docs/UQFF.md): Quantized file format for easy mixing of quants, [collection here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).
 - Speculative Decoding: Mix supported models as the draft model or the target model
 - Dynamic LoRA adapter activation with adapter preloading: [examples and docs](docs/ADAPTER_MODELS.md#adapter-model-dynamic-adapter-activation)
 
@@ -377,8 +380,6 @@ please consider using the method demonstrated in examples below, where the token
 
 Mistral.rs uses subcommands to control the model type. They are generally of format `<XLORA/LORA>-<QUANTIZATION>`. Please run `./mistralrs-server --help` to see the subcommands.
 
-Additionally, for models without quantization, the model architecture should be provided as the `--arch` or `-a` argument in contrast to GGUF models which encode the architecture in the file. 
-
 ### Architecture for plain models
 
 > Note: for plain models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16` or `auto` to choose based on the device. This is specified in the `--dype`/`-d` parameter after the model architecture (`plain`).

diff --git a/docs/UQFF.md b/docs/UQFF.md
@@ -56,22 +56,26 @@ The following quantization formats are supported in UQFF. One can, of course, be
 
 ## Loading a UQFF model
 
-To load a UQFF model, one should specify the artifact path. This can be either be a path to a UQFF file locally, or a Hugging Face model ID with the format `<MODEL ID>/<FILE>`. For example, the following work:
+To load a UQFF model, one should specify the filename. This will be located based on the model ID, and can
+be loaded locally or from Hugging Face based on the model ID.
 
-- `EricB/Phi-3.5-mini-instruct-ISQ/phi3.5-mini-instruct-q4k.uqff`
+- `phi3.5-mini-instruct-q4k.uqff`
 - `../UQFF/phi3.5-mini-instruct-q4k.uqff`
 
-> Note: when loading an UQFF model, it will take precedence over any ISQ setting.
+You can find a [collection of UQFF models here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c), which each include a simple
+command to get started.
+
+> Note: when loading an UQFF model, *any* ISQ setting will be ignored.
 
 ### Running with the CLI
 
 ```
-cargo run --features cuda -- -i plain -m microsoft/Phi-3.5-mini-instruct --from-uqff EricB/Phi-3.5-mini-instruct-ISQ/phi3.5-mini-instruct-q4k.uqff
+./mistralrs-server -i plain -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-f8e4m3.uqff
 ```
 
 ### Using with the Rust API
 
-Modify the Normal or Vision config as follows:
+Modify the Normal or Vision config as follows and update the model ID to point to a UQFF model:
 
 ```diff
 NormalSpecificConfig {
@@ -81,7 +85,7 @@ NormalSpecificConfig {
     organization: Default::default(),
     write_uqff: None,
 -   from_uqff: None,
-+   from_uqff: Some("EricB/Phi-3.5-mini-instruct-ISQ/phi3.5-mini-instruct-q4k.uqff".to_string()),
++   from_uqff: Some("phi3.5-mini-instruct-q4k.uqff".to_string()), // Pull from specified HF hub repo
 }
 ```
 
@@ -92,16 +96,16 @@ VisionSpecificConfig {
     topology: None,
     write_uqff: None,
 -   from_uqff: None,
-+   from_uqff: Some("../UQFF/phi3.5-mini-instruct-q4k.uqff".to_string()),
++   from_uqff: Some("../phi3.5-mini-instruct-q4k.uqff".to_string()), // Local path
 }
 ```
 
 ### Using the Python API
 Modify the `Which` instantiation as follows:
 ```diff
 Which.Plain(
-    model_id="microsoft/Phi-3.5-mini-instruct",
-+   from_uqff="EricB/Phi-3.5-mini-instruct-ISQ/phi3.5-mini-instruct-q4k.uqff"
+    model_id="EricB/Phi-3.5-mini-instruct-UQFF",
++   from_uqff="phi3.5-mini-instruct-q4k.uqff"
 ),
 ```
 
@@ -112,6 +116,11 @@ Creating a UQFF model requires you to generate the UQFF file.
 - This means specifying a local path to a file ending in `.uqff`, where your new UQFF model will be created.
 - The quantization of a UQFF model is determined from the ISQ or model topology (see the [topology docs](TOPOLOGY.md) for more details on how ISQ and the topology mix).
 
+Along with the UQFF file, the generation process will also output several `.json` configuration files and `residual.safetensors`. All of these files are considered the
+UQFF model, and should be kept together or uploaded.
+
+> Note: Only the `.uqff` files are unique to the quantization level(s). If you are generating multiple UQFF files, it is OK for the others to be overwritten.
+
 After creating the UQFF file, you can upload the model to Hugging Face. To do this:
 1) [Create a new model](https://huggingface.co/docs/transformers/v4.17.0/en/create_a_model).
 2) Upload the UQFF file:
@@ -123,7 +132,7 @@ After creating the UQFF file, you can upload the model to Hugging Face. To do th
 ### Creating with the CLI
 
 ```
-cargo run --features cuda -- --isq Q4K -i plain -m microsoft/Phi-3.5-mini-instruct --write-uqff phi3.5-mini-instruct-q4k.uqff
+./mistralrs-server --isq Q4K -i plain -m microsoft/Phi-3.5-mini-instruct --write-uqff phi3.5-mini-instruct-q4k.uqff
 ```
 
 ### Creating with the Rust API
@@ -154,7 +163,7 @@ VisionSpecificConfig {
 ```
 
 ### Creating with the Python API
-Modify the `Which` instantiation as follows:
+Modify the `Which` instantiation as follows. Be sure to add the `in_situ_quant`.
 ```diff
 Which.Plain(
     model_id="microsoft/Phi-3.5-mini-instruct",
@@ -173,10 +182,6 @@ After this, you can use Git to track, commit, and push files.
 
 ## List of models
 
-Have you created a UQFF model on Hugging Face? If so, please [create an issue](https://github.com/EricLBuehler/mistral.rs/issues/new) and we will include it here!
+You can find a list of models in the [Hugging Face model collection](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).
 
-| Name | Base model | UQFF model |
-| -- | -- | -- |
-| Phi 3.5 Mini Instruct | microsoft/Phi-3.5-mini-instruct | [EricB/Phi-3.5-mini-instruct-UQFF](EricB/Phi-3.5-mini-instruct-UQFF) |
-| Llama 3.2 Vision | meta-llama/Llama-3.2-11B-Vision-Instruct | [EricB/Llama-3.2-11B-Vision-Instruct-UQFF](https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF) |
-| Mistral Nemo 2407 | mistralai/Mistral-Nemo-Instruct-2407 | [EricB/Mistral-Nemo-Instruct-2407-UQFF](https://huggingface.co/EricB/Mistral-Nemo-Instruct-2407-UQFF) |
+Have you created a UQFF model on Hugging Face? If so, please [create an issue](https://github.com/EricLBuehler/mistral.rs/issues/new).
diff --git a/docs/UQFF/LAYOUT.md b/docs/UQFF/LAYOUT.md
@@ -6,6 +6,7 @@ The following describes the exact memory layout of HQFF tensors of version 0.1.0
 - [GGUF quantization](#gguf-quantization)
 - [HQQ quantization](#hqq-quantization)
 - [Uquantized layers](#unquantized-layers)
+- [FP8 layers](#fp8-layers)
 - [Standard tensors](#standard-tensors)
 
 
@@ -32,6 +33,19 @@ The following describes the exact memory layout of HQFF tensors of version 0.1.0
 | **Array** Weight tensor data, see [docs](#standard-tensors) | See [docs](#standard-tensors) | See [docs](#standard-tensors)  |
 | **[Optional]** **Array** Bias tensor data, see [docs](#standard-tensors) | See [docs](#standard-tensors) | See [docs](#standard-tensors)  |
 
+## FP8 layers
+| ID | Element type | Endianness |
+| -------- | -------- | -------- |
+| HQFF version | u32 | little endian  |
+| ISQ type (1) | u8 | little endian  |
+| Whether bias data is included (boolean) | u8 | little endian  |
+| **Array** Weight tensor data, see [docs](#standard-tensors) | See [docs](#standard-tensors) | See [docs](#standard-tensors)  |
+| Dequant W scalar | f32 | little endian
+| Dequant X scalar | f32 | little endian
+| Quant scalar | f32 | little endian
+| Quantization type | u32 | little endian
+| **[Optional]** **Array** Bias tensor data, see [docs](#standard-tensors) | See [docs](#standard-tensors) | See [docs](#standard-tensors)  |
+
 ## HQQ quantization
 | ID | Element type | Endianness |
 | -------- | -------- | -------- |

diff --git a/mistralrs-core/src/layers.rs b/mistralrs-core/src/layers.rs
@@ -16,7 +16,7 @@ use candle_core::{
 };
 use candle_nn::{Linear, Module, VarBuilder};
 use mistralrs_quant::QuantMethod;
-use serde::Deserialize;
+use serde::{Deserialize, Serialize};
 
 pub use crate::attention::Sdpa;
 pub use crate::layers_masker::CausalMasker;
@@ -49,9 +49,21 @@ impl RmsNorm {
         Ok(Self { eps, weight: w })
     }
 
+    /// Gemma uses weight + 1.0. Undo for UQFF generation.
+    pub fn undo_gemma(&self) -> Result<Self> {
+        Ok(Self {
+            eps: self.eps,
+            weight: (&self.weight - 1.0)?,
+        })
+    }
+
     pub fn from_w(w: Tensor, eps: f64) -> Result<Self> {
         Ok(Self { eps, weight: w })
     }
+
+    pub fn weight(&self) -> &Tensor {
+        &self.weight
+    }
 }
 
 impl Module for RmsNorm {
@@ -90,7 +102,8 @@ pub struct PhiRotaryEmbedding {
     original_max_position_embeddings: usize,
 }
 
-#[derive(Debug, Clone, Deserialize)]
+#[derive(Debug, Clone, Deserialize, Serialize)]
+#[serde(rename_all = "lowercase")]
 pub enum ScaledRopeType {
     #[serde(alias = "su")]
     #[serde(alias = "longrope")]
@@ -112,7 +125,7 @@ impl FromStr for ScaledRopeType {
     }
 }
 
-#[derive(Debug, Clone, Deserialize)]
+#[derive(Debug, Clone, Deserialize, Serialize)]
 #[serde(untagged)]
 pub enum PhiRopeScalingConfig {
     Classic {
@@ -393,7 +406,7 @@ pub enum Llama3RotaryEmbedding {
     Default(RotaryEmbedding),
 }
 
-#[derive(Debug, Clone, Deserialize, Default)]
+#[derive(Debug, Clone, Deserialize, Serialize, Default)]
 pub enum Llama3RopeType {
     #[serde(rename = "llama3")]
     Llama3,
@@ -402,7 +415,7 @@ pub enum Llama3RopeType {
     Default,
 }
 
-#[derive(Debug, Clone, Deserialize, Default)]
+#[derive(Debug, Clone, Deserialize, Serialize, Default)]
 pub struct Llama3RopeConfig {
     pub factor: f32,
     pub low_freq_factor: f32,
@@ -870,6 +883,51 @@ impl RotaryEmbedding {
     }
 }
 
+#[derive(Debug, Clone, Copy, PartialEq, Deserialize, Serialize, Default)]
+#[serde(rename_all = "lowercase")]
+pub enum Activation {
+    #[default]
+    #[serde(alias = "gelu")]
+    Gelu,
+    #[serde(alias = "gelu_new")]
+    NewGelu,
+    Relu,
+    Relu2,
+    Relu6,
+    Silu,
+    Sigmoid,
+    HardSigmoid,
+    Swiglu,
+    Swish,
+    HardSwish,
+    Elu(f64),
+    LeakyRelu(f64),
+    #[serde(alias = "gelu_pytorch_tanh")]
+    GeluPytorchTanh,
+}
+
+impl Module for Activation {
+    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
+        match self {
+            Self::Gelu => xs.gelu_erf(),
+            // https://github.com/huggingface/transformers/blob/12f043eaeaabfef6f6efea411d98e6f6d3c094b7/src/transformers/activations.py#L49-L78
+            Self::NewGelu => xs.gelu(),
+            Self::Relu => xs.relu(),
+            Self::Relu2 => xs.relu()?.sqr(),
+            Self::Relu6 => xs.clamp(0f32, 6f32),
+            Self::Silu => xs.silu(),
+            Self::Sigmoid => candle_nn::ops::sigmoid(xs),
+            Self::HardSigmoid => candle_nn::ops::hard_sigmoid(xs),
+            Self::Swiglu => candle_nn::ops::swiglu(xs),
+            Self::Swish => xs * candle_nn::ops::sigmoid(xs)?,
+            Self::HardSwish => xs * candle_nn::ops::hard_sigmoid(xs)?,
+            &Self::Elu(alpha) => xs.elu(alpha),
+            &Self::LeakyRelu(negative_slope) => candle_nn::ops::leaky_relu(xs, negative_slope),
+            Self::GeluPytorchTanh => xs.gelu(),
+        }
+    }
+}
+
 mod tests {
 
     #[test]