Skip to content

Commit

Permalink
Generate standalone UQFF models (#849)
Browse files Browse the repository at this point in the history
* Add method for extracting residual tensors

* Adding some better serialization

* Full serialization

* Add support for mllama

* Save processor filename

* Add support for the vision models

* Add support to apis

* Fix some bugs

* Undo gemma weights

* Fixes for gemma and regexing

* Clippy

* Update docs

* Further update docs

* Ensure generation

* Typo
  • Loading branch information
EricLBuehler authored Oct 15, 2024
1 parent eaaaa84 commit 8fa1a0c
Show file tree
Hide file tree
Showing 53 changed files with 1,708 additions and 620 deletions.
36 changes: 18 additions & 18 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis

*After following installation instructions*

- Check out UQFF for prequantized models of various methods!
- Models can be found [here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).

- 🦙📷 Run the **Llama 3.2 Vision** Model: [documentation and guide here](docs/VLLAMA.md)

<img src="https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg" alt="Mount Washington" width = "400" height = "267">
Expand Down Expand Up @@ -106,7 +109,7 @@ Mistal.rs supports several model categories:
- [PagedAttention](docs/PAGED_ATTENTION.md) and continuous batching
- Prefix caching
- [Topology](docs/TOPOLOGY.md): Configure ISQ and device mapping easily
- [UQFF](docs/UQFF.md): Quantized file format for easy mixing of quants, see some [models](docs/UQFF.md#list-of-models) which have already been converted.
- [UQFF](docs/UQFF.md): Quantized file format for easy mixing of quants, [collection here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).
- Speculative Decoding: Mix supported models as the draft model or the target model
- Dynamic LoRA adapter activation with adapter preloading: [examples and docs](docs/ADAPTER_MODELS.md#adapter-model-dynamic-adapter-activation)

Expand Down Expand Up @@ -377,8 +380,6 @@ please consider using the method demonstrated in examples below, where the token

Mistral.rs uses subcommands to control the model type. They are generally of format `<XLORA/LORA>-<QUANTIZATION>`. Please run `./mistralrs-server --help` to see the subcommands.

Additionally, for models without quantization, the model architecture should be provided as the `--arch` or `-a` argument in contrast to GGUF models which encode the architecture in the file.

### Architecture for plain models

> Note: for plain models, you can specify the data type to load and run in. This must be one of `f32`, `f16`, `bf16` or `auto` to choose based on the device. This is specified in the `--dype`/`-d` parameter after the model architecture (`plain`).
Expand Down
39 changes: 22 additions & 17 deletions docs/UQFF.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,22 +56,26 @@ The following quantization formats are supported in UQFF. One can, of course, be

## Loading a UQFF model

To load a UQFF model, one should specify the artifact path. This can be either be a path to a UQFF file locally, or a Hugging Face model ID with the format `<MODEL ID>/<FILE>`. For example, the following work:
To load a UQFF model, one should specify the filename. This will be located based on the model ID, and can
be loaded locally or from Hugging Face based on the model ID.

- `EricB/Phi-3.5-mini-instruct-ISQ/phi3.5-mini-instruct-q4k.uqff`
- `phi3.5-mini-instruct-q4k.uqff`
- `../UQFF/phi3.5-mini-instruct-q4k.uqff`

> Note: when loading an UQFF model, it will take precedence over any ISQ setting.
You can find a [collection of UQFF models here](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c), which each include a simple
command to get started.

> Note: when loading an UQFF model, *any* ISQ setting will be ignored.
### Running with the CLI

```
cargo run --features cuda -- -i plain -m microsoft/Phi-3.5-mini-instruct --from-uqff EricB/Phi-3.5-mini-instruct-ISQ/phi3.5-mini-instruct-q4k.uqff
./mistralrs-server -i plain -m EricB/Phi-3.5-mini-instruct-UQFF --from-uqff phi3.5-mini-instruct-f8e4m3.uqff
```

### Using with the Rust API

Modify the Normal or Vision config as follows:
Modify the Normal or Vision config as follows and update the model ID to point to a UQFF model:

```diff
NormalSpecificConfig {
Expand All @@ -81,7 +85,7 @@ NormalSpecificConfig {
organization: Default::default(),
write_uqff: None,
- from_uqff: None,
+ from_uqff: Some("EricB/Phi-3.5-mini-instruct-ISQ/phi3.5-mini-instruct-q4k.uqff".to_string()),
+ from_uqff: Some("phi3.5-mini-instruct-q4k.uqff".to_string()), // Pull from specified HF hub repo
}
```

Expand All @@ -92,16 +96,16 @@ VisionSpecificConfig {
topology: None,
write_uqff: None,
- from_uqff: None,
+ from_uqff: Some("../UQFF/phi3.5-mini-instruct-q4k.uqff".to_string()),
+ from_uqff: Some("../phi3.5-mini-instruct-q4k.uqff".to_string()), // Local path
}
```

### Using the Python API
Modify the `Which` instantiation as follows:
```diff
Which.Plain(
model_id="microsoft/Phi-3.5-mini-instruct",
+ from_uqff="EricB/Phi-3.5-mini-instruct-ISQ/phi3.5-mini-instruct-q4k.uqff"
model_id="EricB/Phi-3.5-mini-instruct-UQFF",
+ from_uqff="phi3.5-mini-instruct-q4k.uqff"
),
```

Expand All @@ -112,6 +116,11 @@ Creating a UQFF model requires you to generate the UQFF file.
- This means specifying a local path to a file ending in `.uqff`, where your new UQFF model will be created.
- The quantization of a UQFF model is determined from the ISQ or model topology (see the [topology docs](TOPOLOGY.md) for more details on how ISQ and the topology mix).

Along with the UQFF file, the generation process will also output several `.json` configuration files and `residual.safetensors`. All of these files are considered the
UQFF model, and should be kept together or uploaded.

> Note: Only the `.uqff` files are unique to the quantization level(s). If you are generating multiple UQFF files, it is OK for the others to be overwritten.
After creating the UQFF file, you can upload the model to Hugging Face. To do this:
1) [Create a new model](https://huggingface.co/docs/transformers/v4.17.0/en/create_a_model).
2) Upload the UQFF file:
Expand All @@ -123,7 +132,7 @@ After creating the UQFF file, you can upload the model to Hugging Face. To do th
### Creating with the CLI

```
cargo run --features cuda -- --isq Q4K -i plain -m microsoft/Phi-3.5-mini-instruct --write-uqff phi3.5-mini-instruct-q4k.uqff
./mistralrs-server --isq Q4K -i plain -m microsoft/Phi-3.5-mini-instruct --write-uqff phi3.5-mini-instruct-q4k.uqff
```

### Creating with the Rust API
Expand Down Expand Up @@ -154,7 +163,7 @@ VisionSpecificConfig {
```

### Creating with the Python API
Modify the `Which` instantiation as follows:
Modify the `Which` instantiation as follows. Be sure to add the `in_situ_quant`.
```diff
Which.Plain(
model_id="microsoft/Phi-3.5-mini-instruct",
Expand All @@ -173,10 +182,6 @@ After this, you can use Git to track, commit, and push files.

## List of models

Have you created a UQFF model on Hugging Face? If so, please [create an issue](https://github.com/EricLBuehler/mistral.rs/issues/new) and we will include it here!
You can find a list of models in the [Hugging Face model collection](https://huggingface.co/collections/EricB/uqff-670e4a49d56ecdd3f7f0fd4c).

| Name | Base model | UQFF model |
| -- | -- | -- |
| Phi 3.5 Mini Instruct | microsoft/Phi-3.5-mini-instruct | [EricB/Phi-3.5-mini-instruct-UQFF](EricB/Phi-3.5-mini-instruct-UQFF) |
| Llama 3.2 Vision | meta-llama/Llama-3.2-11B-Vision-Instruct | [EricB/Llama-3.2-11B-Vision-Instruct-UQFF](https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF) |
| Mistral Nemo 2407 | mistralai/Mistral-Nemo-Instruct-2407 | [EricB/Mistral-Nemo-Instruct-2407-UQFF](https://huggingface.co/EricB/Mistral-Nemo-Instruct-2407-UQFF) |
Have you created a UQFF model on Hugging Face? If so, please [create an issue](https://github.com/EricLBuehler/mistral.rs/issues/new).
14 changes: 14 additions & 0 deletions docs/UQFF/LAYOUT.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ The following describes the exact memory layout of HQFF tensors of version 0.1.0
- [GGUF quantization](#gguf-quantization)
- [HQQ quantization](#hqq-quantization)
- [Uquantized layers](#unquantized-layers)
- [FP8 layers](#fp8-layers)
- [Standard tensors](#standard-tensors)


Expand All @@ -32,6 +33,19 @@ The following describes the exact memory layout of HQFF tensors of version 0.1.0
| **Array** Weight tensor data, see [docs](#standard-tensors) | See [docs](#standard-tensors) | See [docs](#standard-tensors) |
| **[Optional]** **Array** Bias tensor data, see [docs](#standard-tensors) | See [docs](#standard-tensors) | See [docs](#standard-tensors) |

## FP8 layers
| ID | Element type | Endianness |
| -------- | -------- | -------- |
| HQFF version | u32 | little endian |
| ISQ type (1) | u8 | little endian |
| Whether bias data is included (boolean) | u8 | little endian |
| **Array** Weight tensor data, see [docs](#standard-tensors) | See [docs](#standard-tensors) | See [docs](#standard-tensors) |
| Dequant W scalar | f32 | little endian
| Dequant X scalar | f32 | little endian
| Quant scalar | f32 | little endian
| Quantization type | u32 | little endian
| **[Optional]** **Array** Bias tensor data, see [docs](#standard-tensors) | See [docs](#standard-tensors) | See [docs](#standard-tensors) |

## HQQ quantization
| ID | Element type | Endianness |
| -------- | -------- | -------- |
Expand Down
68 changes: 63 additions & 5 deletions mistralrs-core/src/layers.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ use candle_core::{
};
use candle_nn::{Linear, Module, VarBuilder};
use mistralrs_quant::QuantMethod;
use serde::Deserialize;
use serde::{Deserialize, Serialize};

pub use crate::attention::Sdpa;
pub use crate::layers_masker::CausalMasker;
Expand Down Expand Up @@ -49,9 +49,21 @@ impl RmsNorm {
Ok(Self { eps, weight: w })
}

/// Gemma uses weight + 1.0. Undo for UQFF generation.
pub fn undo_gemma(&self) -> Result<Self> {
Ok(Self {
eps: self.eps,
weight: (&self.weight - 1.0)?,
})
}

pub fn from_w(w: Tensor, eps: f64) -> Result<Self> {
Ok(Self { eps, weight: w })
}

pub fn weight(&self) -> &Tensor {
&self.weight
}
}

impl Module for RmsNorm {
Expand Down Expand Up @@ -90,7 +102,8 @@ pub struct PhiRotaryEmbedding {
original_max_position_embeddings: usize,
}

#[derive(Debug, Clone, Deserialize)]
#[derive(Debug, Clone, Deserialize, Serialize)]
#[serde(rename_all = "lowercase")]
pub enum ScaledRopeType {
#[serde(alias = "su")]
#[serde(alias = "longrope")]
Expand All @@ -112,7 +125,7 @@ impl FromStr for ScaledRopeType {
}
}

#[derive(Debug, Clone, Deserialize)]
#[derive(Debug, Clone, Deserialize, Serialize)]
#[serde(untagged)]
pub enum PhiRopeScalingConfig {
Classic {
Expand Down Expand Up @@ -393,7 +406,7 @@ pub enum Llama3RotaryEmbedding {
Default(RotaryEmbedding),
}

#[derive(Debug, Clone, Deserialize, Default)]
#[derive(Debug, Clone, Deserialize, Serialize, Default)]
pub enum Llama3RopeType {
#[serde(rename = "llama3")]
Llama3,
Expand All @@ -402,7 +415,7 @@ pub enum Llama3RopeType {
Default,
}

#[derive(Debug, Clone, Deserialize, Default)]
#[derive(Debug, Clone, Deserialize, Serialize, Default)]
pub struct Llama3RopeConfig {
pub factor: f32,
pub low_freq_factor: f32,
Expand Down Expand Up @@ -870,6 +883,51 @@ impl RotaryEmbedding {
}
}

#[derive(Debug, Clone, Copy, PartialEq, Deserialize, Serialize, Default)]
#[serde(rename_all = "lowercase")]
pub enum Activation {
#[default]
#[serde(alias = "gelu")]
Gelu,
#[serde(alias = "gelu_new")]
NewGelu,
Relu,
Relu2,
Relu6,
Silu,
Sigmoid,
HardSigmoid,
Swiglu,
Swish,
HardSwish,
Elu(f64),
LeakyRelu(f64),
#[serde(alias = "gelu_pytorch_tanh")]
GeluPytorchTanh,
}

impl Module for Activation {
fn forward(&self, xs: &Tensor) -> Result<Tensor> {
match self {
Self::Gelu => xs.gelu_erf(),
// https://github.com/huggingface/transformers/blob/12f043eaeaabfef6f6efea411d98e6f6d3c094b7/src/transformers/activations.py#L49-L78
Self::NewGelu => xs.gelu(),
Self::Relu => xs.relu(),
Self::Relu2 => xs.relu()?.sqr(),
Self::Relu6 => xs.clamp(0f32, 6f32),
Self::Silu => xs.silu(),
Self::Sigmoid => candle_nn::ops::sigmoid(xs),
Self::HardSigmoid => candle_nn::ops::hard_sigmoid(xs),
Self::Swiglu => candle_nn::ops::swiglu(xs),
Self::Swish => xs * candle_nn::ops::sigmoid(xs)?,
Self::HardSwish => xs * candle_nn::ops::hard_sigmoid(xs)?,
&Self::Elu(alpha) => xs.elu(alpha),
&Self::LeakyRelu(negative_slope) => candle_nn::ops::leaky_relu(xs, negative_slope),
Self::GeluPytorchTanh => xs.gelu(),
}
}
}

mod tests {

#[test]
Expand Down
Loading

0 comments on commit 8fa1a0c

Please sign in to comment.