Skip to content

Commit

Permalink
TGI: export model if configuration is cached (#445)
Browse files Browse the repository at this point in the history
* feat(cache): use one registry per optimum version

* feat(registry): use model_type as primary key

This allows to identify cached configurations that can be applied to models
that differ only by their weights, like meta-llama/Llama-2-7b-hf and
meta-llama/Llama-2-7b-chat-hf.
This also allows to lookup cached configurations for local model folders
containing a model config.

* doc(cache): fix image link

* doc(cache): add cache lookup

* refactor(decoder): add get_export_config helper

* feat(tgi): export model if cached

* review: addressing code comments

* wip

* review: address doc comments
  • Loading branch information
dacorvo authored Jan 30, 2024
1 parent 0f7bf4a commit c114fc8
Show file tree
Hide file tree
Showing 11 changed files with 386 additions and 115 deletions.
2 changes: 1 addition & 1 deletion docs/source/benchmarks/inferentia-llama2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ while 768 is more typical of a Retrieval Augmented Generation (RAG) use-case.

Encoding time is expressed in **seconds**.

![Llama2 inferentia2 encoding-time](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2/encoding-times.png "Encoding time")
![Llama2 inferentia2 encoding-time](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2/encoding_times.png "Encoding time")

We can see that all deployed models exhibit excellent response times, even for long contexts.

Expand Down
108 changes: 92 additions & 16 deletions docs/source/guides/cache_system.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,35 +13,111 @@ specific language governing permissions and limitations under the License.
# Neuron Model Cache

The Neuron Model Cache is a remote cache for compiled Neuron models in the `neff` format.
It is integrated into the [`NeuronTrainer` and `NeuronModelForCausalLM`] classes to enable loading pretrained models from the cache instead of compiling them locally.
It is integrated into the `NeuronTrainer` and `NeuronModelForCausalLM` classes to enable loading pretrained models from the cache instead of compiling them locally.

**Note: it is not available for models exported using any other NeuronModelXX classes, that use a different export mechanism.**

The Neuron Model Cache is hosted on the [Hugging Face Hub](https://huggingface.co/aws-neuron/optimum-neuron-cache) and includes compiled files for all popular and supported `optimum-neuron` pre-trained models.

When loading a Transformers or Diffusion model, it needs to be compiled to neuron format with [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx),
in order to run on Neuron platforms.
The compilation produces several compilation files stored in a local directory, usually `/var/tmp/neuron-compile-cache`.
This means that every time you train or export a model on a new host, you need to recompile it, which takes a lot of time.
Before training a Transformers or Diffusion model or loading a NeuronModelForCausalLM on Neuron platforms, it needs to be exported to neuron format
with [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx).

When exporting a model, [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx) will:

- convert it to a set of [XLA](https://github.com/pytorch/xla/) subgraphs,
- compile each subgraph with the neuronx compiler into a Neuron Executable File Format (NEFF) binary file.

The first step is relatively fast, but the compilation takes a lot of time.
To avoid recompiling all NEFF files every time a model is loaded on a NeuronX host, [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx)
stores NEFF files in a local directory, usually `/var/tmp/neuron-compile-cache`.

However, this local cache is not shared between platforms, which means that every time you train or export a model on a new host, you need to recompile it.

We created the Neuron Model Cache to solve this limitation by providing a public repository of precompiled model graphs.

Note: we also support the creation of private, secured, remote model cache.

We created the Neuron Model Cache to solve this limitation by providing a public cache of precompiled available models and a private cache to create your private, secured, remote model cache.
## How to use the Neuron model cache

## How the caching system works
The public model cache will be used when you use the `NeuronTrainer` or `NeuronModelForCausalLM` classes. There are no additional changes needed.

### Hash computation
When exporting a model to neuron format, `optimum-neuron` will simply look for cached NEFF files in the hub repository during the compilation of the
model subgraphs.

Many factors can trigger compilation among which:
If the NEFF files are cached, they will be fetched from the hub and directly loaded instead of being recompiled.

- The input shapes,
- The precision of the model, full-precision or bf16,
## How caching works

The Optimum Neuron Cache is built on top of the [NeuronX compiler cache](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html).

It is important to understand that the cache operates on NEFF binaries, and not on the model itself.

As explained previously, each model exported to Neuron using the `NeuronTrainer` or `NeuronModelForCausalLM` is composed of [XLA](https://github.com/pytorch/xla/) subgraphs.

Each subgraph is unique, and results from the combination of:
- the `transformers` or `transformers_neuronx` python modeling code,
- the `transformers` model config,
- the `input_shapes` selected during the export,
- The precision of the model, full-precision, fp16 or bf16.

When compiling a subgraph to a NEFF file, other parameters influence the result:
- The version of the Neuron X compiler,
- The number of Neuron cores used.
- The number of Neuron cores used,
- The compilation parameters (such as the optimization level).

All these parameters are combined together to create a unique hash that identifies a NEFF file.

These parameters are used to compute a hash that uniquely identifies each compilation file.
This has two very important consequences:
- it is only when actually exporting a model that the associated NEFF files can be identified,
- even a small change in the model configuration will lead to a different set of NEFF files.

**It is important to keep in mind that even a small change in the model configuration will trigger a recompilation.**
It is therefore very difficult to know in advance if the NEFFs associated to a specific model configuration are cached.

## Neuron model cache lookup (inferentia only)

The neuron cache lookup is a feature allowing users to look for compatible cached model configurations before exporting
a model for inference.

It is based on a dedicated registry composed of stored cached configurations.

Cached model configurations are stored as entries under a specific subfolder in the Neuron Model Cache:

```
neuronxcc-2.12.54.0+f631c2365
├── 0_REGISTRY
└── 0.0.18
└── llama
└── meta-llama
└── Llama-2-7b-chat-hf
└── 54c1f6689cd88f246fce.json
```

Each entry corresponds to the combination of a model configuration and its export parameters: this is as close as we can get to
uniquely identify the exported model.

You can use the `optimum-cli` to lookup for compatible cached entries by passing it a hub model_id or the path to a file
containing a model `config.json`.

```shell
$ optimum-cli neuron cache lookup meta-llama/Llama-2-7b-chat-hf

*** 1 entrie(s) found in cache for meta-llama/Llama-2-7b-chat-hf ***

task: text-generation
batch_size: 1
num_cores: 24
auto_cast_type: fp16
sequence_length: 2048
compiler_type: neuronx-cc
compiler_version: 2.12.54.0+f631c2365
checkpoint_id: meta-llama/Llama-2-7b-chat-hf
checkpoint_revision: c1b0db933684edbfe29a06fa47eb19cc48025e93
```

### How to use the Neuron model cache
**Note that even if compatible cached entries exist, this does not always guarantee that the model will not be recompiled during export
if you modified the compilation parameters or updated the neuronx packages.**

The public model cache will be used when you use the [`NeuronTrainer` or `NeuronModelForCausalLM`] classes. There are no additional changes needed.
## Advanced usage (trainium only)

### How to use a private Neuron model cache (trainium only)

Expand Down
90 changes: 62 additions & 28 deletions optimum/neuron/modeling_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
# limitations under the License.
"""Base class for text-generation model architectures on neuron devices."""

import copy
import logging
import os
import shutil
Expand All @@ -28,7 +29,7 @@
from ..exporters.neuron.model_configs import * # noqa: F403
from ..exporters.tasks import TasksManager
from ..modeling_base import OptimizedModel
from .utils import CacheEntry, hub_neuronx_cache, is_transformers_neuronx_available
from .utils import ModelCacheEntry, hub_neuronx_cache, is_transformers_neuronx_available
from .utils.require_utils import requires_transformers_neuronx
from .utils.version_utils import check_compiler_compatibility, get_neuronxcc_version

Expand Down Expand Up @@ -126,7 +127,7 @@ def __init__(
os.environ["NEURON_CC_FLAGS"] = neuron_cc_flags + " --model-type=transformer"
checkpoint_id = neuron_config.get("checkpoint_id", None)
# Only create a cache entry if the model comes from the hub
cache_entry = None if checkpoint_id is None else CacheEntry(neuron_config["checkpoint_id"], neuron_config)
cache_entry = None if checkpoint_id is None else ModelCacheEntry(checkpoint_id, config)
with hub_neuronx_cache(entry=cache_entry):
neuronx_model.to_neuron()
os.environ["NEURON_CC_FLAGS"] = neuron_cc_flags
Expand Down Expand Up @@ -170,14 +171,7 @@ def _create_checkpoint(
return checkpoint_dir

@classmethod
@requires_transformers_neuronx
def _from_transformers(cls, *args, **kwargs):
# Deprecate it when optimum uses `_export` as from_pretrained_method in a stable release.
return cls._export(*args, **kwargs)

@classmethod
@requires_transformers_neuronx
def _export(
def get_export_config(
cls,
model_id: str,
config: "PretrainedConfig",
Expand All @@ -187,23 +181,11 @@ def _export(
batch_size: Optional[int] = None,
sequence_length: Optional[int] = None,
num_cores: Optional[int] = None,
auto_cast_type: Optional[str] = "fp32",
**kwargs,
) -> "NeuronDecoderModel":
if not os.path.isdir("/sys/class/neuron_device/"):
raise SystemError("Decoder models can only be exported on a neuron platform.")

auto_cast_type: Optional[str] = None,
) -> "PretrainedConfig":
if task is None:
task = TasksManager.infer_task_from_model(cls.auto_model_class)

# Instantiate the transformers model checkpoint
checkpoint_dir = cls._create_checkpoint(
model_id,
task=task,
revision=revision,
**kwargs,
)

if os.path.isdir(model_id):
checkpoint_id = None
checkpoint_revision = None
Expand All @@ -223,9 +205,15 @@ def _export(
if num_cores is None:
# Use all available cores
num_cores = len(os.listdir("/sys/class/neuron_device/")) * 2

# Update the config
config.neuron = {
if auto_cast_type is None:
auto_cast_type = "fp32"
if config.torch_dtype == "float16":
auto_cast_type = "fp16"
elif config.torch_dtype == "bfloat16":
auto_cast_type = "bf16"

new_config = copy.deepcopy(config)
new_config.neuron = {
"task": task,
"batch_size": batch_size,
"num_cores": num_cores,
Expand All @@ -236,6 +224,52 @@ def _export(
"checkpoint_id": checkpoint_id,
"checkpoint_revision": checkpoint_revision,
}
return new_config

@classmethod
@requires_transformers_neuronx
def _from_transformers(cls, *args, **kwargs):
# Deprecate it when optimum uses `_export` as from_pretrained_method in a stable release.
return cls._export(*args, **kwargs)

@classmethod
@requires_transformers_neuronx
def _export(
cls,
model_id: str,
config: "PretrainedConfig",
use_auth_token: Optional[str] = None,
revision: Optional[str] = None,
task: Optional[str] = None,
batch_size: Optional[int] = None,
sequence_length: Optional[int] = None,
num_cores: Optional[int] = None,
auto_cast_type: Optional[str] = "fp32",
**kwargs,
) -> "NeuronDecoderModel":
if not os.path.isdir("/sys/class/neuron_device/"):
raise SystemError("Decoder models can only be exported on a neuron platform.")

# Update the config
new_config = cls.get_export_config(
model_id,
config,
use_auth_token=use_auth_token,
revision=revision,
task=task,
batch_size=batch_size,
sequence_length=sequence_length,
num_cores=num_cores,
auto_cast_type=auto_cast_type,
)

# Instantiate the transformers model checkpoint
checkpoint_dir = cls._create_checkpoint(
model_id,
task=new_config.neuron["task"],
revision=revision,
**kwargs,
)

# Try to reload the generation config (if any)
generation_config = None
Expand All @@ -244,7 +278,7 @@ def _export(
except OSError:
pass

return cls(config, checkpoint_dir, generation_config=generation_config)
return cls(new_config, checkpoint_dir, generation_config=generation_config)

@classmethod
def _get_neuron_dirs(cls, model_path: Union[str, Path]) -> Tuple[str, str]:
Expand Down
2 changes: 1 addition & 1 deletion optimum/neuron/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
ENCODER_NAME,
NEURON_FILE_NAME,
)
from .hub_neuronx_cache import CacheEntry, get_hub_cached_entries, hub_neuronx_cache, synchronize_hub_cache
from .hub_neuronx_cache import ModelCacheEntry, get_hub_cached_entries, hub_neuronx_cache, synchronize_hub_cache
from .import_utils import (
is_accelerate_available,
is_neuron_available,
Expand Down
Loading

0 comments on commit c114fc8

Please sign in to comment.