Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pre Neuron Inf Cache system]Support neff/weights decoupling #402

Merged
merged 24 commits into from
Jan 30, 2024

Conversation

JingyaHuang
Copy link
Collaborator

@JingyaHuang JingyaHuang commented Jan 9, 2024

With inline_weights_to_neff argument, we are now able to decouple the weights and neff graph during the compilation. It means that we can set up a caching system for reusing shared compiled neff graphs and load with weights from different checkpoints during the inference. This could largely save the compilation time which could take up to hours.

  • Support inline_weights_to_neff in Neuron exporter
  • Replace weights function that works for Optimum wrapped models
  • Support replacing weights in the modeling
  • Tests

[Next step]
Initial caching for inference of encoder models

  • Set up hashing for inference
  • Set up caching mechanism based on current caching of models with independent neff files (optimum-cli for creating public/private cache)
  • Checker & replace weights on the modeling from_pretrained when export=True and cache exist.
    • Exist (local / remote) -> load precompiled neff + replace weights with current checkpoint
    • Non-exist -> Compile from scratch and maybe cache

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@JingyaHuang JingyaHuang changed the title [Neuron Inf Cache system]Support neff/weights decoupling + setup initial caching for inference [Neuron Inf Cache system]Support neff/weights decoupling Jan 24, 2024
@JingyaHuang JingyaHuang changed the title [Neuron Inf Cache system]Support neff/weights decoupling [Pre Neuron Inf Cache system]Support neff/weights decoupling Jan 25, 2024
@JingyaHuang JingyaHuang marked this pull request as ready for review January 27, 2024 17:28
optimum/commands/export/neuronx.py Show resolved Hide resolved
optimum/neuron/modeling_base.py Outdated Show resolved Hide resolved
optimum/neuron/utils/misc.py Outdated Show resolved Hide resolved
JingyaHuang and others added 2 commits January 29, 2024 12:11
Copy link
Collaborator

@dacorvo dacorvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but you have a typo in your tests.

tests/inference/test_modeling.py Outdated Show resolved Hide resolved
@JingyaHuang JingyaHuang merged commit de5752d into main Jan 30, 2024
8 checks passed
@JingyaHuang JingyaHuang deleted the decouple-weight-graph branch January 30, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants