Finetune Hydra #797

rayg1234 · 2024-08-07T21:59:23Z

Description

This PR creates a FineTuneHydra model that allows users to finetune entire models or easily replace heads and finetune. The main concept is to treat a finetune job as starting a brand new training job. So we retain the functionality of the --mode=train. We also do not use --checkpoint option as this would indicate resuming from a checkpoint rather than starting a new training job.

To finetune, the user needs to replace the model component of a fairchem config with that of a FineTuneHydra model. A starting_checkpoint is supplied to tell the FineTuneHydra to start with the initial model and weights from the given checkpoint.

Allow finetuning from hydra models, we first support 2 modes:
DATA_ONLY: does not change model, load all previous weights and only finetune on new data

model:
  name: finetune_hydra
  finetune_config:
    mode: DATA_ONLY
    starting_checkpoint: "./checkpoints/2024-08-07-20-20-16-test/checkpoint.pt"

RETAIN_BACKBONE_ONLY: only load backbone and require the user to specify new heads

model:
  name: finetune_hydra
  finetune_config:
    mode: RETAIN_BACKBONE_ONLY
    starting_checkpoint: "./checkpoints/2024-08-07-20-20-16-test/checkpoint.pt"
    heads:
      oc22_energy:
        module: equiformer_v2_energy_head
      oc22_forces:
        module: equiformer_v2_force_head

Example workflow:

Train original oc20 model

fairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/equiformer_v2_oc20.yml --optim.batch_size=1 --amp --num-gpus=1 --optim.eval_every=100 --distributed

Finetuning a oc20 model on oc22 data:

create finetune config yml with starting_checkpoint=<checkpoint from oc20 run>
fairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/finetune_on_oc22.yml --optim.batch_size=1 --num-gpus=1 --optim.eval_every=100
NOTE here a --checkpoint is not given in the command line because we are starting a brand new training run, not resume from a previous state

Resume the training of 2.

fairchem --mode train --identifier test --config-yml configs/s2ef/all_md/equiformer_v2/finetune_on_oc22.yml --optim.batch_size=1 --num-gpus=1 --optim.eval_every=100 --checkpoint "./checkpoints/2024-08-07-23-34-24-test/checkpoint.pt"

Finetune another dataset from checkpoint of 2.

create another finetune config yml with starting_checkpoint=<checkpoint from oc22 finetune run>

Not supported in this PR (but available as followup):

General finetune mode where heads can be partially retained and used as input

Other notable changes

Removed mutation of model config (model -> model_attributes) in base_trainer. this only affects any downstream applications that assumes "model_attributes" in the checkpoint's config, which I have not found any hard dependencies

TODO:

add configs,
add tests

Test Plan

Sanity checks

Run finetuning oc22 (ontop of oc20) with DATA_ONLY
Run finetuning oc22 with new force/energy heads
Run finetuning oc22 with new single force head
Resume interrupted finetuning run
Run finetuning a second time from finetuned model
Train a new oc20 base 31M model on cluster
Finetune oc22 on the fully trained oc20 model on cluster

Tests:
pytest tests/core/e2e/test_e2e_finetune_hydra.py

src/fairchem/core/trainers/base_trainer.py

codecov · 2024-08-08T17:27:35Z

Codecov Report

Attention: Patch coverage is 95.10490% with 7 lines in your changes missing coverage. Please review.

Files	Patch %	Lines
src/fairchem/core/models/finetune_hydra.py	96.19%	4 Missing ⚠️
src/fairchem/core/models/base.py	87.50%	2 Missing ⚠️
src/fairchem/core/common/utils.py	90.00%	1 Missing ⚠️

Files	Coverage Δ
src/fairchem/core/trainers/base_trainer.py	`89.76% <100.00%> (+0.20%)`	⬆️
src/fairchem/core/common/utils.py	`66.10% <90.00%> (+0.36%)`	⬆️
src/fairchem/core/models/base.py	`86.88% <87.50%> (-1.08%)`	⬇️
src/fairchem/core/models/finetune_hydra.py	`96.19% <96.19%> (ø)`

... and 8 files with indirect coverage changes

misko

This looks great! I'm happy with it as is. A few tiny nits, feel free to ignore!
LGTM!

src/fairchem/core/models/finetune_hydra.py

tests/core/e2e/test_e2e_finetune_hydra.py

tests/core/e2e/test_s2ef.py

wood-b

Looking good! I added a few small comments. Also, do we want to include fine-tuning configs in this PR? I'm assuming we don't know how well those work yet (especially on MD+all checkpoint, would be easier to have configs for a 2M model) and other might assume they work well.

src/fairchem/core/models/base.py

src/fairchem/core/models/finetune_hydra.py

src/fairchem/core/trainers/base_trainer.py

lbluque

Thanks @rayg1234 ! Just a few small suggestions in file comments.

src/fairchem/core/trainers/base_trainer.py

src/fairchem/core/models/base.py

src/fairchem/core/models/finetune_hydra.py

tests/core/e2e/test_e2e_commons.py

fixed changes

wood-b

LGTM!

rayg1234 added 3 commits August 7, 2024 21:58

add basic function

60b4326

support retain backbone mode

927c80f

fix config

e5607e0

rayg1234 commented Aug 8, 2024

View reviewed changes

src/fairchem/core/trainers/base_trainer.py Outdated Show resolved Hide resolved

rayg1234 added 6 commits August 8, 2024 04:42

add hydra interface

88c8611

run linter

375899b

run ruff

72d7b42

fix test

02098c0

Merge remote-tracking branch 'origin/main' into rgao_finetune_hydra

2fd9c3b

update main, add configs

8851417

rayg1234 marked this pull request as ready for review August 8, 2024 17:22

add tests

bd12900

rayg1234 requested review from misko and wood-b August 10, 2024 05:48

rayg1234 added 3 commits August 10, 2024 23:25

test double finetune

7f5f851

Merge remote-tracking branch 'origin' into rgao_finetune_hydra

6f34157

format

ba23d7d

misko previously approved these changes Aug 12, 2024

View reviewed changes

wood-b reviewed Aug 13, 2024

View reviewed changes

src/fairchem/core/models/base.py Show resolved Hide resolved

src/fairchem/core/models/finetune_hydra.py Show resolved Hide resolved

src/fairchem/core/trainers/base_trainer.py Show resolved Hide resolved

fix few comments

a6abf92

rayg1234 dismissed misko’s stale review via a6abf92 August 13, 2024 06:27

merge remote-tracking branch 'origin' into rgao_finetune_hydra

8f05fd2

rayg1234 added minor Minor version release enhancement New feature or request labels Aug 13, 2024

lbluque previously requested changes Aug 13, 2024

View reviewed changes

src/fairchem/core/trainers/base_trainer.py Show resolved Hide resolved

src/fairchem/core/models/base.py Show resolved Hide resolved

src/fairchem/core/models/finetune_hydra.py Outdated Show resolved Hide resolved

tests/core/e2e/test_e2e_commons.py Outdated Show resolved Hide resolved

rayg1234 added 4 commits August 13, 2024 20:26

remove finetuneinterface

83545f3

Merge remote-tracking branch 'origin/main' into rgao_finetune_hydra

b4288af

update tests

b4dfc93

remove configs

50c7e19

rayg1234 enabled auto-merge August 13, 2024 21:55

misko approved these changes Aug 13, 2024

View reviewed changes

rayg1234 added this pull request to the merge queue Aug 13, 2024

wood-b approved these changes Aug 13, 2024

View reviewed changes

Merged via the queue into main with commit 8fb16d6 Aug 14, 2024
12 checks passed

rayg1234 deleted the rgao_finetune_hydra branch August 14, 2024 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune Hydra #797

Finetune Hydra #797

rayg1234 commented Aug 7, 2024 •

edited

Loading

codecov bot commented Aug 8, 2024 •

edited

Loading

misko left a comment

wood-b left a comment

lbluque left a comment

wood-b left a comment

Finetune Hydra #797

Finetune Hydra #797

Conversation

rayg1234 commented Aug 7, 2024 • edited Loading

Description

Other notable changes

Test Plan

codecov bot commented Aug 8, 2024 • edited Loading

Codecov Report

misko left a comment

Choose a reason for hiding this comment

wood-b left a comment

Choose a reason for hiding this comment

lbluque left a comment

Choose a reason for hiding this comment

wood-b left a comment

Choose a reason for hiding this comment

rayg1234 commented Aug 7, 2024 •

edited

Loading

codecov bot commented Aug 8, 2024 •

edited

Loading