Added option for continue training from checkpoint #49

PicoCentauri · 2024-02-05T15:18:55Z

📚 Documentation preview 📚: https://metatensor-models--49.org.readthedocs.build/en/49/

into restart

frostedoyster · 2024-02-06T13:13:44Z

Here is the finalized continuation of training for the SOAP-BPNN.
Not only this allows to continue training with the same dataset, but it also allows to "fine-tune" a pre-trained model on a new dataset (this is tested and it works).
When continuing training with a new dataset, we add new capabilities, but we require the species to be a subset of the original species. There are a few ways to get around this in the future, and I will open an issue for it.
The composition weights are only recalculated for new targets (this should be a warning in the docs, but we can't do it before we have actual usage docs for the SOAP-BPNN). I'll open an issue for this as well.

Two considerations:

I'm using None as a string because that's what hydra is feeding me @PicoCentauri
I'm using type: ignore on the compositions weights in the SOAP-BPNN because I think the linter is getting confused by the mechanics of register_buffer

PicoCentauri · 2024-02-06T13:33:20Z

src/metatensor/models/cli/train_model.py

+        "--continue",
+        dest="continue_from",
+        type=str,
+        required=False,


maybe adding a default=None, here changes it from a string.

PicoCentauri

Seems to work but actually the test is not using the continue flag.

Also maybe add some more tests for the new quite complex functions you added.

PicoCentauri · 2024-02-06T13:47:54Z

src/metatensor/models/soap_bpnn/model.py

+        """Add a new output to the model."""
+        # add a new row to the composition weights tensor


Shouldn't be both in the docstring?

PicoCentauri · 2024-02-06T13:49:51Z

tests/cli/test_train_model.py

+    shutil.copy(RESOURCES_PATH / "bpnn-model.pt", "bpnn-model.pt")
+    shutil.copy(RESOURCES_PATH / "options_continue.yaml", "options_continue.yaml")
+
+    command = ["metatensor-models", "train", "options_continue.yaml"]


But this is not using the continue flag?

Sorry, added it

PicoCentauri · 2024-02-06T13:50:11Z

src/metatensor/models/utils/merge_capabilities.py

+from metatensor.torch.atomistic import ModelCapabilities
+
+
+def merge_capabilities(


Should there be a test for this function?

Yes, my bad

PicoCentauri · 2024-02-06T13:52:24Z

tests/resources/options_continue.yaml

+architecture:
+  name: soap_bpnn
+  model:
+    restart: bpnn-model.pt
+  training:
+    batch_size: 2
+    num_epochs: 1
+
+# Section defining the parameters for structure and target data
+training_set:
+  structures:
+    read_from: "qm9_reduced_100.xyz"
+  targets:
+    energy:
+      key: "U0"
+
+test_set: 0.1
+validation_set: 0.1


I would rather try to overwrite the options in the test instead of adding more files. Basically works like this:

options = OmegaConf.load(RESOURCES_PATH / "options.yaml") options["foo"] = "bar" OmegaConf.save(config=options, f="options.yaml")

PicoCentauri added 2 commits February 5, 2024 16:18

Added cli option for continue training

5f7b77e

Added cli option for continue training

699d726

PicoCentauri force-pushed the restart branch from 5f7b77e to 699d726 Compare February 6, 2024 11:33

frostedoyster added 3 commits February 6, 2024 13:58

Finalize continuation of training

bb54b84

Merge branch 'restart' of https://github.com/lab-cosmo/metatensor-models

ea3254e

into restart

Fix some merge issues

8c2dc5e

frostedoyster marked this pull request as ready for review February 6, 2024 13:13

PicoCentauri commented Feb 6, 2024

View reviewed changes

frostedoyster and others added 3 commits February 6, 2024 15:03

Address comments

99e9abb

Sue correct type for continue_from

689bcae

Linter

e03c5bd

frostedoyster approved these changes Feb 6, 2024

View reviewed changes

frostedoyster merged commit 89f5d36 into main Feb 6, 2024
7 of 8 checks passed

frostedoyster deleted the restart branch February 6, 2024 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added option for continue training from checkpoint #49

Added option for continue training from checkpoint #49

PicoCentauri commented Feb 5, 2024 •

edited by github-actions bot

Loading

frostedoyster commented Feb 6, 2024

PicoCentauri Feb 6, 2024

PicoCentauri left a comment

PicoCentauri Feb 6, 2024

frostedoyster Feb 6, 2024

PicoCentauri Feb 6, 2024

frostedoyster Feb 6, 2024

PicoCentauri Feb 6, 2024

frostedoyster Feb 6, 2024

PicoCentauri Feb 6, 2024

frostedoyster Feb 6, 2024

		"""Add a new output to the model."""
		# add a new row to the composition weights tensor

		from metatensor.torch.atomistic import ModelCapabilities


		def merge_capabilities(

Added option for continue training from checkpoint #49

Added option for continue training from checkpoint #49

Conversation

PicoCentauri commented Feb 5, 2024 • edited by github-actions bot Loading

frostedoyster commented Feb 6, 2024

Choose a reason for hiding this comment

PicoCentauri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PicoCentauri commented Feb 5, 2024 •

edited by github-actions bot

Loading