[BUG] Issue with DP-GEN workflow for DPA-2 model using PyTorch backend #1654

chenggoj · 2024-10-08T06:30:35Z

Bug summary

Dear DeePMD community,

I'm encountering an issue while using the DP-GEN workflow with the DPA-2 model and PyTorch backend. Here are the details:

Environment:

DeePMD-kit version: 3.0.0b4-GPU-py3.9-cuda120
Model: DPA-2
Backend: PyTorch
Workflow control: DP-GEN
Issue Description:

In my machine.json file, I'm using parallel training with the following command: "command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt"
The training phase completes successfully for all four models. Each model directory contains the expected output files, including "*_task_tag_finished" and "frozen_model.pth".

├── 000
│   ├── checkpoint
│   ├── dpa2.hdf5
│   ├── f74eaa2be2cab187505b354f787e5e5530d141f4_task_tag_finished
│   ├── frozen_model.pth
│   ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/000/input.json
│   ├── input_v2_compat.json
│   ├── lcurve.out
│   ├── model.ckpt-100.pt
│   ├── model.ckpt-200.pt
│   ├── model.ckpt-300.pt
│   ├── model.ckpt.pt -> model.ckpt-300.pt
│   ├── out.json
│   └── train.log
├── 001
│   ├── 84f1c8acd2f9dc640b2fea97f8aad68396a0fc93_task_tag_finished
│   ├── checkpoint
│   ├── dpa2.hdf5
│   ├── frozen_model.pth
│   ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/001/input.json
│   ├── input_v2_compat.json
│   ├── lcurve.out
│   ├── model.ckpt-100.pt
│   ├── model.ckpt-200.pt
│   ├── model.ckpt-300.pt
│   ├── model.ckpt.pt -> model.ckpt-300.pt
│   ├── out.json
│   └── train.log
├── 002
│   ├── checkpoint
│   ├── dpa2.hdf5
│   ├── dpdispatcher.log
│   ├── e193485d0db3952cdb32f6406c9580c43f010989_task_tag_finished
│   ├── frozen_model.pth
│   ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/002/input.json
│   ├── input_v2_compat.json
│   ├── lcurve.out
│   ├── model.ckpt-100.pt
│   ├── model.ckpt-200.pt
│   ├── model.ckpt-300.pt
│   ├── model.ckpt.pt -> model.ckpt-300.pt
│   ├── out.json
│   └── train.log
├── 003
│   ├── 19f28cb5828301f7434aaed206c3956f6890eb78_task_tag_finished
│   ├── checkpoint
│   ├── dpa2.hdf5
│   ├── frozen_model.pth
│   ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/003/input.json
│   ├── input_v2_compat.json
│   ├── lcurve.out
│   ├── model.ckpt-100.pt
│   ├── model.ckpt-200.pt
│   ├── model.ckpt-300.pt
│   ├── model.ckpt.pt -> model.ckpt-300.pt
│   ├── out.json
│   └── train.log

However, the workflow stops at the model_devi stage with the following error: FileNotFoundError: cannot find download file frozen_model.pb``
I believe DP-GEN is looking for "frozen_model.pb" (TensorFlow format) by default, but it's not compatible with the PyTorch model "`frozen_model.pth`".
When I manually attempt to convert the format using: dp convert-backend frozen_model.pth frozen_model.pb I receive another error: `RuntimeError: Unknown descriptor type: dpa2. Did you mean: dpa1?`
Analysis:
It appears that the DPA-2 model currently only supports PyTorch and cannot be converted to the TensorFlow format (frozen_model.pb). This prevents me from proceeding with subsequent DP-GEN operations for the DPA-2 model.

Questions:

Is there a way to configure DP-GEN to work with PyTorch's "frozen_model.pth" for the DPA-2 model?
Are there plans to support TensorFlow backend or format conversion for the DPA-2 model in future releases?
Is there an alternative workflow or workaround to use the DPA-2 model with DP-GEN?

Any guidance or suggestions would be greatly appreciated. Thank you for your time and assistance.

DeePMD-kit Version

3.0.0b4

Backend and its version

Pytorch 2.1.2

How did you download the software?

conda

Input Files, Running Commands, Error Log, etc.

machine.json

"command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt",

Steps to Reproduce

Use DPA-2 model in DP-GEN.

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

njzjz · 2024-10-08T18:34:42Z

Have you set train_backend to pytorch? Note this option has not been released in a stable version.

chenggoj · 2024-10-08T20:14:57Z

Have you set train_backend to pytorch? Note this option has not been released in a stable version.

Oh, I did not notice that before. Now, I know,

def _get_model_suffix(jdata) -> str:
    """Return the model suffix based on the backend."""
    mlp_engine = jdata.get("mlp_engine", "dp")
    if mlp_engine == "dp":
        suffix_map = {"tensorflow": ".pb", "pytorch": ".pth"}
        backend = jdata.get("train_backend", "tensorflow")
        if backend in suffix_map:
            suffix = suffix_map[backend]
        else:
            raise ValueError(
                f"The backend {backend} is not available. Supported backends are: 'tensorflow', 'pytorch'."
            )
        return suffix
    else:
        raise ValueError(f"Unsupported engine: {mlp_engine}")

Now, I set it.
{
"type_map": ["Al","O", "Pt"],
"mass_map": [27,16,195],
"init_data_prefix": "../",
"init_data_sys": ["init/data/data_SA",
"init/data/data_NP",
"init/data/data_mix",
"init/data/data_NP_gamma-Al2O3_001"
],
"sys_configs_prefix": "../",
"sys_configs": [
["init/model_devi/POSCAR_SA"],
["init/model_devi/POSCAR_NP"],
["init/model_devi/POSCAR_mix"],
["init/model_devi/POSCAR_gamma-Al2O3_001"]
],
"_comment": " that's all ",
"numb_models": 4,
"train_backend": "pytorch",
"default_training_param": {
.......

But it is still not working.
The same error
FileNotFoundError: cannot find download file ........frozen_model.pb

njzjz · 2024-10-16T19:21:42Z

Which commit of DP-GEN do you use?

chenggoj added the bug Something isn't working label Oct 8, 2024

njzjz transferred this issue from deepmodeling/deepmd-kit Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Issue with DP-GEN workflow for DPA-2 model using PyTorch backend #1654

[BUG] Issue with DP-GEN workflow for DPA-2 model using PyTorch backend #1654

chenggoj commented Oct 8, 2024

njzjz commented Oct 8, 2024

chenggoj commented Oct 8, 2024

njzjz commented Oct 16, 2024

[BUG] Issue with DP-GEN workflow for DPA-2 model using PyTorch backend #1654

[BUG] Issue with DP-GEN workflow for DPA-2 model using PyTorch backend #1654

Comments

chenggoj commented Oct 8, 2024

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Oct 8, 2024

chenggoj commented Oct 8, 2024

njzjz commented Oct 16, 2024