You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my machine.json file, I'm using parallel training with the following command: "command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt"
The training phase completes successfully for all four models. Each model directory contains the expected output files, including "*_task_tag_finished" and "frozen_model.pth".
However, the workflow stops at the model_devi stage with the following error: FileNotFoundError: cannot find download file frozen_model.pb``
I believe DP-GEN is looking for "frozen_model.pb" (TensorFlow format) by default, but it's not compatible with the PyTorch model "`frozen_model.pth`".
When I manually attempt to convert the format using: dp convert-backend frozen_model.pth frozen_model.pb I receive another error: `RuntimeError: Unknown descriptor type: dpa2. Did you mean: dpa1?`
Analysis:
It appears that the DPA-2 model currently only supports PyTorch and cannot be converted to the TensorFlow format (frozen_model.pb). This prevents me from proceeding with subsequent DP-GEN operations for the DPA-2 model.
Questions:
Is there a way to configure DP-GEN to work with PyTorch's "frozen_model.pth" for the DPA-2 model?
Are there plans to support TensorFlow backend or format conversion for the DPA-2 model in future releases?
Is there an alternative workflow or workaround to use the DPA-2 model with DP-GEN?
Any guidance or suggestions would be greatly appreciated. Thank you for your time and assistance.
Bug summary
Dear DeePMD community,
I'm encountering an issue while using the DP-GEN workflow with the DPA-2 model and PyTorch backend. Here are the details:
Environment:
In my machine.json file, I'm using parallel training with the following command: "
command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt
"The training phase completes successfully for all four models. Each model directory contains the expected output files, including "
*_task_tag_finished
" and "frozen_model.pth
".├── 000
│ ├── checkpoint
│ ├── dpa2.hdf5
│ ├── f74eaa2be2cab187505b354f787e5e5530d141f4_task_tag_finished
│ ├── frozen_model.pth
│ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/000/input.json
│ ├── input_v2_compat.json
│ ├── lcurve.out
│ ├── model.ckpt-100.pt
│ ├── model.ckpt-200.pt
│ ├── model.ckpt-300.pt
│ ├── model.ckpt.pt -> model.ckpt-300.pt
│ ├── out.json
│ └── train.log
├── 001
│ ├── 84f1c8acd2f9dc640b2fea97f8aad68396a0fc93_task_tag_finished
│ ├── checkpoint
│ ├── dpa2.hdf5
│ ├── frozen_model.pth
│ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/001/input.json
│ ├── input_v2_compat.json
│ ├── lcurve.out
│ ├── model.ckpt-100.pt
│ ├── model.ckpt-200.pt
│ ├── model.ckpt-300.pt
│ ├── model.ckpt.pt -> model.ckpt-300.pt
│ ├── out.json
│ └── train.log
├── 002
│ ├── checkpoint
│ ├── dpa2.hdf5
│ ├── dpdispatcher.log
│ ├── e193485d0db3952cdb32f6406c9580c43f010989_task_tag_finished
│ ├── frozen_model.pth
│ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/002/input.json
│ ├── input_v2_compat.json
│ ├── lcurve.out
│ ├── model.ckpt-100.pt
│ ├── model.ckpt-200.pt
│ ├── model.ckpt-300.pt
│ ├── model.ckpt.pt -> model.ckpt-300.pt
│ ├── out.json
│ └── train.log
├── 003
│ ├── 19f28cb5828301f7434aaed206c3956f6890eb78_task_tag_finished
│ ├── checkpoint
│ ├── dpa2.hdf5
│ ├── frozen_model.pth
│ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/003/input.json
│ ├── input_v2_compat.json
│ ├── lcurve.out
│ ├── model.ckpt-100.pt
│ ├── model.ckpt-200.pt
│ ├── model.ckpt-300.pt
│ ├── model.ckpt.pt -> model.ckpt-300.pt
│ ├── out.json
│ └── train.log
However, the workflow stops at the model_devi stage with the following error:
FileNotFoundError: cannot find download file
frozen_model.pb``I believe DP-GEN is looking for "
frozen_model.pb
" (TensorFlow format) by default, but it's not compatible with the PyTorch model "`frozen_model.pth`".When I manually attempt to convert the format using: dp convert-backend frozen_model.pth frozen_model.pb I receive another error: `RuntimeError: Unknown descriptor type: dpa2. Did you mean: dpa1?`
Analysis:
It appears that the DPA-2 model currently only supports PyTorch and cannot be converted to the TensorFlow format (frozen_model.pb). This prevents me from proceeding with subsequent DP-GEN operations for the DPA-2 model.
Questions:
Is there a way to configure DP-GEN to work with PyTorch's "frozen_model.pth" for the DPA-2 model?
Are there plans to support TensorFlow backend or format conversion for the DPA-2 model in future releases?
Is there an alternative workflow or workaround to use the DPA-2 model with DP-GEN?
Any guidance or suggestions would be greatly appreciated. Thank you for your time and assistance.
DeePMD-kit Version
3.0.0b4
Backend and its version
Pytorch 2.1.2
How did you download the software?
conda
Input Files, Running Commands, Error Log, etc.
machine.json
"command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt",
Steps to Reproduce
Use DPA-2 model in DP-GEN.
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: