Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About machine.json using lsf #378

Open
DM0815 opened this issue Mar 16, 2023 · 4 comments
Open

About machine.json using lsf #378

DM0815 opened this issue Mar 16, 2023 · 4 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@DM0815
Copy link

DM0815 commented Mar 16, 2023

When I use lsf queue system to conduct dpgen in logining node of server cluster.After submitting the command,it reminds "RuntimeError: Meet errors will handle unexpected submission state." and suggest me to see the remote_root.But there are no mistake information in work dir. And in dp task dir, the jobs is still runing, the train.log is ok. And I can the jobs in queue system. I don't know where wrong, can you give me some hints. machine.jsons and mistake informarion attached.

machine.json:
{
"api_version": "1.0",
"_deepmd_version": "2.1.0",
"train" :
{
"command": "dp",
"machine": {
"batch_type": "LSF",
"context_type": "local",
"local_root" : "./",
"remote_root":"/public/home/dmeng/DPGEN/0316testlsf/tmp"
},
"resources":
{
"number_node": 1,
"cpu_per_node": 8,
"gpu_per_node": 0,
"queue_name":"normal",
"group_size": 2,
"_batch_type": "LSF",
"_kwargs": {},
"source_list":["/public/home/dmeng/anaconda3/bin/activate deepmd"]
}
},
"model_devi":
{
"command": "lmp -i input.lammps -v restart 0",
"machine": {
"batch_type": "LSF",
"context_type": "local",
"local_root" : "./",
"remote_root":"/public/home/dmeng/DPGEN/0316testlsf/tmp"

  },
  "resources": {
    "number_node": 1,
    "cpu_per_node": 8,
    "gpu_per_node": 0,
     "queue_name":"normal",
    "group_size": 100,
    "_batch_type": "LSF",
     "_kwargs": {},
    "source_list":["/public/home/dmeng/anaconda3/bin/activate deepmd"]
  }
},

"fp":
{
"command": "ulimit -s unlimited && mpirun -n 8 /public/home/dmeng/softwares/vasp.5.4/bin/vasp_std",
"machine": {
"batch_type": "LSF",
"context_type": "local",
"local_root" : "./",
"remote_root":"/public/home/dmeng/DPGEN/0316testlsf/tmp"
},
"resources": {
"number_node": 1,
"cpu_per_node": 8,
"gpu_per_node": 0,
"queue_name":"normal",
"group_size": 50,
"_batch_type": "LSF",
"_kwargs": {},
"source_list": ["/public/softwares/intel/oneapi/setvars.sh"]
}
}
}
1678950833716

1678951498610

1678951517568

1678951549987

@njzjz
Copy link
Member

njzjz commented Apr 10, 2023

image

Please provide the "above exception" mentioned in your error message. Thanks.

@DM0815
Copy link
Author

DM0815 commented Apr 21, 2023

image

Please provide the "above exception" mentioned in your error message. Thanks.

I'm sorry for replying late.
image

@12jscvb
Copy link

12jscvb commented Sep 26, 2023

Hi I also encounter the same problem, Do you have solved the error?

@njzjz njzjz transferred this issue from deepmodeling/dpgen Oct 16, 2023
@njzjz njzjz added help wanted Extra attention is needed bug Something isn't working labels Oct 20, 2023
@njzjz
Copy link
Member

njzjz commented Oct 20, 2023

Hi I also encounter the same problem, Do you have solved the error?

At this time, we don't have access to any LSF node. If you have found what is wrong with dpdispatcher, feel free to contribute to the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants