Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC(wISC)30to60 performance tests are failing on Perlmutter and Chicoma #497

Closed
xylar opened this issue Jan 11, 2023 · 9 comments · Fixed by #624
Closed

EC(wISC)30to60 performance tests are failing on Perlmutter and Chicoma #497

xylar opened this issue Jan 11, 2023 · 9 comments · Fixed by #624
Assignees
Labels
bug Something isn't working ocean

Comments

@xylar
Copy link
Collaborator

xylar commented Jan 11, 2023

After the recent module changes on Perlmutter and Chicoma, I'm seeing PIO errors but only for the EC performance tests:

ERROR: MPAS IO Error: Bad return value from PIO
CRITICAL ERROR: Core init failed for core ocean

This is on all cores except 0000.

See:

/pscratch/sd/x/xylar/compass_1.2/test_20230111/ocean_pr/ocean/global_ocean/EC30to60/PHC/performance_test/forward

I tried changing the PIO layout but that didn't make a difference. More debugging is needed.

@xylar xylar added bug Something isn't working ocean labels Jan 11, 2023
@xylar xylar changed the title EC(wISC)30to60 performance tests are failing on Perlmutter EC(wISC)30to60 performance tests are failing on Perlmutter and Chicoma Jan 16, 2023
@xylar xylar mentioned this issue Jan 16, 2023
7 tasks
@mark-petersen
Copy link
Collaborator

Note: On perlmutter use the head of compass. On chicoma, use the xylar/add_chicoma-cpu branch

@xylar
Copy link
Collaborator Author

xylar commented Jan 17, 2023

Let's see if E3SM-Project/mache#100 happens to fix this as a first change. We should be able to test this by just adding:

export FI_CXI_RX_MATCH_MODE=software
export MPICH_COLL_SYNC=MPI_Bcast

manually to the load script.

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2023

At this point, I'm not seeing the PIO error but the EC test is jub hanging on Chicoma.

@xylar xylar mentioned this issue Mar 9, 2023
64 tasks
@xylar
Copy link
Collaborator Author

xylar commented Mar 10, 2023

@mark-petersen, as I test #555, this and the probably related issue #500 are really giving me trouble. I could use some help debugging them.

In every case that I'm seeing these issues, it's with Gnu compilers (not sure if that's a coincidence or not). It shows up in PIO in some cases and just as hanging in others.

@xylar
Copy link
Collaborator Author

xylar commented Mar 10, 2023

This issue makes the pr test suite not useful on Perlmutter and Chicoma at all, and limits its usefulness on other machines where Gnu isn't our primary compiler but where we do want to fully support it.

@xylar
Copy link
Collaborator Author

xylar commented Mar 10, 2023

The latest example of this on Perlmutter can be found at:

/pscratch/sd/x/xylar/compass_1.2/test_20230310/ocean_pr2/ocean/global_ocean/EC30to60/PHC/performance_test/forward
/pscratch/sd/x/xylar/compass_1.2/test_20230310/ocean_pr2/ocean/global_ocean/ECwISC30to60/PHC/performance_test/forward

@mark-petersen
Copy link
Collaborator

I think this issue is the same as #500. I just fixed the hang with E3SM-Project/E3SM#5575. We can retest the pr suite with that to see if there remains a PIO issue.

@xylar
Copy link
Collaborator Author

xylar commented Apr 12, 2023

As I commented here E3SM-Project/E3SM#5575 (comment), unfortunately, I don't think that branch has fixed this problem, although it does seem to have fixed #500.

@xylar
Copy link
Collaborator Author

xylar commented Apr 19, 2023

The pr suite runs on Perlmutter with the fix in E3SM-Project/E3SM#5610. I believe we can close this as soon as that gets merged and I update the `E3SM-Project submodule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ocean
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants