Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use hash of comm_tag if not numeric #222

Merged
merged 6 commits into from
Mar 2, 2022
Merged

use hash of comm_tag if not numeric #222

merged 6 commits into from
Mar 2, 2022

Conversation

matthiasdiener
Copy link
Collaborator

@matthiasdiener matthiasdiener commented Mar 2, 2022

grudge/trace_pair.py Outdated Show resolved Hide resolved
@inducer inducer marked this pull request as ready for review March 2, 2022 16:20
@inducer inducer enabled auto-merge (squash) March 2, 2022 16:20
@matthiasdiener
Copy link
Collaborator Author

I think this is probably incorrect in case hash(comm_tag) > MPI_TAG_UB (but I'm not sure if that can happen).

@inducer
Copy link
Owner

inducer commented Mar 2, 2022

Fair, stick a modulo in there. It's a stopgap.

@inducer
Copy link
Owner

inducer commented Mar 2, 2022

You were right: OverflowError: value too large to convert to int

@inducer inducer enabled auto-merge (squash) March 2, 2022 19:08
@inducer inducer disabled auto-merge March 2, 2022 19:08
grudge/trace_pair.py Outdated Show resolved Hide resolved
Co-authored-by: Andreas Klöckner <[email protected]>
@inducer inducer enabled auto-merge (squash) March 2, 2022 19:21
@inducer
Copy link
Owner

inducer commented Mar 2, 2022

Thx!

@MTCam
Copy link
Collaborator

MTCam commented Mar 2, 2022

Don't all ranks need to agree on the message-specific tags? I don't know how this example CI is passing. When I run MIRGE-Com examples by hand, and by MIRGE-Com CI, they appear to hang.

After mpiexec -n 2 python -m mpi4py pulse-mpi.py:

----- cfl=array(0.00100953)
P(101324.99999999999, 101325.0), T(1499.9999999999998, 1500.0)
building face restriction: start
building face restriction: done
building face restriction: start
building face restriction: done
building face restriction: start
building face restriction: done
/Users/mtcampbe/CEESD/devel/omicron/production/emirge/grudge/grudge/trace_pair.py:505: UserWarning: Encountered unknown symbolic tag '<class 'mirgecom.euler._EulerCVTag'>', assigning a value of '261176863'. This is a temporary workaround, please ensure that tags are sufficiently distinct for your use case.
  warn("Encountered unknown symbolic tag "
building face restriction: start
building face restriction: done
/Users/mtcampbe/CEESD/devel/omicron/production/emirge/grudge/grudge/trace_pair.py:505: UserWarning: Encountered unknown symbolic tag '<class 'mirgecom.euler._EulerCVTag'>', assigning a value of '53455899'. This is a temporary workaround, please ensure that tags are sufficiently distinct for your use case.
  warn("Encountered unknown symbolic tag "

Example hangs at this point.

@inducer inducer disabled auto-merge March 2, 2022 19:32
@inducer
Copy link
Owner

inducer commented Mar 2, 2022

Ah right. hash() is not consistent across interpreter instances. pytools.persistent_dict has a whole separate hash infrastructure that solves this.

@inducer
Copy link
Owner

inducer commented Mar 2, 2022

@MTCam As a workaround for the workaround, if you set the env var PYTHONHASHSEED=0, that'll prop this up until we fix this up.

https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED

@MTCam
Copy link
Collaborator

MTCam commented Mar 2, 2022

@MTCam As a workaround for the workaround, if you set the env var PYTHONHASHSEED=0, that'll prop this up until we fix this up.

docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED

Hrm. I tried this, and after it failed, I tried it like this:
mpiexec -n 2 env PYTHONHASHSEED=0 python -m mpi4py autoignition-mpi.py,
and the tag hash was still rank-disparate, resulting in a hang.

@inducer
Copy link
Owner

inducer commented Mar 2, 2022

Just pushed a version that uses the persistent hash.

@MTCam
Copy link
Collaborator

MTCam commented Mar 2, 2022

Just pushed a version that uses the persistent hash.

Eureka! Thanks!

@MTCam
Copy link
Collaborator

MTCam commented Mar 2, 2022

Just FYI, this is totally OK - since it just requires the MIRGE-Com examples (and production drivers) to be updated, but we still get this error when running eager in parallel:

  File "/Users/mtcampbe/CEESD/devel/omicron/production/emirge/grudge/grudge/trace_pair.py", line 307, in interior_trace_pairs
    + cross_rank_trace_pairs(dcoll, vec, comm_tag=comm_tag)
  File "/Users/mtcampbe/CEESD/devel/omicron/production/emirge/grudge/grudge/trace_pair.py", line 508, in cross_rank_trace_pairs
    tag_ub = actx.mpi_communicator.Get_attr(MPI.TAG_UB)
AttributeError: 'PyOpenCLArrayContext' object has no attribute 'mpi_communicator'
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

The fix is to make sure to update everything to use the new API for getting the reasonable array context instead of just picking PyOpenCLArrayContext.

@inducer inducer merged commit bde1ad7 into main Mar 2, 2022
@inducer inducer deleted the hashing_comm_tag branch March 2, 2022 20:57
@inducer
Copy link
Owner

inducer commented Mar 3, 2022

The fix is to make sure to update everything to use the new API for getting the reasonable array context instead of just picking PyOpenCLArrayContext.

This could have been done in a backward-compatible manner, and I'm really kicking myself that it wasn't, for no good reason other than perhaps that I was rushing to get it done. Generally, the change is what we want (that the array context is the source of truth for the communicator, rather than the discretization). Is there a point now to retrofitting backward compat now, or not really?

@MTCam
Copy link
Collaborator

MTCam commented Mar 3, 2022

The fix is to make sure to update everything to use the new API for getting the reasonable array context instead of just picking PyOpenCLArrayContext.

This could have been done in a backward-compatible manner, and I'm really kicking myself that it wasn't, for no good reason other than perhaps that I was rushing to get it done. Generally, the change is what we want (that the array context is the source of truth for the communicator, rather than the discretization). Is there a point now to retrofitting backward compat now, or not really?

Actually, in many ways this is working out better. This would have just added to our pile of technical debt if we were allowed to limp forward without making the changes. All is well imo. I warned folks not to be pulling new toolchains until we get all this worked out. I understand the grumblings too, but on some level we need to be fluid in supporting important changes coming down from upstream pkgs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Errors after switching to grudge@main
3 participants