Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multigroup Per-Thread Cache Conversion #2591

Merged
merged 11 commits into from
Jul 21, 2023

Conversation

jtramm
Copy link
Contributor

@jtramm jtramm commented Jul 6, 2023

Overview

This PR converts per-thread multigroup cross section data caches into particle-owned caches, and is a partial remedy for #2160 and is a full fix for #2590. This PR closes #2590, but #2160 will still be left open as this PR does not solve all the performance issues.

As discussed in Issue #2160, the use of the per-thread cache involved access patterns like:

openmc/src/mgxs.cpp

Lines 428 to 439 in 33ead74

double Mgxs::get_xs(
MgxsType xstype, int gin, const int* gout, const double* mu, const int* dg)
{
// This method assumes that the temperature and angle indices are set
#ifdef _OPENMP
int tid = omp_get_thread_num();
XsData* xs_t = &xs[cache[tid].t];
int a = cache[tid].a;
#else
XsData* xs_t = &xs[cache[0].t];
int a = cache[0].a;
#endif

This existing strategy was problematic for two reasons:

  1. It is canonical multithreaded "false sharing" due to multiple cache objects fitting on the same cache line, but being read/written to by different threads (Poor thread scaling in multi-group mode #2160).

  2. It generates incorrect results in event-based mode (Multigroup MC Gives Wrong Answer in Event-Based Mode #2590), as per-thread particle states like this are meaningless.

To fix these issues, this PR simply moves the cache into the ParticleData class. I also experimented with completely removing the idea of caching MGXS data altogether, but found it was more optimal in terms of performance to retain the cache in the particle.

Other Necessary Changes

Number of Caches

The original scheme involved per-thread caches both for all materials and for all nuclides (if present). For simplicity, to reduce memory footprint, and because the determination of temperature and angle parameters is not that expensive, the new particle-owned scheme only maintains a single cache, which now corresponds to only the most recent material index. I.e., in the previous implementation, a particle might traverse from:

Material A -> Material B -> Material A

without changing energy and be able to reuse a cache entry when re-entering Material A, as each thread had a separate cache for all materials. However, with the new scheme in this PR, the cache will be overwritten when the particle traverses from A -> B, such that determination of the temperature and angle indices will need to be redone when re-entering Material A. As we will see in the performance analysis section, the new scheme is overall faster, so it seems like it may be a net benefit to have more lookups if this saves memory space.

Interfaces

Previously, the MGXS::get_xs() function that utilized caches had the following interface:

double Mgxs::get_xs(MgxsType xstype, int gin, const int* gout, const double* mu, const int* dg)

but this PR adds two integer arguments for the temperature index and angle index:

double Mgxs::get_xs(MgxsType xstype, int gin, const int* gout, const double* mu, const int* dg, int t, int a)

The means the caller needs to read/update the particle's cache themselves and provide this to the MGXS::get_xs() function. Thankfully, this is only done in two places:

  1. as part of the Mgxs::calculate_xs(Particle& p) function
  2. when scoring tallies in the score_general_mg() function

While refactoring of the calculate_xs() function was straightforward, the refactoring of the tally scoring routines ended up being more painful, as each score typically makes a few calls to get_xs().

Correctness

I re-used the multi-group multi-temperature pincell from #2590 and ran the same tests on the new branch. As shown in the table below, the parallel event-based mode now gives the correct answer (i.e., the same as if running with a serial thread in history-based mode). Therefore -- this PR closes #2590.

Parallelism Branch Mode k-eff
Serial Thread Cache (Baseline) history 1.32339
Serial Thread Cache (Baseline) event 1.32117
MPI 112 Ranks Thread Cache (Baseline) event 1.31853
OpenMP 112 Threads Thread Cache (Baseline) event 1.32285
OpenMP 112 Threads Particle Cache (This PR) event 1.32339
OpenMP 112 Threads No Cache event 1.32339

Performance Analysis

I tested performance of the PR against develop on the C5G7 benchmark. I also tested a version of the code that does not have any MGXS cache at all (i.e., the angle and temperature indices are looked up on-the-fly each time they are needed). Given that the C5G7 MGXS data is isotropic and isothermal, this was expected to be fast, but it appeared to have a performance cost to it.

I ran each configuration 10 times to get a sense of the variance in performance timings. I tested using two test systems: a dual-socket Xeon 8180M with 56 cores/112 threads total, and a dual-socket Xeon 6336Y node with 48 cores and 96 threads total. The 8180M node has been the only node so far that has shown the poor thread scaling performance, whereas the 6336Y node behaves well (as most architectures I've tested do on this problem).

2D C5G7 - 5 inactive - 5 active - 100k particles/batch - gcc 12.2.0 - 2x 8180M (56c/112t)

Parallelism Branch Inactive std. dev. Active std. dev.
Serial Thread Cache (Baseline) 39,904 104 36,669 97
MPI - 112 Ranks Thread Cache (Baseline) 1,351,320 266,381 957,790 119,047
OpenMP - 112 Threads Thread Cache (Baseline) 166,462 8,084 95,067 2,244
OpenMP - 112 Threads Particle Cache (This PR) 176,075 11,243 99,457 3,963
OpenMP - 112 Threads No Cache 152,166 16,650 96,740 4,366

2D C5G7 - 5 inactive - 5 active - 100k particles/batch - gcc 12.2.0 - 2x 6336Y (48c/96t)

Parallelism Branch Inactive std. dev. Active std. dev.
Serial Thread Cache (Baseline) 44,123 164 40,334 136
MPI - 112 Ranks Thread Cache (Baseline) 1,445,500 47,892 1,028,990 12,994
OpenMP - 112 Threads Thread Cache (Baseline) 1,084,334 22,947 762,377 8,591
OpenMP - 112 Threads Particle Cache (This PR) 1,008,600 7,293 742,361 9,525
OpenMP - 112 Threads No Cache 854,124 28,345 656,255 21,873

Basically, conversion of the thread-owned XS cache to a particle-owned one does not really improve the performance picture. Moving the cache to the particle has a minor speedup on the troublesome 8180M system, but also shows a minor slowdown on the 6336Y node. Removing the cache completely was also not a great idea, as this performed much slower than the particle-owned cache on both systems. Thus, #2160 will need to remain open as this PR did not really help with whatever the underlying cause of poor thread scaling on the 8180M is.

Other Potential Areas of Memory Contention to Investigate on the 8180M node

There are a few other areas that may have been causing memory contention:

  1. The neighbor list (due to locking when appending to a cell's neighbor list)
  2. The shared fission bank (due to atomically incrementing the size integer when appending). This is relatively much more costly given how much faster MG mode usually is compared to CE, so contention over this variable is more of an issue.
  3. Particle initialization (due to calls to New or malloc in its associated constructors)

To test out if these were the root causes, I implemented algorithms to remove all contention from these areas, but did not observe any additional speedups. Analysis with perf and vtune was somewhat confounding, as they may not have been running correctly on this system due to cluster permissions issues so were identifying seemingly odd locations as being impractically expensive. Manual addition of history-based mode OpenMP timers (and examining performance timers in event-based mode) revealed that basically all the major areas of the code were running slow when a lot of threads were running. I.e., the slowdown did not appear to be limited to a particular location, rather, the entire program seemed to be running slowly.

I also experimented with using numactl to manually manage where the allocations were happening and the thread affinity policies, and while these had marginal benefit, but did not explain the overall poor thread scaling.

Overall, my guess is that there may be a system configuration issue with the 8180M node that is affecting us for some reason, or it may simply be due to a quirk of the cache architecture of skylake that has been remedied in newer architectures since then (as 2017's 8180M is now 6 years old). As discussed in #2160, other dual-socket CPUs I've tested on have had good thread scaling, even before this PR, so right now it doesn't seem worth worrying about further. However, if we see similar thread scaling issues on newer chips, I will plan to commit more time resources to solving this issue. For now, I will consider this a partial remedy for #2160, so will leave that issue open.

Checklist

  • I have performed a self-review of my own code
  • I have run clang-format on any C++ source files (if applicable)
  • I have followed the style guidelines for Python source files (if applicable)
  • I have made corresponding changes to the documentation (if applicable)
  • I have added tests that prove my fix is effective or that my feature works (if applicable)

@jtramm jtramm requested a review from nelsonag as a code owner July 6, 2023 18:40
@jtramm jtramm changed the title Multigroup Per-Thread Cache Removal Multigroup Per-Thread Cache Conversion Jul 6, 2023
@jtramm jtramm removed the request for review from nelsonag July 6, 2023 18:42
@jtramm jtramm added the Bugs label Jul 7, 2023
Copy link
Contributor

@paulromano paulromano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @jtramm and the very thorough explanation of the whole situation! Other than the very minor line comments below, one thing that I would like to see before we merge is the addition of a test to cover this case so we don't mess it up in the future.

include/openmc/mgxs.h Outdated Show resolved Hide resolved
include/openmc/mgxs.h Outdated Show resolved Hide resolved
src/mgxs.cpp Outdated Show resolved Hide resolved
src/mgxs.cpp Outdated Show resolved Hide resolved
@jtramm
Copy link
Contributor Author

jtramm commented Jul 17, 2023

I've added a test in, and can confirm that the test fails with develop in event-based mode (e.g., pytest --event) but passes with this branch in both history- and event-based modes.

The CI is giving failures that seem to be unrelated to anything I've added:

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for h5py
ERROR: Could not build wheels for h5py, which is required to install pyproject.toml-based projects
Failed to build h5py

Notice:  A new release of pip is available: 23.1.2 -> 23.2
Notice:  To update, run: pip install --upgrade pip
Error: Process completed with exit code 1.

Any ideas on how to resolve the CI issue?

@paulromano
Copy link
Contributor

Argh, sometimes I hate CI 😞 I'll look into that and see if I can get to the bottom of it

@paulromano
Copy link
Contributor

@jtramm I think you'll need to push a new commit to this branch to get CI passing (can even be an empty commit)

@jtramm
Copy link
Contributor Author

jtramm commented Jul 21, 2023

I added in a comment-only commit -- looks like CI is passing now.

Copy link
Contributor

@paulromano paulromano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jtramm!

@paulromano paulromano enabled auto-merge (squash) July 21, 2023 04:48
@paulromano paulromano merged commit b5001ee into openmc-dev:develop Jul 21, 2023
16 checks passed
stchaker pushed a commit to stchaker/openmc that referenced this pull request Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multigroup MC Gives Wrong Answer in Event-Based Mode
2 participants