Multigroup Per-Thread Cache Conversion #2591

jtramm · 2023-07-06T18:40:53Z

Overview

This PR converts per-thread multigroup cross section data caches into particle-owned caches, and is a partial remedy for #2160 and is a full fix for #2590. This PR closes #2590, but #2160 will still be left open as this PR does not solve all the performance issues.

As discussed in Issue #2160, the use of the per-thread cache involved access patterns like:

openmc/src/mgxs.cpp

Lines 428 to 439 in 33ead74

    
           double Mgxs::get_xs( 
        
             MgxsType xstype, int gin, const int* gout, const double* mu, const int* dg) 
        
           { 
        
             // This method assumes that the temperature and angle indices are set 
        
           #ifdef _OPENMP 
        
             int tid = omp_get_thread_num(); 
        
             XsData* xs_t = &xs[cache[tid].t]; 
        
             int a = cache[tid].a; 
        
           #else 
        
             XsData* xs_t = &xs[cache[0].t]; 
        
             int a = cache[0].a; 
        
           #endif

This existing strategy was problematic for two reasons:

It is canonical multithreaded "false sharing" due to multiple cache objects fitting on the same cache line, but being read/written to by different threads (Poor thread scaling in multi-group mode #2160).
It generates incorrect results in event-based mode (Multigroup MC Gives Wrong Answer in Event-Based Mode #2590), as per-thread particle states like this are meaningless.

To fix these issues, this PR simply moves the cache into the ParticleData class. I also experimented with completely removing the idea of caching MGXS data altogether, but found it was more optimal in terms of performance to retain the cache in the particle.

Other Necessary Changes

Number of Caches

The original scheme involved per-thread caches both for all materials and for all nuclides (if present). For simplicity, to reduce memory footprint, and because the determination of temperature and angle parameters is not that expensive, the new particle-owned scheme only maintains a single cache, which now corresponds to only the most recent material index. I.e., in the previous implementation, a particle might traverse from:

Material A -> Material B -> Material A

without changing energy and be able to reuse a cache entry when re-entering Material A, as each thread had a separate cache for all materials. However, with the new scheme in this PR, the cache will be overwritten when the particle traverses from A -> B, such that determination of the temperature and angle indices will need to be redone when re-entering Material A. As we will see in the performance analysis section, the new scheme is overall faster, so it seems like it may be a net benefit to have more lookups if this saves memory space.

Interfaces

Previously, the MGXS::get_xs() function that utilized caches had the following interface:

double Mgxs::get_xs(MgxsType xstype, int gin, const int* gout, const double* mu, const int* dg)

but this PR adds two integer arguments for the temperature index and angle index:

double Mgxs::get_xs(MgxsType xstype, int gin, const int* gout, const double* mu, const int* dg, int t, int a)

The means the caller needs to read/update the particle's cache themselves and provide this to the MGXS::get_xs() function. Thankfully, this is only done in two places:

as part of the Mgxs::calculate_xs(Particle& p) function
when scoring tallies in the score_general_mg() function

While refactoring of the calculate_xs() function was straightforward, the refactoring of the tally scoring routines ended up being more painful, as each score typically makes a few calls to get_xs().

Correctness

I re-used the multi-group multi-temperature pincell from #2590 and ran the same tests on the new branch. As shown in the table below, the parallel event-based mode now gives the correct answer (i.e., the same as if running with a serial thread in history-based mode). Therefore -- this PR closes #2590.

Parallelism	Branch	Mode	k-eff
Serial	Thread Cache (Baseline)	history	1.32339
Serial	Thread Cache (Baseline)	event	1.32117
MPI 112 Ranks	Thread Cache (Baseline)	event	1.31853
OpenMP 112 Threads	Thread Cache (Baseline)	event	1.32285
OpenMP 112 Threads	Particle Cache (This PR)	event	1.32339
OpenMP 112 Threads	No Cache	event	1.32339

Performance Analysis

I tested performance of the PR against develop on the C5G7 benchmark. I also tested a version of the code that does not have any MGXS cache at all (i.e., the angle and temperature indices are looked up on-the-fly each time they are needed). Given that the C5G7 MGXS data is isotropic and isothermal, this was expected to be fast, but it appeared to have a performance cost to it.

I ran each configuration 10 times to get a sense of the variance in performance timings. I tested using two test systems: a dual-socket Xeon 8180M with 56 cores/112 threads total, and a dual-socket Xeon 6336Y node with 48 cores and 96 threads total. The 8180M node has been the only node so far that has shown the poor thread scaling performance, whereas the 6336Y node behaves well (as most architectures I've tested do on this problem).

2D C5G7 - 5 inactive - 5 active - 100k particles/batch - gcc 12.2.0 - 2x 8180M (56c/112t)

Parallelism	Branch	Inactive	std. dev.	Active	std. dev.
Serial	Thread Cache (Baseline)	39,904	104	36,669	97
MPI - 112 Ranks	Thread Cache (Baseline)	1,351,320	266,381	957,790	119,047
OpenMP - 112 Threads	Thread Cache (Baseline)	166,462	8,084	95,067	2,244
OpenMP - 112 Threads	Particle Cache (This PR)	176,075	11,243	99,457	3,963
OpenMP - 112 Threads	No Cache	152,166	16,650	96,740	4,366

2D C5G7 - 5 inactive - 5 active - 100k particles/batch - gcc 12.2.0 - 2x 6336Y (48c/96t)

Parallelism	Branch	Inactive	std. dev.	Active	std. dev.
Serial	Thread Cache (Baseline)	44,123	164	40,334	136
MPI - 112 Ranks	Thread Cache (Baseline)	1,445,500	47,892	1,028,990	12,994
OpenMP - 112 Threads	Thread Cache (Baseline)	1,084,334	22,947	762,377	8,591
OpenMP - 112 Threads	Particle Cache (This PR)	1,008,600	7,293	742,361	9,525
OpenMP - 112 Threads	No Cache	854,124	28,345	656,255	21,873

Basically, conversion of the thread-owned XS cache to a particle-owned one does not really improve the performance picture. Moving the cache to the particle has a minor speedup on the troublesome 8180M system, but also shows a minor slowdown on the 6336Y node. Removing the cache completely was also not a great idea, as this performed much slower than the particle-owned cache on both systems. Thus, #2160 will need to remain open as this PR did not really help with whatever the underlying cause of poor thread scaling on the 8180M is.

Checklist

I have performed a self-review of my own code
I have run clang-format on any C++ source files (if applicable)
I have followed the style guidelines for Python source files (if applicable)
I have made corresponding changes to the documentation (if applicable)
I have added tests that prove my fix is effective or that my feature works (if applicable)

…y one MG XS cache (rather than one per material), and refactoring functions needed to support calculate_xs in MG mode.

paulromano

Thanks for this fix @jtramm and the very thorough explanation of the whole situation! Other than the very minor line comments below, one thing that I would like to see before we merge is the addition of a test to cover this case so we don't mess it up in the future.

include/openmc/mgxs.h

src/mgxs.cpp

Co-authored-by: Paul Romano <[email protected]>

jtramm · 2023-07-17T16:50:15Z

I've added a test in, and can confirm that the test fails with develop in event-based mode (e.g., pytest --event) but passes with this branch in both history- and event-based modes.

The CI is giving failures that seem to be unrelated to anything I've added:

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for h5py
ERROR: Could not build wheels for h5py, which is required to install pyproject.toml-based projects
Failed to build h5py

Notice:  A new release of pip is available: 23.1.2 -> 23.2
Notice:  To update, run: pip install --upgrade pip
Error: Process completed with exit code 1.

Any ideas on how to resolve the CI issue?

paulromano · 2023-07-17T17:12:04Z

Argh, sometimes I hate CI 😞 I'll look into that and see if I can get to the bottom of it

paulromano · 2023-07-20T13:26:15Z

@jtramm I think you'll need to push a new commit to this branch to get CI passing (can even be an empty commit)

jtramm · 2023-07-21T03:24:00Z

I added in a comment-only commit -- looks like CI is passing now.

paulromano

Thanks @jtramm!

Co-authored-by: Paul Romano <[email protected]>

jtramm added 7 commits July 5, 2023 14:42

Moving MG XS cache into particle data, making each particle store onl…

378b92d

…y one MG XS cache (rather than one per material), and refactoring functions needed to support calculate_xs in MG mode.

refactored tally scoring function to no longer use MGXS cache

f410670

fixing small compilation bugs.

4b3716f

formatting fixes

bfb8169

updated fmt version

be8c36a

fixed other vendors

2e67102

clang format

8756b0f

jtramm added Performance Multi-Group Mode labels Jul 6, 2023

jtramm requested a review from nelsonag as a code owner July 6, 2023 18:40

jtramm changed the title ~~Multigroup Per-Thread Cache Removal~~ Multigroup Per-Thread Cache Conversion Jul 6, 2023

jtramm removed the request for review from nelsonag July 6, 2023 18:42

jtramm added the Bugs label Jul 7, 2023

paulromano requested changes Jul 15, 2023

View reviewed changes

include/openmc/mgxs.h Outdated Show resolved Hide resolved

include/openmc/mgxs.h Outdated Show resolved Hide resolved

src/mgxs.cpp Outdated Show resolved Hide resolved

src/mgxs.cpp Outdated Show resolved Hide resolved

jtramm and others added 2 commits July 17, 2023 16:28

added multi-temperature MG test.

e2541fc

Apply suggestions from @promano code review

bf5f34f

Co-authored-by: Paul Romano <[email protected]>

comment changes

a009c65

Fix style issues in mg_temperature_multi test

6f7d79d

paulromano approved these changes Jul 21, 2023

View reviewed changes

paulromano enabled auto-merge (squash) July 21, 2023 04:48

paulromano merged commit b5001ee into openmc-dev:develop Jul 21, 2023
16 checks passed

stchaker pushed a commit to stchaker/openmc that referenced this pull request Oct 25, 2023

Multigroup Per-Thread Cache Conversion (openmc-dev#2591)

9d5b560

Co-authored-by: Paul Romano <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multigroup Per-Thread Cache Conversion #2591

Multigroup Per-Thread Cache Conversion #2591

jtramm commented Jul 6, 2023 •

edited

Loading

paulromano left a comment

jtramm commented Jul 17, 2023

paulromano commented Jul 17, 2023

paulromano commented Jul 20, 2023

jtramm commented Jul 21, 2023

paulromano left a comment

	double Mgxs::get_xs(
	MgxsType xstype, int gin, const int* gout, const double* mu, const int* dg)
	{
	// This method assumes that the temperature and angle indices are set
	#ifdef _OPENMP
	int tid = omp_get_thread_num();
	XsData* xs_t = &xs[cache[tid].t];
	int a = cache[tid].a;
	#else
	XsData* xs_t = &xs[cache[0].t];
	int a = cache[0].a;
	#endif

Multigroup Per-Thread Cache Conversion #2591

Multigroup Per-Thread Cache Conversion #2591

Conversation

jtramm commented Jul 6, 2023 • edited Loading

Overview

Other Necessary Changes

Number of Caches

Interfaces

Correctness

Performance Analysis

Other Potential Areas of Memory Contention to Investigate on the 8180M node

Checklist

paulromano left a comment

Choose a reason for hiding this comment

jtramm commented Jul 17, 2023

paulromano commented Jul 17, 2023

paulromano commented Jul 20, 2023

jtramm commented Jul 21, 2023

paulromano left a comment

Choose a reason for hiding this comment

jtramm commented Jul 6, 2023 •

edited

Loading