Regression Test Heap Optimization: Remove Extinct Result Arrays #433

ozmorph · 2020-07-02T00:46:44Z

Objective

Reduce RAM usage in regression tests so that GitHub Action runners can run all 4 tests without paging memory.

Background

The size of the Results struct in Model.h is nearly 805 KB on Clang 9 for Linux, which when accounting for an array of P.NumSamples (720 in the regression tests) means that TSMeanE and TSVarE each require an allocation of around 580 MB. As a result, a single regression test will use over 1.1 GB of RAM just for these two arrays alone. When all 4 tests are ran in parallel, they consume 4.4 GB of RAM.

Rationale

As noted in #388, the default GitHub action runners have only 7 GB of total RAM and a 14 GB temporary storage device allocated to them. For Linux environments the swap space is set to 4GB and on Windows the swap space is most likely set to 2GB.

Local runs using heaptrack indicate that about 3.5GB is the average heap size of each run of CovidSim. If accounting for the fact that the US regression tests run for a fraction of the time that the UK tests do, let's assume that on average 7 GB of RAM is being used by the regression tests and a maximum of 14 GB is used. Windows Server 2019 requires 512MB of RAM to run. If 7 GB is the total amount of RAM given to the runners, this does not leave much room to the operating system itself. Therefore, I surmise that CovidSim is hitting paged memory for at least the first half of a ctest -j6 -V run.

Solution

This pull request removes the TSMeanE and TSVarE arrays from the code since they haven't been used in the final output since the initial squash of commits back in April. The regression tests still pass.

Known Concerns

If this code is still used on the super-computers or outside of the regression tests listed on GitHub, then please close this PR and I will open a new PR that documents its use so that someone doesn't attempt to propose this change in the future.

Alternative/Additional Solution

The memory requirements of the Result arrays could also be optimized either with heap-allocated C style arrays or std::vectors instead of static double arrays of MAX_ADUNITS size (which is set to 3200 but the regression tests only use 1 or 2 elements). However this would require far deeper and more extensive changes to the code base than this Pull Request.

… they are not currently used in the final output; the size of the Results struct is somewhere around 800 KB, which when accounting P.NumSamples (720 in the regression tests) means that TSMeanE and TSVarE each require an allocation of around 580 MB; as a result, the regression tests when run in parallel will use 1160 x 4 = 4640 MB or 4.53 GB less than they were before; this allows the default GitHub action runners which have only 7 GB total to be able to run all 4 tests without having to page memory to disk

matt-gretton-dann

We still use them - but I like the idea of not initialising them if not used - can we use a predicate need_mean(), need_var() to determine whether to initialise them/populate them? And assert that the predicates hold true when we do output which makes use of them?

I'm surprised the regression tests don't make use of them. We should add a config that does - but we can do that in a way that uses less memory (fewer time-steps?) to check the main steps.

Also note that in the CI loop we only run -j2 as the Actions VMs are all dual-core.

ozmorph · 2020-07-02T14:55:55Z

Sounds good! I like having a check as an alternative.

I am surprised too that it did not affect run-time. It could also be that the paging algorithm and Windows allocators are good enough to hide this cost. I'm going to try to replicate the CI binaries on my Windows system and analyze the performance versus my Linux runs to try to determine where the real difference lies.

Good to note about running only 2 tests at once on the CI.

ozmorph · 2020-07-02T17:59:46Z

@matt-gretton-dann Using the exact same compiler that GitHub actions is using, I get similar performance on my local machine as Clang 9 on WSL Ubuntu 20.04 does. This led me down a rabbit hole, but I think your past theory about all of the fprintf() calls being responsible for the slowdown are probably correct.

It appears that I/O performance is a common issue for GitHub actions (especially on Windows):

It appears this project is not the only one to be having this issue. It also explains why Windows performance is so drastically different than Linux and Mac.

To conclude, reducing memory usage anywhere is not going to make the regression tests faster by itself but I think it would provide the extra headroom necessary for more variable tracking in additional simulations in the future.

matt-gretton-dann · 2020-07-03T08:07:46Z

This led me down a rabbit hole, but I think your past theory about all of the fprintf() calls being responsible for the slowdown are probably correct.

I have a WiP branch which enables us to control output verbosity - and so that should show whether I am right or not.

ozmorph requested review from dlaydon, gnedjati, igfoo, matt-gretton-dann, NeilFerguson, weshinsley and Z80coder as code owners July 2, 2020 00:46

matt-gretton-dann requested changes Jul 2, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression Test Heap Optimization: Remove Extinct Result Arrays #433

Regression Test Heap Optimization: Remove Extinct Result Arrays #433

ozmorph commented Jul 2, 2020

matt-gretton-dann left a comment

ozmorph commented Jul 2, 2020

ozmorph commented Jul 2, 2020

matt-gretton-dann commented Jul 3, 2020

Regression Test Heap Optimization: Remove Extinct Result Arrays #433

Are you sure you want to change the base?

Regression Test Heap Optimization: Remove Extinct Result Arrays #433

Conversation

ozmorph commented Jul 2, 2020

matt-gretton-dann left a comment

Choose a reason for hiding this comment

ozmorph commented Jul 2, 2020

ozmorph commented Jul 2, 2020

matt-gretton-dann commented Jul 3, 2020