Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression Test Heap Optimization: Remove Extinct Result Arrays #433

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ozmorph
Copy link
Contributor

@ozmorph ozmorph commented Jul 2, 2020

Objective

Reduce RAM usage in regression tests so that GitHub Action runners can run all 4 tests without paging memory.

Background

The size of the Results struct in Model.h is nearly 805 KB on Clang 9 for Linux, which when accounting for an array of P.NumSamples (720 in the regression tests) means that TSMeanE and TSVarE each require an allocation of around 580 MB. As a result, a single regression test will use over 1.1 GB of RAM just for these two arrays alone. When all 4 tests are ran in parallel, they consume 4.4 GB of RAM.

Rationale

As noted in #388, the default GitHub action runners have only 7 GB of total RAM and a 14 GB temporary storage device allocated to them. For Linux environments the swap space is set to 4GB and on Windows the swap space is most likely set to 2GB.

Local runs using heaptrack indicate that about 3.5GB is the average heap size of each run of CovidSim. If accounting for the fact that the US regression tests run for a fraction of the time that the UK tests do, let's assume that on average 7 GB of RAM is being used by the regression tests and a maximum of 14 GB is used. Windows Server 2019 requires 512MB of RAM to run. If 7 GB is the total amount of RAM given to the runners, this does not leave much room to the operating system itself. Therefore, I surmise that CovidSim is hitting paged memory for at least the first half of a ctest -j6 -V run.

Solution

This pull request removes the TSMeanE and TSVarE arrays from the code since they haven't been used in the final output since the initial squash of commits back in April. The regression tests still pass.

Known Concerns

If this code is still used on the super-computers or outside of the regression tests listed on GitHub, then please close this PR and I will open a new PR that documents its use so that someone doesn't attempt to propose this change in the future.

Alternative/Additional Solution

The memory requirements of the Result arrays could also be optimized either with heap-allocated C style arrays or std::vectors instead of static double arrays of MAX_ADUNITS size (which is set to 3200 but the regression tests only use 1 or 2 elements). However this would require far deeper and more extensive changes to the code base than this Pull Request.

… they are not currently used in the final output; the size of the Results struct is somewhere around 800 KB, which when accounting P.NumSamples (720 in the regression tests) means that TSMeanE and TSVarE each require an allocation of around 580 MB; as a result, the regression tests when run in parallel will use 1160 x 4 = 4640 MB or 4.53 GB less than they were before; this allows the default GitHub action runners which have only 7 GB total to be able to run all 4 tests without having to page memory to disk
Copy link
Collaborator

@matt-gretton-dann matt-gretton-dann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still use them - but I like the idea of not initialising them if not used - can we use a predicate need_mean(), need_var() to determine whether to initialise them/populate them? And assert that the predicates hold true when we do output which makes use of them?

I'm surprised the regression tests don't make use of them. We should add a config that does - but we can do that in a way that uses less memory (fewer time-steps?) to check the main steps.

Also note that in the CI loop we only run -j2 as the Actions VMs are all dual-core.

@ozmorph
Copy link
Contributor Author

ozmorph commented Jul 2, 2020

Sounds good! I like having a check as an alternative.

I am surprised too that it did not affect run-time. It could also be that the paging algorithm and Windows allocators are good enough to hide this cost. I'm going to try to replicate the CI binaries on my Windows system and analyze the performance versus my Linux runs to try to determine where the real difference lies.

Good to note about running only 2 tests at once on the CI.

@ozmorph
Copy link
Contributor Author

ozmorph commented Jul 2, 2020

@matt-gretton-dann Using the exact same compiler that GitHub actions is using, I get similar performance on my local machine as Clang 9 on WSL Ubuntu 20.04 does. This led me down a rabbit hole, but I think your past theory about all of the fprintf() calls being responsible for the slowdown are probably correct.

It appears that I/O performance is a common issue for GitHub actions (especially on Windows):

It appears this project is not the only one to be having this issue. It also explains why Windows performance is so drastically different than Linux and Mac.

To conclude, reducing memory usage anywhere is not going to make the regression tests faster by itself but I think it would provide the extra headroom necessary for more variable tracking in additional simulations in the future.

@matt-gretton-dann
Copy link
Collaborator

This led me down a rabbit hole, but I think your past theory about all of the fprintf() calls being responsible for the slowdown are probably correct.

I have a WiP branch which enables us to control output verbosity - and so that should show whether I am right or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants