Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory growth #839

Open
inducer opened this issue Feb 21, 2023 · 11 comments
Open

Memory growth #839

inducer opened this issue Feb 21, 2023 · 11 comments

Comments

@inducer
Copy link
Contributor

inducer commented Feb 21, 2023

I understand that there is some type of memory growth occurring.

From the 2023-02-17 dev meeting notes, I gather that

  • that memory growth only occurs when using the memory pool
  • "Using jemalloc fixes the issue": Is that before or after turning off the pool.

Possibly related: #212.

cc @matthiasdiener

@inducer
Copy link
Contributor Author

inducer commented Feb 21, 2023

Is the growth reflected in

I.e. do those increase timestep-over-timestep?

If there is growth, can identify what bins in the memory pool are affected? Can you identify which allocations? Python makes it straightforward to attach stack traces to allocations.

Do we know if this growth is of "array" memory or "other" memory?

How do your findings change if you call free_held between steps?

What is the simplest driver that exhibits the growth? I gather from @lukeolson that, maybe, examples/wave-lazy.py may be affected. Could you please confirm? Is grudge/examples/wave/wave-min-mpi.py affected as well? Is, say, vortex-mpi.py affected? Grudge's Euler?

@lukeolson
Copy link
Contributor

Also, it looks like set_trace is exposed so you could get some additional information from that:
https://github.com/inducer/pyopencl/blob/main/src/mempool.hpp#L164

including bin size data

@matthiasdiener
Copy link
Member

matthiasdiener commented Feb 21, 2023

  • that memory growth only occurs when using the memory pool

The growth happens both with and without the pool. Here is an example with drivers_y2-prediction/smoke_test_ks (lazy-eval), 1 rank, Lassen CPU (y-axes are in "MByte"):

with SVM mempool:

image

with non-pool SVM:
image

  • both seem to "level off" after ~140 steps, but memory is likely to grow in the future. see e.g. this graph for a different Lassen run (SVM pool):

    image

  • These results are qualitatively reproducible between runs, but quantitatively differ widely (even when rerunning the same exact configuration).

  • Using CL buffers vs. SVM allocations seem to show the same behavior.

@inducer
Copy link
Contributor Author

inducer commented Feb 21, 2023

  • Please confirm that the relevant growth is only of memory allocated via OpenCL.
  • I gather that you are using some (unspecified?) system/process-level metric of memory usage. What do things look like at the level of the OpenCL API? If you keep a running tally of memory allocated via OpenCL, does that grow as well or stay constant?

Btw, please keep vertical space in mind when writing issue text. Write claims, and hide supporting evidence under a <details>. I've done that for your comment above.

@matthiasdiener
Copy link
Member

Tracing the memory pool allocations with set_trace (and using #840) with the same config as before (1 rank, smoke_test_ks, CPU) revealed some interesting information:

  • the (SVM) memory pool keeps growing throughout execution, although for some reason at step 38 active bytes goes down by 75%
  • throughout the whole execution, at least some memory pool requests required new allocations, i.e. [pool] allocation of size 1511472 required new memory

image

How do your findings change if you call free_held between steps?

Looking at the graph above, it seems like freeing the held memory may not help?


I gather that you are using some (unspecified?) system/process-level metric of memory usage.

The memory usage I initially added here is the RSS high water mark measured with illinois-ceesd/logpyle#79 (= max_rss).

@inducer
Copy link
Contributor Author

inducer commented Feb 22, 2023

Thanks. This tally of pool-held memory means (to me) that the issue is very likely "above" the pool, i.e. in Python. I.e., replacing the memory allocation scheme used by the pool should not help, or at least not much.

My read of this is that some member of a group of objects that cyclically refer to each other holds a reference to our arrays. This follows because Python's refcounting frees objects without cyclic referents effectively instantaneously, i.e. as soon as a reference to them is no longer being held.

To validate the latter conclusion, you could try calling gc.collect() every $N$ time steps to see if that helps free those objects. (Of course, that won't do much if there is some cyclic behavior in what references are held.)

Assuming the above conclusion is correct, the way to address this would be to find the objects referring to the arrays and make it so they no longer hold those references.

@matthiasdiener
Copy link
Member

matthiasdiener commented Feb 22, 2023

What is the simplest driver that exhibits the growth? I gather from @lukeolson that, maybe, examples/wave-lazy.py may be affected. Could you please confirm? Is grudge/examples/wave/wave-min-mpi.py affected as well? Is, say, vortex-mpi.py affected? Grudge's Euler?

I've seen the growth in all drivers I tried, including the simplest ones:

  • Mirgecom's wave, wave-mpi
  • Grudge's euler/vortex, wave/wave-op-mpi

The growth only happens in lazy mode, not eager. The specific memory pool used (SVM, CL buffer) or lazy actx class do not seem to matter.

Graph for mirgecom's wave:

image

@matthiasdiener
Copy link
Member

matthiasdiener commented Feb 22, 2023

To validate the latter conclusion, you could try calling gc.collect() every N time steps to see if that helps free those objects. (Of course, that won't do much if there is some cyclic behavior in what references are held.)

It does seem that running gc.collect resolves mitigates this issue for us. The following results are for smoke_test_ks, but it is similar for the simpler testcases.

GC collect every 10 steps (no measurable performance overhead) :
image

GC collect every 1 step (~25% performance overhead):
image

@inducer
Copy link
Contributor Author

inducer commented Feb 22, 2023

It's important that gc.collect is not a solution, but a workaround. It's quite expensive (and should be unnecessary), and it only masks the problem.

@MTCam
Copy link
Member

MTCam commented Feb 23, 2023

It's important that gc.collect is not a solution, but a workaround. It's quite expensive (and should be unnecessary), and it only masks the problem.
👍

I like your idea of running it every $N$ steps, though. This workaround can likely keep us running comfortably in the interim. afaict, after injecting this fix into the prediction driver, the code infrastructure is now capable of production-scale prediction-like runs, and at the very least in good shape for February trials (leaps and bounds over last year). Gigantic cool.

@matthiasdiener
Copy link
Member

matthiasdiener commented Feb 28, 2023

A few more updates for mirgecom's wave (w/ lazy eval):

  • When running without any gc invocations or gc config changes, gc.garbage is empty (which is expected I think).
  • When running with gc.set_debug(gc.DEBUG_SAVEALL), gc.garbage contains ~62000 objects after the first time step. Each subsequent time step adds about ~1000 objects. Is my assumption correct that those objects are the ones we suspect of having circular references (+holding a reference to arrays)? I was adapting this code https://code.activestate.com/recipes/523004-find-cyclical-references/ to check if there are array references in the objects with circular references, but this appears to be extremely time consuming.

Edit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants