Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM kill on a small system with workqueue consumers at high speed resulting in fast memory growth - #5673

Closed
roeschter opened this issue Jul 18, 2024 · 20 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@roeschter
Copy link

roeschter commented Jul 18, 2024

Observed behavior

Consuming from a workqueue with multiple consumers (subject segmented) at high speed (150MB/s) result in rapid memory growth and eventually exhaustion. Server will die with a range of symptoms, mostly silently OOM killed, sometimes throwing readloop processing time warning first.

Testing on a small system with 4Gb and 2 CPUs - This is deliberate to simulate an edge node.

This does not seem to be a memory leak per se. Repeated tests with a below threshold message volume do not trigger the effect.

Expected behavior

Self limits memory growth or throughput.

Server and client version

jnats-2.19.1.jar _ Java17
2.10.17 and 2.11-preview-rc3 (self compiled)
memory growth is not platform specific. Seen on Windows as well, but crashes observed only on smaller systems with <8GB.

Host environment

Linux - Old Kernel
Linux lvps5-35-240-105.dedicated.hosteurope.de 3.10.0-1160.80.1.vz7.191.4 #1 SMP Thu Dec 15 20:31:06 MSK 2022 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce

playground.jar.zip
nats00_hosteurope.config.txt
source.zip

Full eclipse project available in private repo: https://github.com/roeschter/natsPlayground

Run on a system with 4GB (or less) - CPU count >=2
Java 17

/var/lib/java17/bin/java -cp playground.jar:jnats-2.19.1.jar com.synadia.test.config03_performance.C03JetstreamWorQueueSlowDeletes --user user --password dja673b48fh3kqp9 --messages 200000 --messageSize 10000 --fetchsize 10 --subjects 10 --consumemode fetch

--messages 200000 Should be sufficient on a 4GB system. Increase to 1000000 if not crashing.
--messageSize 10000 Increasing message size my speed the crash up. Not observable with small messages (throughput too low I presume)

Memory should increase rapidly when the consumers start. Rate should be 10000 msg/s or more

@roeschter roeschter added the defect Suspected defect such as a bug or regression label Jul 18, 2024
@roeschter
Copy link
Author

roeschter commented Jul 19, 2024

Will experiment with
export GOGC=nnn

@Jarema
Copy link
Member

Jarema commented Jul 19, 2024

For controlling the memory and preventing such OOMs with high load on insufficient resources, you should use GOMEMLIMIT. Keep in mind that if the limit will be a lot below the needs, it will throttle the throughput a lot, while putting more load on CPU (as the GC will need to work way harder).

@roeschter
Copy link
Author

roeschter commented Jul 19, 2024

GOMEMLIMIT solves the immediate problem (I tried from 256MiB to 1024Mib), but the memory usage reported by the OS still vastly exceeds the limit I set.

Also: I would expect the nats-server to self-limit to a reasonable memory footprint.
"Reasonable" would be 50% of system memory when JS is enabled (because we need space with buffers and caches) and 70% for pure Nats Core.

My test system had 2.5GB of 4GB available for NATS when the tests starts

@roeschter
Copy link
Author

roeschter commented Jul 19, 2024

grafik
E.g. In this case I set
export GOMEMLIMIT=256MiB
export GOGC=20

Resident memory still exceeds 2GB (consistent with overall system memory report)
Performance is not much impacted by the limits above.

Note that this only happens when reading from the stream, footprint is small when writing.

@roeschter
Copy link
Author

roeschter commented Jul 19, 2024

I now created a state of the stream, where nats-server exceeds all limits during startup, recovering the stream and gets killed.
export GOMEMLIMIT=256MiB
export GOGC=10
Stream size on disk ~3.5GB
Resident memory growing to 2.4GB very fast - then OOM killed

@neilalexander
Copy link
Member

Can you get a memory profile when the memory usage is inflated, either through nats server request profile allocs or https://docs.nats.io/running-a-nats-service/nats_admin/profiling?

@neilalexander
Copy link
Member

Also: I would expect the nats-server to self-limit to a reasonable memory footprint.
"Reasonable" would be 50% of system memory when JS is enabled (because we need space with buffers and caches) and 70% for pure Nats Core.

Short of using GOMEMLIMIT, which is actually a runtime hint to the GC and not an actual hard limit, this is simply not practical to do. Runtime stats can't tell us about the actual RSS size, nor can it tell us about buffers/caches from the kernel, and we also don't have custom allocators in Go to actually enforce those limits. Even if we did, failing to allocate structures would just leave the system in an undefined state, which is worse than crashing.

We would need to figure out from profiles what is using the memory and whether that can be optimised for this case.

@roeschter
Copy link
Author

Here are the memory profiles:
03_allocs_after_stream_delete_from_client.zip

Observations:

  1. After consuming messages the memory allocated will not be freed even after the client shuts down AND the stream was deleted. (03_allocs_after)
  2. What eventually frees the memory is a new client connection (without stream deletion) and resuming activity.
  3. The whole game starts over

@neilalexander
Copy link
Member

OOI how did you capture these profiles? They're in text rather than binary format which means the pprof visualiser can't parse them. Please make sure to use --level=0 if using the CLI or ?debug=0 if using the pprof HTTP endpoint.

@roeschter
Copy link
Author

roeschter commented Jul 19, 2024

Human readable output via:

@roeschter
Copy link
Author

roeschter commented Jul 19, 2024

Testing under Windows now, as the memory growth happens here as well. (just no OOM kill)

Binary Allocs:
allocs.zip

The more I look into the behavior the less deterministic it is. Under Windows is respected: Server shrinks back to the set size. BUT under Linux it didnt. In any case, memory temporarily growth to about the amount consumed during 10seconds by the consumers.

Seems the buffer management relies on: https://pkg.go.dev/sync#Pool
Seems this only cleans up after multiple GC cycles. But also, why is the memory not effectively being re-used? Who is not returning it fast enough?

@neilalexander
Copy link
Member

The sync.Pools are indeed only purged during a GC run, that is by design. Otherwise they largely rely on the number of gets to roughly match the number of puts in order to stay balanced in their resource consumption.

In this case, looking at the profile, it appears as though lots of blocks are being loaded in from the filestore and held in the block cache. They should be held there for 10 seconds from the last access, after which the buffer will be returned to the pool. That would roughly explain why you see the correlation between the data accessed and the memory growth in that window.

Perhaps exposing the cache expiry as a configurable, i.e. being able to set it to less than 10 seconds, would help reduce memory usage at the expense of potentially more I/O operations. As an experiment, you could potentially test with a custom build that changes defaultCacheBufferExpiration in filestore.go to a lower value like time.Second * 2.

@roeschter
Copy link
Author

roeschter commented Jul 19, 2024

Correct! If the behavior of the block cache would be documented (I can guess it exists, but I incorrectly assumed it to be of fixed size per connection) and its size or timeout configurable I would not have been surprised.

Last post on observed behavior:

  • Both Linux and Windows respect GOMEMLIMIT . that is they shrink back to the configured size
  • For the block cache to grow its neccssary to have multiple consumers on the stream, consuming with diffrent subject filters. In the case of workqueue its also required that they leave messages behind (e.g. on consumer out of ten currently down).

It does not happen:

  • When a single consumer consumers reads on a limits stream
  • When a workqueue gets fully emptied by its consumers

In summary:

  1. The block cache growth happens in certain circumstances only (hard to guess from the outside)
  2. It does not happen in "trivial" tests (the good cases)
  3. The current unconfigurable timeout makes the memory growth uncontrollable. I can make GOMEMLIMIT small forcing the server to slow down, but that's it.
  4. It happens during stream recovery as well, resulting in nats-server crashingduring startup .

@roeschter
Copy link
Author

I will test with : DefaultCacheBufferExpiration in filestore.go to a lower value like time.Second * 2.

@roeschter
Copy link
Author

Reducing the buffer timeout to 1s limits the memory consumption to around 600MB in my tests. At teh cost of reduced throughput, which is fine (still about 50MB/s from a workqueue)

Tested with DefaultCacheBufferExpiration = time.Second * 1
Windows and Linux based on 2.11 preview rc3
In combination with GOMEMELIMIT=512MiB

@roeschter
Copy link
Author

Closing the investigation, but I would like the CacheBuffer configurable as a small enhancement request.
Should I try to patch this myself and make it a pull request?

@derekcollison
Copy link
Member

If you could file an issue that would be great.

@kohlisid
Copy link

kohlisid commented Aug 2, 2024

What is the stream maxBytes that you were using? @roeschter

@roeschter
Copy link
Author

I was not using maxbytes (unlimited). The effect is purely about the speed of consuming.
As as was testing on a 4GB device with 2.5 GB free a 2GB effective stream size was enough to trigger the OOM.

@roeschter
Copy link
Author

Closing this in favor of a new CR - #5740

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

5 participants