Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit parallel processing for clients #3358

Merged

Conversation

vicsn
Copy link
Contributor

@vicsn vicsn commented Jul 9, 2024

Motivation

Clients run out of memory because they have no limit on how many transactions or solutions they verify in parallel. This PR proposes to queue them (just like the validator does in Consensus) and limit how much parallel verification we do.

We can do a lot of clever things to increase the processing speed, like check how many constraints the incoming transactions have, await on a channel to rapidly start verifying, but the focus for now is simplicity and safety.

Even though it was recently suggested clients should have at least 128 GiB of memory, the current implementation uses "only" up to 30 GiB for transaction verification. The right number is up for debate.

Test Plan

CI passed
Ran a local network shooting some executions.
In canary, more serious concurrent traffic can be shot at the network.

Related PRs

Potentially closes: #3341
Replaces: #2970

@vicsn
Copy link
Contributor Author

vicsn commented Jul 9, 2024

An open design question is how much memory should be reserved for worst case transaction + solution verification.

Summoning @evanmarshall @zosorock @Meshiest . Provable previously recommended that clients should have 128 GiB of RAM, but I understand you and others want to run with less RAM. So my questions to you:

  1. What do you think is a sufficient default amount of RAM which should be assumed to be available for transaction/solution verification?
  2. How badly do you want an additional feature flag which lets you increase the amount of RAM used for transaction/solution verification?

@Meshiest
Copy link
Contributor

Meshiest commented Jul 9, 2024

I have an undiversified node operational experience so I don't have a good estimate for 1 and personally haven't run into anything needing the feature flag from 2

The main limiter we've observed in client sync is core count. 16 cores was barely able to keep up with block production on canary after the 15tps tests. Upgrading a client VM from 16 to 32 cores massively increased sync speed. Our servers with more than 64 cores were powerhouses when syncing. RAM seemed less important in our tests, though we weren't using large programs or verifying any solutions.

raychu86
raychu86 previously approved these changes Jul 10, 2024
Copy link
Contributor

@raychu86 raychu86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code quality and logic LGTM.

Dialing in a reasonable number for these bounds will likely be an iterative process.

In theory we could also use the sysinfo crate to fetch the total memory of the machine thats running the node.

node/src/client/mod.rs Outdated Show resolved Hide resolved
ljedrz
ljedrz previously approved these changes Jul 10, 2024
Copy link
Collaborator

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one open point

@vicsn vicsn dismissed stale reviews from ljedrz and raychu86 via 799f157 July 11, 2024 11:27
@vicsn
Copy link
Contributor Author

vicsn commented Jul 11, 2024

@raychu86 sorry added a separate execution queue, I couldn't let a simple 200x improvement in max throughput slide (assuming available compute): 9049f3e

In theory we could also use the sysinfo crate to fetch the total memory of the machine thats running the node.

Yes thought about it, but this dynamic behaviour will complicate infra planning too much I think, so I think there should rather be a --available-ram flag or something if users have very diverse preferences.

@vicsn vicsn requested review from ljedrz and raychu86 July 11, 2024 11:34
raychu86
raychu86 previously approved these changes Jul 15, 2024
Copy link
Contributor

@raychu86 raychu86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

node/src/client/router.rs Outdated Show resolved Hide resolved
@raychu86
Copy link
Contributor

@vicsn Is this PR still relevant?

@vicsn
Copy link
Contributor Author

vicsn commented Oct 3, 2024

@vicsn Is this PR still relevant?

Yes I believe this to still be relevant to reduce client downtime. The main alternative would be to recommend the ecosystem to auto-restart their clients if they OOM, but still such events will lead to, in web3 terms, FUD. :D

I finally got around to testing this on a devnet, we can clearly see how the memory usage on the 2nd and 3rd peak (which runs with this PR) is capped at a lower point now. If I would increase the # of deployments, the first peak would keep growing until hitting OOM.

Screenshot 2024-10-03 at 16 06 14

Overall processing speed of deployments does not seem significantly impacted because validators also throttle deployments.

@zosorock zosorock requested a review from a team October 18, 2024 22:30
Copy link

@gluax gluax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise LGM

node/src/client/mod.rs Show resolved Hide resolved
zkxuerb
zkxuerb previously approved these changes Oct 20, 2024
Copy link
Collaborator

@zkxuerb zkxuerb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zosorock
Copy link
Contributor

@zosorock zosorock added enhancement v1.1.4 canary release number labels Nov 12, 2024
@zosorock zosorock merged commit aab8a83 into AleoNet:staging Nov 12, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement v1.1.4 canary release number
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] OOM kill because sudden memory increase and requests
8 participants