-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit parallel processing for clients #3358
Limit parallel processing for clients #3358
Conversation
An open design question is how much memory should be reserved for worst case transaction + solution verification. Summoning @evanmarshall @zosorock @Meshiest . Provable previously recommended that clients should have 128 GiB of RAM, but I understand you and others want to run with less RAM. So my questions to you:
|
I have an undiversified node operational experience so I don't have a good estimate for 1 and personally haven't run into anything needing the feature flag from 2 The main limiter we've observed in client sync is core count. 16 cores was barely able to keep up with block production on canary after the 15tps tests. Upgrading a client VM from 16 to 32 cores massively increased sync speed. Our servers with more than 64 cores were powerhouses when syncing. RAM seemed less important in our tests, though we weren't using large programs or verifying any solutions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code quality and logic LGTM.
Dialing in a reasonable number for these bounds will likely be an iterative process.
In theory we could also use the sysinfo
crate to fetch the total memory of the machine thats running the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one open point
@raychu86 sorry added a separate execution queue, I couldn't let a simple 200x improvement in max throughput slide (assuming available compute): 9049f3e
Yes thought about it, but this dynamic behaviour will complicate infra planning too much I think, so I think there should rather be a --available-ram flag or something if users have very diverse preferences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@vicsn Is this PR still relevant? |
Co-authored-by: d0cd <[email protected]>
Yes I believe this to still be relevant to reduce client downtime. The main alternative would be to recommend the ecosystem to auto-restart their clients if they OOM, but still such events will lead to, in web3 terms, FUD. :D I finally got around to testing this on a devnet, we can clearly see how the memory usage on the 2nd and 3rd peak (which runs with this PR) is capped at a lower point now. If I would increase the # of deployments, the first peak would keep growing until hitting OOM. Overall processing speed of deployments does not seem significantly impacted because validators also throttle deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code wise LGM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ploy_verification_2
Motivation
Clients run out of memory because they have no limit on how many transactions or solutions they verify in parallel. This PR proposes to queue them (just like the validator does in Consensus) and limit how much parallel verification we do.
We can do a lot of clever things to increase the processing speed, like check how many constraints the incoming transactions have, await on a channel to rapidly start verifying, but the focus for now is simplicity and safety.
Even though it was recently suggested clients should have at least 128 GiB of memory, the current implementation uses "only" up to 30 GiB for transaction verification. The right number is up for debate.
Test Plan
CI passed
Ran a local network shooting some executions.
In canary, more serious concurrent traffic can be shot at the network.
Related PRs
Potentially closes: #3341
Replaces: #2970