Limit parallel processing for clients #3358

vicsn · 2024-07-09T17:10:19Z

Motivation

Clients run out of memory because they have no limit on how many transactions or solutions they verify in parallel. This PR proposes to queue them (just like the validator does in Consensus) and limit how much parallel verification we do.

We can do a lot of clever things to increase the processing speed, like check how many constraints the incoming transactions have, await on a channel to rapidly start verifying, but the focus for now is simplicity and safety.

Even though it was recently suggested clients should have at least 128 GiB of memory, the current implementation uses "only" up to 30 GiB for transaction verification. The right number is up for debate.

Test Plan

CI passed
Ran a local network shooting some executions.
In canary, more serious concurrent traffic can be shot at the network.

Related PRs

Potentially closes: #3341
Replaces: #2970

…arallel

vicsn · 2024-07-09T17:19:17Z

An open design question is how much memory should be reserved for worst case transaction + solution verification.

Summoning @evanmarshall @zosorock @Meshiest . Provable previously recommended that clients should have 128 GiB of RAM, but I understand you and others want to run with less RAM. So my questions to you:

What do you think is a sufficient default amount of RAM which should be assumed to be available for transaction/solution verification?
How badly do you want an additional feature flag which lets you increase the amount of RAM used for transaction/solution verification?

Meshiest · 2024-07-09T17:51:54Z

I have an undiversified node operational experience so I don't have a good estimate for 1 and personally haven't run into anything needing the feature flag from 2

The main limiter we've observed in client sync is core count. 16 cores was barely able to keep up with block production on canary after the 15tps tests. Upgrading a client VM from 16 to 32 cores massively increased sync speed. Our servers with more than 64 cores were powerhouses when syncing. RAM seemed less important in our tests, though we weren't using large programs or verifying any solutions.

raychu86

The code quality and logic LGTM.

Dialing in a reasonable number for these bounds will likely be an iterative process.

In theory we could also use the sysinfo crate to fetch the total memory of the machine thats running the node.

node/src/client/mod.rs

ljedrz

LGTM with one open point

vicsn · 2024-07-11T11:31:30Z

@raychu86 sorry added a separate execution queue, I couldn't let a simple 200x improvement in max throughput slide (assuming available compute): 9049f3e

In theory we could also use the sysinfo crate to fetch the total memory of the machine thats running the node.

Yes thought about it, but this dynamic behaviour will complicate infra planning too much I think, so I think there should rather be a --available-ram flag or something if users have very diverse preferences.

node/src/client/mod.rs

raychu86

LGTM

node/src/client/router.rs

raychu86 · 2024-09-24T16:04:15Z

@vicsn Is this PR still relevant?

Co-authored-by: d0cd <[email protected]>

…rrent_deploy_verification_2

vicsn · 2024-10-03T14:20:25Z

@vicsn Is this PR still relevant?

Yes I believe this to still be relevant to reduce client downtime. The main alternative would be to recommend the ecosystem to auto-restart their clients if they OOM, but still such events will lead to, in web3 terms, FUD. :D

I finally got around to testing this on a devnet, we can clearly see how the memory usage on the 2nd and 3rd peak (which runs with this PR) is capped at a lower point now. If I would increase the # of deployments, the first peak would keep growing until hitting OOM.

Overall processing speed of deployments does not seem significantly impacted because validators also throttle deployments.

gluax

Code wise LGM

node/src/client/mod.rs

zkxuerb

LGTM

…ploy_verification_2

zosorock · 2024-11-12T15:37:00Z

CI link:
https://app.circleci.com/pipelines/github/AleoNet/snarkOS?branch=pull%2F3358%2Fhead

vicsn added 2 commits July 9, 2024 07:35

Update Cargo.lock

0f7e5b0

Limit amount of transactions and solutions which can be verified in p…

08e3b50

…arallel

vicsn requested review from ljedrz and raychu86 July 9, 2024 17:10

vicsn mentioned this pull request Jul 9, 2024

Limit amount of transactions which clients verify in parallel #2970

Closed

raychu86 previously approved these changes Jul 10, 2024

View reviewed changes

ljedrz reviewed Jul 10, 2024

View reviewed changes

node/src/client/mod.rs Outdated Show resolved Hide resolved

ljedrz previously approved these changes Jul 10, 2024

View reviewed changes

vicsn added 2 commits July 11, 2024 13:13

Introduce separate queue for executions to 200x throughput

9049f3e

Use spawn_blocking for long-lasting tasks

799f157

vicsn dismissed stale reviews from ljedrz and raychu86 via 799f157 July 11, 2024 11:27

vicsn requested review from ljedrz and raychu86 July 11, 2024 11:34

ljedrz reviewed Jul 11, 2024

View reviewed changes

node/src/client/mod.rs Show resolved Hide resolved

raychu86 previously approved these changes Jul 15, 2024

View reviewed changes

d0cd reviewed Jul 22, 2024

View reviewed changes

node/src/client/router.rs Outdated Show resolved Hide resolved

Update node/src/client/router.rs

06adca7

Co-authored-by: d0cd <[email protected]>

vicsn dismissed raychu86’s stale review via 06adca7 September 26, 2024 10:33

vicsn added 3 commits September 26, 2024 12:37

Remove Cargo.lock

6c3eced

Merge remote-tracking branch 'aleonet/mainnet-staging' into max_concu…

7bb6d20

…rrent_deploy_verification_2

Improve logging of tx verification for clients

123633d

zosorock requested review from evanmarshall, zkxuerb and zklimaleo October 18, 2024 22:30

zosorock requested a review from a team October 18, 2024 22:30

gluax approved these changes Oct 19, 2024

View reviewed changes

node/src/client/mod.rs Show resolved Hide resolved

zkxuerb previously approved these changes Oct 20, 2024

View reviewed changes

Merge remote-tracking branch 'aleonet/staging' into max_concurrent_de…

e19efc3

…ploy_verification_2

vicsn dismissed zkxuerb’s stale review via e19efc3 November 12, 2024 12:08

zosorock added enhancement v1.1.4 canary release number labels Nov 12, 2024

zosorock merged commit aab8a83 into AleoNet:staging Nov 12, 2024
24 checks passed

zosorock mentioned this pull request Nov 13, 2024

Canary release week 24.46 - v1.1.4 #3436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit parallel processing for clients #3358

Limit parallel processing for clients #3358

vicsn commented Jul 9, 2024 •

edited

Loading

vicsn commented Jul 9, 2024

Meshiest commented Jul 9, 2024

raychu86 left a comment

ljedrz left a comment

vicsn commented Jul 11, 2024

raychu86 left a comment

raychu86 commented Sep 24, 2024

vicsn commented Oct 3, 2024 •

edited

Loading

gluax left a comment

zkxuerb left a comment

zosorock commented Nov 12, 2024

Limit parallel processing for clients #3358

Limit parallel processing for clients #3358

Conversation

vicsn commented Jul 9, 2024 • edited Loading

Motivation

Test Plan

Related PRs

vicsn commented Jul 9, 2024

Meshiest commented Jul 9, 2024

raychu86 left a comment

Choose a reason for hiding this comment

ljedrz left a comment

Choose a reason for hiding this comment

vicsn commented Jul 11, 2024

raychu86 left a comment

Choose a reason for hiding this comment

raychu86 commented Sep 24, 2024

vicsn commented Oct 3, 2024 • edited Loading

gluax left a comment

Choose a reason for hiding this comment

zkxuerb left a comment

Choose a reason for hiding this comment

zosorock commented Nov 12, 2024

vicsn commented Jul 9, 2024 •

edited

Loading

vicsn commented Oct 3, 2024 •

edited

Loading