Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Change BLOCK_REQUEST_TIMEOUT_IN_SECS to 10s from 600s (10 mins) #3327

Open
wants to merge 1 commit into
base: mainnet-staging
Choose a base branch
from

Conversation

damons
Copy link

@damons damons commented Jun 23, 2024

Motivation

Now that we've fixed the double locks and threading issues in client sync and clients can reliably sync all the way, we can begin improving overall client sync performance. Currently, a 100k+ block ledger on canary can take over 24 hours to sync, especially if there are lots of large blocks with lots of transactions.

Currently, clients make block sync requests to other clients, asking for specific blocks. There's a timeout, BLOCK_REQUEST_TIMEOUT_IN_SECS in block_sync.rs specifying how long to wait for a client to respond to a request. It's currently set to 600s, or 10 minutes. In action, if a client cannot respond in ten minutes, there's something exceptionally wrong as it's either completely overwhelmed or is malfunctioning. Rather than continuing to anticipate a response from that client, it's better to simply move on to the next client and make requests to one that responds almost immediately.

Slow or limited resource clients can be slow to respond or never respond at all to block sync requests. If a fast, capable client wishing to sync comes into contact with these unreliable clients, their block requests go unfulfilled and their sync can be slowed dramatically. Currently, clients will only make up to 21 connections with other clients. A fast, performant client can easily fill it's connections with unresponsive clients and spend most of it's time waiting for responses.

This change reduces the timeout to 10 seconds, and it frees up performant clients to consistently seek blocks from other performant clients containing more sought after blocks in their ledgers.

Testing seems show this to be a 20-30% performance improvement over the 10 minute timeout. As the ledger grows, performance improvements like this will become exceptionally important as sync times can be multiple-days long (as they already are). As time goes by, performance improvements will become more and more difficult, so we should attempt to get in as many as we can, as soon as is reasonable to maintain no regressions.

Test Plan

  1. Started, at the same time, two clients on canary network, on two different machines dedicated to only one client each. Machines tested: Two EPYC 3.8GHz 2x32 Core, 128 threads each, 256GB RAM.
  2. Canary currently has 200 slow clients running on GCP instances (10 clients each, 16 CPUs, all 100% CPU bound and very slow to respond, they've been syncing since canary launched a week ago, and they are only 33k through a 217k blocks; also, that block range contains very large blocks with lots of transactions). Our two test clients routinely encountered these slow clients and made block sync requests which were timing out.
  3. Compared sync progress over time. The client with a 10s timeout demonstrated disconnecting from slow clients that took over ten seconds to respond. New block requests were then made to other clients. The unmodified 10 minute timeout client maintained connections with slow clients and experienced slow responses.
  4. Throughout the sync of 100k blocks, the modified 10s timeout client maintained a 20-30% lead over the 10min timeout client, resulting in finishing syncing about eight hours sooner than the unmodified client.
  5. The slow GPC nodes are still struggling through the 30-40k range of blocks, and are currently slowing down all clients connected to it.

Related PRs

Some related issues. Potentially fixes/affects: #3322

Also, #3320

Note, also included Cargo.lock file as it was not updated by another change upstream in AleoNet repository for some reason.

…conds (ten minutes). Testing shows a 20-30% performance improvement in sync time.
@damons damons assigned damons and unassigned damons Jun 23, 2024
@joske
Copy link
Contributor

joske commented Jun 24, 2024

I think 10 seconds is too low for big blocks.

@damons
Copy link
Author

damons commented Jun 24, 2024

With a longer timeout, what performance improvement are we hoping to achieve?

What do you think is a better timeout? Happy to to test suggestions.

@joske
Copy link
Contributor

joske commented Jun 24, 2024

When there are big blocks, I've seen BlockResponses come in after minutes. I would put at least 3 minutes.

There's not really a downside, when the ledger moved beyond the BlockRequest, they will be removed anyway (is_obsolete).

@damons
Copy link
Author

damons commented Jun 24, 2024

When you say "come in after minutes" was it because those clients are degraded? Why not proactively avoid those clients and moved on to the next? Why bog down a degraded client? Also, was this in an environment where network clients were still suffering from deadlocks?

The question is whether or not a performant client would respond in the minutes range. What comes to mind there is large deployments. I think I want to test a very large deployment (i.e., maximum size) and time the blocks to and from a performant client and use 100Mb/s as the basis for bandwidth. Will test this.

@joske
Copy link
Contributor

joske commented Jun 24, 2024

Not sure what you mean with degraded?

I mean when the client is syncing, and the blocks that are up next are big, it can take a long for the response to come in (and also a long time for the client to deserialize and process them).

@damons
Copy link
Author

damons commented Jun 24, 2024

Not sure what you mean with degraded?

I mean they are overwhelmed, unresponsive, slow, stalled, or just simply under-resourced. This is exactly the case we have right now on canary: 200 of the clients are under-resourced and are still processing the 30k range of the 200k+ ledger (and quickly falling behind). Those two hundred clients have slowed down the network dramatically because of the 10 minute timeout. With those clients existing on the network, they routinely pop up in connections with performant clients and drag them down. This 10s timeout addresses that very issue.

I mean when the client is syncing, and the blocks that are up next are big, it can take a long for the response to come in (and also a long time for the client to deserialize and process them).

Do we know why the responses are taking a long time? Other than what we are seeing in canary today with those under-resourced clients we are running?

On today's canary call I confirmed that Demox has already tested maximum sized deployments (max number of gates in circuits) and those blocks currently exist on canary. This means that our clients with the 10s timeout still performed 20-30% faster than the unmodified client, even syncing through maximum sized blocks.

Also, our testing has showed that serialization of blocks is on the order of a few seconds at most when processing blocks with hundreds of transactions. Albeit, I'd very much like to see the exact times when serializing max deployments. It seems acceptable to me to timeout block transfers (i.e. is_incomplete="true" in the logs) which did happen consistently when communicating with a degraded client. It happened a lot, actually. This explicitly enabled the performant client to connect to another performant client sooner.

I suggest we let the experimental data guide us in making this decision. We should pick the timeout with the highest level of performance. I can run more tests, but we are on the verge of another canary network reset, and we will need to run the load tests again (which we are going to do anyway, as we discussed today).

Do you have data showing specific blocks taking a long time, and is that data based on clients and run built after we fixed the thread double lock issue?

@joske
Copy link
Contributor

joske commented Jun 24, 2024

Yes, experiments should guide to the right value.

We tested on a separate network where we had a lot of TX cannons and we had blocks of 50MB that were taking minutes to just receive the BlockResponse and also minutes to deserialize. This was before the last fix though. Our current separate network doesn't have big blocks at the moment unfortunately. I'll check if we can get some to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants