Review club: Prevention of Stalling when close to the tip #88

Prabhat1308 · 2024-10-20T19:40:46Z

Prabhat1308
Oct 20, 2024
Maintainer

Session Details

Date: 24-10-2024
Time: IST 20:00 (UTC 14:30)
Link: PR #29664
Difficulty: Medium

Motivation:

For stalling at the tip, we have a parallel download mechanism for compact blocks that was added in bitcoin/bitcoin#27626.
For stalling during IBD, we have a lookahead window of 1024 blocks, and if that is exceeded, we disconnect the stalling peer.
However, if we are close to but not at the tip (<=1024 blocks), neither of these mechanisms apply. We can't do compact blocks yet, and the stalling mechanism doesn't work because the 1024 window cannot be exceeded.

As a result, we have to resort to BLOCK_DOWNLOAD_TIMEOUT_BASE which only disconnects a peer after 10 minutes (plus 5 minutes more for each additional peers we currently have blocks in flight). This is too long in my opinion, especially since peers get assigned up to 16 blocks (MAX_BLOCKS_IN_TRANSIT_PER_PEER) and could repeat this process to stall us even longer if they send us a block after 10 minutes.

Pre-requisites:

PR 27626

Questions:

What is the difference between headers sync timeout and block download timeout ?
Do you feel the time of 2s is enough to disconnect a peer for stalling ?
Why do you think the adjusted time for disconnecting a peer whose blocks are in-flight

Learnings:

Initial Block Download
Block Stalling
Net Processing
Concurrency

Prabhat1308 · 2024-10-26T01:09:59Z

Prabhat1308
Oct 26, 2024
Maintainer Author

Summary

Tip Stalling

If I am syncing at the top , lets say my node is at height 1000 and the current height of the network is 1002 and I am trying to sync up with the network but the node I am requesting the block from is not responding then its the case of tip stalling. In that case we ask to get the blocks from other peers in parallel (it mentions getblocktxns so maybe not the entire block but only the transactions to complete the block , testing to know what would be better . PR that introduces this is Pull Request #27626)

IBD stalling

This is when we are either starting the syncing with the network , or some position that is pretty far away from the current height . Lets say we are syncing from the 100000 and current height is 120000 , we will request 1024 blocks (the lookahead window) from the peers and whenever we get the block we update and database and request the next 1024 blocks. The network is asynchonous and our node might request the other 1024 blocks while the previous 1024 blocks have not been completed yet . This is for smooth syncing and in light that in honest and high bandwidth peers we will get all the blocks while the new blocks come. However this is the case where problem comes as the blocks may not come yet and the new blocks start coming. This is when we call the breach of the 1024 limit and stalling occurs , we just disconnect the stalling peer , connect with other and start syncing again afresh.

Meeting in the middle

Now this is when we are not at the top and 1024 limit is never breached , but we sure are slow on our download. A scenario that can be created for this is consider we are at 1000 and the current height is 1567 , in this case we are not at the tip and the 1024 limit wont be reached because the absolute difference in current height of network and height of our node is less than 1024 , this is where when we find a stalling peer in our network , its takes a lot of time for us to remove it from our network 10 min time + additional time based on other peers in flight .

Questions

What is the difference between headers sync timeout and block download timeout ?

Headers sync happens before the IBD is started to get the headers of all the missing blocks. The download is from a single peer and not many peers parallely . Whenever the headers are not being received from the peer we may disconnect the peer . The logic is defined here.
headers sync timeout

After all the headers have been synced and we are now downloading the blocks , it may happens that we may be stalled by our peers because of issues like low connectivity and bandwidth issues and may need to disconnect from that peer . Logic defined here .
IBD timeout

Notice that both of these happen at distinctely different periods during the block download.

Do you feel the time of 2s is enough to disconnect a peer for stalling ?

It may be the case that we at a particular instant of time are connected with multiple peers and parallely downloading multiple blocks. This can result in us getting very less download speed per peer and not getting the data of the block in time making us believe the peer is trying to stall us and we may disconnect many honest and high bandwidth peers so the time is not enough. To tackle this a PR was introduced to make the timeout adaptive.

Why do you think the adjusted time for disconnecting a peer whose blocks are in-flight

As discussed from the previous question many factors are dependent on the number of peers we are connected with and downloading blocks from. As the number of peers increase , the download speed decreases leading to these scenarios. To compensate for this the formula was introduced, increasing the timeout time as the number of peers increase.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review club: Prevention of Stalling when close to the tip #88

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Review club: Prevention of Stalling when close to the tip #88

Prabhat1308 Oct 20, 2024 Maintainer

Session Details

Motivation:

Pre-requisites:

Questions:

Learnings:

Replies: 1 comment

Prabhat1308 Oct 26, 2024 Maintainer Author

Summary

Tip Stalling

IBD stalling

Meeting in the middle

Questions

Prabhat1308
Oct 20, 2024
Maintainer

Prabhat1308
Oct 26, 2024
Maintainer Author