Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in net_plugin that can lead to unnecessary unlinkable block exception #643

Open
arhag opened this issue Aug 26, 2024 · 0 comments
Assignees
Labels
👍 lgtm OCI Work exclusive to OCI team

Comments

@arhag
Copy link
Member

arhag commented Aug 26, 2024

The net_plugin will send a request to peer asking it to return a particular span of blocks referenced by block number. It will then send another request with a later span of block to next peer (round robin). The net_plugin waits until it receives (which includes successful unpacking of the serialized data) all of the blocks from the latest requested span before sending out the request for the next span of blocks. But it does not validate the integrity of the block data received prior to making the next request.

So if the block data was corrupted some how, e.g. missing a transaction, the net_plugin would realize this after it had already sent out the next request. The corrupted block would of course be dropped, but the block that comes after would likely be received by another peer before the net_plugin has a chance to re-request the block that was corrupted. So that block would run into a unlinkable block exception when attempting to add the block to the fork database since its necessary parent block (the corrupted block) would not yet be in the fork database. The unlinkable block exception causes net_plugin to post that block into the queue for the main thread to process along with the other remaining blocks it receives from that peer for the span of blocks requested.

Those sequence of blocks will all ultimately fail to link because the corrupted block is missing. But after a certain number of failure, net_plugin will close the connection with its "bad" peer, and then after some time, after it reconnects, it will re-request the necessary blocks and resolve the issue automatically. But it gets there after a very inefficient process.

It would be more efficient if the net_plugin was aware of the block corruption and the need to request blocks starting from the block it needs.

One simple way to achieve this is to wait until all the received blocks from one requested span of blocks have been minimally verified to have block integrity before requesting the next span. For example, ensuring the last block of the span has been successfully been added into the fork database prior to requesting the next span is one way of achieving this goal. If it encounters a corrupted block, it can disconnect from that peer immediately (ignoring the remainder of the blocks in the span that the peer is sending), and then send a request for a new span of blocks to the next peer starting with the block number of the corrupted block. This solution can be part of the work of #529 since that also changes the behavior of net_plugin to never post an unlinkable block to the queue for the main thread.

@heifner heifner added 👍 lgtm and removed triage labels Aug 27, 2024
@arhag arhag added this to the Spring v1.1.0-rc1 milestone Aug 27, 2024
@heifner heifner self-assigned this Sep 26, 2024
@heifner heifner added the OCI Work exclusive to OCI team label Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👍 lgtm OCI Work exclusive to OCI team
Projects
Status: In Progress
Development

No branches or pull requests

7 participants
@heifner @arhag @enf-ci-bot and others