Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add GenesisHash to AttestationStatusClient and reject unexpected updates (BFT-496) #167

Merged
merged 5 commits into from
Aug 1, 2024

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Aug 1, 2024

What ❔

  • Add the initial GenesisHash to AttestationStatus and change AttestationStatusWatch::update to take GenesisHash and return an error if the latest from the client indicates a change in genesis. This stops the runner and thus should stop the node itself when the error bubbles up. On external nodes this is redundant with the routine that monitors the main node API for changes. On the main node I shouldn't happen.
  • Change next_batch_to_attest() -> Option<BatchNumber> into attestation_status() -> Option<AttestationStatus> on AttestationStatusClient, so it now includes the GenesisHash as well.
  • LocalAttestationStatusClient returns the GenesisHash it was started with

Why ❔

This came out of a discussion here: matter-labs/zksync-era#2544 (comment)

Notably the ConsensusDal::attestation_status() now returns the GenesisHash, which I thought was only to signal through the API from the main node to the external nodes that a reorg has happened (which they would eventually learn anyway through polling the genesis endpoint, causing them to restart), and they should not attest for batches if they are on a different fork. The comment raised the issue that on the main node I discarded the genesis field from the response because I assumed it can only be the same as the one we started the Executor with, and second guessing it in the BatchStore would be awkward: the original genesis wasn't available to check against (it's cached in the BlockStore) and running a query to fetch it (even if the PersistentBatchStore supported it) would just compare it to what it already is.

The way I handled this mismatch for external nodes was to just stop updating the status by returning None and wait for the restart, treating it like any other transient RPC error, it was just logged. It does make sense though to elevate it higher and be able to stop the node if it's in an inconsistent state.

Now it's part of the AttestationStatusWatch because that is what we consider to be our interface with the node, (and not the AttestationStatusRunner which is a convenience construct).

Reorgs without restart?

If an inconsistency happened on the main node it wouldn't necessarily have to be fatal: if the genesis was allowed to change, and the components such as the BlockStore were able to handle it at all, then ostensibly the AttestationStatus* stuff could recover as well, for example by resetting the BatchVotes if they see a change in the genesis. The problem currently is that they might have already received and discarded votes for the new fork, which would not be gossiped again until clients are reconnected.

Apparently we don't plan to handle regenesis on the fly, so AttestationStatus::genesis is only present to prevent any attempt at changing it by causing a restart instead.

Atomicity

In the first version of this PR the BatchStore and PersistentBatchStore were also changed to have an attestation_status() method that returned a GenesisHash along with the BatchNumber. The only way I could make sense of this was if 1) changes in genesis while running the main node were allowed and 2) they could happen between calls to BlockStore::genesis() and BatchStore::next_batch_to_attest(), since those are not atomic methods. We discussed that this will not be supported, the Genesis cached in BlockStore will not change. Since we only start the Executor and the AttestationStatusRunner after regenesis in zksync-era, we don't have to worry about this changing. In that light it would be counter intuitive to return the genesis hash from BatchStore.

(In the future we could consider using Software Transactional Memory to have an in-memory representation of the state where we can do consistent read and writes between multiple Watch-like references. The DAL is capable of transactional reads, and it would be nice if we could combine reading for example here the genesis and the last/next batch and not have to worry about having to compare hashes, when all of these are local reads.)

@aakoshh aakoshh force-pushed the bft-496-attester-genesis-error branch from d36057e to 8dcd1e7 Compare August 1, 2024 11:31
@aakoshh aakoshh changed the title feat: Add GenesisHash to AttestationStatusClient and AttestationStatusRunner feat: Add GenesisHash to AttestationStatusClient and reject unexpected updates Aug 1, 2024
@aakoshh aakoshh changed the title feat: Add GenesisHash to AttestationStatusClient and reject unexpected updates feat: Add GenesisHash to AttestationStatusClient and reject unexpected updates (BFT-496) Aug 1, 2024
@aakoshh aakoshh merged commit ecbabff into main Aug 1, 2024
5 checks passed
@aakoshh aakoshh deleted the bft-496-attester-genesis-error branch August 1, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants