-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Poll the main node for the next batch to sign (BFT-496) #2544
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8 tasks
aakoshh
force-pushed
the
bft-496-poll-next-batch-to-sign
branch
from
July 31, 2024 15:02
2fab9a6
to
f8c63c9
Compare
brunoffranca
pushed a commit
to matter-labs/era-consensus
that referenced
this pull request
Aug 1, 2024
…FT-496) (#166) ## What ❔ Fixes `AttestationStatusRunner::poll_until_some` to return after a successful poll. Originally it returned a value, then decided to write it into the field and forgot to add a return statement. ## Why ❔ Tests in matter-labs/zksync-era#2544 time out because the initialisation never completes and the `Executor` is never started.
pompon0
reviewed
Aug 1, 2024
pompon0
reviewed
Aug 1, 2024
pompon0
reviewed
Aug 1, 2024
pompon0
reviewed
Aug 1, 2024
aakoshh
force-pushed
the
bft-496-poll-next-batch-to-sign
branch
from
August 1, 2024 08:05
3c01e41
to
abd883f
Compare
pompon0
previously approved these changes
Aug 1, 2024
3 tasks
aakoshh
added a commit
to matter-labs/era-consensus
that referenced
this pull request
Aug 1, 2024
…d updates (BFT-496) (#167) ## What ❔ - [x] Add the initial `GenesisHash` to `AttestationStatus` and change `AttestationStatusWatch::update` to take `GenesisHash` and return an error if the latest from the client indicates a change in genesis. This stops the runner and thus should stop the node itself when the error bubbles up. On external nodes this is redundant with the routine that monitors the main node API for changes. On the main node I shouldn't happen. - [x] Change `next_batch_to_attest() -> Option<BatchNumber>` into `attestation_status() -> Option<AttestationStatus>` on `AttestationStatusClient`, so it now includes the `GenesisHash` as well. - [x] `LocalAttestationStatusClient` returns the `GenesisHash` it was started with ## Why ❔ This came out of a discussion here: matter-labs/zksync-era#2544 (comment) Notably the `ConsensusDal::attestation_status()` now returns the `GenesisHash`, which I thought was only to signal through the API from the main node to the external nodes that a reorg has happened (which they would eventually learn anyway through polling the genesis endpoint, causing them to restart), and they should not attest for batches if they are on a different fork. The comment raised the issue that on the main node I discarded the `genesis` field from the response because I assumed it can only be the same as the one we started the `Executor` with, and second guessing it in the `BatchStore` would be awkward: the original genesis wasn't available to check against (it's cached in the `BlockStore`) and running a query to fetch it (even if the `PersistentBatchStore` supported it) would just compare it to what it already is. The way I handled this mismatch for external nodes was to just stop updating the status by returning `None` and wait for the restart, treating it like any other transient RPC error, it was just logged. It does make sense though to elevate it higher and be able to stop the node if it's in an inconsistent state. Now it's part of the `AttestationStatusWatch` because that is what we consider to be our interface with the node, (and not the `AttestationStatusRunner` which is a convenience construct). ### Reorgs without restart? If an inconsistency happened on the main node it wouldn't necessarily have to be fatal: if the genesis was allowed to change, and the components such as the `BlockStore` were able to handle it at all, then ostensibly the `AttestationStatus*` stuff could recover as well, for example by resetting the `BatchVotes` if they see a change in the `genesis`. The problem currently is that they might have already received and discarded votes for the new fork, which would not be gossiped again until clients are reconnected. Apparently we don't plan to handle regenesis on the fly, so `AttestationStatus::genesis` is only present to prevent any attempt at changing it by _causing_ a restart instead. ### Atomicity In the first version of this PR the `BatchStore` and `PersistentBatchStore` were also changed to have an `attestation_status()` method that returned a `GenesisHash` along with the `BatchNumber`. The only way I could make sense of this was if 1) changes in genesis while running the main node were allowed and 2) they could happen between calls to `BlockStore::genesis()` and `BatchStore::next_batch_to_attest()`, since those are not atomic methods. We discussed that this will not be supported, the `Genesis` cached in `BlockStore` will not change. Since we only start the `Executor` and the `AttestationStatusRunner` [after regenesis](https://github.com/matter-labs/zksync-era/blob/1d206c0af8f28eb00eb1498d6f2cdbb45ffef72a/core/node/consensus/src/mn.rs#L39) in `zksync-era`, we don't have to worry about this changing. In that light it would be counter intuitive to return the genesis hash from `BatchStore`. (In the future we could consider using Software Transactional Memory to have an in-memory representation of the state where we can do consistent read and writes between multiple Watch-like references. The DAL is capable of transactional reads, and it would be nice if we could combine reading for example here the genesis and the last/next batch and not have to worry about having to compare hashes, when all of these are local reads.)
aakoshh
force-pushed
the
bft-496-poll-next-batch-to-sign
branch
from
August 1, 2024 13:54
d291919
to
c0ed2e1
Compare
brunoffranca
approved these changes
Aug 1, 2024
This was referenced Aug 1, 2024
github-merge-queue bot
pushed a commit
that referenced
this pull request
Aug 2, 2024
🤖 I have created a release *beep* *boop* --- ## [24.14.0](core-v24.13.0...core-v24.14.0) (2024-08-01) ### Features * Adding SLChainID ([#2547](#2547)) ([656e830](656e830)) * **consensus:** add tracing instrumentation to consensus store ([#2546](#2546)) ([1e53940](1e53940)) * Poll the main node for the next batch to sign (BFT-496) ([#2544](#2544)) ([22cf820](22cf820)) * Support sending logs via OTLP ([#2556](#2556)) ([1d206c0](1d206c0)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: zksync-era-bot <[email protected]>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Aug 2, 2024
🤖 I have created a release *beep* *boop* --- ## [0.1.1](zk_toolbox-v0.1.0...zk_toolbox-v0.1.1) (2024-08-02) ### Features * Add recovery tests to zk_supervisor ([#2444](#2444)) ([0c0d10a](0c0d10a)) * add revert tests (external node) to zk_toolbox ([#2408](#2408)) ([3fbbee1](3fbbee1)) * add revert tests to zk_toolbox ([#2317](#2317)) ([c9ad002](c9ad002)) * add zk toolbox ([#2005](#2005)) ([60a633b](60a633b)) * Adding SLChainID ([#2547](#2547)) ([656e830](656e830)) * Base Token Fundamentals ([#2204](#2204)) ([39709f5](39709f5)) * change `zkSync` occurences to `ZKsync` ([#2227](#2227)) ([0b4104d](0b4104d)) * **configs:** Do not panic if config is only partially filled ([#2545](#2545)) ([db13fe3](db13fe3)) * **eth-watch:** Integrate decentralized upgrades ([#2401](#2401)) ([5a48e10](5a48e10)) * L1 batch signing (BFT-474) ([#2414](#2414)) ([ab699db](ab699db)) * Minimal External API Fetcher ([#2383](#2383)) ([9f255c0](9f255c0)) * Poll the main node for the next batch to sign (BFT-496) ([#2544](#2544)) ([22cf820](22cf820)) * Revisit base config values ([#2532](#2532)) ([3fac8ac](3fac8ac)) * Support sending logs via OTLP ([#2556](#2556)) ([1d206c0](1d206c0)) * Switch to using crates.io deps ([#2409](#2409)) ([27fabaf](27fabaf)) * **toolbox:** add format and clippy to zk_toolbox ci ([#2100](#2100)) ([49a5c3a](49a5c3a)) * **toolbox:** add verify to zk-toolbox ([#2013](#2013)) ([23a545c](23a545c)) * **toolbox:** add zk supervisor database commands ([#2051](#2051)) ([f99739b](f99739b)) * **toolbox:** add zk_toolbox ci ([#1985](#1985)) ([4ab4922](4ab4922)) * **toolbox:** refactor config to its own crate ([#2063](#2063)) ([5cfcc24](5cfcc24)) * Update to consensus 0.1.0-rc.4 (BFT-486) ([#2475](#2475)) ([ff6b10c](ff6b10c)) * **vlog:** New vlog interface + opentelemtry improvements ([#2472](#2472)) ([c0815cd](c0815cd)) * **zk toolbox:** External node support ([#2287](#2287)) ([6384cad](6384cad)) * **zk_toolbox:** Add check for zksync repo path ([#2447](#2447)) ([f1cbb74](f1cbb74)) * **zk_toolbox:** Add contract verifier support for zk toolbox ([#2420](#2420)) ([d10a24b](d10a24b)) * **zk_toolbox:** Add grafana support ([#2557](#2557)) ([f5aaefe](f5aaefe)) * **zk_toolbox:** Add prover generate-sk command ([#2222](#2222)) ([40e0a95](40e0a95)) * **zk_toolbox:** Add prover init command ([#2298](#2298)) ([159af3c](159af3c)) * **zk_toolbox:** Add prover run ([#2272](#2272)) ([598ef7b](598ef7b)) * **zk_toolbox:** add test upgrade subcommand to zk_toolbox ([#2515](#2515)) ([1a12f5f](1a12f5f)) * **zk_toolbox:** Add update command ([#2440](#2440)) ([e2fa86f](e2fa86f)) * **zk_toolbox:** Allow toolbox find Zkstack.yaml in parent dirs ([#2430](#2430)) ([0957119](0957119)) * **zk_toolbox:** Clean command ([#2387](#2387)) ([52a4680](52a4680)) * **zk_toolbox:** Dev command ([#2347](#2347)) ([f508ac1](f508ac1)) * **zk_toolbox:** Implement default upgrader deployment ([#2526](#2526)) ([6d86959](6d86959)) * **zk_toolbox:** resume functionality ([#2376](#2376)) ([e5e0473](e5e0473)) * **zk_toolbox:** Small adjustment for zk toolbox ([#2424](#2424)) ([ce43c42](ce43c42)) * **zk_toolbox:** Update prover support ([#2533](#2533)) ([63c92b6](63c92b6)) * **zk_toolbox:** Update reamde for toolbox ([#2531](#2531)) ([d5ba7d8](d5ba7d8)) * **zk_toolbox:** use configs from the main repo ([#2470](#2470)) ([4222d13](4222d13)) * **zk_toolbox:** Use docker compose instead of docker-compose ([#2195](#2195)) ([2f528ec](2f528ec)) * **zk_toolbox:** use low level command for running verbose command" ([#2358](#2358)) ([29671c8](29671c8)) * **zk-toolbox:** add balance check ([#2016](#2016)) ([a8b8e4b](a8b8e4b)) * **zk-toolbox:** Deploy custom token ([#2329](#2329)) ([3a8fed4](3a8fed4)) ### Bug Fixes * **api:** correct default fee data in eth call ([#2072](#2072)) ([e71f6f9](e71f6f9)) * disable localhost wallets on external network interaction ([#2212](#2212)) ([a00317d](a00317d)) * **house-keeper:** Fix queue size queries ([#2106](#2106)) ([183502a](183502a)) * **toolbox:** Temporary disable fast mode for deploying l1 contracts … ([#2011](#2011)) ([2a1d37b](2a1d37b)) * update rust toolchain version ([#2047](#2047)) ([9fe5212](9fe5212)) * **zk_toolbox:** Add chain id for local wallet ([#2041](#2041)) ([8e147c1](8e147c1)) * **zk_toolbox:** Fix error with balances ([#2034](#2034)) ([5d23a3e](5d23a3e)) * **zk_toolbox:** Fix installation guide ([#2035](#2035)) ([e9038be](e9038be)) * **zk_toolbox:** Fix protocol version ([#2118](#2118)) ([67f6080](67f6080)) * **zk_toolbox:** improve readme to include containers command and cd ([#2073](#2073)) ([5e5628f](5e5628f)) * **zk_toolbox:** Move l1 rpc to init stage ([#2074](#2074)) ([c127ff1](c127ff1)) * **zk_toolbox:** readme added dependencies section and cleaned up ([#2044](#2044)) ([78244c7](78244c7)) * **zk_toolbox:** Set proper pubdata sending mode ([#2507](#2507)) ([21fbd77](21fbd77)) * **zk_toolbox:** Show balance ([#2254](#2254)) ([f1d9f03](f1d9f03)) * **zk_toolbox:** Some small nit ([#2023](#2023)) ([4e96e32](4e96e32)) * **zk_toolbox:** Use both folders for loading contracts ([#2030](#2030)) ([97c6d5c](97c6d5c)) * **zk_toolbox:** Use existing ecosystem ([#2534](#2534)) ([99fd2bd](99fd2bd)) * **zk_toolbox:** Use slug crate instead of self written function ([#2309](#2309)) ([a61f273](a61f273)) * **zk_toolbox:** Use the same l2 address for shared and erc20 bridge ([#2260](#2260)) ([26f2010](26f2010)) * **zk_tool:** Change some texts ([#2027](#2027)) ([a6232c5](a6232c5)) * zk-toolbox integration tests ci ([#2226](#2226)) ([3f521ac](3f521ac)) ### Reverts * "feat: Poll the main node for the next batch to sign (BFT-496)" ([#2574](#2574)) ([72d3be8](72d3be8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
github-merge-queue bot
pushed a commit
that referenced
this pull request
Aug 2, 2024
🤖 I have created a release *beep* *boop* --- ## [16.2.0](prover-v16.1.0...prover-v16.2.0) (2024-08-02) ### Features * **configs:** Do not panic if config is only partially filled ([#2545](#2545)) ([db13fe3](db13fe3)) * Introduce more tracing instrumentation ([#2523](#2523)) ([79d407a](79d407a)) * New prover documentation ([#2466](#2466)) ([1b61d07](1b61d07)) * optimize LWG and NWG ([#2512](#2512)) ([0d00650](0d00650)) * Poll the main node for the next batch to sign (BFT-496) ([#2544](#2544)) ([22cf820](22cf820)) * Remove CI and docker images for CPU prover & compressor ([#2540](#2540)) ([0dda805](0dda805)) * Remove unused VKs, add docs for BWG ([#2468](#2468)) ([2fa6bf0](2fa6bf0)) * Support sending logs via OTLP ([#2556](#2556)) ([1d206c0](1d206c0)) * Update to consensus 0.1.0-rc.4 (BFT-486) ([#2475](#2475)) ([ff6b10c](ff6b10c)) * **vlog:** New vlog interface + opentelemtry improvements ([#2472](#2472)) ([c0815cd](c0815cd)) * **witness_vector_generator:** Make it possible to run multiple wvg instances in one binary ([#2493](#2493)) ([572ad40](572ad40)) * **zk_toolbox:** use configs from the main repo ([#2470](#2470)) ([4222d13](4222d13)) ### Bug Fixes * **prover:** Parallelize circuit metadata uploading for BWG ([#2520](#2520)) ([f49720f](f49720f)) * **prover:** Reduce database connection overhead for BWG ([#2543](#2543)) ([c2439e9](c2439e9)) * **witness_generator:** Only spawn 1 prometheus exporter per witness generator ([#2492](#2492)) ([b9cb222](b9cb222)) ### Reverts * "feat: Poll the main node for the next batch to sign (BFT-496)" ([#2574](#2574)) ([72d3be8](72d3be8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: Danil <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What ❔
Injects an
AttestationStatusWatch
into theExecutor
which on the main node is backed by anAttestationStatusRunner
polling theBatchStore::next_batch_to_attest
and external nodes call theMainNodeApi::fetch_attestation_status
.TODO:
era-consensus
dependency once feat: Poll the main node for batch to vote on (BFT-496) era-consensus#161 is merged and0.1.0-rc.5
is publishedTest failure - pruning the main node
The following test never finished:
zk test rust -p zksync_node_consensus tests::test_with_pruning::case_0 --no-capture
A little extra logging revealed that the
AttesterStatusRunner
never gets initialised, because the blocks get pruned earlier than it could read the result, ie.ConsensusDal::batch_of_block(genesis.first_block)
probably returnsNone
because it's not found, and never will because it's pruned.We discussed in #2480 that the main node is not supposed to be pruned, which is why the SQL that looks for what to attest on doesn't look for pruning records any more. Yet this test prunes the main node, and if I just try to prune the external node instead it panics because it doesn't have that block yet.
Either the SQL should take pruning into account after all, or we have to figure out a way to wait in the test until the main node executor is running, or change the test to prune the external node; I did the latter.
Why ❔
This is coordinating attesters to cast their votes on the batch which the main node tells them to do to produce a quorum certificate for all L1 batches without gaps.
Checklist
zk fmt
andzk lint
.