Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Poll the main node for the next batch to sign (BFT-496) #2544

Merged
merged 14 commits into from
Aug 1, 2024

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Jul 30, 2024

What ❔

Injects an AttestationStatusWatch into the Executor which on the main node is backed by an AttestationStatusRunner polling the BatchStore::next_batch_to_attest and external nodes call the MainNodeApi::fetch_attestation_status.

TODO:

Test failure - pruning the main node

The following test never finished:

zk test rust -p zksync_node_consensus tests::test_with_pruning::case_0 --no-capture

A little extra logging revealed that the AttesterStatusRunner never gets initialised, because the blocks get pruned earlier than it could read the result, ie. ConsensusDal::batch_of_block(genesis.first_block) probably returns None because it's not found, and never will because it's pruned.

We discussed in #2480 that the main node is not supposed to be pruned, which is why the SQL that looks for what to attest on doesn't look for pruning records any more. Yet this test prunes the main node, and if I just try to prune the external node instead it panics because it doesn't have that block yet.

Either the SQL should take pruning into account after all, or we have to figure out a way to wait in the test until the main node executor is running, or change the test to prune the external node; I did the latter.

Why ❔

This is coordinating attesters to cast their votes on the batch which the main node tells them to do to produce a quorum certificate for all L1 batches without gaps.

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zk fmt and zk lint.

@aakoshh aakoshh force-pushed the bft-496-poll-next-batch-to-sign branch from 2fab9a6 to f8c63c9 Compare July 31, 2024 15:02
@aakoshh aakoshh changed the base branch from main to gprusak-endpoint2 July 31, 2024 15:03
@aakoshh aakoshh marked this pull request as ready for review July 31, 2024 15:16
brunoffranca pushed a commit to matter-labs/era-consensus that referenced this pull request Aug 1, 2024
…FT-496) (#166)

## What ❔

Fixes `AttestationStatusRunner::poll_until_some` to return after a
successful poll. Originally it returned a value, then decided to write
it into the field and forgot to add a return statement.

## Why ❔

Tests in matter-labs/zksync-era#2544 time out
because the initialisation never completes and the `Executor` is never
started.
Base automatically changed from gprusak-endpoint2 to main August 1, 2024 07:38
Cargo.toml Outdated Show resolved Hide resolved
@aakoshh aakoshh force-pushed the bft-496-poll-next-batch-to-sign branch from 3c01e41 to abd883f Compare August 1, 2024 08:05
pompon0
pompon0 previously approved these changes Aug 1, 2024
aakoshh added a commit to matter-labs/era-consensus that referenced this pull request Aug 1, 2024
…d updates (BFT-496) (#167)

## What ❔
- [x] Add the initial `GenesisHash` to `AttestationStatus` and change
`AttestationStatusWatch::update` to take `GenesisHash` and return an
error if the latest from the client indicates a change in genesis. This
stops the runner and thus should stop the node itself when the error
bubbles up. On external nodes this is redundant with the routine that
monitors the main node API for changes. On the main node I shouldn't
happen.
- [x] Change `next_batch_to_attest() -> Option<BatchNumber>` into
`attestation_status() -> Option<AttestationStatus>` on
`AttestationStatusClient`, so it now includes the `GenesisHash` as well.
- [x] `LocalAttestationStatusClient` returns the `GenesisHash` it was
started with

## Why ❔

This came out of a discussion here:
matter-labs/zksync-era#2544 (comment)

Notably the `ConsensusDal::attestation_status()` now returns the
`GenesisHash`, which I thought was only to signal through the API from
the main node to the external nodes that a reorg has happened (which
they would eventually learn anyway through polling the genesis endpoint,
causing them to restart), and they should not attest for batches if they
are on a different fork. The comment raised the issue that on the main
node I discarded the `genesis` field from the response because I assumed
it can only be the same as the one we started the `Executor` with, and
second guessing it in the `BatchStore` would be awkward: the original
genesis wasn't available to check against (it's cached in the
`BlockStore`) and running a query to fetch it (even if the
`PersistentBatchStore` supported it) would just compare it to what it
already is.

The way I handled this mismatch for external nodes was to just stop
updating the status by returning `None` and wait for the restart,
treating it like any other transient RPC error, it was just logged. It
does make sense though to elevate it higher and be able to stop the node
if it's in an inconsistent state.

Now it's part of the `AttestationStatusWatch` because that is what we
consider to be our interface with the node, (and not the
`AttestationStatusRunner` which is a convenience construct).

### Reorgs without restart?

If an inconsistency happened on the main node it wouldn't necessarily
have to be fatal: if the genesis was allowed to change, and the
components such as the `BlockStore` were able to handle it at all, then
ostensibly the `AttestationStatus*` stuff could recover as well, for
example by resetting the `BatchVotes` if they see a change in the
`genesis`. The problem currently is that they might have already
received and discarded votes for the new fork, which would not be
gossiped again until clients are reconnected.

Apparently we don't plan to handle regenesis on the fly, so
`AttestationStatus::genesis` is only present to prevent any attempt at
changing it by _causing_ a restart instead.

### Atomicity

In the first version of this PR the `BatchStore` and
`PersistentBatchStore` were also changed to have an
`attestation_status()` method that returned a `GenesisHash` along with
the `BatchNumber`. The only way I could make sense of this was if 1)
changes in genesis while running the main node were allowed and 2) they
could happen between calls to `BlockStore::genesis()` and
`BatchStore::next_batch_to_attest()`, since those are not atomic
methods. We discussed that this will not be supported, the `Genesis`
cached in `BlockStore` will not change. Since we only start the
`Executor` and the `AttestationStatusRunner` [after
regenesis](https://github.com/matter-labs/zksync-era/blob/1d206c0af8f28eb00eb1498d6f2cdbb45ffef72a/core/node/consensus/src/mn.rs#L39)
in `zksync-era`, we don't have to worry about this changing. In that
light it would be counter intuitive to return the genesis hash from
`BatchStore`.


(In the future we could consider using Software Transactional Memory to
have an in-memory representation of the state where we can do consistent
read and writes between multiple Watch-like references. The DAL is
capable of transactional reads, and it would be nice if we could combine
reading for example here the genesis and the last/next batch and not
have to worry about having to compare hashes, when all of these are
local reads.)
@aakoshh aakoshh force-pushed the bft-496-poll-next-batch-to-sign branch from d291919 to c0ed2e1 Compare August 1, 2024 13:54
@aakoshh aakoshh requested a review from pompon0 August 1, 2024 15:40
@RomanBrodetski RomanBrodetski added this pull request to the merge queue Aug 1, 2024
Merged via the queue into main with commit 22cf820 Aug 1, 2024
53 checks passed
@RomanBrodetski RomanBrodetski deleted the bft-496-poll-next-batch-to-sign branch August 1, 2024 19:37
github-merge-queue bot pushed a commit that referenced this pull request Aug 2, 2024
🤖 I have created a release *beep* *boop*
---


##
[24.14.0](core-v24.13.0...core-v24.14.0)
(2024-08-01)


### Features

* Adding SLChainID
([#2547](#2547))
([656e830](656e830))
* **consensus:** add tracing instrumentation to consensus store
([#2546](#2546))
([1e53940](1e53940))
* Poll the main node for the next batch to sign (BFT-496)
([#2544](#2544))
([22cf820](22cf820))
* Support sending logs via OTLP
([#2556](#2556))
([1d206c0](1d206c0))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: zksync-era-bot <[email protected]>
pompon0 added a commit that referenced this pull request Aug 2, 2024
RomanBrodetski pushed a commit that referenced this pull request Aug 2, 2024
…)" (#2574)

Reverts #2544

Consensus on EN gets stuck because the attestationStatus endpoint was
not not rolled out to the main node yet
github-merge-queue bot pushed a commit that referenced this pull request Aug 2, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.1.1](zk_toolbox-v0.1.0...zk_toolbox-v0.1.1)
(2024-08-02)


### Features

* Add recovery tests to zk_supervisor
([#2444](#2444))
([0c0d10a](0c0d10a))
* add revert tests (external node) to zk_toolbox
([#2408](#2408))
([3fbbee1](3fbbee1))
* add revert tests to zk_toolbox
([#2317](#2317))
([c9ad002](c9ad002))
* add zk toolbox
([#2005](#2005))
([60a633b](60a633b))
* Adding SLChainID
([#2547](#2547))
([656e830](656e830))
* Base Token Fundamentals
([#2204](#2204))
([39709f5](39709f5))
* change `zkSync` occurences to `ZKsync`
([#2227](#2227))
([0b4104d](0b4104d))
* **configs:** Do not panic if config is only partially filled
([#2545](#2545))
([db13fe3](db13fe3))
* **eth-watch:** Integrate decentralized upgrades
([#2401](#2401))
([5a48e10](5a48e10))
* L1 batch signing (BFT-474)
([#2414](#2414))
([ab699db](ab699db))
* Minimal External API Fetcher
([#2383](#2383))
([9f255c0](9f255c0))
* Poll the main node for the next batch to sign (BFT-496)
([#2544](#2544))
([22cf820](22cf820))
* Revisit base config values
([#2532](#2532))
([3fac8ac](3fac8ac))
* Support sending logs via OTLP
([#2556](#2556))
([1d206c0](1d206c0))
* Switch to using crates.io deps
([#2409](#2409))
([27fabaf](27fabaf))
* **toolbox:** add format and clippy to zk_toolbox ci
([#2100](#2100))
([49a5c3a](49a5c3a))
* **toolbox:** add verify to zk-toolbox
([#2013](#2013))
([23a545c](23a545c))
* **toolbox:** add zk supervisor database commands
([#2051](#2051))
([f99739b](f99739b))
* **toolbox:** add zk_toolbox ci
([#1985](#1985))
([4ab4922](4ab4922))
* **toolbox:** refactor config to its own crate
([#2063](#2063))
([5cfcc24](5cfcc24))
* Update to consensus 0.1.0-rc.4 (BFT-486)
([#2475](#2475))
([ff6b10c](ff6b10c))
* **vlog:** New vlog interface + opentelemtry improvements
([#2472](#2472))
([c0815cd](c0815cd))
* **zk toolbox:** External node support
([#2287](#2287))
([6384cad](6384cad))
* **zk_toolbox:** Add check for zksync repo path
([#2447](#2447))
([f1cbb74](f1cbb74))
* **zk_toolbox:** Add contract verifier support for zk toolbox
([#2420](#2420))
([d10a24b](d10a24b))
* **zk_toolbox:** Add grafana support
([#2557](#2557))
([f5aaefe](f5aaefe))
* **zk_toolbox:** Add prover generate-sk command
([#2222](#2222))
([40e0a95](40e0a95))
* **zk_toolbox:** Add prover init command
([#2298](#2298))
([159af3c](159af3c))
* **zk_toolbox:** Add prover run
([#2272](#2272))
([598ef7b](598ef7b))
* **zk_toolbox:** add test upgrade subcommand to zk_toolbox
([#2515](#2515))
([1a12f5f](1a12f5f))
* **zk_toolbox:** Add update command
([#2440](#2440))
([e2fa86f](e2fa86f))
* **zk_toolbox:** Allow toolbox find Zkstack.yaml in parent dirs
([#2430](#2430))
([0957119](0957119))
* **zk_toolbox:** Clean command
([#2387](#2387))
([52a4680](52a4680))
* **zk_toolbox:** Dev command
([#2347](#2347))
([f508ac1](f508ac1))
* **zk_toolbox:** Implement default upgrader deployment
([#2526](#2526))
([6d86959](6d86959))
* **zk_toolbox:** resume functionality
([#2376](#2376))
([e5e0473](e5e0473))
* **zk_toolbox:** Small adjustment for zk toolbox
([#2424](#2424))
([ce43c42](ce43c42))
* **zk_toolbox:** Update prover support
([#2533](#2533))
([63c92b6](63c92b6))
* **zk_toolbox:** Update reamde for toolbox
([#2531](#2531))
([d5ba7d8](d5ba7d8))
* **zk_toolbox:** use configs from the main repo
([#2470](#2470))
([4222d13](4222d13))
* **zk_toolbox:** Use docker compose instead of docker-compose
([#2195](#2195))
([2f528ec](2f528ec))
* **zk_toolbox:** use low level command for running verbose command"
([#2358](#2358))
([29671c8](29671c8))
* **zk-toolbox:** add balance check
([#2016](#2016))
([a8b8e4b](a8b8e4b))
* **zk-toolbox:** Deploy custom token
([#2329](#2329))
([3a8fed4](3a8fed4))


### Bug Fixes

* **api:** correct default fee data in eth call
([#2072](#2072))
([e71f6f9](e71f6f9))
* disable localhost wallets on external network interaction
([#2212](#2212))
([a00317d](a00317d))
* **house-keeper:** Fix queue size queries
([#2106](#2106))
([183502a](183502a))
* **toolbox:** Temporary disable fast mode for deploying l1 contracts …
([#2011](#2011))
([2a1d37b](2a1d37b))
* update rust toolchain version
([#2047](#2047))
([9fe5212](9fe5212))
* **zk_toolbox:** Add chain id for local wallet
([#2041](#2041))
([8e147c1](8e147c1))
* **zk_toolbox:** Fix error with balances
([#2034](#2034))
([5d23a3e](5d23a3e))
* **zk_toolbox:** Fix installation guide
([#2035](#2035))
([e9038be](e9038be))
* **zk_toolbox:** Fix protocol version
([#2118](#2118))
([67f6080](67f6080))
* **zk_toolbox:** improve readme to include containers command and cd
([#2073](#2073))
([5e5628f](5e5628f))
* **zk_toolbox:** Move l1 rpc to init stage
([#2074](#2074))
([c127ff1](c127ff1))
* **zk_toolbox:** readme added dependencies section and cleaned up
([#2044](#2044))
([78244c7](78244c7))
* **zk_toolbox:** Set proper pubdata sending mode
([#2507](#2507))
([21fbd77](21fbd77))
* **zk_toolbox:** Show balance
([#2254](#2254))
([f1d9f03](f1d9f03))
* **zk_toolbox:** Some small nit
([#2023](#2023))
([4e96e32](4e96e32))
* **zk_toolbox:** Use both folders for loading contracts
([#2030](#2030))
([97c6d5c](97c6d5c))
* **zk_toolbox:** Use existing ecosystem
([#2534](#2534))
([99fd2bd](99fd2bd))
* **zk_toolbox:** Use slug crate instead of self written function
([#2309](#2309))
([a61f273](a61f273))
* **zk_toolbox:** Use the same l2 address for shared and erc20 bridge
([#2260](#2260))
([26f2010](26f2010))
* **zk_tool:** Change some texts
([#2027](#2027))
([a6232c5](a6232c5))
* zk-toolbox integration tests ci
([#2226](#2226))
([3f521ac](3f521ac))


### Reverts

* "feat: Poll the main node for the next batch to sign (BFT-496)"
([#2574](#2574))
([72d3be8](72d3be8))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
github-merge-queue bot pushed a commit that referenced this pull request Aug 2, 2024
🤖 I have created a release *beep* *boop*
---


##
[16.2.0](prover-v16.1.0...prover-v16.2.0)
(2024-08-02)


### Features

* **configs:** Do not panic if config is only partially filled
([#2545](#2545))
([db13fe3](db13fe3))
* Introduce more tracing instrumentation
([#2523](#2523))
([79d407a](79d407a))
* New prover documentation
([#2466](#2466))
([1b61d07](1b61d07))
* optimize LWG and NWG
([#2512](#2512))
([0d00650](0d00650))
* Poll the main node for the next batch to sign (BFT-496)
([#2544](#2544))
([22cf820](22cf820))
* Remove CI and docker images for CPU prover & compressor
([#2540](#2540))
([0dda805](0dda805))
* Remove unused VKs, add docs for BWG
([#2468](#2468))
([2fa6bf0](2fa6bf0))
* Support sending logs via OTLP
([#2556](#2556))
([1d206c0](1d206c0))
* Update to consensus 0.1.0-rc.4 (BFT-486)
([#2475](#2475))
([ff6b10c](ff6b10c))
* **vlog:** New vlog interface + opentelemtry improvements
([#2472](#2472))
([c0815cd](c0815cd))
* **witness_vector_generator:** Make it possible to run multiple wvg
instances in one binary
([#2493](#2493))
([572ad40](572ad40))
* **zk_toolbox:** use configs from the main repo
([#2470](#2470))
([4222d13](4222d13))


### Bug Fixes

* **prover:** Parallelize circuit metadata uploading for BWG
([#2520](#2520))
([f49720f](f49720f))
* **prover:** Reduce database connection overhead for BWG
([#2543](#2543))
([c2439e9](c2439e9))
* **witness_generator:** Only spawn 1 prometheus exporter per witness
generator
([#2492](#2492))
([b9cb222](b9cb222))


### Reverts

* "feat: Poll the main node for the next batch to sign (BFT-496)"
([#2574](#2574))
([72d3be8](72d3be8))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: Danil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants