Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEV1] Nodes stop syncing due to "failed to verify beacon". #12467

Closed
6 tasks done
jennijuju opened this issue Sep 16, 2024 · 9 comments · Fixed by #12469
Closed
6 tasks done

[SEV1] Nodes stop syncing due to "failed to verify beacon". #12467

jennijuju opened this issue Sep 16, 2024 · 9 comments · Fixed by #12469

Comments

@jennijuju
Copy link
Member

jennijuju commented Sep 16, 2024

This issue is capturing the investigation and communication about the network sync issue that occurred for some Lotus nodes on 2024-09-16 16:35 UTC (https://status.filecoin.io/#subscribe-modal-s11bzdg62qcb).

Tasks

@github-project-automation github-project-automation bot moved this to 📌 Triage in FilOz Sep 16, 2024
@momack2
Copy link
Contributor

momack2 commented Sep 16, 2024

We are investigating a possible issue with 1.28.2 and beyond versions. A number of SPs are indicating success in downgrading to lotus 1.28.1 and below. We have been able to reproduce regaining sync through downgrading to 1.28.1. If you're experiencing this issue and are a Storage Provider, we recommend trying to downgrade to 1.28.1 or before in the meantime, while lotus eng diagnose the specific root cause. If you are an RPC provider, please do not downgrade yet and wait for more updates.

@BigLep
Copy link
Member

BigLep commented Sep 16, 2024

We have identified the root cause and will be publishing Lotus node v1.28.3 and v1.29.1 releases that address the issue. We'll post here as soon the releases are ready for consumption.

We'll update the issue after about the root cause.

@BigLep
Copy link
Member

BigLep commented Sep 16, 2024

The problem with 1.28.2 and 1.29.0 was the update to github.com/kilic/bls12-381 v0.1.1 from v0.1.0 in https://github.com/filecoin-project/lotus/pull/12382/files#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6L252

We are still tracking down what changed in kilic, but Lotus node v1.28.3 and v1.29.1 will use kilic/bls12-381 v0.1.1 v0.1.0.

@rvagg
Copy link
Member

rvagg commented Sep 16, 2024

@BigLep BigLep mentioned this issue Sep 16, 2024
51 tasks
@BigLep
Copy link
Member

BigLep commented Sep 16, 2024

To be clear, as soon as the patches are released, we recommend all users either update to v1.28.3 or v1.29.1. With these upcoming patches, a user only needs to upgrade. One does NOT need to adjust any config flags likeEnableEthRPC.

@github-project-automation github-project-automation bot moved this from 📌 Triage to 🎉 Done in FilOz Sep 16, 2024
@BigLep BigLep reopened this Sep 16, 2024
@github-project-automation github-project-automation bot moved this from 🎉 Done to 📌 Triage in FilOz Sep 16, 2024
@BigLep BigLep mentioned this issue Sep 16, 2024
20 tasks
@rjan90 rjan90 moved this from 📌 Triage to ⌨️ In Progress in FilOz Sep 16, 2024
@rjan90 rjan90 moved this from ⌨️ In Progress to 🎉 Done in FilOz Sep 16, 2024
@BigLep
Copy link
Member

BigLep commented Sep 16, 2024

Patches have been released:
v1.28.3: https://github.com/filecoin-project/lotus/releases/tag/v1.28.3
v1.29.1: https://github.com/filecoin-project/lotus/releases/tag/v1.29.1

Per before, all users are encouraged to upgrade ASAP and no updates are needed to one's config.

@BigLep
Copy link
Member

BigLep commented Sep 16, 2024

Resolving this issue since patches have been released that address the issue.

(A separate issue has been created to properly fix this rather than rely on a go mod replace: #12472 )

@BigLep BigLep closed this as completed Sep 16, 2024
@s0nik42
Copy link

s0nik42 commented Sep 16, 2024

We have 5 nodes :

  • 2 Baremetals (never loose sync)
  • 1 VM lost sync during today's event and recover.
  • 2 containers (LXC) still erroring even after upgrade to 1.28.3 and reimporting the chain from the forest snapshot.

these 2 containers were working fine for months and start erroing sporadically a few days ago.

For now they are both unusable.

all 5 nodes are deployed with the same deployment scripts.

Daemon: 1.28.3+mainnet+git.3c4334071+api1.5.0
Local: lotus version 1.28.3+mainnet+git.3c4334071

@joshdougall
Copy link

Hi all, I just wanted to post an update around the ChainSafe snapshot service for mainnet and calibnet. We have two Forest nodes running, that export snapshots for both networks. To validate these snapshots, they are imported into a Lotus node running v1.28.2, and once synced successfully the snapshots are made available at https://forest-archive.chainsafe.dev/list/[mainnet/calibnet]/latest.

The validation component was failing due to running Lotus v1.28.2, so we downgraded to v1.28.1 temporarily. There is a small gap in our snapshot archive due to this change, but the snapshot following the downgrade contains all of the blocks since the previous snapshot. We will roll out v1.28.3 shortly, but our snapshot service remained functional during the incident.

@rjan90 rjan90 moved this from 🎉 Done to ☑️ Done (Archive) in FilOz Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ☑️ Done (Archive)
Development

Successfully merging a pull request may close this issue.

6 participants