Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node design is unnecessarily fragile for poet registration phase #6035

Open
Tracked by #328
reythia opened this issue Jun 12, 2024 · 1 comment
Open
Tracked by #328

Node design is unnecessarily fragile for poet registration phase #6035

reythia opened this issue Jun 12, 2024 · 1 comment
Assignees

Comments

@reythia
Copy link

reythia commented Jun 12, 2024

There's a couple of design choices which result in frustrating experiences for users, and issues for poet-operators beyond their control due to behavior at the node level. It's a very fragile design.

It doesn't need to be this way.

Issue 1: Limited retry

Nodes only retry registration for 15-20 minutes, regardless of grace period set. Even the default 12hr servers have a 1hr grace period, but a node with a connection issue in the first 30 minutes will fail to register despite having adequate time before the cycle gap ends.

The 15-20 minute limit is shared by the get proof path, making it impossible to increase registration timeout/retry without also increasing proof start timeout.

Solution: Establish a seperate retry parameter for registration allowing users to make use of the entire grace-period. Better yet, the default behavior should be retry until cycle gap ends as that's what users expect.

Issue 2: Single shot

Nodes fire a batch of registrations to all configured poet servers, if any succeed no retry is made on the other servers. There are obviously many reasons why a system might not be able to reach an external server or get the expected response - that's why retrying network requests is standard practice.

This caused serious isues for Team24 in April - which I originally thought was entirely our fault - whereby some users only registered with one poet due to temporary issues on another poet. Those issues were resolved well within the grace period but I now understand that nodes did not even try to register with the other poet. Unfortunately in the next epoch a poet server died leaving many users unable to proof. It could have been avoided if nodes retried registrations.

Solution: Currently BuildNIPost only attempts registrations if registrations for that node in local.sql has count == 0. It should check registrations vs servers listed in config instead and retry any missing.

@rnizametdinov
Copy link

very annoying design bug

@ConvallariaMaj ConvallariaMaj self-assigned this Jul 12, 2024
spacemesh-bors bot pushed a commit that referenced this issue Aug 15, 2024
## Motivation
Fix an issue #6035



Co-authored-by: ConvallariaMaj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog
Development

No branches or pull requests

3 participants