Node design is unnecessarily fragile for poet registration phase #6035

reythia · 2024-06-12T10:01:51Z

There's a couple of design choices which result in frustrating experiences for users, and issues for poet-operators beyond their control due to behavior at the node level. It's a very fragile design.

It doesn't need to be this way.

Issue 1: Limited retry

Nodes only retry registration for 15-20 minutes, regardless of grace period set. Even the default 12hr servers have a 1hr grace period, but a node with a connection issue in the first 30 minutes will fail to register despite having adequate time before the cycle gap ends.

The 15-20 minute limit is shared by the get proof path, making it impossible to increase registration timeout/retry without also increasing proof start timeout.

Solution: Establish a seperate retry parameter for registration allowing users to make use of the entire grace-period. Better yet, the default behavior should be retry until cycle gap ends as that's what users expect.

Issue 2: Single shot

Nodes fire a batch of registrations to all configured poet servers, if any succeed no retry is made on the other servers. There are obviously many reasons why a system might not be able to reach an external server or get the expected response - that's why retrying network requests is standard practice.

This caused serious isues for Team24 in April - which I originally thought was entirely our fault - whereby some users only registered with one poet due to temporary issues on another poet. Those issues were resolved well within the grace period but I now understand that nodes did not even try to register with the other poet. Unfortunately in the next epoch a poet server died leaving many users unable to proof. It could have been avoided if nodes retried registrations.

Solution: Currently BuildNIPost only attempts registrations if registrations for that node in local.sql has count == 0. It should check registrations vs servers listed in config instead and retry any missing.

rnizametdinov · 2024-06-26T08:04:07Z

very annoying design bug

## Motivation Fix an issue #6035 Co-authored-by: ConvallariaMaj <[email protected]>

pigmej mentioned this issue Jun 12, 2024

Smesher UX improvements spacemeshos/pm#328

Open

16 tasks

ConvallariaMaj self-assigned this Jul 12, 2024

ConvallariaMaj mentioned this issue Jul 15, 2024

[Merged by Bors] - Retry registration timeout fix #6136

Closed

4 tasks

spacemesh-bors bot pushed a commit that referenced this issue Aug 15, 2024

Retry registration timeout fix (#6136)

5bedb4e

## Motivation Fix an issue #6035 Co-authored-by: ConvallariaMaj <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node design is unnecessarily fragile for poet registration phase #6035

Node design is unnecessarily fragile for poet registration phase #6035

reythia commented Jun 12, 2024 •

edited

Loading

rnizametdinov commented Jun 26, 2024

Node design is unnecessarily fragile for poet registration phase #6035

Node design is unnecessarily fragile for poet registration phase #6035

Comments

reythia commented Jun 12, 2024 • edited Loading

Issue 1: Limited retry

Issue 2: Single shot

rnizametdinov commented Jun 26, 2024

reythia commented Jun 12, 2024 •

edited

Loading