You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's a couple of design choices which result in frustrating experiences for users, and issues for poet-operators beyond their control due to behavior at the node level. It's a very fragile design.
It doesn't need to be this way.
Issue 1: Limited retry
Nodes only retry registration for 15-20 minutes, regardless of grace period set. Even the default 12hr servers have a 1hr grace period, but a node with a connection issue in the first 30 minutes will fail to register despite having adequate time before the cycle gap ends.
The 15-20 minute limit is shared by the get proof path, making it impossible to increase registration timeout/retry without also increasing proof start timeout.
Solution: Establish a seperate retry parameter for registration allowing users to make use of the entire grace-period. Better yet, the default behavior should be retry until cycle gap ends as that's what users expect.
Issue 2: Single shot
Nodes fire a batch of registrations to all configured poet servers, if any succeed no retry is made on the other servers. There are obviously many reasons why a system might not be able to reach an external server or get the expected response - that's why retrying network requests is standard practice.
This caused serious isues for Team24 in April - which I originally thought was entirely our fault - whereby some users only registered with one poet due to temporary issues on another poet. Those issues were resolved well within the grace period but I now understand that nodes did not even try to register with the other poet. Unfortunately in the next epoch a poet server died leaving many users unable to proof. It could have been avoided if nodes retried registrations.
Solution: Currently BuildNIPost only attempts registrations if registrations for that node in local.sql has count == 0. It should check registrations vs servers listed in config instead and retry any missing.
The text was updated successfully, but these errors were encountered:
There's a couple of design choices which result in frustrating experiences for users, and issues for poet-operators beyond their control due to behavior at the node level. It's a very fragile design.
It doesn't need to be this way.
Issue 1: Limited retry
Nodes only retry registration for 15-20 minutes, regardless of grace period set. Even the default 12hr servers have a 1hr grace period, but a node with a connection issue in the first 30 minutes will fail to register despite having adequate time before the cycle gap ends.
The 15-20 minute limit is shared by the get proof path, making it impossible to increase registration timeout/retry without also increasing proof start timeout.
Solution: Establish a seperate retry parameter for registration allowing users to make use of the entire grace-period. Better yet, the default behavior should be retry until cycle gap ends as that's what users expect.
Issue 2: Single shot
Nodes fire a batch of registrations to all configured poet servers, if any succeed no retry is made on the other servers. There are obviously many reasons why a system might not be able to reach an external server or get the expected response - that's why retrying network requests is standard practice.
This caused serious isues for Team24 in April - which I originally thought was entirely our fault - whereby some users only registered with one poet due to temporary issues on another poet. Those issues were resolved well within the grace period but I now understand that nodes did not even try to register with the other poet. Unfortunately in the next epoch a poet server died leaving many users unable to proof. It could have been avoided if nodes retried registrations.
Solution: Currently BuildNIPost only attempts registrations if registrations for that node in local.sql has count == 0. It should check registrations vs servers listed in config instead and retry any missing.
The text was updated successfully, but these errors were encountered: