Pool Upgrade tried to restart, 1 year later #1823

lynnbendixsen · 2023-11-01T18:32:41Z

Looks like while in the process of upgrading to 20.04 an old "pool_upgrade" command reactivated and some nodes started writing NODE_UPGRADE txns to the config ledger, with "in process" status. See https://indyscan.indiciotech.io/txs/IND_DEMONET/config. for the output in indy_scan of the events.

Sequence of events that resulted in this "issue" being reported:
Sep 2, 2022 18:07:13 -> pool_upgrade command sent for entire Network
2 hours later 3 nodes still hadn't upgraded, not sure why. So...
Sep 2, 2022 20:26:36 -> pool_upgrade command sent for the 3 nodes that didn't upgrade
Sep 2, 2022 20:41:03 -> pool upgrade command completes with the last node of the three reporting "complete" for the upgrade
No other indication that anything has gone wrong happened until the first of these three nodes was started back up as a newly installed 20.04 node. That node registered that an upgrade was needed based on the commands sent a year previously (no new txn written to the ledger for "pool_upgrade" but it began writing txn's every 15 minutes stating that a node_upgrade was "in process")
The logs show a repeated occurrence of the following sequence:
upgrader.py: found upgrade START txn
upgrader.py: Node 'My_Node' handles upgrade txn
upgrader.py: Node 'My_Node' schedules upgrade to 1.1.97
upgrader.py: ...zsRy's upgrader processing upgrade for version sovrin=1.1.97
upgrader.py: ...ezsRy's upgrader calling agent for upgrade
node.py: My_Node is about to be upgraded, sending NODE_UPGRADE in_progress to version 1.1.97
upgrader.py: Sending message to control tool: {"message_type": "upgrade", "version": "1.1.97", "pkg_name": "sovrin"}
upgrader.py: Waiting 15 minutes for upgrade to be performed
upgrader.py:
upgrader.py:
upgrader.py: Timeout exceeded for 2022-09-02
upgrader.py: Node My_Node failed upgrade 1662150396107164000 to version 1.1.97 of package sovrin scheduled on 2022-09-02 ... because of exceeded upgrade timeout
Then immediately repeats (in the logs)
upgrader.py: found upgrade START txn ...

I suggest that we research the proper "fail" command to return back to the node from the controller so that it writes the "fail" to the ledger properly and cleans things up. OR honor the timeout by writing a fail to the ledger after the timeout instead of simply trying again after timeout...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool Upgrade tried to restart, 1 year later #1823

Pool Upgrade tried to restart, 1 year later #1823

lynnbendixsen commented Nov 1, 2023 •

edited

Loading

Pool Upgrade tried to restart, 1 year later #1823

Pool Upgrade tried to restart, 1 year later #1823

Comments

lynnbendixsen commented Nov 1, 2023 • edited Loading

lynnbendixsen commented Nov 1, 2023 •

edited

Loading