-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Upgrade Watcher] Try restarting Agent in multiple ways during rollback #3268
[Upgrade Watcher] Try restarting Agent in multiple ways during rollback #3268
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
internal/pkg/agent/cmd/run.go
Outdated
// FIXME: for testing only! | ||
time.Sleep(10 * time.Second) | ||
return errors.New("deliberately crashing agent (after invoking watcher)") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be removed once this PR is tested and approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 2af8c30.
version/version.go
Outdated
@@ -4,5 +4,5 @@ | |||
|
|||
package version | |||
|
|||
const defaultBeatVersion = "8.11.0" | |||
const defaultBeatVersion = "8.12.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be removed once this PR is tested and approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 2af8c30.
7b20499
to
06be899
Compare
install.StopService(topPath) | ||
err := install.StartService(topPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it better to create a new install.RestartServive(topPath)
wrapper around these two calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I like that idea. Logging the error for stop service might be helpful as well, just for debugging issues in the future, having the contents of that error in the log could be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, restarting by service should work in rollback cases without a need for check if service is there as for now we don't support upgrade scenario without agent being installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me.
install.StopService(topPath) | ||
err := install.StartService(topPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I like that idea. Logging the error for stop service might be helpful as well, just for debugging issues in the future, having the contents of that error in the log could be useful.
🌐 Coverage report
|
func RestartService(topPath string) error { | ||
err := StopService(topPath) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
return StartService(topPath) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are dealing with unlikely problems already, if the agent were to stop for whatever reason between the Stop
and Start
calls it would be left stopped.
The most likely reason this could happen is a laptop system being powered off.
It would be much safer to use systemctl restart
or equivalent if it exists because it should prevent this situation. The service manager knows you want the services end state to be running in this case, which is not true with an isolated call to stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed in d81fe70.
2af8c30
to
f5560d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
45f327a
to
8917a6e
Compare
8917a6e
to
14ea553
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
14ea553
to
44bd7ec
Compare
Unrelated CI failures. Force merge this one as it will bring some additional stability to our upgrade process. |
…ck (#3268) (#3311) * Try restarting via the service as a backup * [Testing] Crashing agent * Adding CHANGELOG fragment * [Testing] Bump up version * Adding install.RestartService * Removing testing code * Use service manager's restart * Updating comment (cherry picked from commit 9377dca) Co-authored-by: Shaunak Kashyap <[email protected]>
What does this PR do?
This PR fixes a potential bug where, during an upgrade, if the upgraded Agent process crashes, the Upgrade Watcher initiates a rollback but isn't able to restart the rolled back Agent.
Before this PR, the Upgrade Watcher would attempt to restart the rolled back Agent via a GRPC call to the Agent's
Restart()
function. However, if the Agent process has crashed, there is no GRPC server available to receive such a call, thus preventing the Agent from being restarted.With this PR, the Upgrade Watcher continues to use the GRPC
Restart()
call to restart the Agent. However, if that call fails, the Upgrade Watcher uses a fallback method of asking the Elastic Agent service to restart the Agent.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature works./changelog/fragments
using the changelog toolI have added an integration test or an E2E testThis bug was found while writing the integration test in [Integration Test] New Agent after upgrade fails to start #3085. As such, this bug fix will be tested by that test.How to test this PR locally
Build an Agent package from
main
(currently at version8.11.0
), unpack it, and install it. This will serve as the version of Agent we will upgrade from.Build an Agent package from the branch for this PR (say at version
8.12.0
). This will serve as the version of Agent we will upgrade to. This Agent will deliberately crash upon start up but it also has the bugfix in the Upgrade Watcher code path.Upgrade to the
8.12.0
Agent.Ensure that the Agent version is now
8.12.0
. Because the Agent will be crashing, you may only see the binary version, not the daemon version.Tail the Upgrade Watcher logs. Ensure that the Upgrade Watcher eventually initiates rollback.
Ensure that the rolled back Agent gets restarted.
Ensure that the Agent version is now back to
8.11.0
.Related issues