Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Upgrade Watcher][Crash Checker] Consider Agent process as crashed if its PID remains 0 #3166

Merged
merged 24 commits into from
Sep 8, 2023

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Aug 1, 2023

What does this PR do?

This PR fixes a bug in the Upgrade Watcher's Crash Checker where it was consider the Agent process as healthy (not crashed) despite its PID remaining 0 every time the Crash Checker retrieved the PID from the service.

Why is it important?

To detect when Agent has crashed so the Upgrade Watcher can initiate a rollback.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

Manually testing this PR is currently blocked on #3377.

Testing this PR locally is not trivial, but it's possible. It involves upgrading to an Agent build where the Agent binary deliberately crashes, and making sure that the Agent is then rolled back to the previous (pre-upgrade) version.

  1. Install and run Elastic Agent >= 8.10.0. This is necessary because we need the pre-upgrade Agent to kick off the Upgrade Watcher using the post-upgrade Agent's binary, a change that was implemented in 8.10.0.

  2. Checkout this PR and make changes to the elastic-agent run code path such that the Agent process will exit with an error. The easiest change would probably be to add a small sleep, say 5 seconds, followed by returning an error, right here:

    RunE: func(cmd *cobra.Command, _ []string) error {
    // done very early so the encrypted store is never used

  3. Also bump up the Agent version to 8.12.0 over here so the upgrade is possible:

    const defaultBeatVersion = "8.11.0"

  4. Build the Agent package with the changes. Since the Agent version has been bumped up and the corresponding component binaries won't be available, make sure to set AGENT_DROP_PATH to nothing. Make sure NOT to use SNAPSHOT=true otherwise the snapshot artifact downloader will kick in during the upgrade process.

    DEV=true AGENT_DROP_PATH= PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package
    
  5. Upgrade the running Agent to the built Agent.

    sudo elastic-agent upgrade 8.12.0 --source-uri file:///home/shaunak/development/github/elastic-agent/build/distributions/ --skip-verify
    
  6. Ensure that the Agent is upgrading.

    sudo elastic-agent status
    
  7. While the upgrade is in progress, watch the Upgrade Watcher's log.

  8. After a couple of minutes, check that the Agent was rolled back.

    sudo elastic-agent version
    

Related issues

@ycombinator ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Aug 1, 2023
@mergify
Copy link
Contributor

mergify bot commented Aug 1, 2023

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip label Aug 1, 2023
@ycombinator ycombinator added backport-v8.9.0 Automated backport with mergify and removed backport-skip labels Aug 1, 2023
@elasticmachine
Copy link
Contributor

elasticmachine commented Aug 1, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-08T14:35:44.076+0000

  • Duration: 27 min 17 sec

Test stats 🧪

Test Results
Failed 0
Passed 6281
Skipped 55
Total 6336

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Aug 1, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.78% (81/82) 👍
Files 66.212% (194/293) 👍
Classes 65.562% (356/543) 👍
Methods 52.701% (1122/2129) 👍 0.089
Lines 38.232% (12759/33373) 👍 0.059
Conditionals 100.0% (0/0) 💚

@ycombinator
Copy link
Contributor Author

By adding a bunch of logging statements in the rollback code path, I've discovered what is potentially another bug or an area for improvement.

{"log.level":"error","@timestamp":"2023-08-02T00:05:47.525Z","log.origin":{"file.name":"cmd/watch.go","file.line":178},"message":"Agent crash detected: service remained crashed (PID = 0) within '10' seconds","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-02T00:05:47.525Z","log.origin":{"file.name":"cmd/watch.go","file.line":105},"message":"Error detected proceeding to rollback: service remained crashed (PID = 0) within '10' seconds","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-08-02T00:05:47.525Z","log.origin":{"file.name":"upgrade/step_relink.go","file.line":42},"message":"Changing symlink","symlink_path":"/opt/Elastic/Agent/elastic-agent","new_path":"/opt/Elastic/Agent/data/elastic-agent-1ed06f/elastic-agent","prev_path":"/opt/Elastic/Agent/elastic-agent.prev","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-08-02T00:05:47.527Z","log.origin":{"file.name":"upgrade/step_mark.go","file.line":131},"message":"Updating active commit","file.path":"/opt/Elastic/Agent/.elastic-agent.active.commit","hash":"1ed06f","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-08-02T00:05:47.528Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":47},"message":"Restarting the agent after rollback","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:05:47.528Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Restart count = 5","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:05:57.528Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":132},"message":"In restartFn","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-02T00:05:57.529Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":143},"message":"Failed to trigger restart of running Agent daemon: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused\"","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:05:57.529Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Restart count = 4","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:06:17.529Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":132},"message":"In restartFn","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-02T00:06:17.530Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":143},"message":"Failed to trigger restart of running Agent daemon: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused\"","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:06:17.530Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Restart count = 3","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:06:57.530Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":132},"message":"In restartFn","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-02T00:06:57.531Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":143},"message":"Failed to trigger restart of running Agent daemon: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused\"","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-02T00:06:57.531Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":155},"message":"Restart count = 2","ecs.version":"1.6.0"}

It looks like the Upgrade Watcher does its job correctly in deciding to rollback to the old Agent. At that point it correctly switches the symlink and updates the active commit in the agent commit file. Then it tries to restart the Agent by trying to connect to the Agent GRPC server and issue a Restart request. This fails because there is no Agent running to receive this request.

What's worse is that because the restart process fails, the Upgrade Watcher never gets to the step of cleaning up the new (post-upgrade) Agent's files and, more importantly, the Upgrade Marker file. Eventually, the service starts up the old (pre-upgrade) Agent, which sees that there's an Upgrade Marker file present, and starts the Upgrade Watcher. For reasons I haven't figured out yet, this seems to be the new agent's Upgrade Watcher. This Upgrade Watcher process eventually (in about 10 minutes) succeeds as the old Agent is healthy. Upon success, this Upgrade Watcher then proceeds to clean up files for any Agent installations other than it's own... which means it ends up cleaning up the old — and currently running — Agent's files!

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 2, 2023

It looks like the Upgrade Watcher does its job correctly in deciding to rollback to the old Agent. At that point it correctly switches the symlink and updates the active commit in the agent commit file. Then it tries to restart the Agent by trying to connect to the Agent GRPC server and issue a Restart request. This fails because there is no Agent running to receive this request.

What's worse is that because the restart process fails, the Upgrade Watcher never gets to the step of cleaning up the new (post-upgrade) Agent's files and, more importantly, the Upgrade Marker file. Eventually, the service starts up the old (pre-upgrade) Agent, which sees that there's an Upgrade Marker file present, and starts the Upgrade Watcher. For reasons I haven't figured out yet, this seems to be the new agent's Upgrade Watcher. This Upgrade Watcher process eventually (in about 10 minutes) succeeds as the old Agent is healthy. Upon success, this Upgrade Watcher then proceeds to clean up files for any Agent installations other than it's own... which means it ends up cleaning up the old — and currently running — Agent's files!

After thinking about this a bit, my instinct says that we should change the order of operations for the rollback process.

Currently, the order of operations is:

  1. Switch symlink
  2. Update active commit in the agent commit file
  3. Restart Agent (the one we rolled back to)
  4. Cleanup Agent (the one we rolled back from) files + upgrade marker file

I think we should change the order of operations to:

  1. Switch symlink
  2. Update active commit in the agent commit file
  3. Cleanup Agent (the one we rolled back from) files + upgrade marker file
  4. Restart Agent (the one we rolled back to)

This way, if the final restart step fails (as is happening while testing this PR; see previous comment for details), at least all the installed files are in a consistent state. Additionally, the service manager may yet restart the correct Agent process, effectively completing the final restart step.

@cmacknz @michalpristas WDYT?

@ycombinator
Copy link
Contributor Author

Discussed the proposed solution in #3166 (comment) in today's team meeting. Unfortunately the solution is not as straightforward as proposed, mainly due to Windows systems not allowing cleanup of a running process's files (which could happen if the new Agent was running but unhealthy).

One idea that was discussed was to move the cleanup of no-longer-needed Agent files to the Agent itself. But this has some edge cases to consider, e.g. when we need the new Agent to keep the older Agent's files around for a potential rollback. Some potential solutions to such cases were also discussed, e.g. the Upgrade Watcher storing enough information about the desired outcome in the upgrade marker file and not deleting it, but this could potentially be problematic for already-released versions of Agent which may not be expecting the upgrade marker file to be present under the happy path.

After much discussion, the thinking is that we need to first get a handle on the various interactions between the Upgrade Watcher, the upgrade marker file, and the Agent first so we can understand the various failure scenarios as well as the impact of any potential changes to the rollback process.

@mergify
Copy link
Contributor

mergify bot commented Aug 2, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b crash-checker-pid-zero upstream/crash-checker-pid-zero
git merge upstream/main
git push upstream crash-checker-pid-zero

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 3, 2023

I just rebased this PR on main so it now includes the InstallChecker enhancement. Testing again, I'm seeing a new/different potential bug.

To recap, this PR deliberately (just for testing) produces an Agent binary that exits with an error after sleeping for 11 seconds. The test is to start with a good build of Agent, say from main, and then try to upgrade to the Agent built from this PR. I just ran this test and here's what I'm seeing in the Upgrade Watcher log:

$ cat /opt/Elastic/Agent/data/elastic-agent-3b6e07/logs/elastic-agent-watcher-20230803.ndjson
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.005Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":62},"message":"running checks using 'DBus' controller","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.006Z","log.origin":{"file.name":"upgrade/error_checker.go","file.line":52},"message":"Error checker started","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.006Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":71},"message":"Crash checker started","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.006Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":73},"message":"watcher having PID: 1785608","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:00.006Z","log.origin":{"file.name":"upgrade/install_checker.go","file.line":43},"message":"Install checker started","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:10.009Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":87},"message":"retrieved service PID [1700826]","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:10.011Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":128},"message":"service PID changed 1 times within 6 evaluations","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:10.011Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":73},"message":"watcher having PID: 1785608","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:10.019Z","log.origin":{"file.name":"upgrade/install_checker.go","file.line":54},"message":"retrieve service status: installed","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:20.012Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":87},"message":"retrieved service PID [0]","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:20.012Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":110},"message":"most recent 2 service PIDs within 6 evaulations: [0 1700826]","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:20.012Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":128},"message":"service PID changed 2 times within 6 evaluations","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-08-03T22:38:20.012Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":73},"message":"watcher having PID: 1785608","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":195},"message":"Agent uninstall detected","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":107},"message":"Exiting early due to: %vElastic Agent was uninstalled: service is not installed","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":51},"message":"Watch command failed","error":{"message":"Elastic Agent was uninstalled: service is not installed"},"ecs.version":"1.6.0"}

But while the upgrade was running, and even after the watcher exited, I can see that the Agent service is installed. I'm on Ubuntu so the service manager is systemd.

$ systemctl status elastic-agent.service
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2023-08-03 23:13:19 UTC; 1min 58s ago
    Process: 1826225 ExecStart=/usr/bin/elastic-agent (code=exited, status=1/FAILURE)
   Main PID: 1826225 (code=exited, status=1/FAILURE)
        CPU: 38ms

Aug 03 23:13:19 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
Aug 03 23:13:19 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Failed with result 'exit-code'.

@blakerouse Given that the service is still installed, shouldn't the InstallChecker keep succeeding? The Upgrade Watcher logs seem to indicate that the InstallChecker fails because the service isn't installed.

{"log.level":"warn","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":195},"message":"Agent uninstall detected","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":107},"message":"Exiting early due to: %vElastic Agent was uninstalled: service is not installed","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-08-03T22:38:20.030Z","log.origin":{"file.name":"cmd/watch.go","file.line":51},"message":"Watch command failed","error":{"message":"Elastic Agent was uninstalled: service is not installed"},"ecs.version":"1.6.0"}

To be clear, the service is installed, as seen from the systemctl status elastic-agent.service output above, but the Agent process that the service controls keeps crashing (deliberately, as part of testing this PR).

If you agree this is a bug, I can create an issue for it with steps to reproduce.

@blakerouse
Copy link
Contributor

@ycombinator It definitely should not be saying it's uninstalled. Either I did something wrong or the service module we use which is an external dependencies is doing something wrong.

Might be better off to revert my install checker PR. Don't want that be killing the watcher all the time, that would be bad.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 3, 2023

Thanks @blakerouse for the quick check.

This PR here has a lot of changes in it. Let me try to come up with a minimal set of steps to reproduce the issue. I'll file an issue with the steps to reproduce and then we can decide if it makes sense to resolve that by reverting the install checker PR or fixing forward. Either way, it'll be good to have minimal repro steps documented in case we decide to bring the install checker back in the future with tweaks.

@ycombinator
Copy link
Contributor Author

This PR here has a lot of changes in it. Let me try to come up with a minimal set of steps to reproduce the issue. I'll file an issue with the steps to reproduce and then we can decide if it makes sense to resolve that by reverting the install checker PR or fixing forward. Either way, it'll be good to have minimal repro steps documented in case we decide to bring the install checker back in the future with tweaks.

#3188

@ycombinator
Copy link
Contributor Author

After much discussion, the thinking is that we need to first get a handle on the various interactions between the Upgrade Watcher, the upgrade marker file, and the Agent first so we can understand the various failure scenarios as well as the impact of any potential changes to the rollback process.

PR to document the details of these interactions: #3189

@mergify
Copy link
Contributor

mergify bot commented Aug 9, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b crash-checker-pid-zero upstream/crash-checker-pid-zero
git merge upstream/main
git push upstream crash-checker-pid-zero

@ycombinator
Copy link
Contributor Author

Waiting on #3268 to fix the bug explained in #3166 (comment).

@mergify
Copy link
Contributor

mergify bot commented Aug 28, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b crash-checker-pid-zero upstream/crash-checker-pid-zero
git merge upstream/main
git push upstream crash-checker-pid-zero

@ycombinator ycombinator force-pushed the crash-checker-pid-zero branch 2 times, most recently from 62359a7 to d523003 Compare September 7, 2023 16:06
@@ -29,7 +29,7 @@ type serviceHandler interface {
// CrashChecker checks agent for crash pattern in Elastic Agent lifecycle.
type CrashChecker struct {
notifyChan chan error
q *disctintQueue
q *distinctQueue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixing a typo.

@elastic-sonarqube
Copy link

@ycombinator ycombinator merged commit 2ce32f8 into elastic:main Sep 8, 2023
8 of 11 checks passed
mergify bot pushed a commit that referenced this pull request Sep 8, 2023
… its PID remains 0 (#3166)

* Refactoring: extract helper method

* Add check for PID remaining 0

* Update + add tests

* Fix typo

* Add CHANGELOG fragment

* Better error messages

* Bump up Agent version + cause error on start

* Better logging for debugging

* More logging for debugging

* Trying secondary restart via service manager

* Add FIXME comments for testing-only changes

* Fix compile errors

* Update testing version

* Implement restart for upstart service manager

* Include service provider name in error

* Implement restart for sysv and darwin

* Implement Restart for Windows

* Remove all Restart() implementations

* Removing extraneous logging statements

* Undo vestigial changes

* Rename all canc -> cancel

* Use assert instead of require

* Remove testing changes

* Use assert instead of require

(cherry picked from commit 2ce32f8)
@ycombinator ycombinator deleted the crash-checker-pid-zero branch September 8, 2023 16:00
ycombinator added a commit that referenced this pull request Sep 12, 2023
… its PID remains 0 (#3166)

* Refactoring: extract helper method

* Add check for PID remaining 0

* Update + add tests

* Fix typo

* Add CHANGELOG fragment

* Better error messages

* Bump up Agent version + cause error on start

* Better logging for debugging

* More logging for debugging

* Trying secondary restart via service manager

* Add FIXME comments for testing-only changes

* Fix compile errors

* Update testing version

* Implement restart for upstart service manager

* Include service provider name in error

* Implement restart for sysv and darwin

* Implement Restart for Windows

* Remove all Restart() implementations

* Removing extraneous logging statements

* Undo vestigial changes

* Rename all canc -> cancel

* Use assert instead of require

* Remove testing changes

* Use assert instead of require

(cherry picked from commit 2ce32f8)
ycombinator added a commit that referenced this pull request Sep 14, 2023
… its PID remains 0 (#3166)

* Refactoring: extract helper method

* Add check for PID remaining 0

* Update + add tests

* Fix typo

* Add CHANGELOG fragment

* Better error messages

* Bump up Agent version + cause error on start

* Better logging for debugging

* More logging for debugging

* Trying secondary restart via service manager

* Add FIXME comments for testing-only changes

* Fix compile errors

* Update testing version

* Implement restart for upstart service manager

* Include service provider name in error

* Implement restart for sysv and darwin

* Implement Restart for Windows

* Remove all Restart() implementations

* Removing extraneous logging statements

* Undo vestigial changes

* Rename all canc -> cancel

* Use assert instead of require

* Remove testing changes

* Use assert instead of require

(cherry picked from commit 2ce32f8)
ycombinator added a commit that referenced this pull request Sep 14, 2023
… its PID remains 0 (#3166) (#3386)

* Refactoring: extract helper method

* Add check for PID remaining 0

* Update + add tests

* Fix typo

* Add CHANGELOG fragment

* Better error messages

* Bump up Agent version + cause error on start

* Better logging for debugging

* More logging for debugging

* Trying secondary restart via service manager

* Add FIXME comments for testing-only changes

* Fix compile errors

* Update testing version

* Implement restart for upstart service manager

* Include service provider name in error

* Implement restart for sysv and darwin

* Implement Restart for Windows

* Remove all Restart() implementations

* Removing extraneous logging statements

* Undo vestigial changes

* Rename all canc -> cancel

* Use assert instead of require

* Remove testing changes

* Use assert instead of require

(cherry picked from commit 2ce32f8)

Co-authored-by: Shaunak Kashyap <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.10.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd
5 participants