Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add / report persistent policy error metadata in Coordinator #3076

Merged
merged 8 commits into from
Jul 24, 2023

Conversation

faec
Copy link
Contributor

@faec faec commented Jul 13, 2023

Update Coordinator's error tracking and aggregate state reporting:

  • Add standalone CoordinatorState / CoordinatorMessage fields to coordinator.State and proto.StateResponse, so Coordinator has its own state instead of overloading the global Agent state.
  • Add new error fields to Coordinator tracking the three types of errors a policy update can encounter (initial parsing, component model generation, and runtime update).
  • Update (*Coordinator).generateReportableState to aggregate the new error/state fields into the global State/Message reported by Fleet.

Fixes #2852.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

@faec faec added bug Something isn't working Team:Elastic-Agent Label for the Agent team backport-skip skip-changelog labels Jul 13, 2023
@faec faec self-assigned this Jul 13, 2023
@elasticmachine
Copy link
Contributor

elasticmachine commented Jul 13, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-07-24T14:58:08.008+0000

  • Duration: 23 min 21 sec

Test stats 🧪

Test Results
Failed 0
Passed 6099
Skipped 27
Total 6126

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

control_v2.proto Outdated
// State + message for the Agent Coordinator, reporting manager-level errors
// like failed policy updates.
State coordinatorState = 7;
string coordinatorMessage = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like the idea of exposing an internal detail like the coordinator to the status output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I expect it wouldn't be user-facing unless we change the UI, and it might be nice to have a way to check it when debugging... but I wouldn't mind being conservative about what we add to the proto in this pass. The most important part of Coordinator state was reporting manager errors and similar, which this PR replaces with explicit aggregation.

What if we break the implicit sync between coordinator state and the proto? Leave the proto unchanged, but keep the coordinator state/message in the internal struct? That way it can still be persisted / aggregated into the reported value, but without any explicit new fields over the wire?

@faec faec marked this pull request as ready for review July 17, 2023 20:46
@faec faec requested a review from a team as a code owner July 17, 2023 20:46
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Contributor

elasticmachine commented Jul 17, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.718% (77/78) 👍
Files 68.657% (184/268) 👍
Classes 67.606% (336/497) 👍
Methods 54.222% (1053/1942) 👍 0.094
Lines 40.342% (12048/29865) 👍 0.137
Conditionals 100.0% (0/0) 💚

@faec
Copy link
Contributor Author

faec commented Jul 17, 2023

buildkite test this

Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments here and there
tested this scenario locally:

  • install agent with system integration logs/metrics enabled
  • unplug cable
  • check fleet status is FAILED
  • reconnect
  • check status is back to green
  • removed log collection from system integration
  • rename filebeat executable
  • add logs back and check status is failed
  • rename filebeat back to filebeat
  • check it's green

@@ -214,11 +214,26 @@ type Coordinator struct {
// into the reported state before broadcasting -- State() will report
// agentclient.Failed if one of these is set, even if the underlying
// coordinator state is agentclient.Healthy.
runtimeMgrErr error
runtimeMgrErr error // Currently unused
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we keep it then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair question -- I asked @blakerouse a while ago and he said it was so the runtime manager satisfies the Runner interface. A more immediate answer is that it will probably be used soon: I have a PR in progress to centralize more of the runtime manager logic in its own run loop. For that, it's useful to decouple runtime manager Update calls from the error result they produce, to avoid accessing runtime internals from the Coordinator goroutine. In that PR I've started using the previously-idle error channel to report component model update failures, and those end up being stored in this field.

return nil
}

span, ctx := apm.StartSpan(ctx, "refreshComponentModel", "app.internal")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be considered breaking? renaming span name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being that APM tracing inside of the Elastic Agent itself is for development or debugging only I would think not

@@ -124,53 +158,66 @@ func (c *Coordinator) applyComponentState(state runtime.ComponentComponentState)
// components and units are also healthy (or in ephemeral non-error states).
// Must be called on the main Coordinator goroutine.
func (c *Coordinator) generateReportableState() (s State) {
s.State = c.state.State
s.Message = c.state.Message
//s.State = c.state.State
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove or explain why this is commented out in additional comment
but considering logic below i think it can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, leftover from the earlier draft, thank you!

@mergify
Copy link
Contributor

mergify bot commented Jul 20, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b coord-reporting upstream/coord-reporting
git merge upstream/main
git push upstream coord-reporting

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Testing looks to cover the additions.

return nil
}

span, ctx := apm.StartSpan(ctx, "refreshComponentModel", "app.internal")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being that APM tracing inside of the Elastic Agent itself is for development or debugging only I would think not

@faec
Copy link
Contributor Author

faec commented Jul 20, 2023

jenkins test this

1 similar comment
@faec
Copy link
Contributor Author

faec commented Jul 24, 2023

jenkins test this

@faec faec merged commit 4269472 into elastic:main Jul 24, 2023
9 checks passed
@faec faec deleted the coord-reporting branch July 24, 2023 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip bug Something isn't working skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent discards unhealthy state because of unrelated changes
4 participants