-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add / report persistent policy error metadata in Coordinator #3076
Conversation
control_v2.proto
Outdated
// State + message for the Agent Coordinator, reporting manager-level errors | ||
// like failed policy updates. | ||
State coordinatorState = 7; | ||
string coordinatorMessage = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like the idea of exposing an internal detail like the coordinator to the status output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I expect it wouldn't be user-facing unless we change the UI, and it might be nice to have a way to check it when debugging... but I wouldn't mind being conservative about what we add to the proto in this pass. The most important part of Coordinator state was reporting manager errors and similar, which this PR replaces with explicit aggregation.
What if we break the implicit sync between coordinator state and the proto? Leave the proto unchanged, but keep the coordinator state/message in the internal struct? That way it can still be persisted / aggregated into the reported value, but without any explicit new fields over the wire?
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
🌐 Coverage report
|
buildkite test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments here and there
tested this scenario locally:
- install agent with system integration logs/metrics enabled
- unplug cable
- check fleet status is FAILED
- reconnect
- check status is back to green
- removed log collection from system integration
- rename filebeat executable
- add logs back and check status is failed
- rename filebeat back to filebeat
- check it's green
@@ -214,11 +214,26 @@ type Coordinator struct { | |||
// into the reported state before broadcasting -- State() will report | |||
// agentclient.Failed if one of these is set, even if the underlying | |||
// coordinator state is agentclient.Healthy. | |||
runtimeMgrErr error | |||
runtimeMgrErr error // Currently unused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we keep it then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair question -- I asked @blakerouse a while ago and he said it was so the runtime manager satisfies the Runner
interface. A more immediate answer is that it will probably be used soon: I have a PR in progress to centralize more of the runtime manager logic in its own run loop. For that, it's useful to decouple runtime manager Update
calls from the error result they produce, to avoid accessing runtime internals from the Coordinator goroutine. In that PR I've started using the previously-idle error channel to report component model update failures, and those end up being stored in this field.
return nil | ||
} | ||
|
||
span, ctx := apm.StartSpan(ctx, "refreshComponentModel", "app.internal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be considered breaking? renaming span name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being that APM tracing inside of the Elastic Agent itself is for development or debugging only I would think not
@@ -124,53 +158,66 @@ func (c *Coordinator) applyComponentState(state runtime.ComponentComponentState) | |||
// components and units are also healthy (or in ephemeral non-error states). | |||
// Must be called on the main Coordinator goroutine. | |||
func (c *Coordinator) generateReportableState() (s State) { | |||
s.State = c.state.State | |||
s.Message = c.state.Message | |||
//s.State = c.state.State |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove or explain why this is commented out in additional comment
but considering logic below i think it can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, leftover from the earlier draft, thank you!
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. Testing looks to cover the additions.
return nil | ||
} | ||
|
||
span, ctx := apm.StartSpan(ctx, "refreshComponentModel", "app.internal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being that APM tracing inside of the Elastic Agent itself is for development or debugging only I would think not
jenkins test this |
1 similar comment
jenkins test this |
Update Coordinator's error tracking and aggregate state reporting:
CoordinatorState
/CoordinatorMessage
fields tocoordinator.State
and, so Coordinator has its own state instead of overloading the global Agent state.proto.StateResponse
Coordinator
tracking the three types of errors a policy update can encounter (initial parsing, component model generation, and runtime update).(*Coordinator).generateReportableState
to aggregate the new error/state fields into the globalState
/Message
reported by Fleet.Fixes #2852.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added an entry in./changelog/fragments
using the changelog toolI have added an integration test or an E2E test