-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
install fails if enroll fails and surface errors #3207
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
845f24e
to
fefb9cb
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
b6a12d5
to
9fca2e7
Compare
/test |
9fca2e7
to
451198a
Compare
797b4a3
to
d6910df
Compare
3824436
to
829574a
Compare
🌐 Coverage report
|
/test |
We are still seeing failures related to failing to restart the agent during enroll: https://buildkite.com/elastic/elastic-agent/builds/3774#018affdb-04c0-40bf-b7f0-8ef6ecb6e453
|
@@ -443,24 +450,32 @@ func (c *enrollCmd) prepareFleetTLS() error { | |||
|
|||
func (c *enrollCmd) daemonReloadWithBackoff(ctx context.Context) error { | |||
err := c.daemonReload(ctx) | |||
if err != nil && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to special case this? Why can't we just enter the backoff loop below and exit it once it is finished?
if err == nil || | ||
errors.Is(err, context.DeadlineExceeded) || | ||
errors.Is(err, context.Canceled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could replace this with https://pkg.go.dev/github.com/cenkalti/backoff/v4 which has a WithContext
option to natively understands contexts. We already use this package for upgrades. Example:
elastic-agent/internal/pkg/agent/application/upgrade/step_download.go
Lines 155 to 185 in 261b94e
expBo := backoff.NewExponentialBackOff() | |
expBo.InitialInterval = settings.RetrySleepInitDuration | |
boCtx := backoff.WithContext(expBo, cancelCtx) | |
var path string | |
var attempt uint | |
opFn := func() error { | |
attempt++ | |
u.log.Infof("download attempt %d", attempt) | |
downloader, err := downloaderCtor(version, u.log, settings) | |
if err != nil { | |
return fmt.Errorf("unable to create fetcher: %w", err) | |
} | |
// All download artifacts expect a name that includes <major>.<minor.<patch>[-SNAPSHOT] so we have to | |
// make sure not to include build metadata we might have in the parsed version (for snapshots we already | |
// used that to configure the URL we download the files from) | |
path, err = downloader.Download(cancelCtx, agentArtifact, version.VersionWithPrerelease()) | |
if err != nil { | |
return fmt.Errorf("unable to download package: %w", err) | |
} | |
// Download successful | |
return nil | |
} | |
opFailureNotificationFn := func(err error, retryAfter time.Duration) { | |
u.log.Warnf("%s; retrying (will be retry %d) in %s.", err.Error(), attempt, retryAfter) | |
} | |
c.log.Errorf("Elastic Agent might not be running; unable to trigger restart: %v", err) | ||
return fmt.Errorf("could not reload agent daemon, unable to trigger restart: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you need the log and the error, at least I hope the error you return gets logged somewhere obvious.
c.log.Infow("Elastic Agent might not be running; unable to trigger restart", "error", err) | ||
} else { | ||
c.log.Info("Successfully triggered restart on running Elastic Agent.") | ||
if err = c.daemonReloadWithBackoff(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 1 minute retry with a 10 minute wait might be too slow for attempts to restart the agent on the same host. We can likely retry every second to start with a maximum wait of up to 2 minutes. If we are still retrying after 2 minutes changes are we aren't going to get the agent to restart.
We seem to need this logic in integration tests to get them to pass reliably and using a 1 minute minimum interval is going to make every test that hits this case take an additional minute to run.
The same is probably true for the Fleet retries but that one is tied to system scalability during mass enrollments so I am more hesitant to change it.
/buildkite test this |
buildkite test this |
You will need to merge in the latest changes in main to get the tests to run properly if you haven't done that already. We had to change the region we provisioned the stack in. |
This is failing with the error it could be fixing:
We need more than 1 retry when we can't restart the agent to fix this. |
Also that error message can be improved since we know exactly why enroll failed. |
* surface errors that might occur during enroll * fail install command if agent cannot be restarted * do not print success message if there was an enroll error. Print an error message and the error instead * add logs to show the different enroll attempts * add more context t errors * refactor internal/pkg/agent/install/perms_unix.go and add more context to errors restore main version * ignore agent restart error on enroll tests as there is no agent to be restarted * daemonReloadWithBackoff does not retry on context deadline exceeded
SonarQube Quality Gate |
* surface errors that might occur during enroll * fail install command if agent cannot be restarted * do not print success message if there was an enroll error. Print an error message and the error instead * add logs to show the different enroll attempts * add more context t errors * refactor internal/pkg/agent/install/perms_unix.go and add more context to errors restore main version * ignore agent restart error on enroll tests as there is no agent to be restarted (cherry picked from commit 33c6934)
For the agent to be actually enrolled it needs to restart after the enroll process is completed, so it'll pickup the new config and "connect" to fleet-server.
This change makes the enroll command to fail if it cannot restart the agent after enrolling on fleet
What does this PR do?
This change makes the enroll command to fail if it cannot restart the agent after enrolling on fleet
Why is it important?
For the agent to be actually enrolled it needs to restart after the enroll process is completed, so it'll pickup the new config and "connect" to fleet-server.
Checklist
[ ] I have commented my code, particularly in hard-to-understand areas[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature works[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testRelated issues
Questions to ask yourself