Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

install fails if enroll fails and surface errors #3207

Merged
merged 1 commit into from
Oct 6, 2023

Conversation

AndersonQ
Copy link
Member

For the agent to be actually enrolled it needs to restart after the enroll process is completed, so it'll pickup the new config and "connect" to fleet-server.

This change makes the enroll command to fail if it cannot restart the agent after enrolling on fleet

What does this PR do?

This change makes the enroll command to fail if it cannot restart the agent after enrolling on fleet

Why is it important?

For the agent to be actually enrolled it needs to restart after the enroll process is completed, so it'll pickup the new config and "connect" to fleet-server.

Checklist

  • My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Related issues

  • N/A

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@AndersonQ AndersonQ added bug Something isn't working Team:Elastic-Agent Label for the Agent team backport-v8.9.0 Automated backport with mergify labels Aug 8, 2023
@AndersonQ AndersonQ self-assigned this Aug 8, 2023
@elasticmachine
Copy link
Contributor

elasticmachine commented Aug 8, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-06T06:40:17.906+0000

  • Duration: 28 min 4 sec

Test stats 🧪

Test Results
Failed 0
Passed 6485
Skipped 59
Total 6544

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@AndersonQ AndersonQ changed the title WIP: enroll command fails if the agent does not restart after enroll install fails if enroll fails and surface errors Aug 24, 2023
@AndersonQ AndersonQ added the backport-v8.10.0 Automated backport with mergify label Aug 24, 2023
@AndersonQ AndersonQ marked this pull request as ready for review August 24, 2023 15:50
@AndersonQ AndersonQ requested a review from a team as a code owner August 24, 2023 15:50
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz cmacknz requested review from ycombinator and removed request for strawgate August 24, 2023 16:43
@AndersonQ AndersonQ changed the base branch from main to 8.10 August 24, 2023 17:07
@AndersonQ AndersonQ force-pushed the fail-enroll branch 2 times, most recently from 845f24e to fefb9cb Compare August 24, 2023 17:19
@mergify
Copy link
Contributor

mergify bot commented Aug 30, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fail-enroll upstream/fail-enroll
git merge upstream/8.10
git push upstream fail-enroll

@AndersonQ AndersonQ requested a review from a team as a code owner August 30, 2023 07:34
@AndersonQ AndersonQ requested review from gsantoro and constanca-m and removed request for a team August 30, 2023 07:34
@AndersonQ AndersonQ changed the base branch from 8.10 to main August 30, 2023 07:35
@AndersonQ AndersonQ removed the backport-v8.9.0 Automated backport with mergify label Aug 31, 2023
@AndersonQ
Copy link
Member Author

/test

@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 4, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.78% (81/82)
Files 67.003% (199/297)
Classes 65.642% (363/553)
Methods 52.976% (1148/2167)
Lines 38.543% (13114/34024)
Conditionals 100.0% (0/0) 💚

@AndersonQ
Copy link
Member Author

/test

@AndersonQ AndersonQ added backport-v8.11.0 Automated backport with mergify and removed backport-v8.10.0 Automated backport with mergify labels Oct 5, 2023
@cmacknz
Copy link
Member

cmacknz commented Oct 5, 2023

We are still seeing failures related to failing to restart the agent during enroll:

https://buildkite.com/elastic/elastic-agent/builds/3774#018affdb-04c0-40bf-b7f0-8ef6ecb6e453

    endpoint_security_test.go:135: >>> Ran Enroll. Output: Installing in non-interactive mode.
        Copying files................................................. DONE
        Installing service...... DONE
        Starting service... DONE
        Enrolling Elastic Agent with Fleet....{"log.level":"info","@timestamp":"2023-10-05T13:50:39.547Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":479},"message":"Starting enrollment to URL: https://892e41f82f2641428afef7c98bc63fa3.fleet.us-central1.gcp.foundit.no:443/","ecs.version":"1.6.0"}
        ...{"log.level":"info","@timestamp":"2023-10-05T13:50:40.761Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":275},"message":"Elastic Agent might not be running; unable to trigger restart","error":{"message":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /opt/Elastic/Agent/data/tmp/elastic-agent-control.sock: connect: no such file or directory\""},"ecs.version":"1.6.0"}
        Successfully enrolled the Elastic Agent.
         DONE
        Elastic Agent has been successfully installed.

@@ -443,24 +450,32 @@ func (c *enrollCmd) prepareFleetTLS() error {

func (c *enrollCmd) daemonReloadWithBackoff(ctx context.Context) error {
err := c.daemonReload(ctx)
if err != nil &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to special case this? Why can't we just enter the backoff loop below and exit it once it is finished?

Comment on lines +470 to +472
if err == nil ||
errors.Is(err, context.DeadlineExceeded) ||
errors.Is(err, context.Canceled) {
Copy link
Member

@cmacknz cmacknz Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could replace this with https://pkg.go.dev/github.com/cenkalti/backoff/v4 which has a WithContext option to natively understands contexts. We already use this package for upgrades. Example:

expBo := backoff.NewExponentialBackOff()
expBo.InitialInterval = settings.RetrySleepInitDuration
boCtx := backoff.WithContext(expBo, cancelCtx)
var path string
var attempt uint
opFn := func() error {
attempt++
u.log.Infof("download attempt %d", attempt)
downloader, err := downloaderCtor(version, u.log, settings)
if err != nil {
return fmt.Errorf("unable to create fetcher: %w", err)
}
// All download artifacts expect a name that includes <major>.<minor.<patch>[-SNAPSHOT] so we have to
// make sure not to include build metadata we might have in the parsed version (for snapshots we already
// used that to configure the URL we download the files from)
path, err = downloader.Download(cancelCtx, agentArtifact, version.VersionWithPrerelease())
if err != nil {
return fmt.Errorf("unable to download package: %w", err)
}
// Download successful
return nil
}
opFailureNotificationFn := func(err error, retryAfter time.Duration) {
u.log.Warnf("%s; retrying (will be retry %d) in %s.", err.Error(), attempt, retryAfter)
}

Comment on lines +280 to +281
c.log.Errorf("Elastic Agent might not be running; unable to trigger restart: %v", err)
return fmt.Errorf("could not reload agent daemon, unable to trigger restart: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you need the log and the error, at least I hope the error you return gets logged somewhere obvious.

c.log.Infow("Elastic Agent might not be running; unable to trigger restart", "error", err)
} else {
c.log.Info("Successfully triggered restart on running Elastic Agent.")
if err = c.daemonReloadWithBackoff(ctx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 1 minute retry with a 10 minute wait might be too slow for attempts to restart the agent on the same host. We can likely retry every second to start with a maximum wait of up to 2 minutes. If we are still retrying after 2 minutes changes are we aren't going to get the agent to restart.

We seem to need this logic in integration tests to get them to pass reliably and using a 1 minute minimum interval is going to make every test that hits this case take an additional minute to run.

The same is probably true for the Fleet retries but that one is tied to system scalability during mass enrollments so I am more hesitant to change it.

@AndersonQ
Copy link
Member Author

/buildkite test this

@AndersonQ
Copy link
Member Author

buildkite test this

@cmacknz
Copy link
Member

cmacknz commented Oct 5, 2023

You will need to merge in the latest changes in main to get the tests to run properly if you haven't done that already. We had to change the region we provisioned the stack in.

@cmacknz
Copy link
Member

cmacknz commented Oct 5, 2023

This is failing with the error it could be fixing:

    endpoint_security_test.go:189: Installing in non-interactive mode.
        Copying files................................................... DONE
        Installing service.... DONE
        Starting service... DONE
        Elastic Agent successfully installed, starting enrollment.
        Enrolling Elastic Agent with Fleet....{"log.level":"info","@timestamp":"2023-10-05T17:49:49.581Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":494},"message":"Starting enrollment to URL: https://2bb73e932bae48fba9bd4ac7037849c5.fleet.us-central1.gcp.foundit.no:443/","ecs.version":"1.6.0"}
        .............................................................................{"log.level":"info","@timestamp":"2023-10-05T17:50:10.808Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":468},"message":"Retrying to restart...","ecs.version":"1.6.0"}
        {"log.level":"error","@timestamp":"2023-10-05T17:50:10.826Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":280},"message":"Elastic Agent might not be running; unable to trigger restart: could not reload deamon after 1 retries: %!w()","ecs.version":"1.6.0"}
        Something went wrong while enrolling the Elastic Agent: could not reload deamon after 1 retries: %!w()
        Error: could not reload agent daemon, unable to trigger restart: could not reload deamon after 1 retries: %!w()
        For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.10/fleet-troubleshooting.html
         FAILED
        Stopping service... DONE
        Uninstalling...
           Stopping service... DONE
           Stopping upgrade watcher; none found... DONE
        .   Removing service.... DONE
           Removing install directory... DONE
           DONE
        Error: enroll command failed for unknown reason: exit status 1
        For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.10/fleet-troubleshooting.html
        
    endpoint_security_test.go:191: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/endpoint_security_test.go:191
        	            				/home/ubuntu/agent/testing/integration/endpoint_security_test.go:88
        	Error:      	Received unexpected error:
        	            	unable to enroll Elastic Agent: error running agent install command: exit status 1
        	Test:       	TestInstallAndCLIUninstallWithEndpointSecurity/protected
        	Messages:   	failed to install agent with policy
--- FAIL: TestInstallAndCLIUninstallWithEndpointSecurity/protected (55.99s)

We need more than 1 retry when we can't restart the agent to fix this.

@cmacknz
Copy link
Member

cmacknz commented Oct 5, 2023

Error: enroll command failed for unknown reason: exit status 1

Also that error message can be improved since we know exactly why enroll failed.

* surface errors that might occur during enroll
* fail install command if agent cannot be restarted
* do not print success message if there was an enroll error. Print an error message and the error instead
* add logs to show the different enroll attempts
* add more context t errors
* refactor internal/pkg/agent/install/perms_unix.go and add more context to errors
restore main version
* ignore agent restart error on enroll tests as there is no agent to be restarted
* daemonReloadWithBackoff does not retry on context deadline exceeded
@elastic-sonarqube
Copy link

@AndersonQ AndersonQ merged commit 33c6934 into elastic:main Oct 6, 2023
24 checks passed
mergify bot pushed a commit that referenced this pull request Oct 6, 2023
* surface errors that might occur during enroll
* fail install command if agent cannot be restarted
* do not print success message if there was an enroll error. Print an error message and the error instead
* add logs to show the different enroll attempts
* add more context t errors
* refactor internal/pkg/agent/install/perms_unix.go and add more context to errors
restore main version
* ignore agent restart error on enroll tests as there is no agent to be restarted

(cherry picked from commit 33c6934)
AndersonQ added a commit that referenced this pull request Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.11.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants