Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrolling into Fleet can fail because the control socket doesn't exist #3664

Closed
cmacknz opened this issue Oct 26, 2023 · 5 comments · Fixed by #3815
Closed

Enrolling into Fleet can fail because the control socket doesn't exist #3664

cmacknz opened this issue Oct 26, 2023 · 5 comments · Fixed by #3815
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Oct 26, 2023

This is frequently affecting our integration tests because we are frequently enrolling as part of the agent installation. A typical failure looks like:

fixture.go:365: >> running agent with: [/tmp/TestInstallWithEndpointSecurityAndRemoveEndpointIntegrationunprotected1325593366/001/elastic-agent-8.11.0-SNAPSHOT-linux-arm64/elastic-agent install --force --non-interactive --url https://xxxxxx.fleet.us-central1.gcp.foundit.no:443 --enrollment-token <REDACTED>]
    endpoint_security_test.go:358: >>> Ran Enroll. Output: Installing in non-interactive mode.
        Copying files................................................................. DONE
        Installing service...... DONE
        Starting service... DONE
        Enrolling Elastic Agent with Fleet...{"log.level":"info","@timestamp":"2023-10-26T17:31:05.010Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":479},"message":"Starting enrollment to URL: https://da362739b9424443a6e94773953da8d7.fleet.us-central1.gcp.foundit.no:443/","ecs.version":"1.6.0"}
        ......{"log.level":"info","@timestamp":"2023-10-26T17:31:06.610Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":275},"message":"Elastic Agent might not be running; unable to trigger restart","error":{"message":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /opt/Elastic/Agent/data/tmp/elastic-agent-control.sock: connect: no such file or directory\""},"ecs.version":"1.6.0"}
        Successfully enrolled the Elastic Agent.
  1. The error tells us exactly what is wrong, the elastic-agent-control.sock file doesn't exist yet. Possibly we just need to retry accessing it, but there could have been an unlogged problem creating it.
  2. The command reports Successfully enrolled the Elastic Agent. which I don't think is expected. The agent is enrolled, but it won't connect to Fleet until it is manually restarted.
@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 26, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member Author

cmacknz commented Oct 26, 2023

A solution to this was started in #3631, but changing the agent to wait longer for this error broke enrollment in a Docker container.

@belimawr
Copy link
Contributor

@faec did you have time to look into that? I have a backport PR to fix another falky test failing for the same reason 😢 . #3704

@belimawr
Copy link
Contributor

@cmacknz I'm a bit confused by this issue.

  1. I cannot reproduce the test failure. I have seen it failing multiple times in CI but I can't get it to fail when I run from my machine
  2. Looking at Elastic Agent enroll fails to restart daemon on docker  #3628, I see the commits were reverted, I tested with docker.elastic.co/beats/elastic-agent:8.12.0-SNAPSHOT (43267fab08cf) and the enrol works. I do see the error but the Elastic-Agent starts working:
{"log.level":"info","@timestamp":"2023-11-22T19:38:31.543Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":481},"message":"Starting enrollment to URL: https://c7328563f8804414a613d61c8a77231e.fleet.us-west2.gcp.elastic-cloud.com:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-11-22T19:38:33.162Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":277},"message":"Elastic Agent might not be running; unable to trigger restart","error":{"message":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /usr/share/elastic-agent/state/data/tmp/elastic-agent-control.sock: connect: no such file or directory\""},"ecs.version":"1.6.0"}
Successfully enrolled the Elastic Agent.
{"log.level":"info","@timestamp":"2023-11-22T19:38:33.182Z","log.origin":{"file.name":"cmd/run.go","file.line":157},"message":"Elastic Agent started","log":{"source":"elastic-agent"},"process.pid":6,"agent.version":"8.12.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-11-22T19:38:33.398Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":114},"message":"agent is not upgradable, not starting watcher","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-11-22T19:38:33.398Z","log.origin":{"file.name":"cmd/run.go","file.line":240},"message":"APM instrumentation disabled","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-11-22T19:38:33.399Z","log.origin":{"file.name":"application/application.go","file.line":62},"message":"Gathered system information","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

Should I add back #3554 and then fix the enrol process?

@cmacknz
Copy link
Member Author

cmacknz commented Nov 22, 2023

You can bring back #3628 as a starting point but enrolling a Docker container into Fleet needs to be fixed before the change can be merged.

docker run \
  --env FLEET_ENROLL=1 \
  --env FLEET_URL=xxxxxxx \
  --env FLEET_ENROLLMENT_TOKEN=xxxxxxx \
  --env FLEET_INSERUCE=true \
  docker.elastic.co/beats/elastic-agent:8.12.0-SNAPSHOT

@rdner rdner added the flaky-test Unstable or unreliable test cases. label Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants