Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race when accessing unversionedHome #3669

Closed

Conversation

belimawr
Copy link
Contributor

What does this PR do?

It is possible to have a data race when reading/modifying unversionedHome via the helper functions. This PR adds a mutex to prevent the data race.

Why is it important?

It fixes this data race.

[2023-10-27T11:37:31.791Z] ==================

[2023-10-27T11:37:31.791Z] WARNING: DATA RACE

[2023-10-27T11:37:31.791Z] Write at 0x000002a64b70 by goroutine 868742:

[2023-10-27T11:37:31.791Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/paths.SetVersionHome()

[2023-10-27T11:37:31.791Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/paths/common.go:112 +0x54

[2023-10-27T11:37:31.791Z]   github.com/elastic/elastic-agent/pkg/component/runtime.testPaths.func1()

[2023-10-27T11:37:31.791Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/manager_test.go:3625 +0x48

[2023-10-27T11:37:31.791Z]   testing.(*common).Cleanup.func1()

[2023-10-27T11:37:31.791Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1150 +0x134

[2023-10-27T11:37:31.791Z]   testing.(*common).runCleanup()

[2023-10-27T11:37:31.791Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1328 +0x148

[2023-10-27T11:37:31.791Z]   testing.tRunner.func2()

[2023-10-27T11:37:31.791Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1570 +0x48

[2023-10-27T11:37:31.791Z]   runtime.deferreturn()

[2023-10-27T11:37:31.791Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/runtime/panic.go:476 +0x30

[2023-10-27T11:37:31.791Z]   testing.(*T).Run.func1()

[2023-10-27T11:37:31.791Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1629 +0x40

[2023-10-27T11:37:31.791Z] 

[2023-10-27T11:37:31.791Z] Previous read at 0x000002a64b70 by goroutine 868786:

[2023-10-27T11:37:31.791Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/paths.Home()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/paths/common.go:96 +0x90c

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*commandRuntime).start()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/command.go:369 +0x91c

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*commandRuntime).Run()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/command.go:154 +0x348

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*componentRuntimeState).runLoop.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:137 +0xc4

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/internal/pkg/runner.Start.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/internal/pkg/runner/runner.go:69 +0x5c

[2023-10-27T11:37:31.792Z] 

[2023-10-27T11:37:31.792Z] Goroutine 868742 (running) created at:

[2023-10-27T11:37:31.792Z]   testing.(*T).Run()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1629 +0x5b4

[2023-10-27T11:37:31.792Z]   testing.runTests.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:2036 +0x80

[2023-10-27T11:37:31.792Z]   testing.tRunner()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1576 +0x180

[2023-10-27T11:37:31.792Z]   testing.runTests()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:2034 +0x6a8

[2023-10-27T11:37:31.792Z]   testing.(*M).Run()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1906 +0x8e0

[2023-10-27T11:37:31.792Z]   main.main()

[2023-10-27T11:37:31.792Z]       _testmain.go:109 +0x2b8

[2023-10-27T11:37:31.792Z] 

[2023-10-27T11:37:31.792Z] Goroutine 868786 (running) created at:

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/internal/pkg/runner.Start()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/internal/pkg/runner/runner.go:68 +0x24c

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*componentRuntimeState).runLoop()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:135 +0xac

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.newComponentRuntimeState.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:128 +0x34

[2023-10-27T11:37:31.792Z] ==================

[2023-10-27T11:37:31.792Z] ==================

[2023-10-27T11:37:31.792Z] WARNING: DATA RACE

[2023-10-27T11:37:31.792Z] Write at 0x000001ab4fe0 by goroutine 868742:

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/paths.SetTop()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/paths/common.go:81 +0x74

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.testPaths.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/manager_test.go:3626 +0x68

[2023-10-27T11:37:31.792Z]   testing.(*common).Cleanup.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1150 +0x134

[2023-10-27T11:37:31.792Z]   testing.(*common).runCleanup()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1328 +0x148

[2023-10-27T11:37:31.792Z]   testing.tRunner.func2()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1570 +0x48

[2023-10-27T11:37:31.792Z]   runtime.deferreturn()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/runtime/panic.go:476 +0x30

[2023-10-27T11:37:31.792Z]   testing.(*T).Run.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1629 +0x40

[2023-10-27T11:37:31.792Z] 

[2023-10-27T11:37:31.792Z] Previous read at 0x000001ab4fe0 by goroutine 868786:

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*commandRuntime).start()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/command.go:369 +0x924

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*commandRuntime).Run()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/command.go:154 +0x348

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*componentRuntimeState).runLoop.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:137 +0xc4

[2023-10-27T11:37:31.792Z]   github.com/elastic/elastic-agent/internal/pkg/runner.Start.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/internal/pkg/runner/runner.go:69 +0x5c

[2023-10-27T11:37:31.792Z] 

[2023-10-27T11:37:31.792Z] Goroutine 868742 (running) created at:

[2023-10-27T11:37:31.792Z]   testing.(*T).Run()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1629 +0x5b4

[2023-10-27T11:37:31.792Z]   testing.runTests.func1()

[2023-10-27T11:37:31.792Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:2036 +0x80

[2023-10-27T11:37:31.792Z]   testing.tRunner()

[2023-10-27T11:37:31.793Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1576 +0x180

[2023-10-27T11:37:31.793Z]   testing.runTests()

[2023-10-27T11:37:31.793Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:2034 +0x6a8

[2023-10-27T11:37:31.793Z]   testing.(*M).Run()

[2023-10-27T11:37:31.793Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/.gvm/versions/go1.20.10.linux.arm64/src/testing/testing.go:1906 +0x8e0

[2023-10-27T11:37:31.793Z]   main.main()

[2023-10-27T11:37:31.793Z]       _testmain.go:109 +0x2b8

[2023-10-27T11:37:31.793Z] 

[2023-10-27T11:37:31.793Z] Goroutine 868786 (running) created at:

[2023-10-27T11:37:31.793Z]   github.com/elastic/elastic-agent/internal/pkg/runner.Start()

[2023-10-27T11:37:31.793Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/internal/pkg/runner/runner.go:68 +0x24c

[2023-10-27T11:37:31.793Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*componentRuntimeState).runLoop()

[2023-10-27T11:37:31.793Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:135 +0xac

[2023-10-27T11:37:31.793Z]   github.com/elastic/elastic-agent/pkg/component/runtime.newComponentRuntimeState.func1()

[2023-10-27T11:37:31.793Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3465/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:128 +0x34

[2023-10-27T11:37:31.793Z] ==================

[2023-10-27T11:37:31.793Z]     testing.go:1446: race detected during execution of tes

Jenkins Job: https://fleet-ci.elastic.co/blue/organizations/jenkins/elastic-agent%2Felastic-agent-mbp/detail/PR-3465/29/pipeline/
Failing PR: #3465

Checklist

  • My code follows the style guidelines of this project
    - [ ] I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
    - [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
    - [ ] I have added an integration test or an E2E test

## Author's Checklist

How to test this PR locally

You can run the test that triggered the data race pkg/component/runtime TestManager_FakeInput_Configure however it's not guaranteed that the data race will trigger. I ran the tests a 100 times and it was not enough to trigger the condition on my machine.

## Related issues
## Use cases
## Screenshots
## Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?

It is possible to have a data race when reading/modifying
unversionedHome via the helper functions. This commit adds a mutex to
prevent the data race.
@mergify
Copy link
Contributor

mergify bot commented Oct 27, 2023

This pull request does not have a backport label. Could you fix it @belimawr? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@elastic-sonarqube
Copy link

@cmacknz
Copy link
Member

cmacknz commented Oct 27, 2023

We have had this fix attempted before, I'm not convinced the mutex actually fixes the underlying bug, it only silences the race detector:

#3243 (comment)

Reposting that comment below

@cmacknz
Copy link
Member

cmacknz commented Oct 27, 2023

Looking at the stack trace in the race, the race is triggered by test code:

[2023-08-15T12:18:14.379Z] WARNING: DATA RACE
[2023-08-15T12:18:14.379Z] Write at 0x0000015625d0 by goroutine 307:
[2023-08-15T12:18:14.379Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/paths.SetVersionHome()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/paths/common.go:112 +0x54
[2023-08-15T12:18:14.379Z]   github.com/elastic/elastic-agent/pkg/component/runtime.testPaths.func1()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/pkg/component/runtime/manager_test.go:2964 +0x48
[2023-08-15T12:18:14.379Z]   testing.(*common).Cleanup.func1()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:1150 +0x134
[2023-08-15T12:18:14.379Z]   testing.(*common).runCleanup()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:1328 +0x148
[2023-08-15T12:18:14.379Z]   testing.tRunner.func2()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:1570 +0x48
[2023-08-15T12:18:14.379Z]   runtime.deferreturn()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/runtime/panic.go:476 +0x30
[2023-08-15T12:18:14.379Z]   testing.(*T).Run.func1()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:1629 +0x40
[2023-08-15T12:18:14.379Z] 
[2023-08-15T12:18:14.379Z] Previous read at 0x0000015625d0 by goroutine 346:
[2023-08-15T12:18:14.379Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/paths.Home()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/paths/common.go:96 +0x90c
[2023-08-15T12:18:14.379Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*commandRuntime).start()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/pkg/component/runtime/command.go:354 +0x91c
[2023-08-15T12:18:14.379Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*commandRuntime).Run()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/pkg/component/runtime/command.go:136 +0x34c
[2023-08-15T12:18:14.379Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*componentRuntimeState).runLoop.func1()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:137 +0xc4
[2023-08-15T12:18:14.379Z]   github.com/elastic/elastic-agent/internal/pkg/runner.Start.func1()
[2023-08-15T12:18:14.379Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/internal/pkg/runner/runner.go:69 +0x5c
[2023-08-15T12:18:14.379Z] 
[2023-08-15T12:18:14.380Z] Goroutine 307 (running) created at:
[2023-08-15T12:18:14.380Z]   testing.(*T).Run()
[2023-08-15T12:18:14.380Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:1629 +0x5b4
[2023-08-15T12:18:14.380Z]   testing.runTests.func1()
[2023-08-15T12:18:14.380Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:2036 +0x80
[2023-08-15T12:18:14.380Z]   testing.tRunner()
[2023-08-15T12:18:14.380Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:1576 +0x180
[2023-08-15T12:18:14.380Z]   testing.runTests()
[2023-08-15T12:18:14.380Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:2034 +0x6a8
[2023-08-15T12:18:14.380Z]   testing.(*M).Run()
[2023-08-15T12:18:14.380Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/.gvm/versions/go1.20.7.linux.arm64/src/testing/testing.go:1906 +0x8e0
[2023-08-15T12:18:14.380Z]   main.main()
[2023-08-15T12:18:14.380Z]       _testmain.go:101 +0x2b8
[2023-08-15T12:18:14.380Z] 
[2023-08-15T12:18:14.380Z] Goroutine 346 (running) created at:
[2023-08-15T12:18:14.380Z]   github.com/elastic/elastic-agent/internal/pkg/runner.Start()
[2023-08-15T12:18:14.380Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/internal/pkg/runner/runner.go:68 +0x24c
[2023-08-15T12:18:14.380Z]   github.com/elastic/elastic-agent/pkg/component/runtime.(*componentRuntimeState).runLoop()
[2023-08-15T12:18:14.380Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:135 +0xac
[2023-08-15T12:18:14.380Z]   github.com/elastic/elastic-agent/pkg/component/runtime.newComponentRuntimeState.func1()
[2023-08-15T12:18:14.380Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3214/src/github.com/elastic/elastic-agent/pkg/component/runtime/runtime.go:128 +0x34

I think it might be specifically coming from this function changing the value that was read by the command runtime used for the fake components:

func testPaths(t *testing.T) {
t.Helper()
versioned := paths.IsVersionHome()
topPath := paths.Top()
tmpDir, err := os.MkdirTemp("", "at-*")
if err != nil {
t.Fatalf("failed to create temp directory: %s", err)
}
paths.SetVersionHome(false)
paths.SetTop(tmpDir)
t.Cleanup(func() {
paths.SetVersionHome(versioned)
paths.SetTop(topPath)
_ = os.RemoveAll(tmpDir)
})
}

I'm not actually sure that the mutexes here fix the problem. They guarantee only one goroutine reads or writes at a time, but it gives you no guarantee about the ordering of those reads an writes.

If one goroutine reads the value and continues on, then another updates it, then the update is effectively lost if the first goroutine never reads the value again. It's like it never happened. I think this is what the race detector is telling us here.

If my understanding of this is correct then:

  1. This isn't that urgent to fix because it only happens in tests, although we do want to make the race detector happy eventually.
  2. We need to fix the ordering of these RW operations in the tests, which adding locks doesn't fix. Looking at where the goroutines were started it looks like a race between test body and an instance of the runtime manager. This seems strange because just reading the code the ordering actually looks right where the testPaths function is called before the runtime manager gets started:

func TestManager_FakeInput_StartStop(t *testing.T) {
testPaths(t)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
ai, _ := info.NewAgentInfo(true)
m, err := NewManager(newDebugLogger(t), newDebugLogger(t), "localhost:0", ai, apmtest.DiscardTracer, newTestMonitoringMgr(), configuration.DefaultGRPCConfig())

I would bet that the problem is actually a race from the previous instance of the runtime manager since the paths are all global, every goroutine in this test is reading from the same source. The ordering I suspect is:

  1. Test A sets global paths A
  2. Test A starts runtime manager which reads paths A
  3. Test B sets global paths B
  4. The goroutine from 2 is still running and the race detector thinks it lost the write.

If this is true the fix isn't mutexes, it's to stop using global state. We need to be able to create a set of paths per test. We could do this by having the paths all read from a single structure which can be atomically swapped out or copied. In the example above test A would never see the global paths write from test B because they would each have their own copy of the paths.

You could probably also fix this by guaranteeing that all goroutines from previous tests exited before starting the next one. Doing both things would be nice, but you probably only need one of them.

@elasticmachine
Copy link
Contributor

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-27T13:31:04.927+0000

  • Duration: 26 min 27 sec

Test stats 🧪

Test Results
Failed 1
Passed 6552
Skipped 59
Total 6612

Test errors 1

Expand to view the tests failures

Test / Matrix - PLATFORM = 'ubuntu-22 && immutable' / Test / TestManager_FakeInput_GoodUnitToBad – github.com/elastic/elastic-agent/pkg/component/runtime
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestManager_FakeInput_GoodUnitToBad
        manager_test.go:1371: component state changed: {State:STARTING Message:Starting: spawned pid '53166' Units:map[{UnitType:input UnitID:fake-input}:{State:STARTING Message:Starting: spawned pid '53166' Payload:map[] unitState:0 unitMessage:Starting unitPayload:map[] configStateIdx:0 err:<nil>} {UnitType:input UnitID:good-input}:{State:STARTING Message:Starting: spawned pid '53166' Payload:map[] unitState:0 unitMessage:Starting unitPayload:map[] configStateIdx:0 err:<nil>}] Features:<nil> FeaturesIdx:0 Component:<nil> ComponentIdx:0 VersionInfo:{Name: Version: Meta:map[]} expectedUnits:map[{UnitType:0 UnitID:fake-input}:{state:2 configStateIdx:1 config:0xc0008de000 err:<nil> logLevel:4} {UnitType:0 UnitID:good-input}:{state:2 configStateIdx:1 config:0xc0008de240 err:<nil> logLevel:4}] expectedFeatures:<nil> expectedFeaturesIdx:1 expectedComponent:<nil> expectedComponentIdx:1}
        manager_test.go:1371: component state changed: {State:HEALTHY Message:Healthy: communicating with pid '53166' Units:map[{UnitType:input UnitID:fake-input}:{State:STARTING Message:Starting: spawned pid '53166' Payload:map[] unitState:0 unitMessage: unitPayload:map[] configStateIdx:0 err:<nil>} {UnitType:input UnitID:good-input}:{State:STARTING Message:Starting: spawned pid '53166' Payload:map[] unitState:0 unitMessage: unitPayload:map[] configStateIdx:0 err:<nil>}] Features:<nil> FeaturesIdx:0 Component:<nil> ComponentIdx:0 VersionInfo:{Name:fake Version:1.0 Meta:map[input:fake]} expectedUnits:map[{UnitType:0 UnitID:fake-input}:{state:2 configStateIdx:1 config:0xc0008de000 err:<nil> logLevel:4} {UnitType:0 UnitID:good-input}:{state:2 configStateIdx:1 config:0xc0008de240 err:<nil> logLevel:4}] expectedFeatures:<nil> expectedFeaturesIdx:1 expectedComponent:<nil> expectedComponentIdx:1}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"info","state":"STARTING","expectedState":"HEALTHY","time":"2023-10-27T13:44:19Z","message":"unit change received","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"time":"2023-10-27T13:44:19Z","message":"registering set_state action for unit","level":"trace","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering send_event action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering kill action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering retrieve_features action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering retrieve_apm_config action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"debug","state":"HEALTHY","message":"updating unit state","time":"2023-10-27T13:44:19Z","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","kill_on_interval":false,"time":"2023-10-27T13:44:19Z","message":"kill_on_interval config set value","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"info","state":"STARTING","expectedState":"HEALTHY","time":"2023-10-27T13:44:19Z","message":"unit change received","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering set_state action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering send_event action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering kill action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering retrieve_features action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","time":"2023-10-27T13:44:19Z","message":"registering retrieve_apm_config action for unit","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"message":"updating unit state","time":"2023-10-27T13:44:19Z","level":"debug","state":"HEALTHY","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"trace","kill_on_interval":false,"time":"2023-10-27T13:44:19Z","message":"kill_on_interval config set value","ecs.version":"1.6.0"}
        manager_test.go:1371: component state changed: {State:HEALTHY Message:Healthy: communicating with pid '53166' Units:map[{UnitType:input UnitID:fake-input}:{State:STARTING Message:Starting Payload:map[] unitState:0 unitMessage:Starting unitPayload:map[] configStateIdx:1 err:<nil>} {UnitType:input UnitID:good-input}:{State:HEALTHY Message:Fake Health 2 Payload:map[] unitState:2 unitMessage:Fake Health 2 unitPayload:map[] configStateIdx:1 err:<nil>}] Features:<nil> FeaturesIdx:1 Component:<nil> ComponentIdx:1 VersionInfo:{Name:fake Version:1.0 Meta:map[input:fake]} expectedUnits:map[{UnitType:0 UnitID:fake-input}:{state:2 configStateIdx:1 config:0xc0008de000 err:<nil> logLevel:4} {UnitType:0 UnitID:good-input}:{state:2 configStateIdx:1 config:0xc0008de240 err:<nil> logLevel:4}] expectedFeatures:<nil> expectedFeaturesIdx:1 expectedComponent:<nil> expectedComponentIdx:1}
        manager_test.go:1389: marking good-input as having a hard-error for config
        manager_test.go:1371: component state changed: {State:HEALTHY Message:Healthy: communicating with pid '53166' Units:map[{UnitType:input UnitID:fake-input}:{State:HEALTHY Message:Fake Healthy Payload:map[] unitState:2 unitMessage:Fake Healthy unitPayload:map[] configStateIdx:1 err:<nil>} {UnitType:input UnitID:good-input}:{State:HEALTHY Message:Fake Health 2 Payload:map[] unitState:2 unitMessage:Fake Health 2 unitPayload:map[] configStateIdx:1 err:<nil>}] Features:<nil> FeaturesIdx:1 Component:<nil> ComponentIdx:1 VersionInfo:{Name:fake Version:1.0 Meta:map[input:fake]} expectedUnits:map[{UnitType:0 UnitID:fake-input}:{state:2 configStateIdx:1 config:0xc0008de000 err:<nil> logLevel:4} {UnitType:0 UnitID:good-input}:{state:2 configStateIdx:1 config:0xc0008de240 err:<nil> logLevel:4}] expectedFeatures:<nil> expectedFeaturesIdx:1 expectedComponent:<nil> expectedComponentIdx:1}
        manager_test.go:1447: 
            	Error Trace:	/var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3669/src/github.com/elastic/elastic-agent/pkg/component/runtime/manager_test.go:1447
            	Error:      	Received unexpected error:
            	            	good-input unit should be either failed or stopped, but it's 'HEALTHY'
            	Test:       	TestManager_FakeInput_GoodUnitToBad
        manager_test.go:1371: component state changed: {State:HEALTHY Message:Healthy: communicating with pid '53166' Units:map[{UnitType:input UnitID:fake-input}:{State:HEALTHY Message:Fake Healthy Payload:map[] unitState:2 unitMessage:Fake Healthy unitPayload:map[] configStateIdx:1 err:<nil>} {UnitType:input UnitID:good-input}:{State:FAILED Message:hard-error for config Payload:map[] unitState:2 unitMessage:Fake Health 2 unitPayload:map[] configStateIdx:1 err:0xc0002e0150}] Features:<nil> FeaturesIdx:1 Component:<nil> ComponentIdx:1 VersionInfo:{Name:fake Version:1.0 Meta:map[input:fake]} expectedUnits:map[{UnitType:0 UnitID:fake-input}:{state:2 configStateIdx:1 config:0xc0008de000 err:<nil> logLevel:4} {UnitType:0 UnitID:good-input}:{state:4 configStateIdx:2 config:<nil> err:0xc0002e0150 logLevel:0}] expectedFeatures:<nil> expectedFeaturesIdx:1 expectedComponent:<nil> expectedComponentIdx:1}
    {"log.level":"debug","@timestamp":"2023-10-27T13:44:19.990Z","log.origin":{"file.name":"runtime/manager.go","file.line":708},"message":"Stopping component \"fake-default\"","ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"0000-01-01T00:00:00.000Z","message":"","component":{"binary":"","dataset":"elastic_agent.","id":"fake-default","type":"fake"},"log":{"source":"fake-default"},"level":"info","state":"HEALTHY","expectedState":"STOPPED","time":"2023-10-27T13:44:19Z","message":"unit change received","ecs.version":"1.6.0"}
    --- FAIL: TestManager_FakeInput_GoodUnitToBad (1.28s)
     
    

Steps errors 2

Expand to view the steps failures

Go unitTest
  • Took 6 min 51 sec . View more details here
  • Description: mage unitTest
Publish Cobertura Coverage Report
  • Took 0 min 1 sec . View more details here
  • Description: [2023-10-27T13:47:15.256Z] [Cobertura] Publishing Cobertura coverage report... [2023-10-27T13:47:15

🐛 Flaky test report

❕ There are test failures but not known flaky tests.

Expand to view the summary

Genuine test errors 1

💔 There are test failures but not known flaky tests, most likely a genuine test failure.

  • Name: Test / Matrix - PLATFORM = 'ubuntu-22 && immutable' / Test / TestManager_FakeInput_GoodUnitToBad – github.com/elastic/elastic-agent/pkg/component/runtime

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@belimawr
Copy link
Contributor Author

Thanks @cmacknz. I saw it was triggered by test come, but because the container command uses one of those functions, I was not sure it it could actually happen.

Avoiding global state is definitely a good idea.

I'll close this PR and re-run the tests on my PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants