Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden linux/windows vault implementation for concurrent access. #3240

Merged
merged 17 commits into from
Aug 24, 2023

Conversation

aleksmaus
Copy link
Member

What does this PR do?

Harden linux/windows vault implementation for concurrent access.

Problem details

During repetitive E2E tests runs (took over 5 hours to reproduce) the file based vault implementation (that is used for linux and Windows) got into the state where the key value was corrupted, the resulted in the error:

Enrollment failed after failed to store agent config: could not save enrollment information: failed to ensure key: 
could not get agent key: could not decrypt key: cipher: message authentication failed

It's difficult to reproduce in the normal agent usage scenario, but looking at the implementation there is no mitigation against multiple processes accessing vault. The lower level API allows to reproduce the situation artificially see TestVaultConcurrent test https://github.com/elastic/elastic-agent/compare/main...aleksmaus:feature/harden_fsvault?expand=1#diff-8a5388ab5e868a37ee07711c7978dc51561a302be8fa560946038e71375fe70bR183

Solution

Harden linux/windows vault implementation for concurrent access.

Why is it important?

Hardens the vault data agains corruption when it's being accessed concurrently.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

@aleksmaus aleksmaus added the enhancement New feature or request label Aug 15, 2023
@aleksmaus aleksmaus requested a review from a team as a code owner August 15, 2023 02:15
@aleksmaus aleksmaus changed the title Feature/harden fsvault Harden linux/windows vault implementation for concurrent access. Aug 15, 2023
@mergify
Copy link
Contributor

mergify bot commented Aug 15, 2023

This pull request does not have a backport label. Could you fix it @aleksmaus? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@elasticmachine
Copy link
Contributor

elasticmachine commented Aug 15, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-08-24T15:44:54.124+0000

  • Duration: 27 min 59 sec

Test stats 🧪

Test Results
Failed 0
Passed 6217
Skipped 47
Total 6264

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Aug 15, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.765% (80/81) 👍 0.015
Files 67.368% (192/285) 👎 -0.464
Classes 66.105% (353/534) 👎 -0.187
Methods 52.874% (1104/2088) 👍
Lines 38.572% (12596/32656) 👎 -0.141
Conditionals 100.0% (0/0) 💚

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Aug 15, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

internal/pkg/agent/vault/vault_linux.go Show resolved Hide resolved
darwin = "darwin"
)

// MigrateAgentSecret migrates agent secret if the secret doesn't exists agent upgrade from 8.3.0 - 8.3.2 to 8.x and above on Linux and Windows platforms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need this anymore? Upgrading from 8.3.x -> 8.9.x is still possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea, but it's not called from anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we lost this in the big 8.6 merge

// This is specific for the agent upgrade from 8.3.0 - 8.3.2 to 8.x and above on Linux and Windows platforms.
// Addresses the issue: https://github.com/elastic/elastic-agent/issues/682
// The vault directory was located in the hash versioned "Home" directory of the agent.
// This moves the vault directory two levels up into the "Config" directory next to fleet.enc file
// in order to be able to "upgrade" the agent from deb/rpm that is not invoking the upgrade handle and
// doesn't perform the migration of the state or vault.
// If the agent secret doesn't exist, then search for the newest agent secret in the agent data directories
// and migrate it into the new vault location.
err = migration.MigrateAgentSecret(logger)

Here was the original bug: #682

Would need to confirm if that situation is still possible with the change in directory structure in 8.6 and above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whooops. Been doing ok without it for few releases. 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to decide what to do about this one. I am fine with deleting it.

The original fix went into 8.3.3 which also permanently moved the vault directory location outside of the versioned data path:

At that point it should be fixed forever as the migration isn't needed anymore. Anyone on 8.3.0 would have upgraded to 8.3.3 and got this fix (or gone directly to 8.4.x or 8.5.x and gotten it).

If we ever seen a version of this problem it can be solved by upgrading to one of 8.3.3, 8.4.x or 8.5.x before going to a version higher than that so the fix hasn't been lost.

@mergify
Copy link
Contributor

mergify bot commented Aug 15, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/harden_fsvault upstream/feature/harden_fsvault
git merge upstream/main
git push upstream feature/harden_fsvault

@cmacknz cmacknz added the backport-v8.10.0 Automated backport with mergify label Aug 16, 2023
@mergify mergify bot removed the backport-skip label Aug 16, 2023
internal/pkg/agent/vault/vault_linux.go Outdated Show resolved Hide resolved
if err != nil {
return fmt.Errorf("failed to acquire shared lock: %v, err: %w", v.lock.Path(), err)
}
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker]
Please, err is always nil so return nil.

Suggested change
return err
return nil

internal/pkg/agent/vault/vault_linux.go Outdated Show resolved Hide resolved
Comment on lines 228 to 240
// writeFile "atomic" file write, utilizes temp file and replace
// os.CreateTemp creates the file with 0600 mask, which is what we need
func writeFile(fp string, data []byte) (err error) {
dir, fn := filepath.Split(fp)
if dir == "" {
dir = "."
}

f, err := os.CreateTemp(dir, fn)
if err != nil {
return fmt.Errorf("failed creating temp file: %w", err)
}
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion]
1 - move the comment about the os.CreateTem to where it's used.
2 - I think the sentence could be improved

Suggested change
// writeFile "atomic" file write, utilizes temp file and replace
// os.CreateTemp creates the file with 0600 mask, which is what we need
func writeFile(fp string, data []byte) (err error) {
dir, fn := filepath.Split(fp)
if dir == "" {
dir = "."
}
f, err := os.CreateTemp(dir, fn)
if err != nil {
return fmt.Errorf("failed creating temp file: %w", err)
}
defer func() {
// writeFile performs an "atomic" file write operation, using a temporary file
// and then replacing the original.
func writeFile(fp string, data []byte) (err error) {
dir, fn := filepath.Split(fp)
if dir == "" {
dir = "."
}
// os.CreateTemp creates the file with a 0600 permission mask,
// which is the required permission setting.
f, err := os.CreateTemp(dir, fn)
if err != nil {
return fmt.Errorf("failed creating temp file: %w", err)
}
defer func() {

internal/pkg/agent/vault/vault_linux.go Outdated Show resolved Hide resolved
internal/pkg/agent/vault/vault_options_nondarwin.go Outdated Show resolved Hide resolved
internal/pkg/agent/vault/vault_test.go Outdated Show resolved Hide resolved
}
}

func doCrud(t *testing.T, ctx context.Context, vaultPath, key string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion]
I don't see t being used, so you could remove it.

Suggested change
func doCrud(t *testing.T, ctx context.Context, vaultPath, key string) error {
func doCrud(ctx context.Context, vaultPath, key string) error {

@@ -59,42 +65,90 @@ func New(path string, opts ...OptionFunc) (v *Vault, err error) {
}
}

entropy, err := getOrCreateSeed(path, options.readonly)
r := &Vault{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT]
Why are you using r instead of v?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to return v as nil if err happens after this line

internal/pkg/agent/vault/vault_windows.go Outdated Show resolved Hide resolved
internal/pkg/agent/vault/vault_windows.go Show resolved Hide resolved
@aleksmaus
Copy link
Member Author

buildkite test this

@pierrehilbert
Copy link
Contributor

Buildkite is failing on main because we are unable to deploy 8.11.0 on Elastic Cloud for now. @AndersonQ is working on a workaround to unlock us in the meantime.

Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one small blocker, I marked it as [Blocker]

return nil, err
}
defer func() {
err = r.unlock(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker]
It'll override the error of getOrCreateSeed. Better to errors.Join the errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will not, it's a helper func using errors.Join

return nil, err
}
defer func() {
err = v.unlock(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker]
Same here, it'll override any other error

Comment on lines 211 to 214
// unlock unlocks the file lock and preserves the original error if there was a error
func (v *Vault) unlock(err error) error {
return errors.Join(err, v.lock.Unlock())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker]
I don't understand why an unlock function needs to receive an error. The caller is responsible to handle any previous error and then decide what to do if unlock fails as well.

Besides it harms the single responsibility principle, unlock is also responsible to aggregate an error for no obvious reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not agree with making it a blocker, it's a helper function to make the deferred call code above more compact and consistent, since it's done in multiple places

Comment on lines 141 to 146
if _, err = os.Stat(v.filepathFromKey(key)); err == nil {
ok = true
} else if errors.Is(err, fs.ErrNotExist) {
err = nil
}
return ok, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion]
Avoid if else for a clearer, more explicit code

Suggested change
if _, err = os.Stat(v.filepathFromKey(key)); err == nil {
ok = true
} else if errors.Is(err, fs.ErrNotExist) {
err = nil
}
return ok, err
if _, err = os.Stat(v.filepathFromKey(key)); err != nil {
if errors.Is(err, fs.ErrNotExist) {
return false, nil
}
return false, err
}
return true, nil

return err
}
defer func() {
err = v.unlock(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker]
same as the others

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained above why this is done the way it's done

Comment on lines 181 to 183
// fileNameFromKey returns the filename as a hash of the vault seed combined with the key
// this ties the key with the vault seed eliminating the change of attempting
// to decrypt the key for the wrong vault seed value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// fileNameFromKey returns the filename as a hash of the vault seed combined with the key
// this ties the key with the vault seed eliminating the change of attempting
// to decrypt the key for the wrong vault seed value.
// fileNameFromKey returns the filename as a hash of the vault seed combined with the key.
// This ties the key with the vault seed eliminating the chance of attempting
// to decrypt the key for the wrong vault seed value.

Comment on lines 47 to 49
func writeFile(fp string, data []byte) (err error) {
return os.WriteFile(fp, data, 0600)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion]

Suggested change
func writeFile(fp string, data []byte) (err error) {
return os.WriteFile(fp, data, 0600)
}
func writeFile(fp string, data []byte) error {
return os.WriteFile(fp, data, 0600)
}

@aleksmaus
Copy link
Member Author

just one small blocker, I marked it as [Blocker]

addressed the comments. do not agree with the blocker, added explanation why, it's a private helper function to avoid repeating the same code in 5 places. could you review again?

@cmacknz cmacknz changed the base branch from main to 8.10 August 24, 2023 14:42
@cmacknz cmacknz added forwardport-main and removed backport-v8.10.0 Automated backport with mergify labels Aug 24, 2023
@cmacknz
Copy link
Member

cmacknz commented Aug 24, 2023

This is going to go into 8.10 first and then get forward ported to main. This will get us around the issue with the 8.11 snapshot.

@cmacknz
Copy link
Member

cmacknz commented Aug 24, 2023

/test

1 similar comment
@cmacknz
Copy link
Member

cmacknz commented Aug 24, 2023

/test

@AndersonQ AndersonQ merged commit c4c314a into elastic:8.10 Aug 24, 2023
14 checks passed
mergify bot pushed a commit that referenced this pull request Aug 24, 2023
* Harden linux/windows vault implementation for concurrent access

* Spit off some platform-specific code to make linter happy

* Add missing field for windows implementation

* Fix return params for Get on windows

* Replace deprecated ioutil with os package API for windows implementation

* Reorganise platform specific code

* Refactored vault code, extracted shared code, addressed the code review comments

* Split off platform specific options

* Remove wrong build directive from the common options

* do not try to generate HTML test report if test XML output isn't present

---------

Co-authored-by: Anderson Queiroz <[email protected]>
(cherry picked from commit c4c314a)
@AndersonQ
Copy link
Member

Merged with failing integration tests because this PR fixes another integration tests issue affecting all tests installing the agent. Therefore we need to have it merged to reduce the number of failing tests and focus on one issue at a time.

cmacknz pushed a commit that referenced this pull request Aug 24, 2023
…) (#3288)

* Harden linux/windows vault implementation for concurrent access

* Spit off some platform-specific code to make linter happy

* Add missing field for windows implementation

* Fix return params for Get on windows

* Replace deprecated ioutil with os package API for windows implementation

* Reorganise platform specific code

* Refactored vault code, extracted shared code, addressed the code review comments

* Split off platform specific options

* Remove wrong build directive from the common options

* do not try to generate HTML test report if test XML output isn't present

---------

Co-authored-by: Anderson Queiroz <[email protected]>
(cherry picked from commit c4c314a)

Co-authored-by: Aleksandr Maus <[email protected]>
cmacknz added a commit to cmacknz/elastic-agent that referenced this pull request Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request forwardport-main skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants