-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden linux/windows vault implementation for concurrent access. #3240
Conversation
This pull request does not have a backport label. Could you fix it @aleksmaus? 🙏
NOTE: |
7ac89ee
to
7fca502
Compare
🌐 Coverage report
|
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
darwin = "darwin" | ||
) | ||
|
||
// MigrateAgentSecret migrates agent secret if the secret doesn't exists agent upgrade from 8.3.0 - 8.3.2 to 8.x and above on Linux and Windows platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need this anymore? Upgrading from 8.3.x -> 8.9.x is still possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea, but it's not called from anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we lost this in the big 8.6 merge
elastic-agent/internal/pkg/agent/cmd/run.go
Lines 132 to 140 in 0067844
// This is specific for the agent upgrade from 8.3.0 - 8.3.2 to 8.x and above on Linux and Windows platforms. | |
// Addresses the issue: https://github.com/elastic/elastic-agent/issues/682 | |
// The vault directory was located in the hash versioned "Home" directory of the agent. | |
// This moves the vault directory two levels up into the "Config" directory next to fleet.enc file | |
// in order to be able to "upgrade" the agent from deb/rpm that is not invoking the upgrade handle and | |
// doesn't perform the migration of the state or vault. | |
// If the agent secret doesn't exist, then search for the newest agent secret in the agent data directories | |
// and migrate it into the new vault location. | |
err = migration.MigrateAgentSecret(logger) |
Here was the original bug: #682
Would need to confirm if that situation is still possible with the change in directory structure in 8.6 and above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whooops. Been doing ok without it for few releases. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to decide what to do about this one. I am fine with deleting it.
The original fix went into 8.3.3 which also permanently moved the vault directory location outside of the versioned data path:
- Here's the commit in 8.3.3 74a597c
- Here it is in 8.3.3 as the last commit on July 19th: https://github.com/elastic/elastic-agent/commits/v8.3.3
At that point it should be fixed forever as the migration isn't needed anymore. Anyone on 8.3.0 would have upgraded to 8.3.3 and got this fix (or gone directly to 8.4.x or 8.5.x and gotten it).
If we ever seen a version of this problem it can be solved by upgrading to one of 8.3.3, 8.4.x or 8.5.x before going to a version higher than that so the fix hasn't been lost.
This pull request is now in conflicts. Could you fix it? 🙏
|
if err != nil { | ||
return fmt.Errorf("failed to acquire shared lock: %v, err: %w", v.lock.Path(), err) | ||
} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Blocker]
Please, err
is always nil
so return nil
.
return err | |
return nil |
// writeFile "atomic" file write, utilizes temp file and replace | ||
// os.CreateTemp creates the file with 0600 mask, which is what we need | ||
func writeFile(fp string, data []byte) (err error) { | ||
dir, fn := filepath.Split(fp) | ||
if dir == "" { | ||
dir = "." | ||
} | ||
|
||
f, err := os.CreateTemp(dir, fn) | ||
if err != nil { | ||
return fmt.Errorf("failed creating temp file: %w", err) | ||
} | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion]
1 - move the comment about the os.CreateTem
to where it's used.
2 - I think the sentence could be improved
// writeFile "atomic" file write, utilizes temp file and replace | |
// os.CreateTemp creates the file with 0600 mask, which is what we need | |
func writeFile(fp string, data []byte) (err error) { | |
dir, fn := filepath.Split(fp) | |
if dir == "" { | |
dir = "." | |
} | |
f, err := os.CreateTemp(dir, fn) | |
if err != nil { | |
return fmt.Errorf("failed creating temp file: %w", err) | |
} | |
defer func() { | |
// writeFile performs an "atomic" file write operation, using a temporary file | |
// and then replacing the original. | |
func writeFile(fp string, data []byte) (err error) { | |
dir, fn := filepath.Split(fp) | |
if dir == "" { | |
dir = "." | |
} | |
// os.CreateTemp creates the file with a 0600 permission mask, | |
// which is the required permission setting. | |
f, err := os.CreateTemp(dir, fn) | |
if err != nil { | |
return fmt.Errorf("failed creating temp file: %w", err) | |
} | |
defer func() { |
} | ||
} | ||
|
||
func doCrud(t *testing.T, ctx context.Context, vaultPath, key string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion]
I don't see t
being used, so you could remove it.
func doCrud(t *testing.T, ctx context.Context, vaultPath, key string) error { | |
func doCrud(ctx context.Context, vaultPath, key string) error { |
@@ -59,42 +65,90 @@ func New(path string, opts ...OptionFunc) (v *Vault, err error) { | |||
} | |||
} | |||
|
|||
entropy, err := getOrCreateSeed(path, options.readonly) | |||
r := &Vault{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT]
Why are you using r
instead of v
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to return v as nil if err happens after this line
buildkite test this |
Buildkite is failing on main because we are unable to deploy 8.11.0 on Elastic Cloud for now. @AndersonQ is working on a workaround to unlock us in the meantime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one small blocker, I marked it as [Blocker]
return nil, err | ||
} | ||
defer func() { | ||
err = r.unlock(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Blocker]
It'll override the error of getOrCreateSeed
. Better to errors.Join
the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will not, it's a helper func using errors.Join
return nil, err | ||
} | ||
defer func() { | ||
err = v.unlock(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Blocker]
Same here, it'll override any other error
// unlock unlocks the file lock and preserves the original error if there was a error | ||
func (v *Vault) unlock(err error) error { | ||
return errors.Join(err, v.lock.Unlock()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Blocker]
I don't understand why an unlock function needs to receive an error. The caller is responsible to handle any previous error and then decide what to do if unlock
fails as well.
Besides it harms the single responsibility principle, unlock is also responsible to aggregate an error for no obvious reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not agree with making it a blocker, it's a helper function to make the deferred call code above more compact and consistent, since it's done in multiple places
if _, err = os.Stat(v.filepathFromKey(key)); err == nil { | ||
ok = true | ||
} else if errors.Is(err, fs.ErrNotExist) { | ||
err = nil | ||
} | ||
return ok, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion]
Avoid if else
for a clearer, more explicit code
if _, err = os.Stat(v.filepathFromKey(key)); err == nil { | |
ok = true | |
} else if errors.Is(err, fs.ErrNotExist) { | |
err = nil | |
} | |
return ok, err | |
if _, err = os.Stat(v.filepathFromKey(key)); err != nil { | |
if errors.Is(err, fs.ErrNotExist) { | |
return false, nil | |
} | |
return false, err | |
} | |
return true, nil |
return err | ||
} | ||
defer func() { | ||
err = v.unlock(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Blocker]
same as the others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explained above why this is done the way it's done
// fileNameFromKey returns the filename as a hash of the vault seed combined with the key | ||
// this ties the key with the vault seed eliminating the change of attempting | ||
// to decrypt the key for the wrong vault seed value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// fileNameFromKey returns the filename as a hash of the vault seed combined with the key | |
// this ties the key with the vault seed eliminating the change of attempting | |
// to decrypt the key for the wrong vault seed value. | |
// fileNameFromKey returns the filename as a hash of the vault seed combined with the key. | |
// This ties the key with the vault seed eliminating the chance of attempting | |
// to decrypt the key for the wrong vault seed value. |
func writeFile(fp string, data []byte) (err error) { | ||
return os.WriteFile(fp, data, 0600) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion]
func writeFile(fp string, data []byte) (err error) { | |
return os.WriteFile(fp, data, 0600) | |
} | |
func writeFile(fp string, data []byte) error { | |
return os.WriteFile(fp, data, 0600) | |
} |
addressed the comments. do not agree with the blocker, added explanation why, it's a private helper function to avoid repeating the same code in 5 places. could you review again? |
This is going to go into 8.10 first and then get forward ported to main. This will get us around the issue with the 8.11 snapshot. |
/test |
1 similar comment
/test |
* Harden linux/windows vault implementation for concurrent access * Spit off some platform-specific code to make linter happy * Add missing field for windows implementation * Fix return params for Get on windows * Replace deprecated ioutil with os package API for windows implementation * Reorganise platform specific code * Refactored vault code, extracted shared code, addressed the code review comments * Split off platform specific options * Remove wrong build directive from the common options * do not try to generate HTML test report if test XML output isn't present --------- Co-authored-by: Anderson Queiroz <[email protected]> (cherry picked from commit c4c314a)
Merged with failing integration tests because this PR fixes another integration tests issue affecting all tests installing the agent. Therefore we need to have it merged to reduce the number of failing tests and focus on one issue at a time. |
…) (#3288) * Harden linux/windows vault implementation for concurrent access * Spit off some platform-specific code to make linter happy * Add missing field for windows implementation * Fix return params for Get on windows * Replace deprecated ioutil with os package API for windows implementation * Reorganise platform specific code * Refactored vault code, extracted shared code, addressed the code review comments * Split off platform specific options * Remove wrong build directive from the common options * do not try to generate HTML test report if test XML output isn't present --------- Co-authored-by: Anderson Queiroz <[email protected]> (cherry picked from commit c4c314a) Co-authored-by: Aleksandr Maus <[email protected]>
…ss. (elastic#3240) (elastic#3288)" This reverts commit 996043d.
What does this PR do?
Harden linux/windows vault implementation for concurrent access.
Problem details
During repetitive E2E tests runs (took over 5 hours to reproduce) the file based vault implementation (that is used for linux and Windows) got into the state where the key value was corrupted, the resulted in the error:
It's difficult to reproduce in the normal agent usage scenario, but looking at the implementation there is no mitigation against multiple processes accessing vault. The lower level API allows to reproduce the situation artificially see
TestVaultConcurrent
test https://github.com/elastic/elastic-agent/compare/main...aleksmaus:feature/harden_fsvault?expand=1#diff-8a5388ab5e868a37ee07711c7978dc51561a302be8fa560946038e71375fe70bR183Solution
Harden linux/windows vault implementation for concurrent access.
.lock
file is created inside of thevault
directory now.migrate_secret.go
that doesn't seem to be used anywhere anymore.Why is it important?
Hardens the vault data agains corruption when it's being accessed concurrently.
Checklist
./changelog/fragments
using the changelog tool