Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auth: Fix missing snapshots and backups from storage pool used-by URLs #14324

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

markylaing
Copy link
Contributor

@markylaing markylaing commented Oct 22, 2024

The underlying cause of this bug was that general filtering of used-by URLs makes the assumption that the can_view entitlement is available for all entity types. It is a fair assumption, but wasn't true for storage volume or instance backups or snapshots.

To fix this, four new entity types have been added to the authorization model:

  • instance_backup
  • instance_snapshot
  • storage_volume_backup
  • storage_volume_snapshot

Each has associated entitlements:

  • can_edit
  • can_view
  • can_delete

It is still not possible to grant these entitlements via the API. Instead, they are granted via can_manage_snapshots or can_manage_backups on the associated instance or storage volume.

The OpenFGADatastore implementation has been updated to handle instance and storage_volume relations between the parent and it's snapshots/backups.

  • Update OpenFGADatastore comments - it says an instance is not a relation but after this PR it is.

Closes #14291

@markylaing markylaing added the Bug Confirmed to be a bug label Oct 22, 2024
@markylaing markylaing added this to the lxd-6.2 milestone Oct 22, 2024
@markylaing markylaing self-assigned this Oct 22, 2024
@github-actions github-actions bot added the Documentation Documentation needs updating label Oct 22, 2024
Copy link

Heads up @mionaalex - the "Documentation" label was applied to this issue.

@markylaing
Copy link
Contributor Author

CC @mas-who @edlerd

@tomponline
Copy link
Member

tests are sad

@markylaing
Copy link
Contributor Author

markylaing commented Oct 23, 2024

@tomponline tests are mostly green except for one: https://github.com/canonical/lxd/actions/runs/11463430112/job/31913802942#step:12:52804

I'm not certain why this is failing as it doesn't seem to have anything to do with this PR. It is potentially related to #14315 since lxc profile assign calls PUT /1.0/instances/{name} which does some Profile.ToAPI work. I also don't understand why it only failed with the dir storage backend.

Edit: Note that this also doesn't fail locally. I'll have to get a tmate session running.

@tomponline
Copy link
Member

tomponline commented Oct 23, 2024

I'm not certain why this is failing as it doesn't seem to have anything to do with this PR. It is potentially related to #14315 since lxc profile assign calls PUT /1.0/instances/{name} which does some Profile.ToAPI work. I also don't understand why it only failed with the dir storage backend.

@hamistao please can you check this out, thanks

Seems like a panic.

@markylaing
Copy link
Contributor Author

@tomponline @hamistao The CI passed on the third attempt. I'll investigate a bit more though as I don't want to introduce any races, especially when they may be causing a panic.

@markylaing
Copy link
Contributor Author

@tomponline @hamistao The CI passed on the third attempt. I'll investigate a bit more though as I don't want to introduce any races, especially when they may be causing a panic.

I've been investigating this for an hour or so with no progress. It would be very useful to surface panics in the test logs. I'm trying to figure out a way to do this.

@tomponline
Copy link
Member

@tomponline @hamistao The CI passed on the third attempt. I'll investigate a bit more though as I don't want to introduce any races, especially when they may be causing a panic.

I've been investigating this for an hour or so with no progress. It would be very useful to surface panics in the test logs. I'm trying to figure out a way to do this.

Did you identify which commit introduced it yet?

Did you try reverting the earlier profiles PR?

@markylaing
Copy link
Contributor Author

Did you identify which commit introduced it yet?

Did you try reverting the earlier profiles PR?

With it being intermittent I didn't think reverting the profiles PR would tell me very much (e.g. I'll need to figure out where that panic is occurring in either case). I've added a commit to check LXD logs for panics. It's failing on standalone tests but not in the cluster tests which is a bit odd. Still investigating.

@markylaing
Copy link
Contributor Author

I've re-run the test 8 times now and the panic only occurred on the first two runs. I've added a PR to handle panics a bit more cleanly in the future (#14346). If it happens again it should be obvious where it occurred.

@markylaing
Copy link
Contributor Author

Of course it fails again as soon as I move the panic checker work into another PR 🤦

@markylaing markylaing marked this pull request as draft October 25, 2024 12:26
@markylaing markylaing force-pushed the used-by-bug branch 5 times, most recently from 320408e to 6b8b335 Compare October 31, 2024 15:02
@markylaing markylaing force-pushed the used-by-bug branch 2 times, most recently from a0a21d4 to ef70dd8 Compare November 6, 2024 14:36
Adds instance and storage volume snapshots and backups to the OpenFGA
model. These entitlements cannot be assigned to identities, service
accounts, or group members. Instead they are inherited from the parent
instance or volume.

Signed-off-by: Mark Laing <[email protected]>
The auth.ValidateEntitlement function validates all entitlements that
can be granted via the API. Since the new entitlements on snapshots and
backups cannot be granted via the API, this check fails.

The OpenFGA server will return an error if an invalid query is performed
based on it's own understanding of the authorization model.

Signed-off-by: Mark Laing <[email protected]>
Previously the only entities that had inherited relations were project and
server. Now that we are linking instances and storage volumes to their
snapshots and backups, the OpenFGADatastore implementation needs to handle
these relations.

On Read, we can connect a snapshot or backup to its parent instance or
storage volume using the information stored in its URL. For example, the
storage volume backup URL:

/1.0/storage-pools/default/volumes/custom/vol1/backups/backup1?project=project1

is related to its parent:

/1.0/storage-pools/default/volumes/custom/vol1?project=project1

via the `storage_volume relation`.

Signed-off-by: Mark Laing <[email protected]>
…tartingWithUser.

Previously the only entities that had inherited relations were project and
server. Now that we are linking instances and storage volumes to their
snapshots and backups, the OpenFGADatastore implementation needs to handle
these relations.

On ReadStartingWithUser, the function needs to return all backups or snapshots that
are related to a parent instance or storage volume. This is used in the `ListObjects`
call to the OpenFGA server, which is used by `(auth.Authorizer).GetPermissionChecker`.

To do this, I have naively queried for all snapshots or backups in the project, and
filtered out those that don't have the correct parent. This keeps the implementation
simple and makes use of `GetEntityURLs`, which performs as few queries as possible.
Further optimisation may be needed.

Signed-off-by: Mark Laing <[email protected]>
We can now use the `can_view`, `can_edit`, and `can_delete` entitlements
with instance backups and snapshots. We should do this so that our checks
more accurately reflect the authorization model.

Signed-off-by: Mark Laing <[email protected]>
The access handler was performing some logic to determine
the location of the storage volume for use in the access check.
This was based on whether the storage pool is remote, and if not,
the cluster member where the volume is located.

This commit removes that logic and adds a "location" field to
`storageVolumeDetails` so that it can be used in the handlers.
The logic for determining the location is modified to suit the call
site. It is only set when the pool is not remote.

Signed-off-by: Mark Laing <[email protected]>
The storage volume snapshot and backup access handlers need to share
almost identical logic to the storage volume access handler. Including
getting the storage pool, understanding if the storage volume is located
on another cluster member, and so forth.

This commit parameterises the function so that it can be used by the
snapshot and backup entity types as well; creating and checking against
the correct URL when called.

Signed-off-by: Mark Laing <[email protected]>
We can now check `can_view`, `can_edit`, and `can_delete` against
the backup/snapshot itself. We should do so to more accurately reflect
the authorization model.

Signed-off-by: Mark Laing <[email protected]>
@markylaing markylaing force-pushed the used-by-bug branch 6 times, most recently from 97046fa to 9b67811 Compare November 8, 2024 11:42
@markylaing
Copy link
Contributor Author

Update on this. I've set up a tmate session 3 times and in each case I:

  • Ran the standalone suite up to the intermittently failing test (config_profiles)
  • Ran the config_profiles test in a loop 60 times without tearing down the test harness (e.g. keeping the test environment). To do this I had to edit the test slightly, only to clean things up (deleted leftover profiles and directories)

It didn't fail once. I have also ran the full suite ~15 times over the last week and haven't seen the failure again.

At this point I'm pretty baffled 🤷

@markylaing
Copy link
Contributor Author

I've just spotted another failure in another PR (#14434) that might be related: https://github.com/canonical/lxd/actions/runs/11777602677/job/32802823358#step:12:38935

@tomponline
Copy link
Member

I've just spotted another failure in another PR (#14434) that might be related: https://github.com/canonical/lxd/actions/runs/11777602677/job/32802823358#step:12:38935

are your test fails always happening on ceph too?

@markylaing
Copy link
Contributor Author

I've just spotted another failure in another PR (#14434) that might be related: https://github.com/canonical/lxd/actions/runs/11777602677/job/32802823358#step:12:38935

are your test fails always happening on ceph too?

No it was failing with the dir backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug Documentation Documentation needs updating
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Snapshots missing in used_by for custom volumes and storage pools on latest/edge LXD build
2 participants