Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user I can pull-through cache container images when remote is defined on distribution #507

Closed
pulpbot opened this issue Dec 15, 2021 · 12 comments · Fixed by #1299
Closed
Assignees

Comments

@pulpbot
Copy link
Member

pulpbot commented Dec 15, 2021

Author: @ipanova ([email protected])

Redmine Issue: 9560, https://pulp.plan.io/issues/9560


see pulp_python PR for more details https://github.com/pulp/pulp_python/pull/384/files
pulp_container won't have this as easy because of multiple content types needed to be downloaded as well as relations created between them

@lubosmj
Copy link
Member

lubosmj commented Apr 12, 2022

In pulp_python, we are streaming artifacts from a remote repository. The artifacts are not attached to any repository and are orphans. The pull-through cache is content marked as an orphan. An orphan clean-up task will remove all of the "cached" content.

Proposal 1

In pulp_container, we have more than one content type to preserve in the "cache". I propose to extend the concept introduced in pulp_python to "cached" repositories. It allows us to better track content pulled from a remote. As a result, we will have two types of repositories referenced from a distribution:

  • the original (synced) one,
  • a temporary one (that will contain the "cached" content, having an additional field timestamp_of_interest).

The Registry serves the content from either the original repository or temporary repository, based on the presence of the content. If the content is not present either in the original repository or the temporary repository, Pulp pulls the missing content (tags, manifests, blobs) from a remote specified in a distribution. Here, we will utilize the workflows implemented in the sync pipeline when downloading the content from a remote (a new async task will be triggered).

A PoC is shown in #704. The extended workflow (15 additional lines) for removing temporary repositories via the orphan clean-up machinery needs to be added to pulpcore.

The whole concept applies to a distribution that handles pull-through caching only for a specific upstream repository:

  • an empty distribution with a remote
  • a distribution with a synced repository with a remote

Side-note

  1. @ipanova mentioned a scenario where we should also consider checking the content on a remote once it is specified in a distribution. And, once the content on a remote is not the same as the content present in Pulp, we should download it and cache it. The problem is that we will be querying the remote repository every time a user makes a request to Pulp.

    In Amazon (https://aws.amazon.com/blogs/aws/announcing-pull-through-cache-repositories-for-amazon-elastic-container-registry/
    ), they refer to "cached" content as content that is periodically checked and updated locally if any changes on the upstream will be detected. In Pulp, we assume that the cached content will be referred to as orphaned content.

    This is supported in Pulp with (mirror=True) sync tasks naturally, where an administrator can configure a periodic task for re-syncing repositories.

  2. The way how our downloaders work will probably need to be adjusted. I think I saw in pulp_rpm (?) that we are downloading content during the syncing every time and then trashing it once digests of artifacts match with the digests present in Pulp. This is not very acceptable for pulp_container considering the pull-through caching. Are we handling it in some way or downloading content all the time?

Proposal 2

The second idea is to make pull-through caching work as a standalone entity where an administrator creates a special type of "distribution". The "distribution" will hold the reference to a remote repository. Content requested by a user will be automatically downloaded (and thus cached) to Pulp on demand. It will work for all repositories hosted on a remote Registry (the option upstream_name in ContainerRemote will be marked as an optional parameter).

Such an approach will result in having a couple of temporary repositories referenced by a distribution or a couple of distribution-repository pairs that will be created from the special "distribution". The latter can benefit from the implementation of Proposal 1 (if the caching will be enabled for a sub-distribution).

Side-note

Again, one of the problems is the time required for downloading content from a remote (downloading large artifacts/blobs to Pulp and then forwarding it to a user will take some time (on_demand downloading + querying existing content units (as we do in a standard sync pipeline) + creating new repositories/namespaces/distributions on the fly)). This might not be tolerated by container CLI clients due to pre-defined timeouts.

Permissions Handling

To be decided.

Any suggestions or ideas on how to improve or adjust the logic?

@lubosmj
Copy link
Member

lubosmj commented Apr 21, 2022

In Proposal 2, we will enable a user to download/forward/stream content from a remote registry considering no matter which repository she is talking to.

Considering that, users could (intentionally) flood Pulp and make it unavailable for some time. It might be convenient to take a look at whitelisting upstreams, as proposed in #459.

@ipanova
Copy link
Member

ipanova commented Aug 19, 2022

We will revisit this once it becomes a higher priority

@ipanova ipanova removed Low labels Aug 19, 2022
@benedikt-bartscher
Copy link

This would be awesome to easily prevent dockerhub rate-limits. Is there anything the community could do to code/test/sponsor this?

@ipanova
Copy link
Member

ipanova commented Feb 15, 2023

@benedikt-bartscher thank you for showing your interest in this feature. There is already a mechanism in place in pulp-container to prevent dockerhub rate-limits. It would consist in creating and mirroring external repo locally with on_demand policy beforehand of client pull( it does not download blobs, just manifests). Pulp would download all the blob data on the first client request from remote source , however all subsequent requests would be served directly from pulp.
This feature specifically, would remove the need of creating repo on pulp side in advance, pulp would transparently create all needed objects based on the incoming client pull request. Pulp would download all the data, however all subsequent requests would be served directly from pulp.
Would existing workflow cover your current needs?

@benedikt-bartscher
Copy link

Hey @ipanova thanks for you reply. I know about that mechanism, i am currently using it. However it's not very convenient to setup every repo manually, thats why i pushed this issue.

@ipanova
Copy link
Member

ipanova commented Feb 16, 2023

@benedikt-bartscher gotcha, yeah that's the inconvenience compared to the pull-through cache. We will try to resume the work on this, there are some design challenges we need to wrap the head around ;)

@ipanova
Copy link
Member

ipanova commented Mar 15, 2023

#732

@mdellweg
Copy link
Member

Braindump of an idea:
Can we use a registry subdirectory (maybe a namespace) (like /quay-pull-through/) to attach a PullThroughRemote to it that will have all the necessary configuration to create all the requested repositories with remotes and distributions?
Having it be a namespace would allow to sort out the RBAC stuff for it.
Related question: Is is possible to configure registry.json in a way that a user issuing podman pull busybox will do the equivalent of podman pull pulp-server.io/quay-pull-through/busybox?

@ipanova
Copy link
Member

ipanova commented Apr 5, 2023

Summary of today's meeting. There will be outlined few phases where the first one will follow the KISS rule and further ones will gradually add improvements on a per-need basis.

Phase 1:

  1. Add a special Distribution type, called, for example PullThroughCacheDistribution that will have referenced new remote type PullThroughCacheRemote. These 2 new objects will be used explicitly in the pull-through-cache workflow.

  2. To enable pull-trough-cache workflow user, given that he has permissions, will create:

    • cache distribution type by assigning to it name and base_path, for example 'dockerhub-cache'. User can decide whether pull-through-cache repos coming through this special pulp distribution are private or public ( pullable by anonymous user) by setting private boolean flag accordingly.
    • cache remote type by specifying source registry url and credentials, if the registry is not publicly accessible.
  3. End user ( podman client) will be accessing pull-through content via:

    • podman pull pulp.example.com/dockerhub-cache/library/busybox
    • podman pull pulp.example.com/dockerhub-cache/library/alpine
    • podman pull pulp.example.com/dockerhub-cache/portainer/portainer
  4. Under the hood pulp will be creating repositories with names dockerhub-cache/library/busybox, dockerhub-cache/portainer/portainer, etc respectively that will be assigned to that special distribution type.

  5. We will need to probably subclass downloader so it can assemble the download url from the special remote type and podman pull command.

  6. On the very first client pull request where pulp won't be having anything in cache, pulp will download manifest, create remote artifacts for respective blobs, create repo version containing tag, manifest and blobs and stream back manifest to the client.

  7. On the exact but subsequent client pull requests where pulp would have content cached we need to ensure the cached content is up-to-date with the remote source. We will ensure by sending a HEAD request to the tagged manifest to see whether tag references still same manifest digest and in case it is outdated then we'd go back to step 6, including also first checking existing artifacts for blobs.

  8. For now there won't be any cache expiration, repo will be mirror=True meaning that it will always match remote source and it won't additively store outdated content. Consider having for such repos retain_repo_versions set to 1 during repo creation. We probably do not care of the version history. These repos should be read-only, that means one cannot push content into it.

@mdellweg
Copy link
Member

mdellweg commented Apr 6, 2023

repo will be mirror=True

Not sure if the repository needs to know, it is used in a pull-through manner. We will always be accessing it from the PullThroughDistribution. We probably do not want to create yet another repository type.

@ipanova
Copy link
Member

ipanova commented Apr 6, 2023

repo will be mirror=True

Not sure if the repository needs to know, it is used in a pull-through manner. We will always be accessing it from the PullThroughDistribution. We probably do not want to create yet another repository type.

It won't be another repo type. Mirror=true describes the behavior of how repo stores content, it won't be additive but mirror. This options is usually passed in the sync task https://github.com/pulp/pulpcore/blob/main/pulpcore/plugin/stages/declarative_version.py#L21 we however will need to adopt similar logic in the pull-through workflow so we do not store unnecessary content which is no longer available on the remote source.

@lubosmj lubosmj self-assigned this Apr 25, 2023
@lubosmj lubosmj added this to the 2.16 milestone May 26, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Jun 15, 2023
@lubosmj lubosmj mentioned this issue Jun 15, 2023
4 tasks
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Jul 24, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 26, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 26, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 26, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 27, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 27, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 27, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 27, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 27, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 27, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 29, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 29, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 29, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 29, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 29, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 29, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Oct 30, 2023
@lubosmj lubosmj removed this from the 2.17 milestone Oct 30, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Nov 9, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Nov 27, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Dec 8, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Dec 10, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Dec 11, 2023
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Jan 2, 2024
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Jan 16, 2024
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Jan 16, 2024
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Jan 16, 2024
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Jan 17, 2024
lubosmj added a commit to lubosmj/pulp_container that referenced this issue Jan 17, 2024
lubosmj added a commit that referenced this issue Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Shipped
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants