Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for dockerignore and containerignore #1205

Merged
merged 6 commits into from
Jan 24, 2024

Conversation

sgaist
Copy link
Contributor

@sgaist sgaist commented Oct 26, 2022

Currently repo2docker creates a context object that includes the whole content of the repository it builds an image for. Thus it includes folders like .git which is usually something that has no interest in the final image, can take quite a lot of space and most importantly, kills the caching of that layer.

This patch adds support for reading dockerignore and containerignore files that are used to ensure only the relevant data are used to build the image.

By default it also excludes the .git folder if neither of these files are provided.

It makes use of docker-py's own file discovery mechanism to ensure coherence and avoid re-inventing a working wheel.

Note: containerignore comes from podman that reads either file. See the podman-build documentation.

Fixes #937
Related to #268 and #269 final discussions.

@welcome
Copy link

welcome bot commented Oct 26, 2022

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@sgaist
Copy link
Contributor Author

sgaist commented Oct 27, 2022

I just thought about something to discuss based on the default "ignore .git" behaviour introduced by this patch.

Since the goal of repo2docker is to provide a ready to use work environment with only the relevant pieces from a repo, should we have a configurable list of paths that we could ignore by default so people don't need to introduce another configuration file ?

I am currently mainly thinking of git and forge specific files and folders like .github, .gitlab-ci.yml, etc.

@manics
Copy link
Member

manics commented Oct 28, 2022

I think we need to keep .git since it provides a link back to the source repo if users are in a terminal of have a Jupyter git extension installed. It's also good to have it since repo2docker is about reproducibility.

@consideRatio consideRatio changed the title [MRG] Implement support for dockerignore and containerignore Implement support for dockerignore and containerignore Oct 30, 2022
@sgaist
Copy link
Contributor Author

sgaist commented Oct 30, 2022

@manics I follow you on the reproducibility side but I thought that the following labels were there for that:

LABEL repo2docker.ref="<commit hash>"
LABEL repo2docker.repo="[email protected]/mygroup/myprojectgit"
LABEL repo2docker.version="2022.10.0"

?

I just saw that repo2docker.ref could contain HEAD which does not provide a useful information for reproducibility. The information would only be retrievable if one I explicitly runs the image to go into the git clone or based on the image generated name. May I suggest that HEAD should be replaced by the corresponding hash at image build time ?

In case we keep the .git folder by default, what are you thinking about my suggestion of a new argument that can be used with a list of files/folders to ignore ?

@consideRatio
Copy link
Member

I rebased this PR by mistake, sorry for the noise!

@sgaist
Copy link
Contributor Author

sgaist commented Nov 2, 2022

@consideRatio no worries !

On that matter, something I am unsure of, it it globally preferred to rebase or to merge main back to the branch ? I am used to do the former to have a "semi-linear" history as they say in GitLab's configuration. I thought I have read something about it in the contributor documentation but I can't find it anymore.

@consideRatio
Copy link
Member

On that matter, something I am unsure of, it it globally preferred to rebase or to merge main back to the branch ?

There is clear no consensus, but I think of two situations. When a PR is accepted an merged/rebased/squashed, and how to handle PR development before that.

  • We almost always use the merge button when merging PRs in jupyterhub org to retain commit authorship as I understand it.
  • I appreciate rebasing my PRs before merging them to have an easier to follow git history, but I dislike if rebasing makes a reviewer uncertain what is already reviewed once and what are new changes etc. Due to this, I'm often rebasing PRs before review efforts are put in, but not after.

@sgaist
Copy link
Contributor Author

sgaist commented Nov 3, 2022

Thanks @consideRatio, sounds good to me.

I haven't yet had a really big review on GitHub were I could see that rebasing did break the process for reviewers. Some times I have encountered policies that would be "we use merge from main" when updating a pull request but that is all. If it could help shape a consensus for the Jupyter project, we can use one of my contribution to do some testing in that regard.

@sgaist
Copy link
Contributor Author

sgaist commented Jun 27, 2023

@manics a small ping to get your opinion an my reproducibility related questions above as well as .git handling.

@sgaist
Copy link
Contributor Author

sgaist commented Dec 13, 2023

Having re-read #269, the ignore .git by default would be "out of spec" with regard to what docker does so I will remove it.

I'll open tickets for the other points as they are not directly related to this patch.

Currently repo2docker creates a context object that includes the whole content
of the repository it builds an image for. Thus it includes folders like .git
which is usually something that has no interest in the final image, can take
quite a lot of space and most importantly, kills the caching of that layer.

This patch adds support for reading dockerignore and containerignore files that
are used to ensure only the relevant data are used to build the image.

By default it also excludes the .git folder if neither of these files are
provided.
The original behavior was to create an src directory with the content of the
repository. The creation would happen in any case (remote or local repository).
With the filtering in place and the default to remove the .git folder, it breaks
the build as the src folder can be missing.

This patch ensures that the directory is present in the tar so the build can
continue as it did until now.
This ensures that the ignore files are retrieved from the proper folder.
If they weren't the build would not succeed as the binder folder is
ignored.
@sgaist sgaist force-pushed the implement_dockerignore_support branch from 8572871 to 77df191 Compare January 19, 2024 08:45
@minrk minrk merged commit 6aad2bc into jupyterhub:main Jan 24, 2024
20 checks passed
Copy link

welcome bot commented Jan 24, 2024

Congrats on your first merged pull request in this project! 🎉
congrats
Thank you for contributing, we are very proud of you! ❤️

@sgaist sgaist deleted the implement_dockerignore_support branch January 29, 2024 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ignoring paths from a repo by reading a .dockerignore file
4 participants