-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index pip-installed packages #109
base: main
Are you sure you want to change the base?
Conversation
This isn't very efficient, we should use pep 658 metadata which pypi provides now so we don't have to redownload the wheel. Also we should do fetches concurrently and asyncio.gather them. But it gets the basic concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea, but as you point out, the querying and download of wheels seems expensive. Have we yet explored how important it is for the file_name
and sha256
fields to be populated in the result? Could those fields instead be made optional for installed packages that no longer have their attribution? i.e. Why not make file_name: str | None
?
file_name=wheel.url, | ||
install_dir="site", | ||
sha256=wheel.sha256, | ||
imports=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like imports
is stubbed here. It should probably query top_level.txt
, same as micropip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unfortunate that top_level.txt
is setuptools only. As I understand it, the other build backends don't have anything similar. Do you know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think top_level.txt
is not very good for querying import names, and probably we should stop using it in micropip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several good examples in that thread. Also, namespace packages aren't reflected accurately either:
@ pip-run jaraco.functools -- -c "import importlib.metadata as md; print(md.distribution('jaraco.functools').read_text('top_level.txt'))"
jaraco
(jaraco.functools is one of many packages that exposes the top-level jaraco
package; the unique offering of jaraco.functools
is the python package of the same name)
ver = resp.releases.get(Version(dist.version), None) | ||
if ver is None: | ||
return None | ||
wheel = next(ver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems risky. What if the package doesn't have a wheel, but only an sdist? What if the sdist is chosen over the wheel? What if the package was installed from a local checkout that advertises the same version (but is different)? What if the package was installed from a local checkout that advertises a local version (that's not present in PyPI)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I agree with these concerns. I agree that there needs to be some sort of architecture to our packaging system and currently there is no architecture merely an accumulation of behaviors. I will answer your questions in defense of this approach because I think the answers contain interesting information about our domain logic and not because this approach is any good.
What if the package doesn't have a wheel, but only an sdist?
This is a great question. It will fail. If the wheel is platformed, then the pip build frontend cannot build it. The only way to build a Pyodide platformed wheel is with our own pyodide build
frontend. I looked into patching pip so that it uses our frontend but I am pretty sure it is not a reasonable task without support from the pip team. But for a non-platformed py3-none-none
package, it can just upload an sdist and then pip will download it and install it. The problem is that this whole lock file idea is predicated on the idea that the packages came from somewhere, and that we didn't do a local build. Probably if we find packages that were installed from sdist, then freeze()
should raise an error complaining specifically about this.
What if the package was installed from a local checkout that advertises the same version (but is different)?
Same issue, can only install from a url. So yeah we would indeed freeze a different wheel than was actually installed. We should probably check for this? Note that pip freeze
also can't save you from this situation, although in normal Python it is fine if pip itself installed the package from sdist. But the fact that if you do
pip freeze > requirements.txt
python -m venv .venv-new
.venv-new/bin/pip install -r requirements.txt
then .venv-new
can end up being different than the current environment is indicating that pip freeze
does not generate a true lock but something weaker. Our aspiration is to generate a lock and we should error with an explanation if we cannot generate one.
url = dist.read_text("PYODIDE_URL") | ||
if url: | ||
pkg_entry = get_pkg_entry_micropip(dist, url) | ||
elif IN_VENV: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure why IN_VENV is relevant. Packages can be installed without virtualenvs (e.g. on PYTHONPATH). And virtualenvs can expose site-packages.
What is the motivation for varying on IN_VENV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is wrong to branch on IN_VENV
, I meant to remove it but I guess I failed to when I stuck up this PR. I think the correct thing to check is actually dist.read_text("INSTALLER")
for whether it was pip or micropip.
Packages can be installed without virtualenvs (e.g. on PYTHONPATH).
Right... some day would be good to test our system against this, it is a good feature. The Pyodide python
cli will correctly pick up PYTHONPATH
. Though I'm also not sure if we want to pick up the current PYTHONPATH
when locking? It's not 100% clear to me what the right logic is there.
if ver is None: | ||
return None | ||
wheel = next(ver) | ||
await wheel.download({}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this method doesn't yet exist. This call could be very expensive and would need to manage the lifecycle of the downloaded content. It may also need to re-implement things that pip does like caching or re-using pip's cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, this PR is intended as a worst possible implementation to get the discussion moving forward. I think the situation is that we only need to download the wheel if the index does not implement PEP 658. Since PyPI finally does implement PEP 658 we should do that if possible.
As far as I know .dist-info
is too weak to lock against other indices because the index is not written anywhere into the .dist-info
directory. Maybe we could have a packaging PEP asking for an additional INDEX
file that reports the index the file was installed from if any? It would be a bit of a complement to direct_url.json
which tells us what to do when we didn't install from an index but rather a url. There's the third case where we installed from a file system path, I think that goes into direct_url.json
too? But we can't lock it in any case.
caching or re-using pip's cache
Yes that's a great idea. Does pip expose any of its caching logic? If not is it reasonably stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pip does expose a list
command that will list all matching packages in the cache:
~ @ pip cache list foo.bar-0.2 --format abspath
/Users/jaraco/Library/Caches/pip/wheels/0b/ec/8b/f669ae78d94a691e6c63872909196857158b3c20f045058e7b/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/10/f7/18/a4a94460da8bf74f841d7df4f10984b730907bea49783615a4/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/11/9f/15/3a750be9586c80770d8355ac4e9dc3c58237b1d994f36a2a21/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/1b/9d/16/254558b9de54233baf2ef9a35eac16ad3c23a8c5fe4122374f/foo.bar-0.2-py3-none-any.whl
/Users/jaraco/Library/Caches/pip/wheels/20/96/a7/21d787cb6f024ae978823ce36306736a68ffc7c728077edf4d/foo.bar-0.2-py3-none-any.whl
...
It does appear as if that "wheel cache" is only for locally-built wheels.
I haven't found a way to query for pip's http cache. Here's what pip gives for cache info
:
draft @ pip cache info
Package index page cache location (pip v23.3+): /Users/jaraco/Library/Caches/pip/http-v2
Package index page cache location (older pips): /Users/jaraco/Library/Caches/pip/http
Package index page cache size: 883.2 MB
Number of HTTP files: 10736
Locally built wheels location: /Users/jaraco/Library/Caches/pip/wheels
Locally built wheels size: 21.2 MB
Number of locally built wheels: 440
It's possible the http-v2
cache is manifest using some library. If not, there's seemingly no API to access it. My guess, based on the fact that there are two cache locations, is that the http-v2 cache is reasonably stable.
See #110 where I've done some work to refactor the functionality to make it easier to alter behavior such as proposed here. |
Yeah thanks for the review, this is definitely a DO NOT MERGE pr. The point is that it passes the one test in the worst possible way and gives us a baseline for writing a good implementation. |
version=dist.version, | ||
file_name=wheel.url, | ||
install_dir="site", | ||
sha256=wheel.sha256, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC, we are fetching the installed package again to get the URL and the sha256? I am not sure if is it a reliable behavior. What happens if a user installed packages from non-PyPI repositories or from URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works if they installed from url (at least in principle it can be made to work) because we can get the url from direct_url.json
. The multiple indexes problem is bigger because we don't have metadata indicated which index we got it from. I think micropip should store the index url into maybe INDEX
.
@pradyunsg @henryiii is there something I'm missing? If someone has a pip.conf with an extra-index-url or passes this as a cli instruction, is it possible to reconstruct which packages in the venv came from which indices? If not maybe we could make a packaging PEP adding INDEX to package metadata.
This isn't very efficient, we should use pep 658 metadata which pypi provides now so we don't have to redownload the wheel. Also we should do fetches concurrently and asyncio.gather them. But it gets the basic idea.
Would resolve #107.
cc @jaraco @ryanking13 do either of you see anything wrong with this concept?