Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a wheel management API #697

Open
agronholm opened this issue Jun 10, 2023 · 27 comments · May be fixed by #805
Open

Adding a wheel management API #697

agronholm opened this issue Jun 10, 2023 · 27 comments · May be fixed by #805

Comments

@agronholm
Copy link

I'm the maintainer of wheel. I'm wondering if you would be interested in receiving an API for reading and writing wheels in this project. I've been wondering for a while if it makes any sense for PyPA to have so many packaging projects.

This is what I had in mind:

  1. The new "wheel public API" would go into this project instead of wheel
  2. setuptools would adopt this API and ditch wheel as a dependency for building wheels
  3. wheel would depend on packaging at run time (currently it vendors this project) and become nothing but a CLI tool
  4. (optional) auditwheel and wheel would be merged (preferably into wheel) (optional, to be discussed later)

The goal is to get rid of at least one PyPA project.

@brettcannon
Copy link
Member

Do you have an issue or something that shows what you're proposing as a "wheel public API"?

And for reading or writing wheel files, we have purposefully taken a sans-I/O approach, so it would be more like "processing and creating zipfile.ZipFile instances that represent wheel files".

@agronholm
Copy link
Author

There's a long time draft PR here: pypa/wheel#472

And for reading or writing wheel files, we have purposefully taken a sans-I/O approach, so it would be more like "processing and creating zipfile.ZipFile instances that represent wheel files".

But ZipFile does I/O, so how does that work? And is there ongoing work towards this in packaging already?

@brettcannon
Copy link
Member

There's a long time draft PR here: pypa/wheel#472

It looks like https://github.com/pypa/wheel/pull/472/files#diff-8ca627810a744afb59bbb1d28a71f6ee66c9878506cfbbf2b6b980053a13c2b8 is the key bit. We already deal with wheel file names, so the key bit seems to be the parts that parse/create all the details of a wheel file.

But ZipFile does I/O, so how does that work?

ZipFile can do whatever it wants, but the point is we as a project wouldn't directly be doing any I/O and would be defining more of an interface that we would work with.

And is there ongoing work towards this in packaging already?

Nope. Closest thing we're working on right now is parsing metadata.

@agronholm
Copy link
Author

ZipFile can do whatever it wants, but the point is we as a project wouldn't directly be doing any I/O and would be defining more of an interface that we would work with.

Ok, so to clarify, does that mean you don't want to have an I/O implementation in this project? I was hoping to reduce the amount of separate projects, and having wheel only host a thin I/O implementation + a CLI tool that few people use doesn't seem sensible to me.

@pradyunsg
Copy link
Member

pradyunsg commented Jun 19, 2023

FWIW, an abstracted interface for interacting with wheels was created as part of the installer project (https://installer.pypa.io/en/stable/concepts/).

@pradyunsg
Copy link
Member

pradyunsg commented Jun 19, 2023

I am doubtful that we can create a sans-io implementation of the wheel format -- wheels can and do contain files that are GBs in size, which would make operating in-memory with such wheels memory-intensive, if we insist on operating with only complete strings/bytes. If we're doing things incrementally, it's better to rely on existing I/O patterns.

Notably, that's basically what we have with https://github.com/pypa/installer/blob/5f281b8e312bac4da00670b643dd4499e1c66a9a/src/installer/sources.py#L139 and even the wheel public API PR ~basically implements that interface (thanks @agronholm for doing that!).

@brettcannon
Copy link
Member

@pradyunsg if installer already has an API then I think it more sense to rely on that API then us moving it here and duplicating it or going through the hassle of deprecating in installer. Thoughts?

@dstufft
Copy link
Member

dstufft commented Jun 19, 2023

packaging should probably not depend on installer?

I suspect wheel also shouldn't depend on installer?

@abravalheri
Copy link
Contributor

Probably installer also does not cover the "creating" part of the API thar wheel covers, just the "reading" part.

@brettcannon
Copy link
Member

packaging should probably not depend on installer?

Correct, and I wasn't proposing that. We actually can't have any dependencies at this point.

@pradyunsg
Copy link
Member

pradyunsg commented Jul 15, 2023

Quoting https://installer.pypa.io/en/stable/concepts/ (about WheelSource):

This protocol/abstraction is designed to be implementable without a direct dependency on this library. This allows for other libraries in the Python packaging ecosystem to provide implementations of the protocol, allowing for more code reuse opportunities.

Do I get a cookie for this not being needless overengineering? 🙃


Probably installer also does not cover the "creating" part of the API thar wheel covers, just the "reading" part.

That's correct. The installation usecase only cares about reading, so that was the scoping there.

@pradyunsg
Copy link
Member

pradyunsg commented Jul 15, 2023

I'm wondering if you would be interested in receiving an API for reading and writing wheels in this project.

I just realised that this was never answered directly -- yes, I think that's a good idea. If @brettcannon and @dstufft agree (or at least are OK with going down this route), we'd have concensus among that packaging maintainers. :)

@brettcannon
Copy link
Member

I'm wondering if you would be interested in receiving an API for reading and writing wheels in this project.

I just realised that this was never answered directly -- yes, I think that's a good idea. If @brettcannon and @dstufft agree (or at least are OK with going down this route), we'd have concensus among that packaging maintainers. :)

Sorry, missed the notification for this comment!

My only real concern is starting to have to care about file I/O which we have avoided thus far. If we can manage to keep a sans-I/O approach then I'm fine with it. If we can't then I think we need to discuss what that change in stance means and how we are going to keep it under control (i.e., I don't want to take on any networking, for instance).

@agronholm
Copy link
Author

I don't quite get the apprehension about file I/O. Abstracting it out is fine, but what's the problem with providing a implementation that handles the file I/O? I was hoping to get all of that moved out of wheel into packaging.

@brettcannon
Copy link
Member

@agronholm not getting in the middle of I/O has been a project policy because people want to use this project in odd places and ways. For instance, if we try go do file I/O we then will have people asking about how to make it work with files stored in some cloud blob storage, some async framework, etc. We have been able to avoid those situations by simply not engaging in any I/O in any way and thus not forcing people into a specific I/O approach.

And even abstractions are a bit tenuous. For instance, if we have a creation API for wheels, how do you pass in the files you want to write and their locations? The simple way is to take a directory and just slurp it all into the zip file. But that requires an API to something that can read a directory and then read those contents. We could define our own protocol that mirrors pathlib.Path and that's what we operate with, but we would have to maintain that protocol as there's no abstract path protocol (yet; probably Python 3.13).

I personally think we could do this in stages. The obvious stuff that requires no I/O we can do without any controversy in the first stage. And then for anything that we think would require file I/O we would have to be very thoughtful about whatever protocols we wanted to support to make it happen and not corner ourselves in some way that isn't appropriately flexible and in a second stage.

@pradyunsg
Copy link
Member

@brettcannon I'm not sure I follow. Why can't we have an IO[bytes] object, taken as input?

The detail of dealing with the filesystem or network is then a problem for the IO object passed in, and if it can't provide a compatible API, then we can't work with that. Taking in an IO object as an argument implies that we don't have to deal with the details of the IO here (leaving that to the object at hand).

@agronholm
Copy link
Author

I would like to add that during the 5 years I've maintained wheel, nobody has come asking for these things 😃 , even when they thought wheel has a public API. But indeed, declaring an interface that works with file descriptors should be good enough, and we're not in any way obligated to fulfill every request coming our way.

@brettcannon
Copy link
Member

Why can't we have an IO[bytes] object, taken as input?

I'm assuming you mean typing.IO[bytes] (which is typing.BinaryIO)?

And we definitely can go with that protocol, we just to have to agree that's the protocol we want to go with and understand the implications. Specifically, that protocol is big and it isn't async. I'm personally fine with all of that, to be clear, but I want us to be explicit that this is how we want to support file I/O in the project going forward, especially if we don't want to put in the effort to specify smaller, stricter protocols (which I'm once again fine not doing for the simplicity of it all).

We also have to understand that protocol has no way to walk the file system (e.g. pathlib.Path.iterdir()). So if we wanted a create_wheel() that took in a directory like src/ and create a wheel from that, we don't have a way to express such an operation. Now we can obviously work around that by having whatever "create a wheel" API we have build the wheel up file-by-file via some add_file() method or something.

Once again, I'm not saying "no" to trying to make this work at all (I think it makes sense for a wheel API to exist here and it makes sense for it to be able to read and create wheels), I just want us to all agree on where the edges of this project are going to be pushed out to since this will change some things for us.

@pradyunsg
Copy link
Member

pradyunsg commented Aug 10, 2023

Ah, OK. I understand the concern now. 😅

I agree that the choice of having I/O-based API (and the corresponding I/O mechanism) should be an explicit design choice. And, I'd be on-board for having a sans-I/O implementation (with a filesystem-based I/O helper provided out of the box).

However, I can't really come up with any sans-I/O design where we can have tooling handle wheels with large files in them, without a bunch of memory overhead in a sans-I/O model. Given that, I'm OK with a non-async I/O model for composing/creating a wheel and processing/reading a wheel.

The reading portion of a wheel API is covered by the installer's WheelSource API (the one piece of that which might be worth improving is the RECORD file validation, which is currently coupled to the WheelSource, and coupled to the I/O model -- I'm happy to evolve installer's WheelSource to have a better approach for that).

@dstufft
Copy link
Member

dstufft commented Aug 10, 2023

It seems like it should be possible to design a sans I/O approach here? Maybe not with the zipfile module itself, but that's just a limitation of that module then. But in theory, we should be able to define an abstraction that is passed and emits chunks rather than whole files at once.

@pradyunsg
Copy link
Member

pradyunsg commented Aug 10, 2023

Are you thinking of something like:

from functools import partial
from shutil import COPY_BUFSIZE

wheel = ...

with open(large_model_filename, "rb") as model_file:
    # something before to initialize the file, somehow?
    for chunk in iter(partial(model_file.read, COPY_BUFSIZE), b''):
        wheel.<something>(chunk, "purelib", "awesome-model.safetensors")
    # something after to finalize the file, somehow?

If we put it in a context manager... I'd argue we're basically re-creating:

with open(large_model_filename, "rb") as model_file:
    with wheel.open(destination, "purelib", "wb") as dest_file:
        shutil.copyfileobj(model_file, dest_file)

... except without the ability to use shutil. 😉 (and, no, I don't think that's a good API)

Maybe not with the zipfile module itself, but that's just a limitation of that module then.

Yea, that's part of what I was hinting at -- there's only one-shot mechanisms to write to a zip (in-memory with io.BytesIO, or on disk directly) with Zipfile and we don't want to take on non-standard-lib dependencies on this project. This means that even if we had an API that took chunks one-at-a-time, we need to write things onto the zipfile on a file-by-file basis for things to actually work. In other words, we either need to write in one-shot or somehow buffer a large file.

That leaves us with the only way to do things (without taking an I/O object) as pretending that it's an I/O-ish object (while not using standard I/O APIs) while allowing for potential misuse, or a stateful object when performing the "add a file" operation.


I don't think a sans-IO model based on chunks would be better than:

with open(large_model_filename, "rb") as model_file:
    wheel.<something>(model_file, "purelib", "awesome-model.safetensors")

and having a loop in there that's basically https://github.com/pypa/installer/blob/444e5295f2403ab303dcf14160591074396f1d21/src/installer/utils.py#L116.

@dstufft
Copy link
Member

dstufft commented Aug 10, 2023

I think it's a fine API? Like yea you're kinda doing it weird for shutil, but it also means you can do:

from functools import partial
from shutil import COPY_BUFSIZE


async def make_wheel():
    wheel = ...

    session = aioboto3.Session()
    async with session.client("s3") as s3:
        s3_ob = await s3.get_object(Bucket=bucket, Key=blob_s3_key)
        stream = s3_ob["Body"]
        while chunk := stream.read(COPY_BUFSIZE):
            wheel.<something>(chunk, "purelib", "awesome-model.safetensors")

I think allowing that is worth the slightly worse implementation when you couple yourself to an IO implementation (and let's be honest, all Sans I/O APIs are at least a little bit less streamlined than ones that tie yourself to an IO implementation, because they have to be to actually function).

Stepping back a minute though, can we define what it is exactly that this API is trying to do? Looking at the installer code linked, I see a few things:

  • Parses the wheel filename.
  • Determine what the .dist-info directory name of the wheel is.
  • Gets a list of the filenames that are inside of the .dist-info directory.
  • Gets the contents of a file from the .dist-info directory.
  • Validates the RECORD file against the contents of the zip file.
  • Iterate over the files in the Wheel, validating that all of the files are in the RECORD.

Reading that, to me none of the interesting bits are actually driving the ZipFile implementation, so we could reimplement that in a sans i/o way like this:

import zipfile

from shutil import COPY_BUFSIZE

filename = "..."

with zipfile.ZipFile(filename, "r") as zfp:
    wheel = SansIOWheelFile(filename, zfp.namelist())  # Parses the filename and assigns attributes to wheel.version etc
    wheel.dist_info_dir  # Searches the list of filenames it's stored internally and returns the correct one.
    wheel.dist_info_filenames  # Searches the list of filenames it's stored internally and returns the correct ones.

    # Equivalent to installer.sources.WheelFile.read_dist_info, but letting the caller do the I/O
    zfp.read(wheel.dist_info_filename("METADATA"))

   # Equivalent to installer.sources.WheelFile.validate_record(validate_contents=True) but
   # letting the caller do the I/O.
   # 
   # Note: Context manager and for loop only need for the validate_contents = True case
   with wheel.record_validator(zfp.read(wheel.dist_info_filename("RECORD"))) as validator:
       for file_validator in validator:
           with zfp.open(file_validator.filename, "r") as stream:
               for chunk in stream.read(COPY_BUFSIZE):
                    _chunk = file_validator.read(chunk)  # Does no I/O, just updates the internal hash and returns the chunk
     # Upon validator.__exit__ all of the hashes will get finalized and we'll collect our issues

    #  Equivalent to installer.sources.WheelFile.get_contents but letting the caller do the I/O
    for filename in wheel.get_filenames(zfp.read(wheel.dist_info_filename("RECORD"))):
        zfp.read(filename)  # ... Or use zfp.open or whatever

Like all Sans I/O APIs, it's a little more verbose than one that is married to a specific I/O implementation, but it's not like.. terrible? It seems perfectly fine, and most of these things actually exist in installer already (particularly the reading of chunks to validate the files in records) but by turning the API "inside out" we divorce the logic of what we're doing from the logic of how I/O happens. Making it possible to drive it from an asynchronous I/O case, but also making testing significantly easier since we separate it from the need to do I/O at all.

In fact, you can see that the installer.sources.WheelFile doesn't even want to drive the I/O in all cases, because it's returning a file object from its get_contents, because the actually interesting part of that method is validating that the filenames are in the RECORD not calling ZipFile.open().

You can tweak the exact API a bit, for instance you could have the various methods that need to access the list of filenames to each pass in the filenames rather than requiring it passed into the constructor... or you could also record the RECORD contents to be passed into the constructor so the RECORD consuming methods don't need to implement them.

Of course, given a SansIOWheelFile like above, there's nothing stopping us from also including a synchronous WheelFile that just acts as a wrapper around SansIOWheelFile and does the boilerplate I/O stuff. Something like:

import zipfile

class WheelFile:

    def __init__(self,  zfp: zipfile.ZipFile):
        self._zipfile = zfp
        self._wheel = SansIOWheelFile(os.path.basename(zfp.filename), zfp.namelist())

    @property
    def dist_info_dir(self):
        return self._wheel.dist_info_dir

    @property
    def dist_info_filenames(self):
        return self._wheel.dist_info_filenames

    def read_dist_info(self, filename):
        self._zipfile.read(self._wheel.dist_info_filename(filename))

    def validate_record(self):
         with wheel.record_validator(zfp.read(wheel.dist_info_filename("RECORD"))) as validator:
           for file_validator in validator:
               with self._zipfile.open(file_validator.filename, "r") as stream:
                   for chunk in stream.read(COPY_BUFSIZE):
                        file_validator.read(chunk)

    def get_contents(self):
        for filename in wheel.get_filenames(zfp.read(wheel.dist_info_filename("RECORD"))):
            yield self._zipfile.open(filename, "r")

@abravalheri
Copy link
Contributor

abravalheri commented Aug 11, 2023

Stepping back a minute though, can we define what it is exactly that this API is trying to do? Looking at the installer code linked, I see a few things:
Looking at the installer code linked, I see a few things:

@dstufft , the ones you wrote, plus:

  • Creating a Python representation of a wheel, that you can use to:
    • add files
    • automatically create a WHEEL file
    • automatically create a RECORD file once you are done adding files.

Lastly:

  • We want to create an actual file from this representation (it does not have to do IO, but it should be easy to integrate with something else that does).

I think this thread started from a comment in the setuptools issue tracker, when @agronholm and I were discussing moving bdist_wheel to setuptools. So one of the primary use cases is creating wheels.


Other part of the discussion that preceeded this thread is that, wheel currently impements the macosx_libfile module to calculate the most adequate platform tag for the current machine. I think is an important piece necessary to create wheels.

This is related to some existing functionality in packaging, and conceptually feels like a natural extension.
So maybe diserves a different thread, but when talking about porting code from pypa/wheel to pypa/packaging, that is also part of the work imho.

Anyway, this is the final requirement:

  • calculate the most appropriate platform tag to use for the current machine when creating a binary wheel.

@agronholm
Copy link
Author

Sounds good so far! Having a sans-I/O base implementation, plus a helper for actually working with wheels in the file system would work great in my opinion. I'm not opposed to an async helper either if someone wants that.

Anyway, this is the final requirement:

  • calculate the most appropriate platform tag to use for the current machine when creating a binary wheel.

Just to be clear, the current machine should preferably be a non-factor when creating wheels. I would like the implementation to determine the platform tags from the metadata and files being presented to the API, if possible.

@dstufft
Copy link
Member

dstufft commented Aug 12, 2023

I assume the inputs for creating a wheel are the files to include + the relevant metadata files, sans the wheel specific ones like WHEEL and RECORD?

import contextlib
import os
import zipfile

from shutil import COPY_BUFSIZE

# Kinda sucks I feel like read/create should be different classes, but both kinda 
# want to be called WheelFile
from something import  SansIOWheelMaker


# Maybe we can have some defaults for the various tags? Or maybe we should
# require them to be passed in. Probably require them to be passed in? I definitely
# think any serious tag generation should live outside this, and get passed in.
#
# This should also include stuff like the Generator, If root should be purelib,
# etc. Basically all the stuff that goes into WHEEL and/or influences where stuff
# gets placed in the wheel.
wheel = SansIOWheelMaker("myproject", "myversion", python_tag="...")


with contextlib.ExitStack() as stack:
    # The library has computed the wheel filename for us, so we can just go
    # ahead and open it.
    wheelfile = stack.enter_context(zipfile.ZipFile(wheel.filename, "w"))

    # Real things will probably do something smarter than just scandir, like
    # checking for files committed in VCS, or looking at the package names or
    # whatever.
    for entry in stack.enter_context(os.scandir("build/")):
        # Zip files can contain directories, but the RECORD file doesn't mention
        # them, so it's unclear if wheels can/should contain directory records
        # or not, and if they do what they should be in RECORD, but if they
        # can we can just do the right thing here.
        if entry.is_dir():
            continue

        # We'll need to open the existing file, and also the zip file to add it
        with contextlib.ExitStack() as fstack:
            # In reality, we'd have to have logic to determine if this is a purelib
            # or something else.
            file = fstack.enter_context(wheel.add_file(entry.name, "purelib"))
            sfp = fstack.enter_context(open(entry.path, "rb"))
            zfp = fstack.enter_context(wheelfile.open(file.name, "w"))

            while chunk := sfp.read(COPY_BUFSIZE):
                # file.write does not hold onto any data, it's just taking chunks
                # so that it can run them through a hasher for RECORD, and returns
                # the chunk so the caller can do something with (presumably write
                # it to a zip file).
                zfp.write(file.write(chunk))
            
            # When the `file` context manager ends, the RECORD file hash will
            # be finalized. We could make it not a context manager, and just
            # finalize whenever we emit the RECORD file, or offer both.
    
    # We also need to pass the metadata in somehow, which we could make a
    # different function for it, or we could just use a "metadata" sigil to
    # wheel.add_file. Here we pick a different function, because there's no
    # promise that sysconfig schemes don't add a `metadata` type in the future.
    #
    # Another option for METADATA would be to have it take a Metadata class
    # instance, or possibly even have the SansIOWheelMaker take it and it can
    # use that for the project name / version / etc, then METADATA can be treated
    # like WHEEL and RECORD is, as "finalizing" files.
    with contextlib.ExitStack() as fstack:
        file = fstack.enter_context(wheel.add_metadata_file("METADATA"))
        zfp = fstack.enter_context(wheelfile.open(file.name, "w"))
        zfp.write(file.write(b"...."))
    
    # After this, calling anything to add any kind of file to the wheel should
    # be an error, as should trying to write to any existing file objects, etc.
    # This emits the RECORD and WHEEL file (and METDATA, or any other future
    # files that may be mandatory).
    #
    # We could make SansIOWheelFile a context manager and have its __exit__
    # error if this method hasn't been called.
    for filename, filedata in wheel.finalize():
        with wheelfile.open(filename, "w") as zfp:
            zfp.write(filedata)

I think that API looks perfectly fine?

What's interesting is I tried to make a synchronous API like I did for reading, and that actually feels much harder to do in a generic way, because there's a lot of little fiddly bits that are hard to get right generically, because for each file you basically need to know:

  • If the file should be included.
  • What sysconfig scheme this file should be a part of.
  • What is the filename, relative to the root of the sysconfig scheme directory for the files sysconfig scheme.

However, that becomes hard to do without ending up with API that looks similar to the above or that makes assumptions about the above based on something. I kind of sketched it out the best I could, and came up with this:

class WheelMaker:
    def add_files(self, path, callback):
        with os.scandir(path) as it:
            for entry in it:
                if entry.is_dir():
                    continue

                if meta := callback(path, entry.name):
                    scheme, arcname = meta
                    with contextlib.ExitStack() as fstack:
                        file = fstack.enter_context(self._wheel.add_file(arcname, scheme))
                        sfp = fstack.enter_context(open(entry.path, "rb"))
                        zfp = fstack.enter_context(self._zipfilefile.open(file.name, "w"))

                        while chunk := sfp.read(COPY_BUFSIZE):
                            zfp.write(file.write(chunk))

You would have to pass a callback that ws something like:

def callback(base, filepath):
    if filepath.endswith("*.pyc"):  # Check in VCS, or whatever makes sense
        return None

    # (Scheme, Filepath Relative To Scheme Dir)
    return "purelib", filepath

The real callback would likely be much more complicated than that of course, and it works? But it assumes that you have the files on disk in a layout that makes sense to be shuffled right into the wheel (which may be a reasonable assumption?) or that you will call add_files multiple times or something.

Another option is to add (either instead of, or in addition to) a add_file API, but that can take a path or a file object (and if you do it in addition to, you can refactor add_files to use it), and that can end up looking something like:

class WheelMaker:
    def add_files(self, path, callback):
        with os.scandir(path) as it:
            for entry in it:
                if entry.is_dir():
                    continue

                if meta := callback(path, entry.name):
                    scheme, arcname = meta
                    self.add_file(entry.path, arcname, scheme)
    
    def add_file(self, path_or_obj, arcname, scheme):
        with contextlib.ExitStack() as fstack:
            if isinstance(path_or_obj, str):  # PathLike I guess?
                sfp = fstack.enter_context(open(path_or_obj, "rb"))
            else:
                sfp = path_or_obj

            file = fstack.enter_context(self._wheel.add_file(arcname, scheme))
            zfp = fstack.enter_context(self._zipfilefile.open(file.name, "w"))

            while chunk := sfp.read(COPY_BUFSIZE):
                zfp.write(file.write(chunk))

Even with all of that, you still need a way to add metadata files... which I guess you could just treat as a special type of scheme? But like earlier there's no guarantee that "metadata" doesn't become a scheme name in the future. We could add an add_file like API that just adds the metadata file, so something like:

class WheelMaker:
    def add_metadata_file(self, filename, path_or_obj):
        with contextlib.ExitStack() as fstack:
            if isinstance(path_or_obj, str):  # PathLike I guess?
                sfp = fstack.enter_context(open(path_or_obj, "rb"))
            else:
                sfp = path_or_obj

            file = fstack.enter_context(self._wheel.add_metadata_file(filename))
            zfp = fstack.enter_context(self._zipfile.open(file.name, "w"))

            while chunk := sfp.read(COPY_BUFSIZE):
                zfp.write(file.write(chunk))

Since we can do synchronous I/O, we wouldn't need the .finalize() method, and could just emit RECORD and WHEEL in __exit__?

Oh, but in the above I was assuming that they could pass us a fp that we could wrap with a ZipFile, but they won't actually know the filename (unless we require they figure it out themselves) until after the WheelMaker has been instantiated, or we end up needing an explicit WheelMaker().open() that takes the fp, or we need WheelMaker to take a directory it will put the file in?

Anyways, all together, the sync API use could end up looking something like:

import contextlib
import io

from something import  WheelMaker


class FileFilter:
    def __init__(self, *args, **kwargs):
        # Some information stored here like the VCS root, or the package name
        # or whatever this caller uses to classify and filter files.
        pass 

    def __call__(self, base, filepath):
        # Real filtering/classification Logic
        pass



wheel = WheelMaker("myproject", "myversion", python_tag="...")

with contextlib.ExitStack() as stack:
    wfp = stack.enter_context(open(wheel.filename, "wb"))
    # This returns an object, and we probably move all of the add_* stuff to an
    # inner class that is returned from __enter__, unless we want to just return
    # self and have all of the add_* methods have to check if we have a file
    # we're operating on at the moment.
    wheel = stack.enter_context(wheel.open(wfp))

    # Assume that we have a build directory that has stuff in the correct place
    # since that is our best case, more complicated cases will have to use the
    # wheel.add_file() directly.
    wheel.add_files("build/", FileFilter())

    # This file is generated in memory or whatever, for some reason it's not in
    # the build directory
    wheel.add_file(io.BytesIO(b"..."), "mypkg/__generated__.py", "purelib")

    # Add whatever metadata you want.
    wheel.add_metadata_file("METADATA", io.BytesIO(b"..."))


# When the wheel context manager exits, it will write the RECORD and WHEEL and
# any other files to the archive, then close the internal zipfile, the underlying
# file object will still be open, which in the above will be closed by ExitStack
# next.

It gets a little awkward due to the circular dependency between the WheelMaker.__init__ being the thing that computes the filename and also wanting to have the caller pass in a path or fileobj to operate on, but overall it seems ok?

@pelson
Copy link
Contributor

pelson commented Aug 28, 2024

I saw #805, and wanted to highlight that there is a Wheel editing API in setuptools-ext (full disclosure: I have contributed to it in the past). That API also supports read of existing wheels, and write of new wheels (to some extent - this could be improved though IMO). It is lazy in the sense that when you call the wheel.write() method to add a file, it holds that in-memory until the wheel.write_wheel() method is called (this can obviously be memory intensive, but given we are talking about wheels I think that is reasonable [even considering the size of tensorflow etc.]).

I would be happy to discuss the api in more detail if there is interest (at the least, I think it is interesting to consider when designing a wheel api from scratch). An example of its use for modifying wheel metadata can be found at https://github.com/wimglenn/setuptools-ext/blob/0.9/setuptools_ext.py#L221-L233.

Finally, it is a bit of a surprise to me that we would go from a specialised project (wheel) to a general one (packaging). My guess is that, given there is no definitive answer to what a good API looks like for wheels, you would want to expose an experimental API and build experience from there. I don't personally believe packaging is the best place to do that. Just my 2 cents from an avid packaging user.

@agronholm
Copy link
Author

My question is, since it's not a lot of code, why should we want to have so many scattered packaging infrastructure projects? I'm trying to undo at least some of the insanity. You have so far given no rationale for putting this interface to wheel - where I don't want it. I want wheel to become nothing but a CLI tool for manipulating wheels. Perhaps one day even that could be eliminated entirely.

If you have suggestions/criticism about the API itself, feel free to voice your opinions in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants