Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-26658: Re-implement the Formatter class #1018

Merged
merged 30 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
4bb47f6
Reset storage class singleton between tests
timj May 16, 2024
2e8eee2
Change approach to kwargs merging
timj May 17, 2024
9631f24
Catch warnings from TAI to UTC conversion for future dates
timj Jun 8, 2024
69b89d6
Write new Formatter implementation
timj May 15, 2024
d589050
Attach notes to re-raised exception if the formatter attached any
timj Jul 3, 2024
f084fd9
Add news fragment
timj Jul 3, 2024
0ee2c6b
Update the documentation for FormatterV2
timj Jul 3, 2024
62b1577
Remove the old Formatter V1 implementations
timj Jul 3, 2024
69eb69f
Allow a formatter to declare it can accept a python type without coer…
timj Jul 4, 2024
522211c
Test stringification of formatter
timj Jul 4, 2024
03d295b
Update only resources for docker container
timj Jul 4, 2024
7673101
Changes from review
timj Jul 8, 2024
9f01c6c
Add missing docstring
timj Jul 9, 2024
3006e42
Return NotImplmented rather than raising in formatter
timj Jul 10, 2024
528f663
Document exceptions raised by formatter V2
timj Jul 10, 2024
2020eec
Now defer calling read_from_uri if whole file is to be read and cached
timj Jul 10, 2024
f4c137f
Documentation clean up
timj Jul 10, 2024
c22ab2f
Use the pydantic JSON parser for speed
timj Jul 15, 2024
2f5488a
Use lists for __all__ rather than string
timj Jul 23, 2024
64466e5
Change read_from_local_file to use a path string and not URI
timj Jul 23, 2024
bd7a4d0
Refresh pre-commit
timj Jul 23, 2024
b659506
Change order for is_dataclass check to aid mypy 1.11
timj Jul 23, 2024
759a106
Remove logging that was duplicated
timj Jul 23, 2024
8c5fe69
Disable default acceptance on put
timj Jul 24, 2024
2be2ed9
Clarify in docs that read_from_local_file uses a path not URI
timj Jul 24, 2024
fd00140
Fix some errors in docstring markup
timj Jul 24, 2024
43f1e9d
Allow for a storage class converter to be a builtin type
timj Jul 24, 2024
914b91e
Remove now unused cache_ref parameter
timj Jul 24, 2024
e847c28
Ensure that component override storage class is forwarded when disass…
timj Jul 24, 2024
c4dd9bc
Add some notes on storage class overrides for formatters
timj Jul 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ repos:
name: isort (python)
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.5.1
rev: v0.5.4
hooks:
- id: ruff
- repo: https://github.com/numpy/numpydoc
Expand Down
3 changes: 3 additions & 0 deletions doc/changes/DM-26658.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Added a new formatter class, ``lsst.daf.butler.FormatterV2`` that has been redesigned to be solely focused on file I/O with a much cleaner interface.
This is now the recommended interface for writing a formatter.
Butler continues to support the legacy ``Formatter`` but you should plan to migrate to the new simpler interface.
115 changes: 73 additions & 42 deletions doc/lsst.daf.butler/formatters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ Read parameters are used to adjust what is returned by the `Butler.get()` call b
For example this means that a read parameter that subsets an image is valid because the type returned would still be an image.

If read parameters are defined then a `StorageClassDelegate.handleParameters()` method must be defined that understands how to apply these parameters to the Python object and should return a modified copy.
This method must be written even if a `Formatter` is to be used.
This method must be written even if a `FormatterV2` is to be used.
There are two reasons for this; firstly, there is no guarantee that a particular formatter implementation will understand the parameter (and no requirement for that to be the case), and secondly there is no guarantee that a formatter will be involved in retrieval of the dataset.
In-memory datastores never involve a file artifact so whilst composite disassembly is never an issue, a delegate must at least provide the parameter handler to allow the user to configure such a datastore.

Expand All @@ -228,69 +228,100 @@ Formatters
==========

Formatters are responsible for serializing a Python type to a storage system and for reconstructing the Python type from the serialized form.
A formatter has to implement at minimum a `Formatter.read()` method and a `Formatter.write()` method.
The ``write()`` method takes a Python object and serializes it somewhere and the ``read()`` method is optionally given a component name and returns the matching Python object.
Details of where the artifact may be located within the datastore are passed to the constructor by the datastore as a `FileDescriptor` instance.
A formatter author should define their formatter as a subclass of `FormatterV2`.

.. warning::
Reading a Dataset
^^^^^^^^^^^^^^^^^

The formatter system has only been used to write datasets to files or to bytes that would be written to a file.
The interface may evolve as other types of datastore become available and make use of the formatter system.
The interface is being reassessed on :jira:`DM-26658`.
A datastore knows which formatter was used to write or ingest a dataset.
There are three methods a formatter author can implement in order to read a Python type from a file:

When ingesting files from external sources formatters are associated with each incoming file but these formatters are only required to support a `Formatter.read()` method.
They must though declare all the file extensions that they can support.
This allows the datastore to ensure that the image being ingested has not obviously been associated with a formatter that does not recognize it.
``read_from_local_file``
The ``read_from_local_file`` method is guaranteed to be passed a local file path.
If the resource was initially remote it will be downloaded before calling the method and the file can be cached if the butler has been configured to do that.

In the current implementation that is focussed entirely on external files in datastores, the location of the serialized data is available to the formatter using the `Formatter.fileDescriptor` property.
This `FileDescriptor` property makes the file location available as a `Location` and also gives access to read parameters supplied by the caller and also defines the `StorageClass` of the dataset being written.
On read the the storage class used to read the file can be different from the storage class expected to be returned by `Datastore`.
This happens if a composite was written but a component from that composite is being read.
``read_from_uri``
The ``read_from_uri`` method is given a URI which might be local or remote and the method can access the resource directly.
This can be especially helpful if the formatter can support partial reads of a remote resource if a component is requested or some parameters that subset the data.
This file might be read from the local cache, if it is available, but will not trigger a download of the remote resource to the local cache.
If the formatter is being called without a component or parameters such that the whole file would be read and if the dataset should be cached, this method will be called with a local file URI.

File Extensions
^^^^^^^^^^^^^^^
``read_from_stream``
The ``read_from_stream`` method is given a file handle (usually a `lsst.resources.ResourceHandleProtocol`) which might be a local or remote resource.
The resource might be read from local cache but the file will not be downloaded to the local cache prior to calling this method.
If the file is being read without components or parameters and if it would be cached, this method will be bypassed if a file reader is available.

Each formatter that reads or writes a file must declare the file extensions that it supports.
For a formatter that supports a single extension this is most easily achieved by setting the class property `Formatter.extension` to that extension.
In some scenarios a formatter might support multiple formats that are controlled by write parameters.
In this case the formatter should assign a frozen set to the `Formatter.supportedExtensions` class property.
It is then required that the class implements an instance property for ``extension`` that returns the extension that will be used by this formatter for writing the current dataset.
By default these methods are disabled by setting corresponding class properties named ``can_read_from_*`` to `False`.
To activate specific implementations a formatter author must set the corresponding properties to `True`.
Only one of these methods needs to be implemented by a formatter author, but if multiple options are available the priority order is specified in the `FormatterV2.read` documentation.
Any of these methods can be skipped by the formatter if it returns `NotImplemented`.

File vs Bytes
^^^^^^^^^^^^^
The read method has access to the storage class that was used to write the original dataset and the storage class that has been requested by the caller.
These are available in ``self.file_descriptor.storageClass`` (the one used for the write) and ``self.file_descriptor.readStorageClass``.
If a component is requested the read storage class will be that of the component.

Some datastores can stream bytes from remote storage systems and do not require that a local file is created before the Python object can be created.
To support this use case an implementer can implement `Formatter.fromBytes()` for reading in from a datastore and `Formatter.toBytes()` for serializing to a datastore.
If a formatter raises `NotImplementedError` when these byte-like methods are called the datastore will default to using the `Formatter.read()` and `Formatter.write()` methods making use of local temporary files.
Composite:

.. warning::
If "X" is the storage class of the dataset type associated with the registry, "Y" is the storage class of a component and "X'" and "Y'" are user overrides of those storage classes then:

This interface has some rough edges since it is not yet possible for the formatter to optionally support bytes directly based on the amount of data involved.
Even though bytes may be more efficient for small or medium-sized datasets, in some cases with significant datasets the memory overhead of multiple copies may be excessive and a temporary file would be more prudent.
Neither datastore nor the formatter can opt out of using bytes on a per-dataset basis.
========= ====== ======= ======
Component UserSC WriteSC ReadSC
========= ====== ======= ======
No - X X
No X' X X'
Yes - X Y
Yes Y' X Y'
========= ====== ======= ======

FileFormatter Subclass
^^^^^^^^^^^^^^^^^^^^^^
For a disassembled composite the file being opened by the formatter is a component and not directly the composite dataset.
In this situation the ``self.file_descriptor.component`` property will be set to indicate which component this file corresponds to and the ``self.dataset_ref`` property will refer to the composite dataset type.
As for the previous case, the write storage class will match the storage class used to write the file and the read storage class will be the storage class that has been requested and can either match the write storage class or be a user-provided override.
A component will be provided as a parameter solely in the cases where a derived component has been requested and in this scenario the read storage class will be the storage class of the derived component.
Any storage class override request for the composite will be applied by the storage class delegate if the composite has been disassembled and then reassembled.

For many file-based formatter implementations a subclass of `Formatter` can be used that has a much simplified interface.
`~formatters.file.FileFormatter` allows a formatter implementation to be written using two methods: `~formatters.file.FileFormatter._readFile()` takes a local path to the file system and the expected Python type, and `~formatters.file.FileFormatter._writeFile()` takes the in-memory object to be serialized.
Derived components will always set the read storage class to be that of the derived component, including any requested override.
If the original storage class for the derived component is required it can be obtained from the write storage class.

Composites are not handled by `~formatters.file.FileFormatter`.
Writing a Dataset
^^^^^^^^^^^^^^^^^

.. note::
When storing a dataset in a file datastore the datastore looks up the relevant formatter in the configuration based on the storage class or dataset name.
A formatter author can define one of the following methods to support writing:

``to_bytes``
This method is given an in-memory dataset and returns the serialized bytes.

``write_local_file``
This method is given an in-memory dataset and a local file name to write to.


When ingesting files from external sources formatters are associated with each incoming file but these formatters are only required to support reads.
They must though declare all the file extensions that they can support.
This allows the datastore to ensure that the image being ingested has not obviously been associated with a formatter that does not recognize it.

Some formatters can handle multiple Python types without requiring the datastore to force a conversion to a specific type before using the formatter.
A formatter that can support this should override the default `FormatterV2.can_accept` method such that it returns `True` for all supported Python types.

File Extensions
^^^^^^^^^^^^^^^

Each formatter that reads or writes a file must declare the file extensions that it supports.
For a formatter that supports a single extension this is most easily achieved by setting the class property `FormatterV2.default_extension` to that extension.
In some scenarios a formatter might support multiple formats that are controlled by write parameters.
In this case the formatter should assign a frozen set to the `FormatterV2.supported_extensions` class property.

The design of this class hierarchy will be reassessed in :jira:`DM-26658`.
It is then required that the subclass overrides the ``get_write_extension`` so that it returns the extension that will be used by this formatter for writing the current dataset (for example by looking in the write parameters configuration).

Write Parameters
^^^^^^^^^^^^^^^^

Datastores can be configured to specify parameters that can control how a formatter serializes a Python object.
These configuration parameters are not available to `Butler` users as part of `Butler.put` since the user does not know how a datastore is configured or which formatter will be used for a particular `DatasetType`.

When datastore instantiates the `Formatter` the relevant write parameters are supplied.
When datastore instantiates the `FormatterV2` the relevant write parameters are supplied.
These write parameters can be accessed when the data are written and they can control any aspect of the write.
The only caveat is that the `Formatter.read` method must be able to read the resulting file without having to know which write parameters were used to create it.
The `Formatter.read` method can look at the file extension and file metadata but it will not have the write parameters supplied to it by datastore.
The only caveat is that the `FormatterV2` read methods must be able to read the resulting file without having to know which write parameters were used to create it.
The read implementation methods can look at the file extension and file metadata but will not have the write parameters supplied to it by datastore.

Write Recipes
^^^^^^^^^^^^^
Expand All @@ -301,7 +332,7 @@ Rather than require that every formatter is explicitly configured with this deta
Write recipes have their own configuration section and are associated with a specific formatter class and contain named collections of parameters.
The write parameters can then specify one of the named recipes by name.

If write recipes are used the formatter should implement a `Formatter.validateWriteRecipes` method.
If write recipes are used the formatter should implement a `FormatterV2.validate_write_recipes` method.
This method not only checks that the parameters are reasonable, it can also update the parameters with default values to make them self-consistent.

Configuring Formatters
Expand Down
8 changes: 4 additions & 4 deletions python/lsst/daf/butler/_butler_collections.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def extend_chain(self, parent_collection_name: str, child_collection_names: str
----------
parent_collection_name : `str`
The name of a CHAINED collection to which we will add new children.
child_collection_names : `~collections.abc.Iterable` [ `str ` ] | `str`
child_collection_names : `~collections.abc.Iterable` [ `str` ] | `str`
A child collection name or list of child collection names to be
added to the parent.

Expand Down Expand Up @@ -80,7 +80,7 @@ def prepend_chain(self, parent_collection_name: str, child_collection_names: str
----------
parent_collection_name : `str`
The name of a CHAINED collection to which we will add new children.
child_collection_names : `~collections.abc.Iterable` [ `str ` ] | `str`
child_collection_names : `~collections.abc.Iterable` [ `str` ] | `str`
A child collection name or list of child collection names to be
added to the parent.

Expand Down Expand Up @@ -113,7 +113,7 @@ def redefine_chain(
parent_collection_name : `str`
The name of a CHAINED collection to which we will assign new
children.
child_collection_names : `~collections.abc.Iterable` [ `str ` ] | `str`
child_collection_names : `~collections.abc.Iterable` [ `str` ] | `str`
A child collection name or list of child collection names to be
added to the parent.

Expand Down Expand Up @@ -146,7 +146,7 @@ def remove_from_chain(
parent_collection_name : `str`
The name of a CHAINED collection from which we will remove
children.
child_collection_names : `~collections.abc.Iterable` [ `str ` ] | `str`
child_collection_names : `~collections.abc.Iterable` [ `str` ] | `str`
A child collection name or list of child collection names to be
removed from the parent.

Expand Down
2 changes: 1 addition & 1 deletion python/lsst/daf/butler/_dataset_ref.py
Original file line number Diff line number Diff line change
Expand Up @@ -671,7 +671,7 @@ def iter_by_type(
Returns
-------
grouped : `~collections.abc.Iterable` [ `tuple` [ `DatasetType`, \
`Iterable` [ `DatasetRef` ] ]]
`~collections.abc.Iterable` [ `DatasetRef` ] ]]
Grouped `DatasetRef` instances.
"""
if isinstance(refs, _DatasetRefGroupedIterable):
Expand Down
4 changes: 2 additions & 2 deletions python/lsst/daf/butler/_dataset_type.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,11 +119,11 @@ def direct(


class DatasetType:
r"""A named category of Datasets.
"""A named category of Datasets.

Defines how they are organized, related, and stored.

A concrete, final class whose instances represent `DatasetType`\ s.
A concrete, final class whose instances represent a `DatasetType`.
`DatasetType` instances may be constructed without a `Registry`,
but they must be registered
via `Registry.registerDatasetType()` before corresponding Datasets
Expand Down
9 changes: 8 additions & 1 deletion python/lsst/daf/butler/_file_descriptor.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,22 +49,27 @@
Storage class associated with reading the file. Defines the
Python type that the in memory Dataset will have. Will default
to the ``storageClass`` if not specified.
component : `str` or `None`
Component associated with this file. Will only be set for disassembled
composites. Will be `None` for standard composites.
parameters : `dict`, optional
Additional parameters that can be used for reading and writing.
"""

__slots__ = ("location", "storageClass", "_readStorageClass", "parameters")
__slots__ = ("location", "storageClass", "_readStorageClass", "parameters", "component")

def __init__(
self,
location: Location,
storageClass: StorageClass,
readStorageClass: StorageClass | None = None,
component: str | None = None,
parameters: Mapping[str, Any] | None = None,
):
self.location = location
self._readStorageClass = readStorageClass
self.storageClass = storageClass
self.component = component
self.parameters = dict(parameters) if parameters is not None else None

def __repr__(self) -> str:
Expand All @@ -73,6 +78,8 @@
optionals["readStorageClass"] = self._readStorageClass
if self.parameters:
optionals["parameters"] = self.parameters
if self.component:
optionals["component"] = self.component

Check warning on line 82 in python/lsst/daf/butler/_file_descriptor.py

View check run for this annotation

Codecov / codecov/patch

python/lsst/daf/butler/_file_descriptor.py#L82

Added line #L82 was not covered by tests

# order is preserved in the dict
options = ", ".join(f"{k}={v!r}" for k, v in optionals.items())
Expand Down
Loading
Loading