Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some Zarr-based datatypes #19040

Open
wants to merge 23 commits into
base: dev
Choose a base branch
from

Conversation

davelopez
Copy link
Contributor

@davelopez davelopez commented Oct 22, 2024

Requires #17614 and #19077

Includes the following datatypes:

General Zarr datatypes

  • CompressedZarrZipArchive (zarr.zip): represents a Zarr ZipStore. It seems to have some limitations (i.e. it doesn't work with zips containing the zarr store in a subfolder).
  • ZarrDirectory (zarr): represents a Zarr DirectoryStore. Contains the zarr structure in the extra_files_path of the dataset. I wonder if there is a way to make the "main" dataset to really point to the actual root folder in extra_files like a symlink or something like this, but I don't know if this really makes sense. Right now, to access your zarr store root you will need to use the input like this '$zarrinput.extra_files_path/$zarrinput.metadata.store_root' (see Handling different input types section below)

OME-Zarr datatypes

  • CompressedOMEZarrZipArchive (ome_zarr.zip): Similar to CompressedZarrZipArchive but expects to find an OME/METADATA.ome.xml file in the store root so it can be easily converted/extracted to an OMEZarr directory.
  • OMEZarr (ome_zarr): Similar to ZarrDirectory but identify this datatype as an OME Zarr image.

Handling different input types

If your zarr input is defined like this:

<inputs>
    <param name="zarrinput" type="data" format="zarr,zarr.zip" allow_uri_if_protocol="https" label="Zarr input"/>
</inputs>

You need to access the zarr store in a "slightly" different way depending on how the store is provided as input.

Input as zarr.zip

UploadOMEZarr.zip.mp4

You can upload a zip file containing the zarr store. In order to open it in your tool you need to do the following:

if '$zarrinput.extension' == 'zarr.zip':
    compression = int('$zarrinput.metadata.compression' or 0)
    zip_store = zarr.ZipStore('$zarrinput', mode='r', compression=compression)
    input_zarr = zarr.open(zip_store, mode='r')`

Please note that the root store must be in the root of the zip (i.e. no subdirectory containing the store)

Input as zarr (Directory)

You cannot directly upload a directory, but you can upload a zarr.zip and then extract it to a "zarr directory dataset".

ConvertZipToOMEZarr.mp4

To open this kind of zarr in your tool you must do the following:

if '$zarrinput.extension' == 'zarr' and '$zarrinput.state' != 'deferred':
        input_store = '$zarrinput.extra_files_path/$zarrinput.metadata.store_root'
        input_zarr = zarr.open(input_store, mode='r')`

Note that you need to reference the extra_files_path and append the $zarrinput.metadata.store_root to it. This will point to the right internal directory where the zarr store has been extracted and will work even if the uploaded zip had a parent subfolder containing the zarr store.

Input as a deferred zarr (URI to a remote zarr)

DeferredZarr.mp4

In your tool, you can just directly use '$zarrinput', it will point to the remote URI. But you need to check this is actually a deferred dataset like this:

if '$zarrinput.extension' == 'zarr' and '$zarrinput.state' == 'deferred':
        input_zarr = zarr.open('$zarrinput', mode='r')`

How to test the changes?

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    • For testing the "directory" variant datatype:
    • For testing the "deferred" variant (remote URI):
      • Paste the following URI as content: https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0062A/6001240.zarr. You can find more examples here.
      • Make sure you set the dataset options to Defer dataset resolution
    • You can use this "simple" tool wrapper to test or make your own:
<tool id="zarr_test" name="Zarr Test" version="0.1.0+galaxy0" profile="23.0">
    <description>test wrapper for zarr format</description>
    <requirements>
        <requirement type="package" version="2.18.3">zarr</requirement>
        <requirement type="package" version="2024.9.0">s3fs</requirement>
    </requirements>
    <command detect_errors="exit_code"><![CDATA[
        python $script
    ]]></command>
    <configfiles>
        <configfile name="script"><![CDATA[
import zarr

input_store = None
if '$zarrinput.extension' == 'zarr':
    if '$zarrinput.is_deferred':
        print('Using remote S3 store as input')
        # Deferred datasets will return directly the URI in combination with allow_uri_if_protocol
        input_store = '$zarrinput'
    else:
        print('Using local directory store as input')
        # TODO: Is there a way for the dataset file to directly reference the extra_files_path?
        # We need to append the store_root to the extra_files_path to get the full path to the zarr store
        # in case it was extracted to a subdirectory
        input_store = '$zarrinput.extra_files_path/$zarrinput.metadata.store_root'
elif '$zarrinput.extension' == 'zarr.zip':
    compression = int('$zarrinput.metadata.compression' or 0)
    print(f'Using zipped store as input with compression {compression}')
    # Currently only zip files containing the zarr store at the root (i.e. no subdirectories) are supported
    input_store = zarr.ZipStore('$zarrinput', mode='r', compression=compression)
else:
    raise ValueError('Unsupported input format')

if input_store is None:
    raise ValueError('Unable to determine input store. Your input dataset may be an unsupported format')

# Inspect the input zarr
input_zarr = zarr.open(input_store, mode='r')
print("Input zarr info:")
print(f"  {input_zarr.info_items()}")

# Create the output store where the new zarr will be written
output_store = '$zarroutput.extra_files_path'
output_zarr = zarr.open(output_store, mode='w')
foo = output_zarr.create_group('foo')
bar = foo.create_group('bar')
baz = bar.zeros('baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')
        ]]></configfile>
    </configfiles>
    <inputs>
        <param name="zarrinput" type="data" format="zarr,zarr.zip" allow_uri_if_protocol="https" label="Zarr input"/>
    </inputs>
    <outputs>
        <data name="zarroutput" format="zarr" label="${tool.name} on ${on_string}: New ZARR"/>
    </outputs>
    <help><![CDATA[

simple wrapper to test zarr datatype

    ]]></help>
</tool>

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jmchilton
Copy link
Member

I think the URI datatype is a little wonky also. I don't want to encode remote URIs this way if we can avoid it. Lets focus on a syntax in tool data parameters that allows accepting URIs instead of doing this. The other datatypes and enhancements all look really great to me. I'll create an issue for the data parameter handling.

@davelopez
Copy link
Contributor Author

I think the URI datatype is a little wonky also. I don't want to encode remote URIs this way if we can avoid it. Lets focus on a syntax in tool data parameters that allows accepting URIs instead of doing this. The other datatypes and enhancements all look really great to me. I'll create an issue for the data parameter handling.

Thanks again for the great feedback! I've dropped the "wonky" URI datatypes and implemented #19077 as discussed. Once that is merged, I'll rebase here and should be ready to go :)

@bgruening
Copy link
Member

This one needs now a final rebase :)
Thanks @davelopez!

@davelopez davelopez marked this pull request as ready for review November 5, 2024 09:04
@github-actions github-actions bot added this to the 24.2 milestone Nov 5, 2024

# This wouldn't be needed if the CompressedFile.extract function didn't
# create an extra folder under the dataset's extra_files_path.
# Maybe this can be avoided somehow?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge with this or do we need to investigate this more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to get rid of this ? Otherwise I'd like to see a tool actually use this metadata element.

Copy link
Contributor Author

@davelopez davelopez Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain the issue in more detail to see if we can eliminate the need for store_root metadata. This store_root isn't true metadata—it's more of a workaround that indicates the folder containing the actual root of the Zarr directory.

Both tools and visualizations require a path to this root directory to access the correct contents.

When we upload a zip file containing a Zarr directory, it’s common for the zip to include the parent folder of the Zarr store. Many Zarr zips I’ve encountered are structured this way. Ideally, the Zarr store would be zipped without this extra parent folder, but even if it isn’t, when we extract it using the converter, it creates a new folder (like dataset_{uuid}) within extra_files_path, resulting in an additional layer.

Currently, to access the Zarr directory correctly, any tool needs to reference it as follows:

input_zarr = zarr.open('$zarrinput.extra_files_path/$zarrinput.metadata.store_root', mode='r')

This approach, however, is not fully reliable—what if the Zarr store is nested deeper within subdirectories? A better solution might be to use a dedicated converter (rather than archive_to_directory.xml) that finds and extracts the root store directly to extra_files_path, without any parent folders, would this be better?

Another drawback is that tool developers must remember to reference the $zarrinput.extra_files_path, and even add /$zarrinput.metadata.store_root to reach the actual Zarr store.

Any ideas on how to make this process more elegant and eliminate the store_root?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants