Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidated Zarr support could improve S3 data loading #2987

Open
mannreis opened this issue Aug 19, 2024 · 6 comments
Open

Consolidated Zarr support could improve S3 data loading #2987

mannreis opened this issue Aug 19, 2024 · 6 comments

Comments

@mannreis
Copy link
Contributor

Hello 👋

We've noticed the difference between reading a remote Zarr dataset [https://...#mode=s3,zarr] and local one [file://....#mode=file,zarr]:

$ time ncdump/ncdump -v tas file://${HOME}/${DATASET}/#mode=zarr | tail -n+2 | md5sum
abd28bc55fb9d0c25a3767a43d27110a -
real 0m0.111s
user 0m0.104s
sys 0m0.017s
$ time ncdump/ncdump -v tas https://${ENDPOINT}/${BUCKET}/${DATASET}/#mode=zarr,s3 | tail -n+2 | md5sum
abd28bc55fb9d0c25a3767a43d27110a -
real 0m9.854s
user 0m4.739s
sys 0m0.162s

Network overhead is expected, specially if the service imposes rate limits. But such a difference motivated me to look at the implementation behaviour.

It seems that the approach used by netcdf is similar to the one used with Python Zarr, fetching all the metadata in advance. And for this reason the following requests are sent (in netcdf) for my the example above:

  • 4 GET Requests to list the dataset metadatafiles
  • 674 HEAD Requests to mainly fetch the size of the object to be transfered
  • 224 GET Requests to actually read the content of both metadata (223) and data/chunks (1) objects.

There are 3x more HEAD than GET which can be a tinny improvement, but overall this is not much different from what Python does:

import zarr,s3fs,os
s3 = s3fs.S3FileSystem(endpoint_url=f'https://{os.environ["ENDPOINT"]}/',anon=True)
store = s3fs.S3Map(root=f'{os.environ["BUCKET"]}/{os.environ["DATASET"]}',s3=s3)
d = zarr.open(store,'r')
print(d.info)

Which produces:

  • 127 GET Requests to list metadatafiles or variable names after the dataset prefix
  • 349 HEAD Requests to check for metadatafiles

Implementing a consolidated access mode could improve the situation. In Python, the example above can be simplified to a single request:

  • 1 GET Request to fetch the content of /.zmetadata (Note that not even a HEAD request is done in advance)
import zarr,s3fs,os
s3 = s3fs.S3FileSystem(endpoint_url=f'https://{os.environ["ENDPOINT"]}/',anon=True)
store = s3fs.S3Map(root=f'{os.environ["BUCKET"]}/{os.environ["DATASET"]}',s3=s3)
d = zarr.open_consolidated(store)
print(d.info)

If this is desired perhaps it could be supported by other modes, like file (or even zip!?) as well. In that case I think it would be part of the zarr api and not a specific zmap S3 implementation.

I will try to come up with a PR for this but it would be great to have some feedback and if positive, some pointers/draft on how to support it (via #mode=consolidated controls? Environment? Only when build --with-consolidated-zarr?)

Thanks!

@joshmoore
Copy link

👍 for support of Zarr v2 "consolidated". A discussion on the zarr-python team yesterday touched on how to deal with the potential differences in this respect between possible differences in the definition of consolidated between the v2 and v3 formats. The decision was to add arguments to enable the v2 consolidated format to the v3 library, but potentially disallow those arguments when producing the v3 format (since the v3 library will need to support both the v2 and v3 formats.) </tongue_twister>

@DennisHeimbigner
Copy link
Collaborator

Sorry, I apparently missed this Issue when it was first posted.
In any case, we have always planned to support consolidated
metadata for both V2 and V3. The problem was that there appeared
to be no specification for the JSON for consolidated metadata.
Has that changed? Can you point me to that spec?
Josh's note about V3 supporting both V2 and V3 is unclear.
I get that the actual (python) library will need to read files in both V2
and V3 formats. But I do not understand this remark:

The decision was to add arguments to enable the v2 consolidated format to the v3 library, but potentially disallow those arguments when producing the v3 format

What kind of arguments are being considered?

@joshmoore
Copy link

The problem was that there appeared
to be no specification for the JSON for consolidated metadata.
Has that changed? Can you point me to that spec?

No, that has not changed, but agreed that that is a difficult for the v2 format.

What kind of arguments are being considered?

Correction. For zarr-python library v2, I should have said "methods" or "API" for activating consolidated metadata. (Those don't yet exist for zarr-python library v3.) The method arguments I was thinking of are in xarray: https://docs.xarray.dev/en/stable/user-guide/io.html#consolidated-metadata

@DennisHeimbigner
Copy link
Collaborator

Ok, I see.
So the big holdup at the moment is a JSON spec for
consolidated metadata for V2 and another for V3.

@joshmoore
Copy link

The discussion around V3 is currently ongoing. It's unlikely that there will be significant work on a V2 "spec". (I would certainly be for having an "upgrade guide" between the two which may be as close as we can come.)

@mannreis
Copy link
Contributor Author

Thanks for the discussion! In the meantime I've tried to just add a "caching layer" the metadata functions that would GET the .z* files to see what the difference would be [1] . I've opened #2992 but its a draft perhaps not useful on the long term.

[1]

$ time ncdump/ncdump -v tas https://${ENDPOINT}/${BUCKET}/${DATASET}/#mode=zarr,s3,consolidated | tail -n+2 | md5sum
abd28bc55fb9d0c25a3767a43d27110a -
real	0m0.262s
user	0m0.155s
sys	0m0.022s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants