Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Version Control of data in the object store #25

Open
mattjbr123 opened this issue Oct 23, 2024 · 0 comments
Open

Data Version Control of data in the object store #25

mattjbr123 opened this issue Oct 23, 2024 · 0 comments

Comments

@mattjbr123
Copy link
Collaborator

Earthmover have released Icechunk, the version-control-of-ARCO aspect of arraylake. This does not contain the cataloguing features of arraylake (that remains something you can only get when paying for arraylake).

Is there a use for Icechunk in this project?
✔ Setup is simple
✔ All changes to datasets are tracked
✔ N new versions of the dataset don't take up O(N x dataset_size) size
✔ Could lend itself of provenance tracking

✖ Might be overkill for datasets that are never updated
✖ Another code library for users to get their heads around (examples would be essential to mitigate, but it takes us a little away from "use the data as if it were on disk" idea which is easier with some boilerplate "load the data and get out the way" fsspec code
✖ Integrating icechunk with EIDC could be trickier, given icechunk does modify the storage format of the files

Originally posted by @mattjbr123 in #19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant