Data Version Control of data in the object store #25

mattjbr123 · 2024-10-23T17:30:34Z

Earthmover have released Icechunk, the version-control-of-ARCO aspect of arraylake. This does not contain the cataloguing features of arraylake (that remains something you can only get when paying for arraylake).

Is there a use for Icechunk in this project?
✔ Setup is simple
✔ All changes to datasets are tracked
✔ N new versions of the dataset don't take up O(N x dataset_size) size
✔ Could lend itself of provenance tracking

✖ Might be overkill for datasets that are never updated
✖ Another code library for users to get their heads around (examples would be essential to mitigate, but it takes us a little away from "use the data as if it were on disk" idea which is easier with some boilerplate "load the data and get out the way" fsspec code
✖ Integrating icechunk with EIDC could be trickier, given icechunk does modify the storage format of the files

Originally posted by @mattjbr123 in #19

mattjbr123 mentioned this issue Oct 23, 2024

Cataloguing the data #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Version Control of data in the object store #25

Data Version Control of data in the object store #25

mattjbr123 commented Oct 23, 2024

Data Version Control of data in the object store #25

Data Version Control of data in the object store #25

Comments

mattjbr123 commented Oct 23, 2024