Add an end to end test #3

mattjbr123 · 2024-10-01T15:06:06Z

Is the output zarr dataset the same as the input netcdf dataset

Exact method to do this TBD

mattjbr123 · 2024-10-07T16:11:43Z

Ignore the above commits and comments, I think they've been assigned to the wrong issue...

mattjbr123 · 2024-10-25T15:43:48Z

We want to compare the data in the input netcdf file(s) to the output zarr dataset to ensure they are the same.

TL;DR

Do we do full data-point by data-point comparison or hashing/summarising?
Where do we store the data?
Where do we run the test?

If doing fully/completely one major issue is the size of the datasets, potentially multi-TB. Would it be possible to do hashing or some other simple calculation to solve this? But this probably still has the issue of being a computationally expensive operation and needing to read in all the data anyway? This must be an issue the EIDC team face and have solutions for? @phtrceh would you be able to advise? Maybe we do just compare the datasets in chunks/slices. It'll still take a while but probably not as computationally expensive as calculating a parameter/hash from the data? Given we'd probably still want to use Beam to parallelise this as much as possible, we could build it in to the conversion pipeline itself somehow?

Then there is the issue of where do we want to run the test? If we want to run the test via GitHub Actions/CI, we'd need to link to whatever HPC or HPC-like environment we are running the conversion on and run there, unless we get the pipeline to calculate a number that represents the whole dataset somewhow for each dataset, and then the comparison is trivial and can run directly on a teeny tiny instance on GitHub.

Another issue is that we cannot store the data on GitHub as again it is too big. A potential way around this would be to upload the original and converted datasets to an object store and read them from there. We might have to find a way to safely store the credentials needed to access the object store, but I feel like this should have been a problem already solved elsewhere (e.g. the time series FDRI product?). Eventually we will not need to upload the converted data to object storage as a manual step as it will be done anyway as part of the Beam pipeline, but we need to not be using the DirectRunner for this, which involves creating a Flink or Spark instance for the Beam Flink or Spark runners to use.

Lots of questions!!

mattjbr123 transferred this issue from NERC-CEH/object_store_tutorial Oct 7, 2024

mattjbr123 added a commit that referenced this issue Oct 7, 2024

First attempt at recipe for GEAR #3

a892352

mattjbr123 closed this as completed in b443a1e Oct 7, 2024

mattjbr123 added a commit that referenced this issue Oct 7, 2024

random tweaks #3

0a1cf08

mattjbr123 added a commit that referenced this issue Oct 7, 2024

Add preprocessor #3

b55a161

mattjbr123 added a commit that referenced this issue Oct 7, 2024

minor changes to pipeline #3

438169e

mattjbr123 reopened this Oct 7, 2024

mattjbr123 changed the title ~~Add some an end to end test~~ Add an end to end test Oct 7, 2024

dolegi self-assigned this Oct 18, 2024

mattjbr123 linked a pull request Oct 22, 2024 that will close this issue

feature/3-e2e-test add basic e2e test #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an end to end test #3

Add an end to end test #3

mattjbr123 commented Oct 1, 2024

mattjbr123 commented Oct 7, 2024

mattjbr123 commented Oct 25, 2024 •

edited

Loading

Add an end to end test #3

Add an end to end test #3

Comments

mattjbr123 commented Oct 1, 2024

mattjbr123 commented Oct 7, 2024

mattjbr123 commented Oct 25, 2024 • edited Loading

mattjbr123 commented Oct 25, 2024 •

edited

Loading