Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an end to end test #3

Open
mattjbr123 opened this issue Oct 1, 2024 · 2 comments
Open

Add an end to end test #3

mattjbr123 opened this issue Oct 1, 2024 · 2 comments
Assignees

Comments

@mattjbr123
Copy link
Collaborator

Is the output zarr dataset the same as the input netcdf dataset

Exact method to do this TBD

@mattjbr123 mattjbr123 transferred this issue from NERC-CEH/object_store_tutorial Oct 7, 2024
mattjbr123 added a commit that referenced this issue Oct 7, 2024
mattjbr123 added a commit that referenced this issue Oct 7, 2024
mattjbr123 added a commit that referenced this issue Oct 7, 2024
mattjbr123 added a commit that referenced this issue Oct 7, 2024
@mattjbr123
Copy link
Collaborator Author

Ignore the above commits and comments, I think they've been assigned to the wrong issue...

@mattjbr123 mattjbr123 reopened this Oct 7, 2024
@mattjbr123 mattjbr123 changed the title Add some an end to end test Add an end to end test Oct 7, 2024
@dolegi dolegi self-assigned this Oct 18, 2024
@mattjbr123 mattjbr123 linked a pull request Oct 22, 2024 that will close this issue
@mattjbr123
Copy link
Collaborator Author

mattjbr123 commented Oct 25, 2024

We want to compare the data in the input netcdf file(s) to the output zarr dataset to ensure they are the same.

TL;DR

  • Do we do full data-point by data-point comparison or hashing/summarising?
  • Where do we store the data?
  • Where do we run the test?

If doing fully/completely one major issue is the size of the datasets, potentially multi-TB. Would it be possible to do hashing or some other simple calculation to solve this? But this probably still has the issue of being a computationally expensive operation and needing to read in all the data anyway? This must be an issue the EIDC team face and have solutions for? @phtrceh would you be able to advise? Maybe we do just compare the datasets in chunks/slices. It'll still take a while but probably not as computationally expensive as calculating a parameter/hash from the data? Given we'd probably still want to use Beam to parallelise this as much as possible, we could build it in to the conversion pipeline itself somehow?

Then there is the issue of where do we want to run the test? If we want to run the test via GitHub Actions/CI, we'd need to link to whatever HPC or HPC-like environment we are running the conversion on and run there, unless we get the pipeline to calculate a number that represents the whole dataset somewhow for each dataset, and then the comparison is trivial and can run directly on a teeny tiny instance on GitHub.

Another issue is that we cannot store the data on GitHub as again it is too big. A potential way around this would be to upload the original and converted datasets to an object store and read them from there. We might have to find a way to safely store the credentials needed to access the object store, but I feel like this should have been a problem already solved elsewhere (e.g. the time series FDRI product?). Eventually we will not need to upload the converted data to object storage as a manual step as it will be done anyway as part of the Beam pipeline, but we need to not be using the DirectRunner for this, which involves creating a Flink or Spark instance for the Beam Flink or Spark runners to use.

Lots of questions!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

2 participants