-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create example for the UKCEH GEAR-1hrly dataset #9
Create example for the UKCEH GEAR-1hrly dataset #9
Comments
16/08/2024
|
Next thing to do is to test out the adapted recipe with the GEAR dataset, running locally on the JASMIN sci machines first with a subset of the data, then with all the data on the LOTUS cluster. Using the 'DirectRunner' of Apache Beam, which is the un-optimized one-size-fits-all Runner. |
22/08/2024 What I've found out after testing the adapted recipe.
To do next:
|
The solution might be as simple as adding the dimension to the This seems to be the relevant issue: pangeo-forge/pangeo-forge-recipes#644 From which we can narrow down the problem further to "pangeo-forge-recipes requires all variables that are not coordinates to have the concat_dim in them". The proposed solution would be fine if only the xarray open_dataset function (which pangeo-forge-recipes's OpenWithXarray function uses) accepted preprocess as a kwarg. In reality only xarray open_mfdataset accepts preprocess. So to implement this solution we would have to develop our own custom OpenWithXarray that uses open_mfdataset instead, which definitely seems overkill. Instead we might have to create our own Beam Transform function (similar to here) that does the same thing. Might be worth putting a PR together to add in that suggested more helpful error |
23/08/2024 Today has been a day of trying to understand how 'preprocessors' might work in a Beam pipeline and pangeo-forge-recipes context. I've figured out the general struture and syntax of preprocessors, just through looking at multiple examples, but there's definitely a shortfall in the pangeo-forge-recipes documentation here. I've been doing this so that I can add a preprocessor that converts the annoying extra variables in the netcdf files to coordinates so that the OutputToZarr pangeo-forge-recipes Beam PTransform can handle the dataset, as suggested above. I found an example in another recipe that should do more or less what I want. The general structure of preprocessors seems to be:
then in the pipeline:
A cleaner alternative seems like it could be:
and
|
Have now implemented the preprocesser, seems to be working, albeit taking a long time (~1hr) to process with a single core, even when pruned to only 12 files (which I've just noticed are in the wrong order, it's doing all the januaries from the first 12 years instead of the first 12 months of the first year...) |
Key takeaway is that some manual writing of a preprocessing function is likely going to be necessary for each dataset, so plenty of training material on this should be made available. This could realistically just be in the form of: "all you need to worry about it is the bit inside the _name_of_preprocess function (what you actually want to do in your preprocessing of the data), the rest is just copy-and-paste-able around it" |
28/08/2024 findings The
attributes were dropped during the processing and don't appear in the final (zarr) version of the dataset. I guess this makes sense as the dataset has now been modified and these attributes need to change but there presumably is no code in pangeo-forge-recipes yet that can do this. The meaning of 'date_created' is also unclear after modification of the 'official' version of the dataset - is it the creation date of the original official version or the now-modified one? Otherwise the datasets do appear identical at least at an 'ncdump' style first-glance |
29/08/2024 - 25/09/24 Getting a strange error when trying to run it in parallel, which I haven't seen before...
which is a C++ error in a basic erase function which is trying to erase index 16 of a string which doesn't exist because the string ends at index 15. It is something to do with my python environment - a particular version of a particular package must be breaking things as the original workflow for the G2G data runs fine in my old environment, but not in the new. To figure out which package is the culprit I will clone the old environment and gradually update the key packages in it, until it breaks, at which point I can look and see what dependencies the new package required and repeat the process with these dependencies until I have it. The problematic packages seem to be
Fortunately, older versions of the packages seem to work, specifically:
Any pyarrow version above 8.0.1 seems to reproduce the error. This limits the environment to python<=3.10 which isn't ideal, but at least it works for now and I can proceed with the rest of the project... I will spin this post off into an issue of it's own which merits further debugging, starting with the suggestions in the pangeo forums thread: https://discourse.pangeo.io/t/strange-error-using-pangeo-forge-recipes-apache-beam-in-parallel/4540 |
Have posted a query in the pangeo discourse forums to see if anyone else there might be able to shed more light on the problem! |
Successfully run with parallelisation with this environment. Though I've now noticed that the pipeline has added a time dimension to the variables that we converted from data variables to coordinate variables in the preprocessing function (so the variables |
There is some relevant discussion over at the UKCEH RSE Space: NERC-CEH/rse_group#21 (comment) |
Only remaining thing to do on this particular issue is to check it runs on LOTUS on JASMIN |
All other issues have been spun off into their own issues |
As part of the FDRI project work package 2 I am (co-)developing a "Gridded Data Tools" product that intends to allow easy conversion of gridded data to ARCO format, easy upload to object-storage, cataloguing of the data and easy access and analysis of this data.
I thought this object_store_tutorial repo would be a good place to start, at least for the "ingestion" stage (conversion and upload to object storage) replicating the workflow shown in the README, scripts and notebooks for the UKCEH GEAR-1hrly dataset. Initially away from any complicated cloud infrastructure like kubernetes, argo, airflow etc. and just running it manually locally or on JASMIN.
I'll attempt to document progress here, and add a further script and/or notebook to the repo in the respective folder. Development work will be on the 'GEAR' branch.
The text was updated successfully, but these errors were encountered: