Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/simple cli for chunking local or remote NetCDF files #319

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

steph-ben
Copy link

@steph-ben steph-ben commented Mar 15, 2023

Hello, thanks for this lib !

I ended up rewriting several times the scan and consolidate parts, from your tutorial. I thought this small cli would be of interest, when working outside notebook ! Happy to share your view on this.

Usage example :

$ kerchunk-nc -i s3://era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc -i s3://era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc
INFO:kercli:Scanning s3://era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc ...
INFO:kercli:Scanning s3://era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc ...
INFO:kercli:Data loaded from json/mydataset : 2 found
INFO:kercli:Consolidating to zarr/mydataset.zarr ...

Will result in

$ tree json/ zarr/
json/
└── mydataset
    └── s3:
        └── era5-pds
            └── 2020
                ├── 01
                │   └── data
                │       └── air_pressure_at_mean_sea_level.json
                └── 02
                    └── data
                        └── air_pressure_at_mean_sea_level.json
zarr/
└── mydataset.zarr

Help looks like :

$ kerchunk-nc --help
Usage: kerchunk-nc [OPTIONS]

  Cli for ker-chunking local or remote NetCDF files

Options:
  --name TEXT           Dataset name  [default: mydataset]
  -i, --input TEXT      Input file url, readable by fsspec  [required]
  --input-format TEXT
  --input-fs-args TEXT  Arguments that will be passed to fsspec.open()
                        [default: {'anon': True}]
  --json-dir TEXT       Where to store scan output as json
  --zarr-output TEXT    Output of fully merged kerchunk zarr file
  --force-scan          Force scanning input file, even if json file exists
  -v, --verbose
  --help                Show this message and exit.

@martindurant
Copy link
Member

I wonder, are you aware of pangeo-forge? It provides a recipe-runner abstraction for reading various xarray supported file types and converting them for storage. That conversion can be via kerchunk to produce JSON files like you are doing. The target is mostly for automatic running of recipes on various cloud backends, so very large datasets; but you can execute a recipe locally in a way that is probably quite similar to the CLI here.
I am not saying that I am opposed to the CLI, but if pangeo-forge is simple enough to use for the same purpose, it seems better not to duplicate effort. Would you mind having a look and seeing if what is there make sense to you and that you can easily reach the same workflow.

If we decide to go ahead here, could we extend to multiple file types? This is one of kerchunk's great strengths. Each file type, of course, takes a different set of options and may have other semantic differences (a grib2 file produces a list of reference sets, for instance).

@martindurant
Copy link
Member

Also, before I forget: the auto_dask function also does s=much of the job of automating scanning multiple files and combining the results in a single call (with parallelised tree reduction). Might be worth calling that rather than writing a class to do the same thing, however short that class may be.

@steph-ben
Copy link
Author

I had a quick look before this PR on pangeo-forge, but it seems to me very cloud-oriented and a little bit "the-big-thing" to do what I want.

My use-case was really to tackle simple case, easy to demonstrate and to explain, where everything go well, and there is no need to write any python.

I understood pangeo-forge target was to cover all use-cases (therefore the need to write some python in receipe.py), and provide a cloud-ready CI (which is a really really great job!!!).

Possible solution to go on:

  • Continue this cli on kerchunk, allowing to cover simple use-case directly from this lib, without external libs
  • Make a simple-use-case cli at the pangeo-forge-receipe

Happy to get your view on this.

@steph-ben
Copy link
Author

Also, before I forget: the auto_dask function also does s=much of the job of automating scanning multiple files and combining the results in a single call (with parallelised tree reduction). Might be worth calling that rather than writing a class to do the same thing, however short that class may be.

Thanks I totally miss this function ! For sure will use it if we go ahead

If we decide to go ahead here, could we extend to multiple file types? This is one of kerchunk's great strengths. Each file type, of course, takes a different set of options and may have other semantic differences (a grib2 file produces a list of reference sets, for instance).

Currently started with NetCDF file, but yes I need to cover GRIB as well

from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr

logger = logging.getLogger("kercli")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"kerchunk-cli-nc" is fine :)

@@ -31,6 +31,9 @@
"FITSVarBintable = kerchunk.codecs:VarArrCodec",
"record_member = kerchunk.codecs.RecordArrayMember",
],
'console_scripts': [
'kerchunk-nc = kerchunk.cli.chunk_nc:cli',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just call it kerchunk, and either infer the file type from the URL extension or provide a --format option to select nc, the only one available right now.

logger = logging.getLogger("kercli")


class NetcdfChunker:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely NetCDFKerchunker !

value = json.loads(value)
return value

@click.command()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a general description

@click.option("--input", "-i",
help="Input file url, readable by fsspec", required=True,
multiple=True)
@click.option("--input-format", default="nc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see you already thought of this - but we don't do anything with this argument, right? We should raise a useful message if anything other than "nc" is provided.

@abkfenris
Copy link

Would it make sense to make this a sub command? Then there could be another sub command for combining ref json files?

@martindurant
Copy link
Member

Would it make sense to make this a sub command?

I have no preference between subcommands and passing extra arguments, it's just a matter of style.

@NikosAlexandris
Copy link

Hello, what is the status of this "feature"?

@steph-ben
Copy link
Author

Hello, what is the status of this "feature"?

Hello, currently I don't have time to work on this. Happy if someone want to take over.

@NikosAlexandris
Copy link

Hello, what is the status of this "feature"?

Hello, currently I don't have time to work on this. Happy if someone want to take over.

I am working-out something over at https://github.com/NikosAlexandris/rekx.

@NikosAlexandris
Copy link

Step-by-step, I have some Very DRAFT without tests at https://github.com/NikosAlexandris/rekx/tree/main/rekx in a 'works-for-me' state. @martindurant any interest in seeing this growing ?

@martindurant
Copy link
Member

any interest in seeing this growing?

I wouldn't use it personally, but it seems that some in here would, so I'd be happy to include something like this.

@NikosAlexandris
Copy link

I am working on it : https://github.com/NikosAlexandris/rekx#examples -- these are just a small part of what rekx can already crunch. In time I will add examples for Kerchunking massive datasets.

@martindurant
Copy link
Member

@NikosAlexandris , I see you've already put a decent amount of effort into it! I'd be happy to link to it from the kerchunk documentation or include it right here if you think it appropriate - whenever you reckon it's ready for a wider audience.

@NikosAlexandris
Copy link

NikosAlexandris commented Jan 5, 2024

@NikosAlexandris , I see you've already put a decent amount of effort into it! I'd be happy to link to it from the kerchunk documentation or include it right here if you think it appropriate - whenever you reckon it's ready for a wider audience.

I'd appreciate some guidance on all matters about Kerchunk and, of course, I'd be grateful for suggestions to eventually make this effort meaningful outside own needs. Some examples :

Maybe we can better shape it before asking for exposure ?

ps- A larger tutorial using SARAH3 products is on its way, also thanks to the good people in the german weather service (DWD) who actually produce these data.

@NikosAlexandris
Copy link

@martindurant And of course, if I wasn't clear, I don't mind for whatever scenario if this goes well: integrate directly here-in or link to it. Whatever works better.

@martindurant
Copy link
Member

I have a slight preference to integrate it into kerchunk, using command kerchunk if possible, since it's so tightly coupled to this repo's functionality. For tutorials, they should probably become normal documentation pages, or (if executable if useful), pythia cookbooks (like https://projectpythia.org/kerchunk-cookbook/README.html ).

@NikosAlexandris
Copy link

I have a slight preference to integrate it into kerchunk, using command kerchunk if possible, since it's so tightly coupled to this repo's functionality. For tutorials, they should probably become normal documentation pages, or (if executable if useful), pythia cookbooks (like https://projectpythia.org/kerchunk-cookbook/README.html ).

Would you prefer a rather clean Kerchunking interface (i.e. kerchunk reference, kerchunk combine and more from what is in the Kerchunk API and makes sense to expose to the command line) ? Or would you accept keeping also, in some form, some of the inspect, shapes, select/read-performance and rechunk-generator commands too ?

@martindurant
Copy link
Member

would you accept keeping also

Yes, I think it's fine to have all those commands - they can be helpful shortcuts in some places.

@NikosAlexandris
Copy link

NikosAlexandris commented Jan 18, 2024

I am working on rekx further as it serves for my work. The idea is to bring it to a cleaner shape before integrating to Kerchunk. I feel some important bits are currently rather messy. My main concern is to achieve a clean and logical correspondence between commands (based on Typer, currently defined in https://github.com/NikosAlexandris/rekx/blob/main/rekx/cli.py) which consume CLI modules (e.g. https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/inspect.py and https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/shapes.py) which in turn consume something like an API (i.e. https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/netcdf_metadata.py and https://github.com/NikosAlexandris/rekx/blob/bc56436e7e50f7ff3f1ea2c78e1e6f83e08890ce/rekx/diagnose.py).

Ah, and testing... of course! This front needs some love.

It would be good however to start a discussion on the integration (requirements overall, dependencies, things to do and things not to do) at some point and formalise the tasks to-do (?).

@martindurant
Copy link
Member

I am planning a kerchunk (virtual) get-together to discuss all manner of topics, and this would be a good one.

on the integration, requirements overall, dependencies, things to do and things not to do

Since nothing exists yet, I am not too worried. Probably it's reasonable to add typer to the requirements, but the actual file type readers have extra requirements, so it would be best if the CLI produced reasonable error messages when extra packages are needed.

@NikosAlexandris
Copy link

I am planning a kerchunk (virtual) get-together to discuss all manner of topics, and this would be a good one.

I hope I can make it to join.

on the integration, requirements overall, dependencies, things to do and things not to do

Since nothing exists yet, I am not too worried. Probably it's reasonable to add typer to the requirements, but the actual file type readers have extra requirements, so it would be best if the CLI produced reasonable error messages when extra packages are needed.

You are right, I will try to contribute useful things while I expect to learn a lot from the interaction and the experience.

@martindurant
Copy link
Member

https://discourse.pangeo.io/t/kerchunk-planning/4002/2 for the kerchunk planning thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants