Parallel COG / TIFF Storage #257
Replies: 3 comments 4 replies
-
Can you clarify your desired output file(s)? Do you want a single COG or many (one per chunk)? |
Beta Was this translation helpful? Give feedback.
-
Preferably one file, however an alternative that's easily explorable in e.g. QGIS is fine as well (such as the vrt I mentioned) . The further pipeline uses dask and xarray, so this is mostly relevant for interactive quality checks on intermediate data. |
Beta Was this translation helpful? Give feedback.
-
It's fine to use the lock.distributed.Lock(), as that should be fairly fast to pass on between the different jobs in the queue. Generally, I expect the most time-consuming part is the actual compute to memory, before writing to disk (/ streaming to blob storage). |
Beta Was this translation helpful? Give feedback.
-
We're exploring the Planetary Computer Hub for data processing. We have a requirement of a file format that's easy to interactively explore for quality checks etc through software like QGIS. COG / plain geotiff seems excellent for this, where with zarr it's harder to interactively explore. Please correct me if I'm wrong!!
Thanks, Tom Augspurger for suggesting zarr and xcog example. xcog could be part of the solution in combination with a vrt.
Based on rioxarray's documentation (link), we explored parallel COG / tif storage using dask.
Using rxr.open_rasterio() and ds.rio.to_raster(), we routed operations through the Planetary Hub's gateway cluster. However, the dask dashboard doesn't show the last part of the data writing step. Instead, the system seems idle for a while before completing the tiff writing.
For reference, our process resembles this example from the pc docs, with an active dask
client
on a gateway cluster. In our case, the data comes from a COG in blob storage. We're seeking a format to support dimensions - input as well as output - like{"band": 10, "x": 100_000, "y": 100_000}
.I've observed that data writing is parallelized when using rioxarray with the relevant dask args (dask.array.store https://github.com/corteva/rioxarray/blob/master/rioxarray/raster_writer.py#L294), though the writing to disk is handled by the JupyterHub server node, rather than being distributed across the gateway cluster workers. Our goal is to parallelize the COG chunk writing to Azure Blob Storage.
The code snippet below highlights the buffer processing data, seemingly in memory or over the network. This setup suggests a close dependency between the hub and the gateway cluster. Our intention is to leverage the gateway more effectively, minimizing potential network i/o costs and dependency on the hub.
Relevant code snippet from the example from the pc docs:
While investigating, I found the "put-block-list" method here. However, I'm not sure how relevant this is for the use case above.
Any advice?
Beta Was this translation helpful? Give feedback.
All reactions