Reconsider if partitions are needed for the AHN assets #31

balazsdukai · 2024-04-24T10:04:24Z

Currently the las_files, lasindex and metadata assets are partitioned, where an AHN tile is one partition.

This makes it transparent which AHN tiles are successfully processed on the server.
Additionally, we get the parallel processing of the partitions for free, with dagster's parallel executor.

The partitions are generated by downloading the AHN tile index and checking which tile has data available. The advantage of this method is that in case of a partially complete AHN version (eg AHN5), new unprocessed partitions will show up in dagster when new ahn tiles become available.
The disadvantage of using partitions is that the partition definitions are generated when the code location is (re)loaded. This means that with the current setup, the partitions need to be know at the time of loading the code, they cannot be output from an upstream asset. Therefore, in the current setup the tile_index asset does not actually pass down the partition definition to the las_files asset, but the partition definition is loaded for each asset separately.
If we want to do a partial run of the AHN assets, then we could:

Load the full AHN tile indices and create all partitions as currently, but then only pick 1-2 partitions to execute. This is straightforward and doesn't require any change. But it means that a partition needs to be manually selected, we cannot just materialize all.
Somehow make the partition definition code only load the required part of the ahn tile index, so that we only get that 1-2 partitions that we need. This way we can just click materialize all, but it requires some refactoring. Might not even be possible.

An alternative would be to not use partitions at all, load the tile index in a root asset and pass it downstream. This would allow us to subset the tile index in the root asset only pass the subset downstream. However, then we would loose some of the transparency and would need to implement parallel execution within the assets.

#17 is related to this

What to do?

balazsdukai · 2024-10-17T11:37:59Z

Discussed a solution today for the AHN partition definitions. The new unified AHN tile index allows us to hardcode the partitions (or load them from a local file), so that the partition definition won't need to query the server any more. #57 and #58 will fix this.

balazsdukai added this to the Partial run milestone Apr 24, 2024

balazsdukai added the AHN5 label Oct 17, 2024

balazsdukai assigned GinaStavropoulou Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider if partitions are needed for the AHN assets #31

Reconsider if partitions are needed for the AHN assets #31

balazsdukai commented Apr 24, 2024 •

edited

Loading

balazsdukai commented Oct 17, 2024 •

edited

Loading

Reconsider if partitions are needed for the AHN assets #31

Reconsider if partitions are needed for the AHN assets #31

Comments

balazsdukai commented Apr 24, 2024 • edited Loading

balazsdukai commented Oct 17, 2024 • edited Loading

balazsdukai commented Apr 24, 2024 •

edited

Loading

balazsdukai commented Oct 17, 2024 •

edited

Loading