Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider if partitions are needed for the AHN assets #31

Open
balazsdukai opened this issue Apr 24, 2024 · 1 comment
Open

Reconsider if partitions are needed for the AHN assets #31

balazsdukai opened this issue Apr 24, 2024 · 1 comment
Assignees
Labels
Milestone

Comments

@balazsdukai
Copy link
Member

balazsdukai commented Apr 24, 2024

Currently the las_files, lasindex and metadata assets are partitioned, where an AHN tile is one partition.

image

This makes it transparent which AHN tiles are successfully processed on the server.
Additionally, we get the parallel processing of the partitions for free, with dagster's parallel executor.

The partitions are generated by downloading the AHN tile index and checking which tile has data available. The advantage of this method is that in case of a partially complete AHN version (eg AHN5), new unprocessed partitions will show up in dagster when new ahn tiles become available.
The disadvantage of using partitions is that the partition definitions are generated when the code location is (re)loaded. This means that with the current setup, the partitions need to be know at the time of loading the code, they cannot be output from an upstream asset. Therefore, in the current setup the tile_index asset does not actually pass down the partition definition to the las_files asset, but the partition definition is loaded for each asset separately.
If we want to do a partial run of the AHN assets, then we could:

  • Load the full AHN tile indices and create all partitions as currently, but then only pick 1-2 partitions to execute. This is straightforward and doesn't require any change. But it means that a partition needs to be manually selected, we cannot just materialize all.
  • Somehow make the partition definition code only load the required part of the ahn tile index, so that we only get that 1-2 partitions that we need. This way we can just click materialize all, but it requires some refactoring. Might not even be possible.

An alternative would be to not use partitions at all, load the tile index in a root asset and pass it downstream. This would allow us to subset the tile index in the root asset only pass the subset downstream. However, then we would loose some of the transparency and would need to implement parallel execution within the assets.

#17 is related to this

What to do?

@balazsdukai balazsdukai added this to the Partial run milestone Apr 24, 2024
@balazsdukai
Copy link
Member Author

balazsdukai commented Oct 17, 2024

Discussed a solution today for the AHN partition definitions. The new unified AHN tile index allows us to hardcode the partitions (or load them from a local file), so that the partition definition won't need to query the server any more. #57 and #58 will fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants