You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the las_files, lasindex and metadata assets are partitioned, where an AHN tile is one partition.
This makes it transparent which AHN tiles are successfully processed on the server.
Additionally, we get the parallel processing of the partitions for free, with dagster's parallel executor.
The partitions are generated by downloading the AHN tile index and checking which tile has data available. The advantage of this method is that in case of a partially complete AHN version (eg AHN5), new unprocessed partitions will show up in dagster when new ahn tiles become available.
The disadvantage of using partitions is that the partition definitions are generated when the code location is (re)loaded. This means that with the current setup, the partitions need to be know at the time of loading the code, they cannot be output from an upstream asset. Therefore, in the current setup the tile_index asset does not actually pass down the partition definition to the las_files asset, but the partition definition is loaded for each asset separately.
If we want to do a partial run of the AHN assets, then we could:
Load the full AHN tile indices and create all partitions as currently, but then only pick 1-2 partitions to execute. This is straightforward and doesn't require any change. But it means that a partition needs to be manually selected, we cannot just materialize all.
Somehow make the partition definition code only load the required part of the ahn tile index, so that we only get that 1-2 partitions that we need. This way we can just click materialize all, but it requires some refactoring. Might not even be possible.
An alternative would be to not use partitions at all, load the tile index in a root asset and pass it downstream. This would allow us to subset the tile index in the root asset only pass the subset downstream. However, then we would loose some of the transparency and would need to implement parallel execution within the assets.
Discussed a solution today for the AHN partition definitions. The new unified AHN tile index allows us to hardcode the partitions (or load them from a local file), so that the partition definition won't need to query the server any more. #57 and #58 will fix this.
Currently the las_files, lasindex and metadata assets are partitioned, where an AHN tile is one partition.
This makes it transparent which AHN tiles are successfully processed on the server.
Additionally, we get the parallel processing of the partitions for free, with dagster's parallel executor.
The partitions are generated by downloading the AHN tile index and checking which tile has data available. The advantage of this method is that in case of a partially complete AHN version (eg AHN5), new unprocessed partitions will show up in dagster when new ahn tiles become available.
The disadvantage of using partitions is that the partition definitions are generated when the code location is (re)loaded. This means that with the current setup, the partitions need to be know at the time of loading the code, they cannot be output from an upstream asset. Therefore, in the current setup the tile_index asset does not actually pass down the partition definition to the las_files asset, but the partition definition is loaded for each asset separately.
If we want to do a partial run of the AHN assets, then we could:
An alternative would be to not use partitions at all, load the tile index in a root asset and pass it downstream. This would allow us to subset the tile index in the root asset only pass the subset downstream. However, then we would loose some of the transparency and would need to implement parallel execution within the assets.
#17 is related to this
What to do?
The text was updated successfully, but these errors were encountered: