-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ahn5 job in pipeline #57
base: develop
Are you sure you want to change the base?
Conversation
…5-job-in-pipeline
@@ -8,7 +8,7 @@ | |||
|
|||
class PartitionDefinitionAHN(StaticPartitionsDefinition): | |||
def __init__(self, ahn_version: int): | |||
tile_ids = download_ahn_index_esri(ahn_version, with_geom=False) | |||
tile_ids = download_ahn_index(with_geom=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note about checking the webservice in this init(): it will be ran at for each new job that depends on the ahn partition, so thousands of times in out pipeline. In the past this led to minor issues, ie. on rare cases the webservice is not reachable and it leads to a job error.
Could we move this download function to an asset perhaps? ie. it downloads to a json and on this line we just read that json? that should fix the above issue. Or maybe there are other solutions.
Doesn't have to be part of this PR, but I think we should fix this at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested all of the ahn5 assets manually, through the UI, and some of the ahn3 assets too. I fixed what was necessary for the assets to run.
In particular:
- Disabled the SSL verification for the AHN downloads, because the server is not configured correctly. Added a comment on this in the code.
- The new-style pdal resource was not passed correctly to pdal_info, fixed it.
- There was a change to pass tuples instead of strings to the hash-members of LAZDownload. I reverted this, because the hash must be formatted as
algo:hash
so that tools can parse it. - The tile_id as primary key in the AHN metadata table prevents the subsequent execution of the metadata asset, so I reverted this. If you really want a primary key then it needs to be tile_id+timestamp.
- Some other minor fixes, like logging, corrected the AHN3/4 licenses, typos.
pdal info
did work for me on the 37cn1
tile of AHN5. Which in my opinion just shows the importance of validating the point cloud files by trying to compute stats on them.
Regarding the AHN partitions, with the new ahn index we don't need query the ahn index each time the partition definition is created. The previous ahn indices from esri had a flag to signal if a tile has available data (eg. half of ahn5 index would not have data). In this case, our dynamically loaded ahn5 partition definition would only contain 700 tiles instead of 1400. We run the 700 partitions and they show up as successful in dagster. When esri updates the index that new tiles are available with data, let's say an additional 100, then next time we reload dagster, those 100 tiles will show up as missing partitions for ahn5. This signals us that new data became available and we need to process it. With the new, unified ahn index we know exactly the available ahn partitions, and they don't change, or they don't change often. This means that we don't need to query the ahn index just to get all the ahn tile id-s for defining the ahn partitions. We can load the tile id-s from a simple text file, in the partition definition code, or even hardcode them directly in python. |
My suggestion for how to tackle the issues with AHN5.
Now we are no longer using the ESRI provider for the AHN index. I don't think that one function per AHN number was necessary so I merged them into one and I modified a little how the index object is structured.
Since we can only use pdal for getting the info, I added some error handling there to avoid failures. When pdal fails (as it happens with AHN5 file, then the metadata are set as {}. This results in this empty json in the DB: