Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ahn5 job in pipeline #57

Draft
wants to merge 25 commits into
base: develop
Choose a base branch
from
Draft

Conversation

GinaStavropoulou
Copy link
Member

My suggestion for how to tackle the issues with AHN5.

  • Now we are no longer using the ESRI provider for the AHN index. I don't think that one function per AHN number was necessary so I merged them into one and I modified a little how the index object is structured.

  • Since we can only use pdal for getting the info, I added some error handling there to avoid failures. When pdal fails (as it happens with AHN5 file, then the metadata are set as {}. This results in this empty json in the DB:

image
  • Regarding the validation of the files, since no checksums are available for AHN5 I had to separate the downloading from the validation. For AHN3 and 4 we download and then we validate but in the case of AHN5 we only download. I did as minimal changes as I could but I think the overall design should be improved.
  • I also added a PK to the metadata_ahn tables because otherwise new lines kept being created.

@@ -8,7 +8,7 @@

class PartitionDefinitionAHN(StaticPartitionsDefinition):
def __init__(self, ahn_version: int):
tile_ids = download_ahn_index_esri(ahn_version, with_geom=False)
tile_ids = download_ahn_index(with_geom=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note about checking the webservice in this init(): it will be ran at for each new job that depends on the ahn partition, so thousands of times in out pipeline. In the past this led to minor issues, ie. on rare cases the webservice is not reachable and it leads to a job error.

Could we move this download function to an asset perhaps? ie. it downloads to a json and on this line we just read that json? that should fix the above issue. Or maybe there are other solutions.

Doesn't have to be part of this PR, but I think we should fix this at some point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the issue, it is what is causing this when running dagster:

image

I am not familiar with Partition Definitions yet but I will look into this

Copy link
Member

@balazsdukai balazsdukai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested all of the ahn5 assets manually, through the UI, and some of the ahn3 assets too. I fixed what was necessary for the assets to run.
In particular:

  • Disabled the SSL verification for the AHN downloads, because the server is not configured correctly. Added a comment on this in the code.
  • The new-style pdal resource was not passed correctly to pdal_info, fixed it.
  • There was a change to pass tuples instead of strings to the hash-members of LAZDownload. I reverted this, because the hash must be formatted as algo:hash so that tools can parse it.
  • The tile_id as primary key in the AHN metadata table prevents the subsequent execution of the metadata asset, so I reverted this. If you really want a primary key then it needs to be tile_id+timestamp.
  • Some other minor fixes, like logging, corrected the AHN3/4 licenses, typos.

pdal info did work for me on the 37cn1 tile of AHN5. Which in my opinion just shows the importance of validating the point cloud files by trying to compute stats on them.

@balazsdukai
Copy link
Member

Regarding the AHN partitions, with the new ahn index we don't need query the ahn index each time the partition definition is created.

The previous ahn indices from esri had a flag to signal if a tile has available data (eg. half of ahn5 index would not have data). In this case, our dynamically loaded ahn5 partition definition would only contain 700 tiles instead of 1400. We run the 700 partitions and they show up as successful in dagster. When esri updates the index that new tiles are available with data, let's say an additional 100, then next time we reload dagster, those 100 tiles will show up as missing partitions for ahn5. This signals us that new data became available and we need to process it.
More importantly, the ahn indices of the two versions (3,4) were not the same. There were a couple tiles that were missing from one or the other.

With the new, unified ahn index we know exactly the available ahn partitions, and they don't change, or they don't change often. This means that we don't need to query the ahn index just to get all the ahn tile id-s for defining the ahn partitions. We can load the tile id-s from a simple text file, in the partition definition code, or even hardcode them directly in python.
Then what we do need to detect is that if the ahn index has changed compared to what we've hardcoded. Effectively, this is a validation step that we need to run every now and then...like with each new ahn release I guess.
The validation could become a separate op in the ahn tile index asset, which could make the tile index asset a multi-asset (https://docs.dagster.io/concepts/assets/software-defined-assets#graph-backed-assets-and-multi-assets). Then if the validation op fails, the tile index asset fails and everything downstream is marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants