Ahn5 job in pipeline #57

GinaStavropoulou · 2024-10-14T12:39:11Z

My suggestion for how to tackle the issues with AHN5.

Now we are no longer using the ESRI provider for the AHN index. I don't think that one function per AHN number was necessary so I merged them into one and I modified a little how the index object is structured.
Since we can only use pdal for getting the info, I added some error handling there to avoid failures. When pdal fails (as it happens with AHN5 file, then the metadata are set as {}. This results in this empty json in the DB:

Regarding the validation of the files, since no checksums are available for AHN5 I had to separate the downloading from the validation. For AHN3 and 4 we download and then we validate but in the case of AHN5 we only download. I did as minimal changes as I could but I think the overall design should be improved.
I also added a PK to the metadata_ahn tables because otherwise new lines kept being created.

…5-job-in-pipeline

… for AHN tile

Ylannl · 2024-10-14T13:33:48Z

packages/core/src/bag3d/core/assets/ahn/core.py

@@ -8,7 +8,7 @@

 class PartitionDefinitionAHN(StaticPartitionsDefinition):
    def __init__(self, ahn_version: int):
-        tile_ids = download_ahn_index_esri(ahn_version, with_geom=False)
+        tile_ids = download_ahn_index(with_geom=False)


One note about checking the webservice in this init(): it will be ran at for each new job that depends on the ahn partition, so thousands of times in out pipeline. In the past this led to minor issues, ie. on rare cases the webservice is not reachable and it leads to a job error.

Could we move this download function to an asset perhaps? ie. it downloads to a json and on this line we just read that json? that should fix the above issue. Or maybe there are other solutions.

Doesn't have to be part of this PR, but I think we should fix this at some point

I see the issue, it is what is causing this when running dagster:

I am not familiar with Partition Definitions yet but I will look into this

balazsdukai

I tested all of the ahn5 assets manually, through the UI, and some of the ahn3 assets too. I fixed what was necessary for the assets to run.
In particular:

Disabled the SSL verification for the AHN downloads, because the server is not configured correctly. Added a comment on this in the code.
The new-style pdal resource was not passed correctly to pdal_info, fixed it.
There was a change to pass tuples instead of strings to the hash-members of LAZDownload. I reverted this, because the hash must be formatted as algo:hash so that tools can parse it.
The tile_id as primary key in the AHN metadata table prevents the subsequent execution of the metadata asset, so I reverted this. If you really want a primary key then it needs to be tile_id+timestamp.
Some other minor fixes, like logging, corrected the AHN3/4 licenses, typos.

pdal info did work for me on the 37cn1 tile of AHN5. Which in my opinion just shows the importance of validating the point cloud files by trying to compute stats on them.

balazsdukai · 2024-10-16T14:47:16Z

Regarding the AHN partitions, with the new ahn index we don't need query the ahn index each time the partition definition is created.

The previous ahn indices from esri had a flag to signal if a tile has available data (eg. half of ahn5 index would not have data). In this case, our dynamically loaded ahn5 partition definition would only contain 700 tiles instead of 1400. We run the 700 partitions and they show up as successful in dagster. When esri updates the index that new tiles are available with data, let's say an additional 100, then next time we reload dagster, those 100 tiles will show up as missing partitions for ahn5. This signals us that new data became available and we need to process it.
More importantly, the ahn indices of the two versions (3,4) were not the same. There were a couple tiles that were missing from one or the other.

With the new, unified ahn index we know exactly the available ahn partitions, and they don't change, or they don't change often. This means that we don't need to query the ahn index just to get all the ahn tile id-s for defining the ahn partitions. We can load the tile id-s from a simple text file, in the partition definition code, or even hardcode them directly in python.
Then what we do need to detect is that if the ahn index has changed compared to what we've hardcoded. Effectively, this is a validation step that we need to run every now and then...like with each new ahn release I guess.
The validation could become a separate op in the ahn tile index asset, which could make the tile index asset a multi-asset (https://docs.dagster.io/concepts/assets/software-defined-assets#graph-backed-assets-and-multi-assets). Then if the validation op fails, the tile index asset fails and everything downstream is marked as outdated.

GinaStavropoulou added 13 commits October 10, 2024 12:20

readme

0a8dc04

Merge branch 'develop' of github.com:3DBAG/3dbag-pipeline into 55-ahn…

ef7c50c

…5-job-in-pipeline

AHN5 WIP

4ba4100

fixed tests

3d3fdf1

if pdal is not working with new laz files, no info is saved in the DB…

809f8e3

… for AHN tile

merging ahn conflicts

dd3e884

format

89c837b

create directories

ad64e82

added tmp path

3e46b71

correction of lastool resource

bcef02b

added primary key to the AHN tables

2053e93

metadata into {}

1a98a94

separate validation from downloading in ahn

ab12bce

GinaStavropoulou requested review from Ylannl and balazsdukai October 14, 2024 12:39

GinaStavropoulou added 4 commits October 14, 2024 14:44

formatting

ba05de2

tmp chang in pyproject

eb91fcd

fixed tests

37b89b7

removed unused import

f148702

Ylannl reviewed Oct 14, 2024

View reviewed changes

GinaStavropoulou and others added 8 commits October 15, 2024 09:44

removed ahn version from PartitionDefinitionAHN

ae021cb

Fix SSL verification for AHN downloads

4dfe44a

Fix docstring

8596370

Fix AHN3/4 license

e93dd0e

Fix pdal info in ahn metadata

657dca1

Fix LAZDownload validate hash

827a9c7

Revert primary key in ahn metadata tables

24965a4

Add logging to download_ahn_index

c4788e3

balazsdukai approved these changes Oct 15, 2024

View reviewed changes

This was referenced Oct 17, 2024

5 tiles failed to reconstruct in nl_reconstruct due to esri index connection reset #17

Open

Reconsider if partitions are needed for the AHN assets #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ahn5 job in pipeline #57

Ahn5 job in pipeline #57

GinaStavropoulou commented Oct 14, 2024

Ylannl Oct 14, 2024

GinaStavropoulou Oct 14, 2024

balazsdukai left a comment

balazsdukai commented Oct 16, 2024

Ahn5 job in pipeline #57

Are you sure you want to change the base?

Ahn5 job in pipeline #57

Conversation

GinaStavropoulou commented Oct 14, 2024

Ylannl Oct 14, 2024

Choose a reason for hiding this comment

GinaStavropoulou Oct 14, 2024

Choose a reason for hiding this comment

balazsdukai left a comment

Choose a reason for hiding this comment

balazsdukai commented Oct 16, 2024