Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve timing of checks depending on changes since last check #163

Open
wants to merge 71 commits into
base: main
Choose a base branch
from

Conversation

bolinocroustibat
Copy link
Contributor

@bolinocroustibat bolinocroustibat commented Sep 10, 2024

Closes datagouv/data.gouv.fr#1312.

̶-̶ ̶R̶e̶m̶o̶v̶e̶s̶ ̶S̶Q̶L̶ ̶q̶u̶e̶r̶y̶ ̶w̶r̶a̶p̶p̶e̶r̶ ̶̶s̶e̶l̶e̶c̶t̶_̶r̶o̶w̶s̶_̶b̶a̶s̶e̶d̶_̶o̶n̶_̶q̶u̶e̶r̶y̶̶ ̶w̶h̶i̶c̶h̶ ̶w̶a̶s̶ ̶u̶s̶e̶d̶ ̶t̶o̶ ̶c̶r̶e̶a̶t̶e̶ ̶a̶ ̶t̶e̶m̶p̶o̶r̶a̶r̶y̶ ̶t̶a̶b̶l̶e̶ ̶a̶n̶d̶ ̶t̶e̶m̶p̶o̶r̶a̶r̶i̶l̶y̶ ̶m̶a̶r̶k̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶b̶e̶i̶n̶g̶ ̶c̶r̶a̶w̶l̶e̶d̶ ̶a̶s̶ ̶"̶c̶r̶a̶w̶l̶i̶n̶g̶"̶.̶
̶S̶i̶n̶c̶e̶ ̶w̶e̶ ̶u̶p̶d̶a̶t̶e̶ ̶t̶h̶e̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶ ̶s̶t̶a̶t̶u̶s̶ ̶i̶n̶ ̶t̶h̶e̶ ̶c̶h̶e̶c̶k̶ ̶c̶o̶d̶e̶ ̶e̶l̶s̶e̶w̶h̶e̶r̶e̶,̶ ̶i̶t̶ ̶s̶h̶o̶u̶l̶d̶ ̶n̶o̶t̶ ̶b̶e̶ ̶n̶e̶c̶e̶s̶s̶a̶r̶y̶ ̶a̶n̶y̶m̶o̶r̶e̶ ̶t̶o̶ ̶h̶a̶v̶e̶ ̶t̶h̶i̶s̶ ̶t̶e̶m̶p̶o̶r̶a̶r̶y̶ ̶t̶a̶b̶l̶e̶.̶
̶̶̶O̶n̶ ̶r̶e̶v̶i̶e̶w̶,̶ ̶p̶l̶e̶a̶s̶e̶ ̶d̶o̶u̶b̶l̶e̶-̶c̶h̶e̶c̶k̶ ̶t̶h̶e̶ ̶v̶a̶l̶i̶d̶i̶t̶y̶ ̶o̶f̶ ̶t̶h̶i̶s̶ ̶r̶e̶m̶o̶v̶a̶l̶̶̶

̶-̶ ̶A̶d̶d̶ ̶̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶_̶D̶E̶F̶A̶U̶L̶T̶̶ ̶c̶o̶n̶f̶i̶g̶ ̶v̶a̶r̶:̶ ̶t̶i̶m̶e̶ ̶a̶f̶t̶e̶r̶ ̶w̶h̶i̶c̶h̶ ̶a̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶ ̶m̶u̶s̶t̶ ̶b̶e̶ ̶c̶h̶e̶c̶k̶ ̶i̶f̶ ̶w̶e̶ ̶d̶o̶n̶'̶t̶ ̶k̶n̶o̶w̶ ̶a̶b̶o̶u̶t̶ ̶i̶t̶s̶ ̶l̶a̶s̶t̶ ̶m̶o̶d̶i̶f̶i̶c̶a̶t̶i̶o̶n̶ ̶d̶a̶t̶e̶ ̶(̶r̶e̶p̶l̶a̶c̶e̶s̶ ̶t̶h̶e̶ ̶p̶r̶e̶v̶i̶o̶u̶s̶ ̶̶S̶I̶N̶C̶E̶̶ ̶c̶o̶n̶f̶i̶g̶ ̶v̶a̶r̶)̶

  • Add CHECK_DELAYS config var: list of delays (̶p̶o̶s̶t̶g̶r̶e̶s̶q̶l̶ ̶i̶n̶t̶e̶r̶v̶a̶l̶ ̶s̶y̶n̶t̶a̶x̶)̶ between two checks, if the resource has not been modified

̶-̶ ̶U̶p̶d̶a̶t̶e̶ ̶t̶h̶e̶ ̶̶s̶e̶l̶e̶c̶t̶_̶b̶a̶t̶c̶h̶_̶r̶e̶s̶o̶u̶r̶c̶e̶s̶_̶t̶o̶_̶c̶h̶e̶c̶k̶̶ ̶m̶e̶t̶h̶o̶d̶ ̶w̶i̶t̶h̶ ̶t̶h̶e̶ ̶f̶o̶l̶l̶o̶w̶i̶n̶g̶ ̶l̶o̶g̶i̶c̶:̶
̶ ̶ ̶ ̶ ̶1̶.̶ ̶F̶i̶r̶s̶t̶ ̶a̶d̶d̶s̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶i̶t̶h̶ ̶p̶r̶i̶o̶r̶i̶t̶y̶=̶T̶r̶u̶e̶ ̶t̶o̶ ̶t̶h̶e̶ ̶b̶a̶t̶c̶h̶ ̶(̶_̶t̶h̶i̶s̶ ̶d̶o̶e̶s̶n̶'̶t̶ ̶c̶h̶a̶n̶g̶e̶_̶)̶
̶ ̶ ̶ ̶ ̶2̶.̶ ̶T̶h̶e̶n̶ ̶a̶d̶d̶s̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶i̶t̶h̶o̶u̶t̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶(̶=̶w̶i̶t̶h̶ ̶n̶o̶ ̶c̶h̶e̶c̶k̶,̶ ̶n̶e̶v̶e̶r̶ ̶h̶a̶v̶e̶ ̶b̶e̶e̶n̶ ̶c̶h̶e̶c̶k̶e̶d̶)̶ ̶t̶o̶ ̶t̶h̶e̶ ̶b̶a̶t̶c̶h̶ ̶(̶_̶t̶h̶i̶s̶ ̶d̶o̶e̶s̶n̶'̶t̶ ̶c̶h̶a̶n̶g̶e̶_̶)̶
̶ ̶ ̶ ̶ ̶3̶.̶ ̶T̶h̶e̶n̶,̶ ̶i̶f̶ ̶t̶h̶e̶ ̶t̶o̶t̶a̶l̶ ̶n̶u̶m̶b̶e̶r̶ ̶o̶f̶ ̶s̶e̶l̶e̶c̶t̶e̶d̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶i̶s̶ ̶s̶t̶i̶l̶l̶ ̶l̶e̶s̶s̶ ̶t̶h̶a̶n̶ ̶t̶h̶e̶ ̶b̶a̶t̶c̶h̶ ̶s̶i̶z̶e̶:̶
̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶3̶.̶1̶.̶ ̶a̶d̶d̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶h̶i̶c̶h̶ ̶d̶o̶n̶'̶t̶ ̶h̶a̶v̶e̶ ̶a̶t̶ ̶l̶e̶a̶s̶t̶ ̶t̶w̶o̶ ̶c̶h̶e̶c̶k̶s̶ ̶y̶e̶t̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶i̶s̶ ̶o̶l̶d̶e̶r̶ ̶t̶h̶a̶n̶ ̶̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶_̶D̶E̶F̶A̶U̶L̶T̶̶
̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶3̶.̶2̶.̶ ̶a̶d̶d̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶h̶i̶c̶h̶ ̶a̶t̶ ̶l̶e̶a̶s̶t̶ ̶o̶n̶e̶ ̶t̶h̶e̶ ̶t̶h̶e̶i̶r̶ ̶t̶w̶o̶ ̶m̶o̶s̶t̶ ̶r̶e̶c̶e̶n̶t̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶s̶ ̶h̶a̶s̶ ̶a̶n̶ ̶u̶n̶k̶n̶o̶w̶n̶ ̶m̶o̶d̶i̶f̶i̶e̶d̶ ̶d̶a̶t̶e̶ ̶(̶n̶o̶ ̶̶d̶e̶t̶e̶c̶t̶e̶d̶_̶l̶a̶s̶t̶_̶m̶o̶d̶i̶f̶i̶e̶d̶_̶a̶t̶̶,̶ ̶s̶o̶ ̶w̶e̶ ̶c̶a̶n̶n̶o̶t̶ ̶c̶o̶m̶p̶a̶r̶e̶ ̶w̶i̶t̶h̶ ̶a̶n̶o̶t̶h̶e̶r̶ ̶c̶h̶e̶c̶k̶)̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶i̶s̶ ̶o̶l̶d̶e̶r̶ ̶t̶h̶a̶n̶ ̶̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶_̶D̶E̶F̶A̶U̶L̶T̶̶
̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶3̶.̶3̶.̶ ̶a̶d̶d̶ ̶r̶e̶s̶s̶o̶u̶r̶c̶e̶s̶ ̶t̶o̶ ̶w̶h̶i̶c̶h̶ ̶t̶h̶e̶i̶r̶ ̶t̶w̶o̶ ̶m̶o̶s̶t̶ ̶r̶e̶c̶e̶n̶t̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶s̶ ̶h̶a̶v̶e̶ ̶c̶h̶a̶n̶g̶e̶d̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶i̶s̶ ̶o̶l̶d̶e̶r̶ ̶t̶h̶a̶n̶ ̶̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶S̶[̶0̶]̶̶ ̶(̶t̶h̶i̶s̶ ̶i̶s̶ ̶t̶o̶ ̶r̶e̶-̶c̶h̶e̶c̶k̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶t̶h̶a̶t̶ ̶w̶e̶ ̶k̶n̶o̶w̶ ̶i̶t̶ ̶h̶a̶s̶ ̶c̶h̶a̶n̶g̶e̶d̶)̶
̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶3̶.̶4̶.̶ ̶a̶d̶d̶ ̶r̶e̶s̶s̶o̶u̶r̶c̶e̶s̶ ̶t̶o̶ ̶w̶h̶i̶c̶h̶ ̶t̶h̶e̶i̶r̶ ̶l̶a̶s̶t̶ ̶t̶w̶o̶ ̶c̶h̶e̶c̶k̶s̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶c̶h̶a̶n̶g̶e̶d̶,̶ ̶t̶h̶e̶ ̶t̶w̶o̶ ̶c̶h̶e̶c̶k̶s̶ ̶h̶a̶v̶e̶ ̶d̶o̶n̶e̶ ̶b̶e̶t̶w̶e̶e̶n̶ ̶t̶w̶o̶ ̶d̶e̶l̶a̶y̶s̶ ̶i̶n̶ ̶̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶S̶̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶i̶s̶ ̶o̶l̶d̶e̶r̶ ̶t̶h̶a̶n̶ ̶t̶h̶e̶ ̶s̶a̶m̶e̶ ̶d̶e̶l̶a̶y̶ ̶i̶n̶ ̶̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶S̶̶ ̶(̶t̶h̶i̶s̶ ̶i̶s̶ ̶i̶n̶ ̶o̶r̶d̶e̶r̶ ̶t̶o̶ ̶a̶v̶o̶i̶d̶ ̶c̶h̶e̶c̶k̶i̶n̶g̶ ̶t̶o̶o̶ ̶o̶f̶t̶e̶n̶ ̶t̶h̶e̶ ̶s̶a̶m̶e̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶ ̶w̶h̶i̶c̶h̶ ̶d̶o̶e̶s̶n̶'̶t̶ ̶c̶h̶a̶n̶g̶e̶)̶ ̶
̶
̶-̶ ̶A̶d̶a̶p̶t̶ ̶t̶h̶e̶ ̶w̶a̶y̶ ̶w̶e̶ ̶c̶o̶u̶n̶t̶ ̶̶p̶e̶n̶d̶i̶n̶g̶_̶c̶h̶e̶c̶k̶s̶̶ ̶a̶n̶d̶ ̶̶f̶r̶e̶s̶h̶_̶c̶h̶e̶c̶k̶s̶̶ ̶i̶n̶ ̶t̶h̶e̶ ̶̶/̶a̶p̶i̶/̶s̶t̶a̶t̶u̶s̶/̶c̶r̶a̶w̶l̶e̶r̶̶ ̶e̶n̶d̶p̶o̶i̶n̶t̶ ̶r̶e̶s̶p̶o̶n̶s̶e̶:̶
̶ ̶ ̶ ̶ ̶-̶ ̶̶p̶e̶n̶d̶i̶n̶g̶_̶c̶h̶e̶c̶k̶s̶̶ ̶a̶r̶e̶ ̶t̶h̶e̶ ̶n̶u̶m̶b̶e̶r̶ ̶o̶f̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶i̶t̶h̶ ̶n̶o̶ ̶c̶h̶e̶c̶k̶,̶ ̶p̶l̶u̶s̶ ̶t̶h̶e̶ ̶n̶u̶m̶b̶e̶r̶ ̶w̶i̶t̶h̶ ̶o̶u̶t̶d̶a̶t̶e̶d̶ ̶c̶h̶e̶c̶k̶s̶ ̶(̶o̶u̶t̶d̶a̶t̶e̶d̶ ̶c̶h̶e̶c̶k̶s̶ ̶a̶r̶e̶ ̶n̶o̶w̶ ̶c̶o̶u̶n̶t̶e̶d̶ ̶d̶i̶f̶f̶e̶r̶e̶n̶t̶l̶y̶,̶ ̶d̶e̶p̶e̶n̶d̶i̶n̶g̶ ̶o̶n̶ ̶t̶h̶e̶ ̶d̶e̶l̶a̶y̶s̶)̶
̶ ̶ ̶ ̶ ̶ ̶-̶ ̶̶f̶r̶e̶s̶h̶_̶c̶h̶e̶c̶k̶s̶̶ ̶a̶r̶e̶ ̶t̶h̶e̶ ̶n̶u̶m̶b̶e̶r̶ ̶o̶f̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶i̶t̶h̶ ̶a̶ ̶c̶h̶e̶c̶k̶,̶ ̶m̶i̶n̶u̶s̶ ̶t̶h̶e̶ ̶n̶u̶m̶b̶e̶r̶ ̶w̶i̶t̶h̶ ̶o̶u̶t̶d̶a̶t̶e̶d̶ ̶c̶h̶e̶c̶k̶s̶ ̶(̶o̶u̶t̶d̶a̶t̶e̶d̶ ̶c̶h̶e̶c̶k̶s̶ ̶a̶r̶e̶ ̶n̶o̶w̶ ̶c̶o̶u̶n̶t̶e̶d̶ ̶d̶i̶f̶f̶e̶r̶e̶n̶t̶l̶y̶,̶ ̶d̶e̶p̶e̶n̶d̶i̶n̶g̶ ̶o̶n̶ ̶t̶h̶e̶ ̶d̶e̶l̶a̶y̶s̶)̶

  • Add a migration to add a next_check_at datetime column to the checks table

  • Add logic to calculate next_check_at datetime when processing the check, depending on the date f the previous check and if it has changed since then, in a dedicated calculate_next_check_datemethod

  • Modify logic for select_batch in order to select resources that have a related last check that have an expired next check date

  • Refactor store_last_modified_date method to also store the next check date, and rename it to update_check_with_modification_and_next_dates

  • Rename process_check_data to preprocess_check_data for more clarity

@bolinocroustibat bolinocroustibat changed the title docs: update changelog improve unavalaible resources management Sep 10, 2024
@bolinocroustibat bolinocroustibat self-assigned this Sep 10, 2024
@bolinocroustibat bolinocroustibat changed the title improve unavalaible resources management Improve unavalaible resources management Sep 11, 2024
@bolinocroustibat bolinocroustibat marked this pull request as ready for review September 12, 2024 12:10
@bolinocroustibat bolinocroustibat changed the title Improve unavalaible resources management Improve management of unavalaible resources with variable delays between two checks Sep 12, 2024
Copy link
Contributor

@magopian magopian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what was the use of the temporary table, was it to make sure that it's threadsafe?

Other than a small nit, this looks good to me ;)

udata_hydra/config_default.toml Outdated Show resolved Hide resolved
@magopian
Copy link
Contributor

Also, didn't investigate why, but it seems you've got a failing test.

@bolinocroustibat bolinocroustibat changed the title Improve management of unavalaible resources with variable delays between two checks Improve timing of checks depending on resources last modification date Sep 18, 2024
Copy link
Contributor

@magopian magopian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this is very nice and impressive, pretty complex piece of code!

I believe there's a typo that breaks the behavior that you may want to fix, other than that, I'm afraid I can't tell much more, in particular about the SQL query 😱

I'll approve this PR and let you decide if you want to go forward with this PR as is, or if you want to add tests to validate that the SQL query does what it's meant to do? Maybe there area already some that were there previously?

udata_hydra/db/resource.py Show resolved Hide resolved
udata_hydra/crawl/process_check_data.py Outdated Show resolved Hide resolved
udata_hydra/crawl/process_check_data.py Outdated Show resolved Hide resolved
Comment on lines 45 to 59
# Resource has not been modified since last check:
# if the time since last check is greater than the longest delay in CHECK_DELAYS, next check will be after the longest delay
# if the time since last check is less than CHECK_DELAYS[i], next check will be after this CHECK_DELAYS[i]
previous_delay: timedelta = datetime.now(timezone.utc) - datetime.fromisoformat(
last_check["created_at"]
)
if previous_delay > timedelta(hours=config.CHECK_DELAYS[-1]):
next_check = datetime.now(timezone.utc) + timedelta(hours=config.CHECK_DELAYS[-1])
else:
for d in config.CHECK_DELAYS:
if previous_delay <= timedelta(hours=d):
next_check = datetime.now(timezone.utc) + timedelta(hours=d)
break
check_data["next_check"] = next_check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time understanding what that was doing, maybe it's just me...

Would that be clearer like so (I may be missing some detail or edge cases):

Suggested change
# Resource has not been modified since last check:
# if the time since last check is greater than the longest delay in CHECK_DELAYS, next check will be after the longest delay
# if the time since last check is less than CHECK_DELAYS[i], next check will be after this CHECK_DELAYS[i]
previous_delay: timedelta = datetime.now(timezone.utc) - datetime.fromisoformat(
last_check["created_at"]
)
if previous_delay > timedelta(hours=config.CHECK_DELAYS[-1]):
next_check = datetime.now(timezone.utc) + timedelta(hours=config.CHECK_DELAYS[-1])
else:
for d in config.CHECK_DELAYS:
if previous_delay <= timedelta(hours=d):
next_check = datetime.now(timezone.utc) + timedelta(hours=d)
break
check_data["next_check"] = next_check
# Resource has not been modified since last check:
# Re-check it after a delay just greater than the time elapsed since the last check.
since_last_check: timedelta = datetime.now(timezone.utc) - datetime.fromisoformat(
last_check["created_at"]
)
now: datetime = datetime.now(timezone.utc)
# At most, re-check after the greatest delay.
next_check = now + timedelta(hours=config.CHECK_DELAYS[-1])
# Should we re-check it sooner? Loop through the CHECK_DELAYS from longest to shortest.
for delay in reversed(config.CHECK_DELAYS):
delay_delta: timedelta = timedelta(hours=delay)
if since_last_check <= delay_delta:
next_check = now + delay_delta
else:
# Previous delay was the last one greater than `since_last_check`, keep it.
break
check_data["next_check"] = next_check

I realize my rewrite is slightly longer, it feels clearer to me, but it could very well be because i just (re)wrote it, feel free to simply dismiss this.

Copy link
Contributor Author

@bolinocroustibat bolinocroustibat Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks, although I wonder if I prefer going from the shorter delay to the longest... As the shortest should be the default (more safe), and going to longer delays should be considered as "fallbacks", if that makes sense.

I might rewrite a bit as a mix between your solution and mine though!

udata_hydra/routes/status.py Outdated Show resolved Hide resolved
@bolinocroustibat bolinocroustibat changed the title feat: improve timing of checks depending on resources last modification date feat: improve timing of checks depending on resources changes Oct 30, 2024
# Conflicts:
#	CHANGELOG.md
#	udata_hydra/crawl/helpers.py
#	udata_hydra/crawl/process_check_data.py
#	udata_hydra/routes/status.py
@bolinocroustibat bolinocroustibat changed the title feat: improve timing of checks depending on resources changes feat: improve timing of checks depending on resource/check changes Nov 6, 2024
@bolinocroustibat bolinocroustibat changed the title feat: improve timing of checks depending on resource/check changes feat: improve timing of checks depending on changes since last check Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[hydra] Improve management of unavalaible resources with variable delays between two checks
3 participants