-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve timing of checks depending on changes since last check #163
base: main
Are you sure you want to change the base?
Conversation
715a695
to
739ef30
Compare
32ac865
to
87d2c52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what was the use of the temporary table, was it to make sure that it's threadsafe?
Other than a small nit, this looks good to me ;)
Also, didn't investigate why, but it seems you've got a failing test. |
# Conflicts: # CHANGELOG.md
776e189
to
463f229
Compare
13608ac
to
04d324a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this is very nice and impressive, pretty complex piece of code!
I believe there's a typo that breaks the behavior that you may want to fix, other than that, I'm afraid I can't tell much more, in particular about the SQL query 😱
I'll approve this PR and let you decide if you want to go forward with this PR as is, or if you want to add tests to validate that the SQL query does what it's meant to do? Maybe there area already some that were there previously?
# Resource has not been modified since last check: | ||
# if the time since last check is greater than the longest delay in CHECK_DELAYS, next check will be after the longest delay | ||
# if the time since last check is less than CHECK_DELAYS[i], next check will be after this CHECK_DELAYS[i] | ||
previous_delay: timedelta = datetime.now(timezone.utc) - datetime.fromisoformat( | ||
last_check["created_at"] | ||
) | ||
if previous_delay > timedelta(hours=config.CHECK_DELAYS[-1]): | ||
next_check = datetime.now(timezone.utc) + timedelta(hours=config.CHECK_DELAYS[-1]) | ||
else: | ||
for d in config.CHECK_DELAYS: | ||
if previous_delay <= timedelta(hours=d): | ||
next_check = datetime.now(timezone.utc) + timedelta(hours=d) | ||
break | ||
check_data["next_check"] = next_check | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a hard time understanding what that was doing, maybe it's just me...
Would that be clearer like so (I may be missing some detail or edge cases):
# Resource has not been modified since last check: | |
# if the time since last check is greater than the longest delay in CHECK_DELAYS, next check will be after the longest delay | |
# if the time since last check is less than CHECK_DELAYS[i], next check will be after this CHECK_DELAYS[i] | |
previous_delay: timedelta = datetime.now(timezone.utc) - datetime.fromisoformat( | |
last_check["created_at"] | |
) | |
if previous_delay > timedelta(hours=config.CHECK_DELAYS[-1]): | |
next_check = datetime.now(timezone.utc) + timedelta(hours=config.CHECK_DELAYS[-1]) | |
else: | |
for d in config.CHECK_DELAYS: | |
if previous_delay <= timedelta(hours=d): | |
next_check = datetime.now(timezone.utc) + timedelta(hours=d) | |
break | |
check_data["next_check"] = next_check | |
# Resource has not been modified since last check: | |
# Re-check it after a delay just greater than the time elapsed since the last check. | |
since_last_check: timedelta = datetime.now(timezone.utc) - datetime.fromisoformat( | |
last_check["created_at"] | |
) | |
now: datetime = datetime.now(timezone.utc) | |
# At most, re-check after the greatest delay. | |
next_check = now + timedelta(hours=config.CHECK_DELAYS[-1]) | |
# Should we re-check it sooner? Loop through the CHECK_DELAYS from longest to shortest. | |
for delay in reversed(config.CHECK_DELAYS): | |
delay_delta: timedelta = timedelta(hours=delay) | |
if since_last_check <= delay_delta: | |
next_check = now + delay_delta | |
else: | |
# Previous delay was the last one greater than `since_last_check`, keep it. | |
break | |
check_data["next_check"] = next_check |
I realize my rewrite is slightly longer, it feels clearer to me, but it could very well be because i just (re)wrote it, feel free to simply dismiss this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks, although I wonder if I prefer going from the shorter delay to the longest... As the shortest should be the default (more safe), and going to longer delays should be considered as "fallbacks", if that makes sense.
I might rewrite a bit as a mix between your solution and mine though!
# Conflicts: # CHANGELOG.md
# Conflicts: # CHANGELOG.md # udata_hydra/crawl/helpers.py # udata_hydra/crawl/process_check_data.py # udata_hydra/routes/status.py
…ification date, as a standalone function
572c3e7
to
5df8670
Compare
# Conflicts: # CHANGELOG.md
# Conflicts: # CHANGELOG.md
# Conflicts: # CHANGELOG.md # udata_hydra/analysis/resource.py # udata_hydra/crawl/check_resources.py
# Conflicts: # CHANGELOG.md
Closes datagouv/data.gouv.fr#1312.
̶-̶ ̶R̶e̶m̶o̶v̶e̶s̶ ̶S̶Q̶L̶ ̶q̶u̶e̶r̶y̶ ̶w̶r̶a̶p̶p̶e̶r̶ ̶
̶s̶e̶l̶e̶c̶t̶_̶r̶o̶w̶s̶_̶b̶a̶s̶e̶d̶_̶o̶n̶_̶q̶u̶e̶r̶y̶
̶ ̶w̶h̶i̶c̶h̶ ̶w̶a̶s̶ ̶u̶s̶e̶d̶ ̶t̶o̶ ̶c̶r̶e̶a̶t̶e̶ ̶a̶ ̶t̶e̶m̶p̶o̶r̶a̶r̶y̶ ̶t̶a̶b̶l̶e̶ ̶a̶n̶d̶ ̶t̶e̶m̶p̶o̶r̶a̶r̶i̶l̶y̶ ̶m̶a̶r̶k̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶b̶e̶i̶n̶g̶ ̶c̶r̶a̶w̶l̶e̶d̶ ̶a̶s̶ ̶"̶c̶r̶a̶w̶l̶i̶n̶g̶"̶.̶̶S̶i̶n̶c̶e̶ ̶w̶e̶ ̶u̶p̶d̶a̶t̶e̶ ̶t̶h̶e̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶ ̶s̶t̶a̶t̶u̶s̶ ̶i̶n̶ ̶t̶h̶e̶ ̶c̶h̶e̶c̶k̶ ̶c̶o̶d̶e̶ ̶e̶l̶s̶e̶w̶h̶e̶r̶e̶,̶ ̶i̶t̶ ̶s̶h̶o̶u̶l̶d̶ ̶n̶o̶t̶ ̶b̶e̶ ̶n̶e̶c̶e̶s̶s̶a̶r̶y̶ ̶a̶n̶y̶m̶o̶r̶e̶ ̶t̶o̶ ̶h̶a̶v̶e̶ ̶t̶h̶i̶s̶ ̶t̶e̶m̶p̶o̶r̶a̶r̶y̶ ̶t̶a̶b̶l̶e̶.̶
̶̶̶O̶n̶ ̶r̶e̶v̶i̶e̶w̶,̶ ̶p̶l̶e̶a̶s̶e̶ ̶d̶o̶u̶b̶l̶e̶-̶c̶h̶e̶c̶k̶ ̶t̶h̶e̶ ̶v̶a̶l̶i̶d̶i̶t̶y̶ ̶o̶f̶ ̶t̶h̶i̶s̶ ̶r̶e̶m̶o̶v̶a̶l̶̶̶
̶-̶ ̶A̶d̶d̶ ̶
̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶_̶D̶E̶F̶A̶U̶L̶T̶
̶ ̶c̶o̶n̶f̶i̶g̶ ̶v̶a̶r̶:̶ ̶t̶i̶m̶e̶ ̶a̶f̶t̶e̶r̶ ̶w̶h̶i̶c̶h̶ ̶a̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶ ̶m̶u̶s̶t̶ ̶b̶e̶ ̶c̶h̶e̶c̶k̶ ̶i̶f̶ ̶w̶e̶ ̶d̶o̶n̶'̶t̶ ̶k̶n̶o̶w̶ ̶a̶b̶o̶u̶t̶ ̶i̶t̶s̶ ̶l̶a̶s̶t̶ ̶m̶o̶d̶i̶f̶i̶c̶a̶t̶i̶o̶n̶ ̶d̶a̶t̶e̶ ̶(̶r̶e̶p̶l̶a̶c̶e̶s̶ ̶t̶h̶e̶ ̶p̶r̶e̶v̶i̶o̶u̶s̶ ̶̶S̶I̶N̶C̶E̶
̶ ̶c̶o̶n̶f̶i̶g̶ ̶v̶a̶r̶)̶CHECK_DELAYS
config var: list of delays (̶p̶o̶s̶t̶g̶r̶e̶s̶q̶l̶ ̶i̶n̶t̶e̶r̶v̶a̶l̶ ̶s̶y̶n̶t̶a̶x̶)̶ between two checks, if the resource has not been modified̶-̶ ̶U̶p̶d̶a̶t̶e̶ ̶t̶h̶e̶ ̶
̶s̶e̶l̶e̶c̶t̶_̶b̶a̶t̶c̶h̶_̶r̶e̶s̶o̶u̶r̶c̶e̶s̶_̶t̶o̶_̶c̶h̶e̶c̶k̶
̶ ̶m̶e̶t̶h̶o̶d̶ ̶w̶i̶t̶h̶ ̶t̶h̶e̶ ̶f̶o̶l̶l̶o̶w̶i̶n̶g̶ ̶l̶o̶g̶i̶c̶:̶̶ ̶ ̶ ̶ ̶1̶.̶ ̶F̶i̶r̶s̶t̶ ̶a̶d̶d̶s̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶i̶t̶h̶ ̶p̶r̶i̶o̶r̶i̶t̶y̶=̶T̶r̶u̶e̶ ̶t̶o̶ ̶t̶h̶e̶ ̶b̶a̶t̶c̶h̶ ̶(̶_̶t̶h̶i̶s̶ ̶d̶o̶e̶s̶n̶'̶t̶ ̶c̶h̶a̶n̶g̶e̶_̶)̶
̶ ̶ ̶ ̶ ̶2̶.̶ ̶T̶h̶e̶n̶ ̶a̶d̶d̶s̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶i̶t̶h̶o̶u̶t̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶(̶=̶w̶i̶t̶h̶ ̶n̶o̶ ̶c̶h̶e̶c̶k̶,̶ ̶n̶e̶v̶e̶r̶ ̶h̶a̶v̶e̶ ̶b̶e̶e̶n̶ ̶c̶h̶e̶c̶k̶e̶d̶)̶ ̶t̶o̶ ̶t̶h̶e̶ ̶b̶a̶t̶c̶h̶ ̶(̶_̶t̶h̶i̶s̶ ̶d̶o̶e̶s̶n̶'̶t̶ ̶c̶h̶a̶n̶g̶e̶_̶)̶
̶ ̶ ̶ ̶ ̶3̶.̶ ̶T̶h̶e̶n̶,̶ ̶i̶f̶ ̶t̶h̶e̶ ̶t̶o̶t̶a̶l̶ ̶n̶u̶m̶b̶e̶r̶ ̶o̶f̶ ̶s̶e̶l̶e̶c̶t̶e̶d̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶i̶s̶ ̶s̶t̶i̶l̶l̶ ̶l̶e̶s̶s̶ ̶t̶h̶a̶n̶ ̶t̶h̶e̶ ̶b̶a̶t̶c̶h̶ ̶s̶i̶z̶e̶:̶
̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶3̶.̶1̶.̶ ̶a̶d̶d̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶h̶i̶c̶h̶ ̶d̶o̶n̶'̶t̶ ̶h̶a̶v̶e̶ ̶a̶t̶ ̶l̶e̶a̶s̶t̶ ̶t̶w̶o̶ ̶c̶h̶e̶c̶k̶s̶ ̶y̶e̶t̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶i̶s̶ ̶o̶l̶d̶e̶r̶ ̶t̶h̶a̶n̶ ̶
̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶_̶D̶E̶F̶A̶U̶L̶T̶
̶̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶3̶.̶2̶.̶ ̶a̶d̶d̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶h̶i̶c̶h̶ ̶a̶t̶ ̶l̶e̶a̶s̶t̶ ̶o̶n̶e̶ ̶t̶h̶e̶ ̶t̶h̶e̶i̶r̶ ̶t̶w̶o̶ ̶m̶o̶s̶t̶ ̶r̶e̶c̶e̶n̶t̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶s̶ ̶h̶a̶s̶ ̶a̶n̶ ̶u̶n̶k̶n̶o̶w̶n̶ ̶m̶o̶d̶i̶f̶i̶e̶d̶ ̶d̶a̶t̶e̶ ̶(̶n̶o̶ ̶
̶d̶e̶t̶e̶c̶t̶e̶d̶_̶l̶a̶s̶t̶_̶m̶o̶d̶i̶f̶i̶e̶d̶_̶a̶t̶
̶,̶ ̶s̶o̶ ̶w̶e̶ ̶c̶a̶n̶n̶o̶t̶ ̶c̶o̶m̶p̶a̶r̶e̶ ̶w̶i̶t̶h̶ ̶a̶n̶o̶t̶h̶e̶r̶ ̶c̶h̶e̶c̶k̶)̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶i̶s̶ ̶o̶l̶d̶e̶r̶ ̶t̶h̶a̶n̶ ̶̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶_̶D̶E̶F̶A̶U̶L̶T̶
̶̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶3̶.̶3̶.̶ ̶a̶d̶d̶ ̶r̶e̶s̶s̶o̶u̶r̶c̶e̶s̶ ̶t̶o̶ ̶w̶h̶i̶c̶h̶ ̶t̶h̶e̶i̶r̶ ̶t̶w̶o̶ ̶m̶o̶s̶t̶ ̶r̶e̶c̶e̶n̶t̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶s̶ ̶h̶a̶v̶e̶ ̶c̶h̶a̶n̶g̶e̶d̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶i̶s̶ ̶o̶l̶d̶e̶r̶ ̶t̶h̶a̶n̶ ̶
̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶S̶[̶0̶]̶
̶ ̶(̶t̶h̶i̶s̶ ̶i̶s̶ ̶t̶o̶ ̶r̶e̶-̶c̶h̶e̶c̶k̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶t̶h̶a̶t̶ ̶w̶e̶ ̶k̶n̶o̶w̶ ̶i̶t̶ ̶h̶a̶s̶ ̶c̶h̶a̶n̶g̶e̶d̶)̶̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶3̶.̶4̶.̶ ̶a̶d̶d̶ ̶r̶e̶s̶s̶o̶u̶r̶c̶e̶s̶ ̶t̶o̶ ̶w̶h̶i̶c̶h̶ ̶t̶h̶e̶i̶r̶ ̶l̶a̶s̶t̶ ̶t̶w̶o̶ ̶c̶h̶e̶c̶k̶s̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶c̶h̶a̶n̶g̶e̶d̶,̶ ̶t̶h̶e̶ ̶t̶w̶o̶ ̶c̶h̶e̶c̶k̶s̶ ̶h̶a̶v̶e̶ ̶d̶o̶n̶e̶ ̶b̶e̶t̶w̶e̶e̶n̶ ̶t̶w̶o̶ ̶d̶e̶l̶a̶y̶s̶ ̶i̶n̶ ̶
̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶S̶
̶,̶ ̶a̶n̶d̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶c̶h̶e̶c̶k̶ ̶i̶s̶ ̶o̶l̶d̶e̶r̶ ̶t̶h̶a̶n̶ ̶t̶h̶e̶ ̶s̶a̶m̶e̶ ̶d̶e̶l̶a̶y̶ ̶i̶n̶ ̶̶C̶H̶E̶C̶K̶_̶D̶E̶L̶A̶Y̶S̶
̶ ̶(̶t̶h̶i̶s̶ ̶i̶s̶ ̶i̶n̶ ̶o̶r̶d̶e̶r̶ ̶t̶o̶ ̶a̶v̶o̶i̶d̶ ̶c̶h̶e̶c̶k̶i̶n̶g̶ ̶t̶o̶o̶ ̶o̶f̶t̶e̶n̶ ̶t̶h̶e̶ ̶s̶a̶m̶e̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶ ̶w̶h̶i̶c̶h̶ ̶d̶o̶e̶s̶n̶'̶t̶ ̶c̶h̶a̶n̶g̶e̶)̶ ̶̶
̶-̶ ̶A̶d̶a̶p̶t̶ ̶t̶h̶e̶ ̶w̶a̶y̶ ̶w̶e̶ ̶c̶o̶u̶n̶t̶ ̶
̶p̶e̶n̶d̶i̶n̶g̶_̶c̶h̶e̶c̶k̶s̶
̶ ̶a̶n̶d̶ ̶̶f̶r̶e̶s̶h̶_̶c̶h̶e̶c̶k̶s̶
̶ ̶i̶n̶ ̶t̶h̶e̶ ̶̶/̶a̶p̶i̶/̶s̶t̶a̶t̶u̶s̶/̶c̶r̶a̶w̶l̶e̶r̶
̶ ̶e̶n̶d̶p̶o̶i̶n̶t̶ ̶r̶e̶s̶p̶o̶n̶s̶e̶:̶̶ ̶ ̶ ̶ ̶-̶ ̶
̶p̶e̶n̶d̶i̶n̶g̶_̶c̶h̶e̶c̶k̶s̶
̶ ̶a̶r̶e̶ ̶t̶h̶e̶ ̶n̶u̶m̶b̶e̶r̶ ̶o̶f̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶i̶t̶h̶ ̶n̶o̶ ̶c̶h̶e̶c̶k̶,̶ ̶p̶l̶u̶s̶ ̶t̶h̶e̶ ̶n̶u̶m̶b̶e̶r̶ ̶w̶i̶t̶h̶ ̶o̶u̶t̶d̶a̶t̶e̶d̶ ̶c̶h̶e̶c̶k̶s̶ ̶(̶o̶u̶t̶d̶a̶t̶e̶d̶ ̶c̶h̶e̶c̶k̶s̶ ̶a̶r̶e̶ ̶n̶o̶w̶ ̶c̶o̶u̶n̶t̶e̶d̶ ̶d̶i̶f̶f̶e̶r̶e̶n̶t̶l̶y̶,̶ ̶d̶e̶p̶e̶n̶d̶i̶n̶g̶ ̶o̶n̶ ̶t̶h̶e̶ ̶d̶e̶l̶a̶y̶s̶)̶̶ ̶ ̶ ̶ ̶ ̶-̶ ̶
̶f̶r̶e̶s̶h̶_̶c̶h̶e̶c̶k̶s̶
̶ ̶a̶r̶e̶ ̶t̶h̶e̶ ̶n̶u̶m̶b̶e̶r̶ ̶o̶f̶ ̶r̶e̶s̶o̶u̶r̶c̶e̶s̶ ̶w̶i̶t̶h̶ ̶a̶ ̶c̶h̶e̶c̶k̶,̶ ̶m̶i̶n̶u̶s̶ ̶t̶h̶e̶ ̶n̶u̶m̶b̶e̶r̶ ̶w̶i̶t̶h̶ ̶o̶u̶t̶d̶a̶t̶e̶d̶ ̶c̶h̶e̶c̶k̶s̶ ̶(̶o̶u̶t̶d̶a̶t̶e̶d̶ ̶c̶h̶e̶c̶k̶s̶ ̶a̶r̶e̶ ̶n̶o̶w̶ ̶c̶o̶u̶n̶t̶e̶d̶ ̶d̶i̶f̶f̶e̶r̶e̶n̶t̶l̶y̶,̶ ̶d̶e̶p̶e̶n̶d̶i̶n̶g̶ ̶o̶n̶ ̶t̶h̶e̶ ̶d̶e̶l̶a̶y̶s̶)̶Add a migration to add a
next_check_at
datetime column to thechecks
tableAdd logic to calculate
next_check_at
datetime when processing the check, depending on the date f the previous check and if it has changed since then, in a dedicatedcalculate_next_check_date
methodModify logic for
select_batch
in order to select resources that have a related last check that have an expirednext check
dateRefactor
store_last_modified_date
method to also store the next check date, and rename it toupdate_check_with_modification_and_next_dates
Rename
process_check_data
topreprocess_check_data
for more clarity