Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for change in UTC offset when calculating next schedule #35887

Merged
merged 8 commits into from
Dec 6, 2023

Conversation

uranusjr
Copy link
Member

@uranusjr uranusjr commented Nov 27, 2023

Restarting the effort in #30083 from scratch from some failing test cases we want to fulfill. Some changes in the other PR may be cherry-picked into here at some point.

Fix #7999 eventually.

@uranusjr
Copy link
Member Author

uranusjr commented Dec 1, 2023

Hey this is actually works better than I expected, just one failing test that doesn’t seem to be related to folding. Maybe the actual logic isn’t that complicated.

@uranusjr
Copy link
Member Author

uranusjr commented Dec 3, 2023

I did some additional debugging and it seems like the root issue is croniter does not take account fold (or Pendulum’s transition rule, they’re fundamentally the same thing) correctly. I’ve illustrated the issue in an upstream bug report: kiorky/croniter#56

Since we essentially treats croniter as a black box, I don’t think there is a way to detect this issue in Airflow, and every workaround would only result in another bug somewhere else. I believe the only viable fix would be to fix croniter, unfortunately.

@uranusjr
Copy link
Member Author

uranusjr commented Dec 3, 2023

A minor plus is I found a bug in CPython while debugging this. python/cpython#112638

@uranusjr
Copy link
Member Author

uranusjr commented Dec 4, 2023

I implemented a workaround for “common” cases that I can think of. The only logic I really changed is how we decide to apply the DST workaround—instead of the weird “fixed” concept, I peaked into the cron expression to figure out whether the DAG is run hourly or more frequentm and the workaround is applied if it is. This covers all the cron cases I can think of, and should not induce a backward incompatiblity (outside of fixing the logically incorrect behaviour as described in the original bug).

@uranusjr uranusjr marked this pull request as ready for review December 4, 2023 11:01
@timkpaine
Copy link
Contributor

This looks similar to something I tried to do on the other PR but it didn't work in all cases. I think the example that broke was for both DST transitions, does '0 9 * * *' and '0 9,10 * * *' start at the same time (9am local) the day of the transition.

@uranusjr
Copy link
Member Author

uranusjr commented Dec 5, 2023

There is a test case for 7,8 and that works. I don’t think 9,10 is somehow different? And how is 0 9 * * * special? It’s not logically different from the provided test_exiting_exact.

@timkpaine
Copy link
Contributor

Let me throw a few tests I have locally at this PR and see if anything fails

@timkpaine
Copy link
Contributor

Ok nvm! I think this is good, here is my test (adapted to Zurich):

import pendulum
import datetime
from airflow.models.dag import DAG
from airflow.utils import timezone
from airflow.timetables.interval import CronDataIntervalTimetable

# Zurich (Switzerland) is chosen since it is +1/+2 DST, making it a bit easier
# to get my head around the mental timezone conversion.
# In 2023, DST entered on 26th Mar, 2am local clocks (1am UTC) were turned
# forward to 3am. DST exited on 29th Oct, 3am local clocks (1am UTC) were
# turned backward to 2am (making the 2:XX hour fold).

# Interval cron
cron_interval = CronDataIntervalTimetable("0 0,3 * * *", timezone=pendulum.timezone("Europe/Zurich"))
# Fixed cron
cron_fixed = CronDataIntervalTimetable("0 3 * * *", timezone=pendulum.timezone("Europe/Zurich"))

# ENTERING
# These two last executed at 00:00 today and 03:00 yesterday, respectively.
# Both of them should next execute at 3am local time today
last_run_interval = pendulum.datetime(2023, 3, 26, 0, tz="Europe/Zurich").in_timezone("UTC")
last_run_fixed = pendulum.datetime(2023, 3, 25, 3, tz="Europe/Zurich").in_timezone("UTC")

# these should be 2023-03-25T23:00:00+00:00 and 2023-03-24T23:00:00+00:00
expected_last_run_interval_utc_iso = "2023-03-25T23:00:00+00:00"
expected_last_run_fixed_utc_iso = "2023-03-25T02:00:00+00:00"
print(last_run_interval.isoformat(), last_run_interval.isoformat() == expected_last_run_interval_utc_iso)
print(last_run_fixed.isoformat(), last_run_fixed.isoformat() == expected_last_run_fixed_utc_iso)

# these should be 2023-03-26T00:00:00+01:00 and 2023-03-25T00:00:00+01:00
expected_last_run_interval_local_iso = "2023-03-26T00:00:00+01:00"
expected_last_run_fixed_local_iso = "2023-03-25T03:00:00+01:00"
print(last_run_interval.in_timezone("Europe/Zurich").isoformat(), last_run_interval.in_timezone("Europe/Zurich").isoformat() == expected_last_run_interval_local_iso)
print(last_run_fixed.in_timezone("Europe/Zurich").isoformat(), last_run_fixed.in_timezone("Europe/Zurich").isoformat() == expected_last_run_fixed_local_iso)


# Now do next run
next_run_interval = cron_interval._get_next(last_run_interval)
next_run_fixed = cron_fixed._get_next(last_run_fixed)

# these should be 2023-03-26T01:00:00+00:00
expected_next_run_interval_utc_iso = "2023-03-26T01:00:00+00:00"
expected_next_run_fixed_utc_iso = "2023-03-26T01:00:00+00:00"
print(next_run_interval.isoformat(), next_run_interval.isoformat() == expected_next_run_interval_utc_iso)
print(next_run_fixed.isoformat(), next_run_fixed.isoformat() == expected_next_run_fixed_utc_iso)

expected_next_run_interval_local_iso = "2023-03-26T03:00:00+02:00"
expected_next_run_fixed_local_iso = "2023-03-26T03:00:00+02:00"
print(next_run_interval.in_timezone("Europe/Zurich").isoformat(), next_run_interval.in_timezone("Europe/Zurich").isoformat() == expected_next_run_interval_local_iso)
print(next_run_fixed.in_timezone("Europe/Zurich").isoformat(), next_run_fixed.in_timezone("Europe/Zurich").isoformat() == expected_next_run_fixed_local_iso)


# LEAVING
# These two last executed at 00:00 today and 03:00 yesterday, respectively.
# Both of them should next execute at 3am local time today
last_run_interval = pendulum.datetime(2023, 10, 29, 0, tz="Europe/Zurich").in_timezone("UTC")
last_run_fixed = pendulum.datetime(2023, 10, 28, 3, tz="Europe/Zurich").in_timezone("UTC")

# these should be 2023-10-28T22:00:00+00:00 and 2023-10-28T01:00:00+00:00
expected_last_run_interval_utc_iso = "2023-10-28T22:00:00+00:00"
expected_last_run_fixed_utc_iso = "2023-10-28T01:00:00+00:00"
print(last_run_interval.isoformat(), last_run_interval.isoformat() == expected_last_run_interval_utc_iso)
print(last_run_fixed.isoformat(), last_run_fixed.isoformat() == expected_last_run_fixed_utc_iso)

# these should be 2023-10-29T00:00:00+02:00 and 2023-10-28T03:00:00+02:00
expected_last_run_interval_local_iso = "2023-10-29T00:00:00+02:00"
expected_last_run_fixed_local_iso = "2023-10-28T03:00:00+02:00"
print(last_run_interval.in_timezone("Europe/Zurich").isoformat(), last_run_interval.in_timezone("Europe/Zurich").isoformat() == expected_last_run_interval_local_iso)
print(last_run_fixed.in_timezone("Europe/Zurich").isoformat(), last_run_fixed.in_timezone("Europe/Zurich").isoformat() == expected_last_run_fixed_local_iso)


# Now do next run
next_run_interval = cron_interval._get_next(last_run_interval)
next_run_fixed = cron_fixed._get_next(last_run_fixed)

# these should be 2023-03-26T01:00:00+00:00
expected_next_run_interval_utc_iso = "2023-10-29T02:00:00+00:00"
expected_next_run_fixed_utc_iso = "2023-10-29T02:00:00+00:00"
print(next_run_interval.isoformat(), next_run_interval.isoformat() == expected_next_run_interval_utc_iso)
print(next_run_fixed.isoformat(), next_run_fixed.isoformat() == expected_next_run_fixed_utc_iso)

expected_next_run_interval_local_iso = "2023-10-29T03:00:00+01:00"
expected_next_run_fixed_local_iso = "2023-10-29T03:00:00+01:00"
print(next_run_interval.in_timezone("Europe/Zurich").isoformat(), next_run_interval.in_timezone("Europe/Zurich").isoformat() == expected_next_run_interval_local_iso)
print(next_run_fixed.in_timezone("Europe/Zurich").isoformat(), next_run_fixed.in_timezone("Europe/Zurich").isoformat() == expected_next_run_fixed_local_iso)

Main:

2023-03-25T23:00:00+00:00 True
2023-03-25T02:00:00+00:00 True
2023-03-26T00:00:00+01:00 True
2023-03-25T03:00:00+01:00 True
2023-03-26T02:00:00+00:00 False
2023-03-26T01:00:00+00:00 True
2023-03-26T04:00:00+02:00 False
2023-03-26T03:00:00+02:00 True
2023-10-28T22:00:00+00:00 True
2023-10-28T01:00:00+00:00 True
2023-10-29T00:00:00+02:00 True
2023-10-28T03:00:00+02:00 True
2023-10-29T01:00:00+00:00 False
2023-10-29T02:00:00+00:00 True
2023-10-29T02:00:00+01:00 False
2023-10-29T03:00:00+01:00 True
2023-03-25T23:00:00+00:00 True
2023-03-25T02:00:00+00:00 True
2023-03-26T00:00:00+01:00 True
2023-03-25T03:00:00+01:00 True
2023-03-26T01:00:00+00:00 True
2023-03-26T01:00:00+00:00 True
2023-03-26T03:00:00+02:00 True
2023-03-26T03:00:00+02:00 True
2023-10-28T22:00:00+00:00 True
2023-10-28T01:00:00+00:00 True
2023-10-29T00:00:00+02:00 True
2023-10-28T03:00:00+02:00 True
2023-10-29T02:00:00+00:00 True
2023-10-29T02:00:00+00:00 True
2023-10-29T03:00:00+01:00 True
2023-10-29T03:00:00+01:00 True

Copy link
Contributor

@timkpaine timkpaine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just needs a news fragment, mine was here: https://github.com/timkpaine/airflow/blob/ebfbbfa82c625e9dc70994c92dcac4ccf59e6d3b/newsfragments/30083.significant.rst but probably can do better wording now that you've fully investigated the problems

Comment on lines +867 to +878

from airflow.timetables._cron import CronMixin

if not isinstance(self.timetable, CronMixin):
return True

from croniter import croniter

cron = croniter(self.timetable._expression)
next_a = cron.get_next(datetime.datetime)
next_b = cron.get_next(datetime.datetime)
return next_b.minute == next_a.minute and next_b.hour == next_a.hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's not used anywhere, and should not be used anywhere, why make a change to it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah because you remove _should_fix_dst

Comment on lines +48 to +52
While this technically happens for all runs (in such a timezone), we only
really care about runs that happen at least once an hour, and can
provide a somewhat reasonable rationale to skip the fold hour for things
such as ``*/2`` (every two hour). So we try to *minially* peak into croniter
internals to work around the issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While this technically happens for all runs (in such a timezone), we only
really care about runs that happen at least once an hour, and can
provide a somewhat reasonable rationale to skip the fold hour for things
such as ``*/2`` (every two hour). So we try to *minially* peak into croniter
internals to work around the issue.
While this technically happens for all cron schedules (in such a timezone), we only
really care about schedules that run at least once an hour; for schedules that run
less frequently, such as ``*/2`` (every two hours), it seems reasonable to skip the
fold hour. So we try to *minially* peak into croniter internals to work around the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly this suggestion

(1) tries to clarify "happens for all runs" vs "happens for all cron schedules". A given run is just a single instant in time but this is more about the schedule, i.e. computing next run from this run

and (2) suggests to just state the rationale rather than saying "we can provide" but don't provide

@uranusjr
Copy link
Member Author

uranusjr commented Dec 6, 2023

Does this need a significant fragment? I’m assuming it does not since there’s no backward incompatibility.

# Last run before DST. Interval starts and ends on 2am UTC (local time is +1).
next_info = timetable.next_dagrun_info(last_automated_data_interval=None, restriction=restriction)
assert next_info and next_info.data_interval == DataInterval(
pendulum.datetime(2023, 3, 24, 2, tz=TIMEZONE),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it strikes me that since this assumes TIMEZONE will be UTC (i think) we should just specify UTC here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me, fwiw, i think it would also be easier to reason about the tests if we compared in native UTC datetimes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think all TIMEZONE in this file should be replaced by UTC. I did this for consistency so a later PR can just change them all.

Copy link
Contributor

@dstandish dstandish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@uranusjr uranusjr merged commit c4549d7 into apache:main Dec 6, 2023
49 checks passed
@uranusjr uranusjr deleted the cron-dst-follow branch December 6, 2023 07:40
@uranusjr
Copy link
Member Author

uranusjr commented Dec 6, 2023

Ah damn I forgot to apply the docstring change. Will do a follow up PR.

@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.0 milestone Dec 6, 2023
@ephraimbuddy ephraimbuddy added the type:improvement Changelog: Improvements label Dec 6, 2023
ephraimbuddy pushed a commit that referenced this pull request Dec 6, 2023
@timkpaine
Copy link
Contributor

Does this need a significant fragment? I’m assuming it does not since there’s no backward incompatibility.

Doesn't make a difference to me but this does change how certain things are scheduled. Does a schedule being run differently count as significant?

@dstandish
Copy link
Contributor

There's a bugfix newsfragment type I think. Isn't this a bugfix? Doesn't every bugfix imply some kind of behavior change? I guess what's the big change here in behavior? Previously when a dag with this kind of schedule crossed a boundary the dag would get stuck (and if catchup=True, stuck forever). That should no longer be the case.

@timkpaine
Copy link
Contributor

It changes when things will get scheduled, see my code above. So the same schedule will run at different times after upgrade

@dstandish
Copy link
Contributor

dstandish commented Dec 6, 2023

Well, let's assess and articulate what the actual behavior change is. We'll need that anyway, however we categorize it.

I took a look at your code and refactored a bit to better understand what the actual change is.

Here's my version

from __future__ import annotations

from datetime import datetime

import pendulum

from airflow.timetables.interval import CronDataIntervalTimetable

# Zurich (Switzerland) is chosen since it is +1/+2 DST, making it a bit easier
# to get my head around the mental timezone conversion.
# In 2023, DST entered on 26th Mar, 2am local clocks (1am UTC) were turned
# forward to 3am. DST exited on 29th Oct, 3am local clocks (1am UTC) were
# turned backward to 2am (making the 2:XX hour fold).

CET = "Europe/Zurich"


def to_local_iso(val: str):
    return pendulum.parse(val).in_timezone(CET).isoformat()


def run_check(starting_run, timetable, exp_utc_iso):
    print(f"when last run is\t\t\t{starting_run.in_timezone(CET)} / {starting_run}...")
    exp_local = to_local_iso(exp_utc_iso)
    print(f"then next run should be\t\t{exp_local} / {exp_utc_iso}")
    next_run = timetable._get_next(starting_run)
    if (dt := next_run.isoformat()) and dt == exp_utc_iso:
        print("and it is!")
    else:
        print(f"but instead it is\t\t\t{to_local_iso(dt)} / {dt}")
        delta = (pendulum.parse(dt) - pendulum.parse(exp_utc_iso)).in_minutes()
        print(f"this is {abs(delta)} minutes {'ahead of' if delta < 0 else 'behind'} schedule")
    print()


timetable_irregular = CronDataIntervalTimetable("0 0,3 * * *", timezone=pendulum.timezone(CET))
starting_run1 = pendulum.datetime(2023, 3, 26, 0, tz=CET).in_timezone("UTC")
run_check(
    starting_run=starting_run1,
    timetable=timetable_irregular,
    exp_utc_iso="2023-03-26T01:00:00+00:00",
)
starting_run2 = pendulum.datetime(2023, 10, 29, 0, tz=CET).in_timezone("UTC")
expected_next_run = (
    pendulum.instance(datetime(2023, 10, 29, 3, fold=1, tzinfo=pendulum.timezone(CET)))
    .in_timezone("UTC")
    .isoformat()
)
assert expected_next_run == "2023-10-29T02:00:00+00:00"
run_check(
    starting_run=starting_run2,
    timetable=timetable_irregular,
    exp_utc_iso=expected_next_run,  # runs on the second occurrence of 3am
)

and i think this is what the change boils down to...

before the change this is the output:

when last run is			2023-03-26T00:00:00+01:00 / 2023-03-25T23:00:00+00:00...
then next run should be		2023-03-26T03:00:00+02:00 / 2023-03-26T01:00:00+00:00
but instead it is			2023-03-26T04:00:00+02:00 / 2023-03-26T02:00:00+00:00
this is 60 minutes behind schedule

when last run is			2023-10-29T00:00:00+02:00 / 2023-10-28T22:00:00+00:00...
then next run should be		2023-10-29T03:00:00+01:00 / 2023-10-29T02:00:00+00:00
but instead it is			2023-10-29T02:00:00+01:00 / 2023-10-29T01:00:00+00:00
this is 60 minutes ahead of schedule

Caveat, this is all very confusing and hard to wrap one's head around but that said, let's discusse the two cases.

In case 1, spring forward, the next run is supposed to at 3am (2 hours after the last run, since an hour is skipped) but instead the next run is calculated as 4am, which is simply the wrong time by any assessment. So one behavior change is, instead of running at 4am, it will run at 3am, the correct clock time. This to me seems like a bugfix.

Side note: we might also want to verify the behavior for a 0,2 schedule, since i think in spring ahead, there is never a 2am.

In case 2, fall back, the next run is supposed to be the 3am. In actuality, this is an unambiguous time reference. At fall back, the clock only strikes 3am one time. I.e. after 2:59 am, it goes to 2:00 again -- not first to 3am and then back to 2am. Which is to say, the next run is at the wrong clock time again -- 2am instead of 3am. So again, to me this looks like a bugfix.

The other thing to note is that we also observed this year that with fall back, the dag would actually get stuck (if catchup=True, then forever, or until next run is past transition otherwise). So, presuming this fixes that, it's a bugfix on another layer.

Note, I'm not arguing to "win", just trying to further our own understandings of what this is and how best to represent this to users. Do I understand this correctly? Do you think I'm describing the details correctly?

@timkpaine
Copy link
Contributor

timkpaine commented Dec 6, 2023

There is a broader conversation around "bug fix" vs "feature change" in the original ticket but yes on all parts and my stance on the original ticket is that it's bug fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect DAG scheduling after DST
4 participants