Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: controlling issues #5756

Merged
merged 16 commits into from
Aug 13, 2024
Merged

fix: controlling issues #5756

merged 16 commits into from
Aug 13, 2024

Conversation

rangoo94
Copy link
Member

@rangoo94 rangoo94 commented Aug 13, 2024

Pull request description

Checklist (choose whats happened)

  • breaking change! (describe)
  • tested locally
  • tested on cluster
  • added new dependencies
  • updated the docs
  • added a test

Breaking changes

Changes

Fixes

@rangoo94 rangoo94 marked this pull request as ready for review August 13, 2024 13:33
@rangoo94 rangoo94 requested a review from a team as a code owner August 13, 2024 13:33
if nodeName != prevNodeName || podIP != prevPodIP || prevStatus != status || prevCurrent != current {
// TODO: the final status should always have the finishedAt too,
// there should be no need for checking isFinished diff
if nodeName != prevNodeName || isFinished != prevIsFinished || podIP != prevPodIP || prevStatus != status || prevCurrent != current {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such code will not be simple, to maintain (

Copy link
Member Author

@rangoo94 rangoo94 Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree, although this code is rather not meant to be maintained. This PR, along actual bugfixes, is meant to "enable" the new orchestration with similar watching system.

The (mainly) WatchInstrumentedPod and TestWorkflowResult have so many edge cases handled, clock calibration, and auto-healing mechanisms implemented, that at this point it's probably better to just rewrite them based on the observations that we have for the last few months of running Test Workflows. After that, we will likely not need conditions like these at all.

I'm guessing that half of the healing mechanisms and edge cases handlers are no longer needed, considering iterations of orchestration improvements.

@rangoo94 rangoo94 merged commit fe27bb7 into develop Aug 13, 2024
35 checks passed
@rangoo94 rangoo94 deleted the sandbox/control-issues branch August 13, 2024 20:16
rangoo94 added a commit that referenced this pull request Aug 14, 2024
* fix: continue paused container, when the abort is requested

* fix: ensure the lightweight container watcher will get `finishedAt` timestamp

* chore: add minor todos

* fix: configure no preemption policy by default for Test Workflows

* fix: allow Test Workflow status notifier to update "Aborted" status with details

* fix: ensure the parallel workers will not end without result

* fix: properly build timestamps and detect finished resul in the TestWorkflowResult model

* fix: use Pod/Job StatusConditions for detecting the status, make watching more resilient to external problems, expose more Kubernetes error details

* chore: do not require job/pod events when fetching logs of parallel workers and services

* fixup unit tests

* fix: delete preemption policy setup

* fixup unit tests

* fix: adjust resume time to avoid negative duration

* fix: calibrate clocks

* chore: use consts

* fixup unit tests
rangoo94 added a commit that referenced this pull request Aug 14, 2024
* fix: continue paused container, when the abort is requested

* fix: ensure the lightweight container watcher will get `finishedAt` timestamp

* chore: add minor todos

* fix: configure no preemption policy by default for Test Workflows

* fix: allow Test Workflow status notifier to update "Aborted" status with details

* fix: ensure the parallel workers will not end without result

* fix: properly build timestamps and detect finished resul in the TestWorkflowResult model

* fix: use Pod/Job StatusConditions for detecting the status, make watching more resilient to external problems, expose more Kubernetes error details

* chore: do not require job/pod events when fetching logs of parallel workers and services

* fixup unit tests

* fix: delete preemption policy setup

* fixup unit tests

* fix: adjust resume time to avoid negative duration

* fix: calibrate clocks

* chore: use consts

* fixup unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants