Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-2011] Fix bug where another host could report the job as skipped, then the … #3888

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Will-Lo
Copy link
Contributor

@Will-Lo Will-Lo commented Mar 1, 2024

…skipped jobstatus is used to check for concurrent flows

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    There's a bug that causes GaaS multileader to kick off unintended concurrent flows which happens in the order described below:
  1. Host A checks the latest flow execution status to ensure the prior flow is not running, sees that the prior execution is still running.
  2. Host A fails the flow pending execution as it cannot run concurrent flow, this emits a FAILED event to GaaS which is ingested by the JobStatusMonitor.
  3. Host B checks the latest flow execution status, sees the current flow execution ID which is FAILED (considered a finished flow).
  4. Host B kicks off the pending flow execution when it shouldn't be.

To resolve this, we need to ensure that we are looking at the past 2 flow executions, and follow the behavior:

  1. If there is no prior execution, kick off the pending flow
  2. If the prior execution is IN PROGRESS, we want to indicate that there is a concurrent flow and block the pending execution.
  3. If the prior execution is FINISHED, then we want to kick off the pending execution (rely on the DagManager for deduplication of flows because we do not know if the host managing this pending flow is running behind the other hosts).

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Unit tests

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

…skipped jobstatus is used to check for concurrent flows
@codecov-commenter
Copy link

codecov-commenter commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 62.50000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 46.66%. Comparing base (a78ee54) to head (74360c4).
Report is 1 commits behind head on master.

Files Patch % Lines
...obblin/service/monitoring/FlowStatusGenerator.java 62.50% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3888      +/-   ##
============================================
- Coverage     46.67%   46.66%   -0.01%     
+ Complexity    11154    11152       -2     
============================================
  Files          2219     2219              
  Lines         87657    87660       +3     
  Branches       9621     9624       +3     
============================================
- Hits          40911    40910       -1     
- Misses        43055    43057       +2     
- Partials       3691     3693       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

if (flowStatusList == null || flowStatusList.isEmpty()) {
return false;
}
FlowStatus flowStatus = flowStatusList.get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a comment to make clear that the first one is the most recent and may or may not match the pending flowExecutionId attempt and the second index is an older one? It's a bit hard to keep track of what's going on here without reading ur commit desc

@arjun4084346
Copy link
Contributor

I think we should not set the status "Failed" when the last execution is running. We should instead emit a new event "SKIPPED". With this any further execution should be able to correctly decide whether a new job should run or not.
This will make the code more maintainable.

@Will-Lo
Copy link
Contributor Author

Will-Lo commented Mar 2, 2024

@arjun4084346 I agree that we should emit a new event SKIPPED but we will need to make sure that this is compatible with all clients, which can cause issues if they are set to expect only certain outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants