Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-58085] Fix issue where stages are shown as waiting to start when they are running/completed #2017

Merged
merged 2 commits into from
Sep 3, 2019

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Aug 21, 2019

Description

See JENKINS-58085/JENKINS-53816 and some tickets linked from those.

This PR tweaks the fix for JENKINS-53816 added in #1961.

In some cases, when a build's flow graph changes from one build to the next (because the Pipeline was changed, or perhaps due to the result of a when condition or some dynamic Pipeline behavior in a Scripted Pipeline), but stays similar enough that Blue Ocean considers the overall graph structure the same, the node ID for one of the nodes in the current Pipeline would use the node ID for what Blue Ocean thought the corresponding node was in previous builds. If the node ID for that node was not actually the same in both builds, the new node would be shown as having never started, and would not have any steps associated with it (because the API calls for that node ID would be for some other node entirely, perhaps one that wasn't actually a stage, so it had no children). Once the build completed, the node ID would be replaced with the actual node ID from the build, and things would be displayed correctly.

I'm not sure if this covers all of the issues users describe in JENKINS-58085/JENKINS-53816, but it fixes at least one problem I found. Just opening a draft PR for now while I try to come up with a regression test for the bug.

Here is the Pipeline I was using for testing
pipeline {
    agent none
    stages {
        stage("Before Parallel") {
            steps {
                echo "starting"
                sleep 15
                echo "done"
            }
        }
        stage("Parallel") {
            parallel {
                stage("Par 1") {
                    stages {
                        stage("Par 1 - Seq 1") {
                            steps {
                                echo "starting"
                                sleep 5
                                echo "done"
                            }
                        }
                        stage("Par 1 - Seq 2") {
                            steps {
                                echo "starting"
                                sleep 10
                                echo "done"
                            }
                        }
                    }
                }
                stage("Par 2") {
                    // Comment and uncomment this `when` from one build to the next to see the bug I am trying to fix.
                    when {
                        expression { true == true }
                    }
                    stages {
                        stage("Par 2 - Seq 1") {
                            steps {
                                echo "starting"
                                sleep 10
                                echo "done"
                            }
                        }
                        stage("Par 2 - Seq 2") {
                            steps {
                                echo "starting"
                                sleep 10
                                echo "done"
                            }
                        }
                    }
                }
                stage("Par 3") {
                    steps {
                        echo "starting"
                        sleep 5
                        echo "done"
                    }
                }
            }
        }
        stage("After Parallel") {
            steps {
                echo "starting"
                sleep 15
                echo "done"
            }
        }
    }
}

Submitter checklist

  • Link to JIRA ticket in description, if appropriate.
  • Change is code complete and matches issue description
  • Appropriate unit or acceptance tests or explanation to why this change has no tests
  • Reviewer's manual test instructions provided in PR description. See Reviewer's first task below.

Reviewer checklist

  • Run the changes and verified the change matches the issue description
  • Reviewed the code
  • Verified that the appropriate tests have been written or valid explanation given

@@ -710,14 +711,14 @@ public BluePipelineStep getPipelineNodeStep(String id, Link parent) {

// New wrapper object based on current execution
FlowNodeWrapper newNodeWrapper = new FlowNodeWrapper(
oldNodeWrapper.getNode(), // Use old node because we want to show the final id not intermediary
currentNodeWrapper.getNode(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main fix. The old comment doesn't make sense to me. My understanding is that in this case, we have a real node from the current Pipeline. Its ID will never change, even when the build completes, so we already have the "final" ID. Because the code checks for a match using findNodeWrapperByParentageIn if it doesn't find a direct ID match, the ID of the real node might not match the ID of the corresponding node from the old build. This leads to a bug when we use that ID in reference to the current build in other contexts where we expect it to be a stage/branch or whatever, since that ID might be something totally different in the current build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just delete newNodeWrapper and use currentNodeWrapper directly now that we aren't changing the fields?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root problem is the node ids do change during a run, thanks to the way this code walks the graph backwards instead of forwards.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this will fix some of the bad graphs, and break others that are currently working. I used the old node because it's from a finished run, which has the "final" node ids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root problem is the node ids do change during a run, thanks to the way this code walks the graph backwards instead of forwards.

The ID for an actual Pipeline FlowNode in a build will never change once it is added to the flow graph. Whether we walk the graph in reverse or not does not matter. In this particular loop, as far as I can tell, we are looping through FlowNodeWrapperss for FlowNodes that already exist in the flow graph of the current build, so their IDs will never change.

If you mean the problem is that a single FlowNodeWrapper that reference nodes from the current build can end up referencing a different node once the build progresses further because we are mutating FlowNodeWrapper or something is unstable in the visitor, that seems like a bug, and I am happy to look into fixing that here too if we can come up with a reproduction case.

I used the old node because it's from a finished run, which has the "final" node ids.

The problem is that expecting the node IDs from a different build to match the node IDs in the current build is absolutely not something you can count on. It happens to work in some cases, but findNodeByParentageIn is not strict enough to guarantee that if two nodes match by that criteria their IDs also match, and nothing in a Pipeline will change the ID of a FlowNode once it has been added to the flow graph to make them match after the fact or something.

Further on in this code, where we use the IDs from the old build in what we return, is also a bit suspect, because one of the old IDs might refer to a different node in the current Pipeline, but since those nodes are marked as having not started it seems to not really matter. It would be safer to make the node IDs for the fake FlowNodeWrappers some ID that could not possibly be a valid ID in the current Pipeline, but I don't know what assumptions the rest of the code makes about the IDs for the fake FlowNodeWrappers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I came up with two minimal test cases where nodes that show up while the build is in progress do not appear in the final graph. Both of them involve a parallel step with only one branch, so they're not particularly realistic, but they should help to figure out what is going on (and to check if my change affects the behavior or not).

Here is the Declarative test case:

pipeline {
  agent none
  stages {
    stage('outer') {
      parallel {
        stage('inner') {
          steps {
            sleep 5
          }
        }
      }
    }
  }
}

Here is the scripted test case:

stage('outer') {
  parallel inner: {
    stage('inner') {
        sleep 5
    }
  }
}

In both cases, outer is shown in a separate bubble preceding inner until the build completes, at which point only inner is shown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looking into those test cases, the main difference in the graph of FlowNodeWrappers when the build is running vs completed is that when the build is running, the FlowNodeWrapper.type of the node for inner is STAGE, but when the build completes the FlowNodeWrapper.type for inner becomes PARALLEL. I'll keep looking to understand why and either fix it or at add an ignored test case, although I'll probably open a separate PR for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't reproduce all of the situations that cause node drift reliably, it's definitely timing-sensitive. I think it occurs only when parallel stages are just starting or just ending, such that the nodes exist but not all the events for those nodes do yet. I don't really have more detail to give, other than my conclusion that it'd be quicker building a new graph visitor that uses the newer pipeline API via feature flag rather than trying to debug the exact event sequence that triggers it. Fresh eyes may be able to find it though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility might be adopting a different id scheme, that is tied to the structure of the graph (akin to an xpath or something) rather than just reusing the ids given to the pipeline nodes by Jenkins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, from what I can tell, the problem for my test cases from #2017 (comment) is that ForkScanner.visitSimpleChunks, which is used by PipelineNodeGraphVisitor, makes it difficult to identify what chunks belong to a parallel branch when the branch is not yet completed. In particular, when the build is still in progress, this condition is true for inner:

Because of that, parallelStart exits early and we won't identify inner as having type parallel until after it is complete so that parallelBranchEndNode is called to set up the data for parallelStart. I don't really see any obvious fix for that over here, I think it would make more sense to adapt ForkScanner to have it call some new SimpleChunkVisitor method named inProgressParallelBranchEnd or something in this case so PipelineNodeGraphVisitor doesn't have to reverse engineer what is happening.

That said, my change doesn't affect these cases as far as I can tell, they are broken with or without my fix, and the problem is the FlowNodeWrapper.type, not the ID. Because the type doesn't match, FlowNodeWrapper.probablySameNode will return false in PipelineNodeGraphVisitor.union here:

Because the overall condition is true, the graphs will be marked as incompatible, and so union will break out of the loop that contains my changes and just show the nodes from the current build, so my changes won't affect these cases.

For now, I can't think of an example where my changes will make things worse, so I am going to mark this PR as ready for review. If we come up with an example where it does make things worse, I am happy to look into it, and if there are any regressions from the PR, I am happy to look into them as well.

@@ -687,7 +687,8 @@ public BluePipelineStep getPipelineNodeStep(String id, Link parent) {

boolean graphsAreCompatible = true;

Map<String, FlowNodeWrapper> newNodes = new HashMap<>(previousNodes.size()); // indexed by id
// A map from the ID of a node from previousNodes to the corresponding node for the current build.
Map<String, FlowNodeWrapper> newNodes = new HashMap<>(previousNodes.size());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be useful to clarify that the IDs here come from the old build, not the new build, and I updated the names of the variables that get put into the map to try to make that clearer as well.

@halkeye
Copy link
Member

halkeye commented Aug 21, 2019

AFAIK, Josh has worked the most on this space, so added him as a reviewer to get feedback.

currentNodeWrapper.getStatus(),
currentNodeWrapper.getTiming(),
currentNodeWrapper.getInputStep(),
run
);

newNodes.put(nodeId, newNodeWrapper);
newNodes.put(oldNodeWrapper.getId(), newNodeWrapper);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so with some more testing, it turns out that this is actually the only change needed to fix the test case I am working with. The problem is that nodeId is an ID from the current build, but all other gets/puts on newNodes use an ID from the old build, so when the IDs don't match across builds the logic in the rest this method can get mixed up. Regardless of the discussion in #2017 (comment), I think we should use the ID from the old node here so that all of the keys in newNodes are IDs from the old build.

{
List<FlowNodeWrapper> nodes = unionPipelineNodes(b1, b2);
assertStageAndEdges(nodes, "first", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, "second");
assertStageAndEdges(nodes, "second", BlueRun.BlueRunState.RUNNING, BlueRun.BlueRunResult.UNKNOWN, "third");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you revert the changes in src/main, all assertions expecting BlueRunState.RUNNING and BlueRunResult.UNKNOWN fail because they are null instead.

@dwnusbaum
Copy link
Member Author

dwnusbaum commented Aug 23, 2019

Test failures on ci.blueocean.io are PipelineNodeTest.sequentialParallelStagesLongRun and GithubEditorTest.testEditor/GithubEditorTest.testEditorSetEnvironmentVariables. The latter two seem unrelated. The former could be related to the issues discussed in #2017 (comment), so I'll look into it, but that test was flaky before these changes as well, see #1954 (comment).

@dwnusbaum dwnusbaum marked this pull request as ready for review August 27, 2019 18:23
@dwnusbaum
Copy link
Member Author

Rebuilding because I suspect the test failures are preexisting flakes, see #2017 (comment).

@dwnusbaum dwnusbaum closed this Aug 27, 2019
@dwnusbaum dwnusbaum reopened this Aug 27, 2019
@halkeye
Copy link
Member

halkeye commented Aug 27, 2019

ATH is super flakey, I don't think i've seen those tests break before, but you could have fixed something that made other tests less flakey :)

@dwnusbaum dwnusbaum requested a review from olamy August 29, 2019 20:49
Copy link
Collaborator

@vivek vivek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@vivek
Copy link
Collaborator

vivek commented Aug 29, 2019

@dwnusbaum Lets get @olamy review and then merge and release it. Great job 🎉

Copy link
Member

@olamy olamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dwnusbaum dwnusbaum merged commit a7661c7 into jenkinsci:master Sep 3, 2019
@dwnusbaum dwnusbaum deleted the JENKINS-58085 branch September 3, 2019 20:19
@dwnusbaum dwnusbaum added the bug label Sep 3, 2019
@graingert
Copy link

@olamy do you know when this will be released?

@olamy
Copy link
Member

olamy commented Sep 4, 2019

@graingert 1.19.0 has been released with this fix.

runzexia pushed a commit to runzexia/ks-installer that referenced this pull request Oct 17, 2019
@feliwir
Copy link

feliwir commented Jan 14, 2020

I'm still getting this after upgrading to 1.19.0

@dwnusbaum
Copy link
Member Author

@feliwir Please post comments like that to the corresponding Jira issue or the jenkinsci-users mailing list instead of GitHub. Make sure to include steps to reproduce the problem, such as the Jenkinsfile you are using, otherwise we are unlikely to investigate because we do not have any leads on what the problem might be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants