[JENKINS-58085] Fix issue where stages are shown as waiting to start when they are running/completed #2017

dwnusbaum · 2019-08-21T20:44:21Z

Description

See JENKINS-58085/JENKINS-53816 and some tickets linked from those.

This PR tweaks the fix for JENKINS-53816 added in #1961.

In some cases, when a build's flow graph changes from one build to the next (because the Pipeline was changed, or perhaps due to the result of a when condition or some dynamic Pipeline behavior in a Scripted Pipeline), but stays similar enough that Blue Ocean considers the overall graph structure the same, the node ID for one of the nodes in the current Pipeline would use the node ID for what Blue Ocean thought the corresponding node was in previous builds. If the node ID for that node was not actually the same in both builds, the new node would be shown as having never started, and would not have any steps associated with it (because the API calls for that node ID would be for some other node entirely, perhaps one that wasn't actually a stage, so it had no children). Once the build completed, the node ID would be replaced with the actual node ID from the build, and things would be displayed correctly.

I'm not sure if this covers all of the issues users describe in JENKINS-58085/JENKINS-53816, but it fixes at least one problem I found. Just opening a draft PR for now while I try to come up with a regression test for the bug.

Here is the Pipeline I was using for testing

pipeline {
    agent none
    stages {
        stage("Before Parallel") {
            steps {
                echo "starting"
                sleep 15
                echo "done"
            }
        }
        stage("Parallel") {
            parallel {
                stage("Par 1") {
                    stages {
                        stage("Par 1 - Seq 1") {
                            steps {
                                echo "starting"
                                sleep 5
                                echo "done"
                            }
                        }
                        stage("Par 1 - Seq 2") {
                            steps {
                                echo "starting"
                                sleep 10
                                echo "done"
                            }
                        }
                    }
                }
                stage("Par 2") {
                    // Comment and uncomment this `when` from one build to the next to see the bug I am trying to fix.
                    when {
                        expression { true == true }
                    }
                    stages {
                        stage("Par 2 - Seq 1") {
                            steps {
                                echo "starting"
                                sleep 10
                                echo "done"
                            }
                        }
                        stage("Par 2 - Seq 2") {
                            steps {
                                echo "starting"
                                sleep 10
                                echo "done"
                            }
                        }
                    }
                }
                stage("Par 3") {
                    steps {
                        echo "starting"
                        sleep 5
                        echo "done"
                    }
                }
            }
        }
        stage("After Parallel") {
            steps {
                echo "starting"
                sleep 15
                echo "done"
            }
        }
    }
}

Submitter checklist

Link to JIRA ticket in description, if appropriate.
Change is code complete and matches issue description
Appropriate unit or acceptance tests or explanation to why this change has no tests
Reviewer's manual test instructions provided in PR description. See Reviewer's first task below.

Reviewer checklist

Run the changes and verified the change matches the issue description
Reviewed the code
Verified that the appropriate tests have been written or valid explanation given

… in the UI

dwnusbaum · 2019-08-21T20:47:19Z

...api-impl/src/main/java/io/jenkins/blueocean/rest/impl/pipeline/PipelineNodeGraphVisitor.java

@@ -710,14 +711,14 @@ public BluePipelineStep getPipelineNodeStep(String id, Link parent) {

                // New wrapper object based on current execution
                FlowNodeWrapper newNodeWrapper = new FlowNodeWrapper(
-                    oldNodeWrapper.getNode(), // Use old node because we want to show the final id not intermediary
+                    currentNodeWrapper.getNode(),


This is the main fix. The old comment doesn't make sense to me. My understanding is that in this case, we have a real node from the current Pipeline. Its ID will never change, even when the build completes, so we already have the "final" ID. Because the code checks for a match using findNodeWrapperByParentageIn if it doesn't find a direct ID match, the ID of the real node might not match the ID of the corresponding node from the old build. This leads to a bug when we use that ID in reference to the current build in other contexts where we expect it to be a stage/branch or whatever, since that ID might be something totally different in the current build.

Maybe we should just delete newNodeWrapper and use currentNodeWrapper directly now that we aren't changing the fields?

The root problem is the node ids do change during a run, thanks to the way this code walks the graph backwards instead of forwards.

Changing this will fix some of the bad graphs, and break others that are currently working. I used the old node because it's from a finished run, which has the "final" node ids.

The root problem is the node ids do change during a run, thanks to the way this code walks the graph backwards instead of forwards.

The ID for an actual Pipeline FlowNode in a build will never change once it is added to the flow graph. Whether we walk the graph in reverse or not does not matter. In this particular loop, as far as I can tell, we are looping through FlowNodeWrapperss for FlowNodes that already exist in the flow graph of the current build, so their IDs will never change.

If you mean the problem is that a single FlowNodeWrapper that reference nodes from the current build can end up referencing a different node once the build progresses further because we are mutating FlowNodeWrapper or something is unstable in the visitor, that seems like a bug, and I am happy to look into fixing that here too if we can come up with a reproduction case.

I used the old node because it's from a finished run, which has the "final" node ids.

The problem is that expecting the node IDs from a different build to match the node IDs in the current build is absolutely not something you can count on. It happens to work in some cases, but findNodeByParentageIn is not strict enough to guarantee that if two nodes match by that criteria their IDs also match, and nothing in a Pipeline will change the ID of a FlowNode once it has been added to the flow graph to make them match after the fact or something.

Further on in this code, where we use the IDs from the old build in what we return, is also a bit suspect, because one of the old IDs might refer to a different node in the current Pipeline, but since those nodes are marked as having not started it seems to not really matter. It would be safer to make the node IDs for the fake FlowNodeWrappers some ID that could not possibly be a valid ID in the current Pipeline, but I don't know what assumptions the rest of the code makes about the IDs for the fake FlowNodeWrappers.

Ok, I came up with two minimal test cases where nodes that show up while the build is in progress do not appear in the final graph. Both of them involve a parallel step with only one branch, so they're not particularly realistic, but they should help to figure out what is going on (and to check if my change affects the behavior or not).

Here is the Declarative test case:

pipeline { agent none stages { stage('outer') { parallel { stage('inner') { steps { sleep 5 } } } } } }

Here is the scripted test case:

stage('outer') { parallel inner: { stage('inner') { sleep 5 } } }

In both cases, outer is shown in a separate bubble preceding inner until the build completes, at which point only inner is shown.

Ok, looking into those test cases, the main difference in the graph of FlowNodeWrappers when the build is running vs completed is that when the build is running, the FlowNodeWrapper.type of the node for inner is STAGE, but when the build completes the FlowNodeWrapper.type for inner becomes PARALLEL. I'll keep looking to understand why and either fix it or at add an ignored test case, although I'll probably open a separate PR for that.

I couldn't reproduce all of the situations that cause node drift reliably, it's definitely timing-sensitive. I think it occurs only when parallel stages are just starting or just ending, such that the nodes exist but not all the events for those nodes do yet. I don't really have more detail to give, other than my conclusion that it'd be quicker building a new graph visitor that uses the newer pipeline API via feature flag rather than trying to debug the exact event sequence that triggers it. Fresh eyes may be able to find it though.

Another possibility might be adopting a different id scheme, that is tied to the structure of the graph (akin to an xpath or something) rather than just reusing the ids given to the pipeline nodes by Jenkins.

Yeah, from what I can tell, the problem for my test cases from #2017 (comment) is that ForkScanner.visitSimpleChunks, which is used by PipelineNodeGraphVisitor, makes it difficult to identify what chunks belong to a parallel branch when the branch is not yet completed. In particular, when the build is still in progress, this condition is true for inner:

blueocean-plugin/blueocean-pipeline-api-impl/src/main/java/io/jenkins/blueocean/rest/impl/pipeline/PipelineNodeGraphVisitor.java

Line 321 in db8d55d

if (nestedbranches.size() != parallelBranchEndNodes.size()) {

Because of that, parallelStart exits early and we won't identify inner as having type parallel until after it is complete so that parallelBranchEndNode is called to set up the data for parallelStart. I don't really see any obvious fix for that over here, I think it would make more sense to adapt ForkScanner to have it call some new SimpleChunkVisitor method named inProgressParallelBranchEnd or something in this case so PipelineNodeGraphVisitor doesn't have to reverse engineer what is happening.

That said, my change doesn't affect these cases as far as I can tell, they are broken with or without my fix, and the problem is the FlowNodeWrapper.type, not the ID. Because the type doesn't match, FlowNodeWrapper.probablySameNode will return false in PipelineNodeGraphVisitor.union here:

blueocean-plugin/blueocean-pipeline-api-impl/src/main/java/io/jenkins/blueocean/rest/impl/pipeline/PipelineNodeGraphVisitor.java

Line 705 in db8d55d

if (!currentNodeWrapper.probablySameNode(oldNodeWrapper)) {

Because the overall condition is true, the graphs will be marked as incompatible, and so union will break out of the loop that contains my changes and just show the nodes from the current build, so my changes won't affect these cases.

For now, I can't think of an example where my changes will make things worse, so I am going to mark this PR as ready for review. If we come up with an example where it does make things worse, I am happy to look into it, and if there are any regressions from the PR, I am happy to look into them as well.

dwnusbaum · 2019-08-21T20:48:50Z

...api-impl/src/main/java/io/jenkins/blueocean/rest/impl/pipeline/PipelineNodeGraphVisitor.java

@@ -687,7 +687,8 @@ public BluePipelineStep getPipelineNodeStep(String id, Link parent) {

        boolean graphsAreCompatible = true;

-        Map<String, FlowNodeWrapper> newNodes = new HashMap<>(previousNodes.size()); // indexed by id
+        // A map from the ID of a node from previousNodes to the corresponding node for the current build.
+        Map<String, FlowNodeWrapper> newNodes = new HashMap<>(previousNodes.size());


I thought it would be useful to clarify that the IDs here come from the old build, not the new build, and I updated the names of the variables that get put into the map to try to make that clearer as well.

halkeye · 2019-08-21T21:21:09Z

AFAIK, Josh has worked the most on this space, so added him as a reviewer to get feedback.

dwnusbaum · 2019-08-22T17:25:28Z

...api-impl/src/main/java/io/jenkins/blueocean/rest/impl/pipeline/PipelineNodeGraphVisitor.java

                    currentNodeWrapper.getStatus(),
                    currentNodeWrapper.getTiming(),
                    currentNodeWrapper.getInputStep(),
                    run
                );

-                newNodes.put(nodeId, newNodeWrapper);
+                newNodes.put(oldNodeWrapper.getId(), newNodeWrapper);


Ok, so with some more testing, it turns out that this is actually the only change needed to fix the test case I am working with. The problem is that nodeId is an ID from the current build, but all other gets/puts on newNodes use an ID from the old build, so when the IDs don't match across builds the logic in the rest this method can get mixed up. Regardless of the discussion in #2017 (comment), I think we should use the ID from the old node here so that all of the keys in newNodes are IDs from the old build.

dwnusbaum · 2019-08-22T21:20:31Z

...ipeline-api-impl/src/test/java/io/jenkins/blueocean/rest/impl/pipeline/GraphBuilderTest.java

+        {
+            List<FlowNodeWrapper> nodes = unionPipelineNodes(b1, b2);
+            assertStageAndEdges(nodes, "first", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, "second");
+            assertStageAndEdges(nodes, "second", BlueRun.BlueRunState.RUNNING, BlueRun.BlueRunResult.UNKNOWN, "third");


If you revert the changes in src/main, all assertions expecting BlueRunState.RUNNING and BlueRunResult.UNKNOWN fail because they are null instead.

dwnusbaum · 2019-08-23T17:10:27Z

Test failures on ci.blueocean.io are PipelineNodeTest.sequentialParallelStagesLongRun and GithubEditorTest.testEditor/GithubEditorTest.testEditorSetEnvironmentVariables. The latter two seem unrelated. The former could be related to the issues discussed in #2017 (comment), so I'll look into it, but that test was flaky before these changes as well, see #1954 (comment).

dwnusbaum · 2019-08-27T18:24:55Z

Rebuilding because I suspect the test failures are preexisting flakes, see #2017 (comment).

halkeye · 2019-08-27T18:27:46Z

ATH is super flakey, I don't think i've seen those tests break before, but you could have fixed something that made other tests less flakey :)

vivek

LGTM 👍

vivek · 2019-08-29T21:49:46Z

@dwnusbaum Lets get @olamy review and then merge and release it. Great job 🎉

olamy

LGTM

graingert · 2019-09-03T21:04:14Z

@olamy do you know when this will be released?

olamy · 2019-09-04T02:24:00Z

@graingert 1.19.0 has been released with this fix.

Signed-off-by: runzexia <[email protected]>

feliwir · 2020-01-14T14:26:09Z

I'm still getting this after upgrading to 1.19.0

dwnusbaum · 2020-01-14T14:39:12Z

@feliwir Please post comments like that to the corresponding Jira issue or the jenkinsci-users mailing list instead of GitHub. Make sure to include steps to reproduce the problem, such as the Jenkinsfile you are using, otherwise we are unlikely to investigate because we do not have any leads on what the problem might be.

[JENKINS-58085] Fix issue where stage progress is not correctly shown…

aa812b3

… in the UI

dwnusbaum commented Aug 21, 2019

View reviewed changes

halkeye requested review from sophistifunk, halkeye and vivek August 21, 2019 21:20

dwnusbaum commented Aug 22, 2019

View reviewed changes

[JENKINS-58085] Add minimal regression test

7c86e7f

dwnusbaum commented Aug 22, 2019

View reviewed changes

dwnusbaum marked this pull request as ready for review August 27, 2019 18:23

dwnusbaum closed this Aug 27, 2019

dwnusbaum reopened this Aug 27, 2019

graingert approved these changes Aug 29, 2019

View reviewed changes

dwnusbaum requested a review from olamy August 29, 2019 20:49

vivek approved these changes Aug 29, 2019

View reviewed changes

olamy approved these changes Aug 30, 2019

View reviewed changes

halkeye mentioned this pull request Aug 30, 2019

[JENKINS-58684] use sse gateway 1.20 #2021

Merged

7 tasks

dwnusbaum merged commit a7661c7 into jenkinsci:master Sep 3, 2019

dwnusbaum deleted the JENKINS-58085 branch September 3, 2019 20:19

dwnusbaum added the bug label Sep 3, 2019

runzexia pushed a commit to runzexia/ks-installer that referenced this pull request Oct 17, 2019

update jenkins blue to 1.19.0 to fix jenkinsci/blueocean-plugin#2017

f83c9ce

Signed-off-by: runzexia <[email protected]>

runzexia mentioned this pull request Oct 17, 2019

update jenkins blue to 1.19.0 kubesphere/ks-installer#298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-58085] Fix issue where stages are shown as waiting to start when they are running/completed #2017

[JENKINS-58085] Fix issue where stages are shown as waiting to start when they are running/completed #2017

dwnusbaum commented Aug 21, 2019 •

edited

Loading

dwnusbaum Aug 21, 2019

dwnusbaum Aug 21, 2019

sophistifunk Aug 21, 2019

sophistifunk Aug 21, 2019

dwnusbaum Aug 22, 2019

dwnusbaum Aug 23, 2019

dwnusbaum Aug 23, 2019

sophistifunk Aug 26, 2019

sophistifunk Aug 26, 2019

dwnusbaum Aug 27, 2019

dwnusbaum Aug 21, 2019

halkeye commented Aug 21, 2019

dwnusbaum Aug 22, 2019

dwnusbaum Aug 22, 2019

dwnusbaum commented Aug 23, 2019 •

edited

Loading

dwnusbaum commented Aug 27, 2019

halkeye commented Aug 27, 2019

vivek left a comment

vivek commented Aug 29, 2019

olamy left a comment

graingert commented Sep 3, 2019

olamy commented Sep 4, 2019

feliwir commented Jan 14, 2020

dwnusbaum commented Jan 14, 2020

[JENKINS-58085] Fix issue where stages are shown as waiting to start when they are running/completed #2017

[JENKINS-58085] Fix issue where stages are shown as waiting to start when they are running/completed #2017

Conversation

dwnusbaum commented Aug 21, 2019 • edited Loading

Description

Submitter checklist

Reviewer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

halkeye commented Aug 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwnusbaum commented Aug 23, 2019 • edited Loading

dwnusbaum commented Aug 27, 2019

halkeye commented Aug 27, 2019

vivek left a comment

Choose a reason for hiding this comment

vivek commented Aug 29, 2019

olamy left a comment

Choose a reason for hiding this comment

graingert commented Sep 3, 2019

olamy commented Sep 4, 2019

feliwir commented Jan 14, 2020

dwnusbaum commented Jan 14, 2020

dwnusbaum commented Aug 21, 2019 •

edited

Loading

dwnusbaum commented Aug 23, 2019 •

edited

Loading