Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-58085] Fix issue where stages are shown as waiting to start when they are running/completed #2017

Merged
merged 2 commits into from
Sep 3, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -687,7 +687,8 @@ public List<BluePipelineNode> union(List<FlowNodeWrapper> previousNodes, final L

boolean graphsAreCompatible = true;

Map<String, FlowNodeWrapper> newNodes = new HashMap<>(previousNodes.size()); // indexed by id
// A map from the ID of a node from previousNodes to the corresponding node for the current build.
Map<String, FlowNodeWrapper> newNodes = new HashMap<>(previousNodes.size());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be useful to clarify that the IDs here come from the old build, not the new build, and I updated the names of the variables that get put into the map to try to make that clearer as well.


// Start with the currently-executing nodes
for (FlowNodeWrapper currentNodeWrapper : nodes) {
Expand All @@ -710,14 +711,14 @@ public List<BluePipelineNode> union(List<FlowNodeWrapper> previousNodes, final L

// New wrapper object based on current execution
FlowNodeWrapper newNodeWrapper = new FlowNodeWrapper(
oldNodeWrapper.getNode(), // Use old node because we want to show the final id not intermediary
currentNodeWrapper.getNode(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main fix. The old comment doesn't make sense to me. My understanding is that in this case, we have a real node from the current Pipeline. Its ID will never change, even when the build completes, so we already have the "final" ID. Because the code checks for a match using findNodeWrapperByParentageIn if it doesn't find a direct ID match, the ID of the real node might not match the ID of the corresponding node from the old build. This leads to a bug when we use that ID in reference to the current build in other contexts where we expect it to be a stage/branch or whatever, since that ID might be something totally different in the current build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just delete newNodeWrapper and use currentNodeWrapper directly now that we aren't changing the fields?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root problem is the node ids do change during a run, thanks to the way this code walks the graph backwards instead of forwards.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this will fix some of the bad graphs, and break others that are currently working. I used the old node because it's from a finished run, which has the "final" node ids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root problem is the node ids do change during a run, thanks to the way this code walks the graph backwards instead of forwards.

The ID for an actual Pipeline FlowNode in a build will never change once it is added to the flow graph. Whether we walk the graph in reverse or not does not matter. In this particular loop, as far as I can tell, we are looping through FlowNodeWrapperss for FlowNodes that already exist in the flow graph of the current build, so their IDs will never change.

If you mean the problem is that a single FlowNodeWrapper that reference nodes from the current build can end up referencing a different node once the build progresses further because we are mutating FlowNodeWrapper or something is unstable in the visitor, that seems like a bug, and I am happy to look into fixing that here too if we can come up with a reproduction case.

I used the old node because it's from a finished run, which has the "final" node ids.

The problem is that expecting the node IDs from a different build to match the node IDs in the current build is absolutely not something you can count on. It happens to work in some cases, but findNodeByParentageIn is not strict enough to guarantee that if two nodes match by that criteria their IDs also match, and nothing in a Pipeline will change the ID of a FlowNode once it has been added to the flow graph to make them match after the fact or something.

Further on in this code, where we use the IDs from the old build in what we return, is also a bit suspect, because one of the old IDs might refer to a different node in the current Pipeline, but since those nodes are marked as having not started it seems to not really matter. It would be safer to make the node IDs for the fake FlowNodeWrappers some ID that could not possibly be a valid ID in the current Pipeline, but I don't know what assumptions the rest of the code makes about the IDs for the fake FlowNodeWrappers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I came up with two minimal test cases where nodes that show up while the build is in progress do not appear in the final graph. Both of them involve a parallel step with only one branch, so they're not particularly realistic, but they should help to figure out what is going on (and to check if my change affects the behavior or not).

Here is the Declarative test case:

pipeline {
  agent none
  stages {
    stage('outer') {
      parallel {
        stage('inner') {
          steps {
            sleep 5
          }
        }
      }
    }
  }
}

Here is the scripted test case:

stage('outer') {
  parallel inner: {
    stage('inner') {
        sleep 5
    }
  }
}

In both cases, outer is shown in a separate bubble preceding inner until the build completes, at which point only inner is shown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looking into those test cases, the main difference in the graph of FlowNodeWrappers when the build is running vs completed is that when the build is running, the FlowNodeWrapper.type of the node for inner is STAGE, but when the build completes the FlowNodeWrapper.type for inner becomes PARALLEL. I'll keep looking to understand why and either fix it or at add an ignored test case, although I'll probably open a separate PR for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't reproduce all of the situations that cause node drift reliably, it's definitely timing-sensitive. I think it occurs only when parallel stages are just starting or just ending, such that the nodes exist but not all the events for those nodes do yet. I don't really have more detail to give, other than my conclusion that it'd be quicker building a new graph visitor that uses the newer pipeline API via feature flag rather than trying to debug the exact event sequence that triggers it. Fresh eyes may be able to find it though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility might be adopting a different id scheme, that is tied to the structure of the graph (akin to an xpath or something) rather than just reusing the ids given to the pipeline nodes by Jenkins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, from what I can tell, the problem for my test cases from #2017 (comment) is that ForkScanner.visitSimpleChunks, which is used by PipelineNodeGraphVisitor, makes it difficult to identify what chunks belong to a parallel branch when the branch is not yet completed. In particular, when the build is still in progress, this condition is true for inner:

Because of that, parallelStart exits early and we won't identify inner as having type parallel until after it is complete so that parallelBranchEndNode is called to set up the data for parallelStart. I don't really see any obvious fix for that over here, I think it would make more sense to adapt ForkScanner to have it call some new SimpleChunkVisitor method named inProgressParallelBranchEnd or something in this case so PipelineNodeGraphVisitor doesn't have to reverse engineer what is happening.

That said, my change doesn't affect these cases as far as I can tell, they are broken with or without my fix, and the problem is the FlowNodeWrapper.type, not the ID. Because the type doesn't match, FlowNodeWrapper.probablySameNode will return false in PipelineNodeGraphVisitor.union here:

Because the overall condition is true, the graphs will be marked as incompatible, and so union will break out of the loop that contains my changes and just show the nodes from the current build, so my changes won't affect these cases.

For now, I can't think of an example where my changes will make things worse, so I am going to mark this PR as ready for review. If we come up with an example where it does make things worse, I am happy to look into it, and if there are any regressions from the PR, I am happy to look into them as well.

currentNodeWrapper.getStatus(),
currentNodeWrapper.getTiming(),
currentNodeWrapper.getInputStep(),
run
);

newNodes.put(nodeId, newNodeWrapper);
newNodes.put(oldNodeWrapper.getId(), newNodeWrapper);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so with some more testing, it turns out that this is actually the only change needed to fix the test case I am working with. The problem is that nodeId is an ID from the current build, but all other gets/puts on newNodes use an ID from the old build, so when the IDs don't match across builds the logic in the rest this method can get mixed up. Regardless of the discussion in #2017 (comment), I think we should use the ID from the old node here so that all of the keys in newNodes are IDs from the old build.


} else {
// Node does not exist in previous graph, user probably changed pipleine definition.
Expand All @@ -729,9 +730,9 @@ public List<BluePipelineNode> union(List<FlowNodeWrapper> previousNodes, final L
// Walk the old graph, create new wrappers for any stages not yet started
if (graphsAreCompatible) {
for (FlowNodeWrapper oldNodeWrapper : previousNodes) {
final String nodeId = oldNodeWrapper.getId();
final String oldNodeId = oldNodeWrapper.getId();

if (!newNodes.containsKey(nodeId)) {
if (!newNodes.containsKey(oldNodeId)) {
FlowNodeWrapper newNodeWrapper = new FlowNodeWrapper(
oldNodeWrapper.getNode(),
new NodeRunStatus(null, null),
Expand All @@ -740,17 +741,17 @@ public List<BluePipelineNode> union(List<FlowNodeWrapper> previousNodes, final L
run
);

newNodes.put(nodeId, newNodeWrapper);
newNodes.put(oldNodeId, newNodeWrapper);
}
}
}

// Re-create edges and parentage based on previous run
if (graphsAreCompatible) {
for (FlowNodeWrapper oldNodeWrapper : previousNodes) {
final String nodeId = oldNodeWrapper.getId();
final String oldNodeId = oldNodeWrapper.getId();

FlowNodeWrapper newNodeWrapper = newNodes.get(nodeId);
FlowNodeWrapper newNodeWrapper = newNodes.get(oldNodeId);

for (FlowNodeWrapper oldEdge : oldNodeWrapper.edges) {
FlowNodeWrapper newEdge = newNodes.get(oldEdge.getId());
Expand All @@ -773,8 +774,8 @@ public List<BluePipelineNode> union(List<FlowNodeWrapper> previousNodes, final L
// code that makes too many assumptions

for (FlowNodeWrapper oldNodeWrapper : previousNodes) {
final String nodeId = oldNodeWrapper.getId();
FlowNodeWrapper newNodeWrapper = newNodes.get(nodeId);
final String oldNodeId = oldNodeWrapper.getId();
FlowNodeWrapper newNodeWrapper = newNodes.get(oldNodeId);
newList.add(new PipelineNodeImpl(newNodeWrapper, () -> parent, run));
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,21 @@
import com.google.common.io.Resources;
import hudson.model.Result;
import io.jenkins.blueocean.listeners.NodeDownstreamBuildAction;
import io.jenkins.blueocean.rest.hal.Link;
import io.jenkins.blueocean.rest.model.BlueRun;
import org.apache.commons.lang3.StringUtils;
import org.jenkinsci.plugins.workflow.cps.CpsFlowDefinition;
import org.jenkinsci.plugins.workflow.job.WorkflowJob;
import org.jenkinsci.plugins.workflow.job.WorkflowRun;
import org.jenkinsci.plugins.workflow.test.steps.SemaphoreStep;
import org.junit.Assert;
import org.junit.Test;
import org.jvnet.hudson.test.Issue;

import java.net.URL;
import java.util.Collection;
import java.util.List;
import java.util.stream.Collectors;

import static org.junit.Assert.assertEquals;

Expand Down Expand Up @@ -409,6 +412,53 @@ public void unstableSmokes() throws Exception {
assertEquals("Unexpected stages in graph", 4, nodes.size());
}

/**
* Builds a basic Pipeline where the overall graph structure (in terms of stages) is the same from one build
* to the next, but there are differences in the number of FlowNodes (and thus different node IDs for
* corresponding nodes like the starts of stages) in each build.
*
* If this test breaks, it probably means that something is mixing up node IDs across different builds when it
* should not be doing so.
*/
@Test
public void unionDifferentNodeIdsSameStructure() throws Exception {
WorkflowJob p = createJob("unionDifferentNodeIdsSameStructure", "unionDifferentNodeIdsSameStructure.jenkinsfile");
// Run the first build, it should complete successfully. We don't care about its structure.
WorkflowRun b1 = p.scheduleBuild2(0).waitForStart();
SemaphoreStep.waitForStart("second/1", b1);
SemaphoreStep.success("second/1", null);
SemaphoreStep.waitForStart("third/1", b1);
SemaphoreStep.success("third/1", null);
j.waitForCompletion(b1);
j.assertBuildStatus(Result.SUCCESS, b1);
// Run the second build. Its graph as computed by PipelineNodeGraphVisitor will be created by combining the current build with the previous build.
WorkflowRun b2 = p.scheduleBuild2(0).waitForStart();
SemaphoreStep.waitForStart("second/2", b2);
{
List<FlowNodeWrapper> nodes = unionPipelineNodes(b1, b2);
assertStageAndEdges(nodes, "first", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, "second");
assertStageAndEdges(nodes, "second", BlueRun.BlueRunState.RUNNING, BlueRun.BlueRunResult.UNKNOWN, "third");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you revert the changes in src/main, all assertions expecting BlueRunState.RUNNING and BlueRunResult.UNKNOWN fail because they are null instead.

assertStageAndEdges(nodes, "third", null, (BlueRun.BlueRunResult) null);
}
SemaphoreStep.success("second/2", null);
SemaphoreStep.waitForStart("third/2", b2);
{
List<FlowNodeWrapper> nodes = unionPipelineNodes(b1, b2);
assertStageAndEdges(nodes, "first", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, "second");
assertStageAndEdges(nodes, "second", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, "third");
assertStageAndEdges(nodes, "third", BlueRun.BlueRunState.RUNNING, BlueRun.BlueRunResult.UNKNOWN);
}
SemaphoreStep.success("third/2", null);
j.waitForCompletion(b2);
j.assertBuildStatus(Result.SUCCESS, b2);
{
List<FlowNodeWrapper> nodes = unionPipelineNodes(b1, b2);
assertStageAndEdges(nodes, "first", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, "second");
assertStageAndEdges(nodes, "second", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, "third");
assertStageAndEdges(nodes, "third", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS);
}
}

private FlowNodeWrapper assertStageAndEdges(Collection<FlowNodeWrapper> searchNodes, String stageName, String... edgeNames) {
return assertStageAndEdges(searchNodes, stageName, BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, edgeNames);
}
Expand Down Expand Up @@ -500,4 +550,19 @@ private WorkflowJob createJob(String jobName, String jenkinsFileName) throws jav
job.setDefinition(new CpsFlowDefinition(jenkinsFile, true));
return job;
}

/**
* Compute the union of the specified builds using {@link PipelineNodeGraphVisitor#union}.
* @param b1 the old build.
* @param b2 the current build.
* @return the union of the builds converted to {@code List<FlowNodeWrapper>} to be used with methods like {@link #assertStageAndEdges}.
*/
private List<FlowNodeWrapper> unionPipelineNodes(WorkflowRun b1, WorkflowRun b2) {
List<FlowNodeWrapper> oldNodes = NodeGraphBuilder.NodeGraphBuilderFactory.getInstance(b1).getPipelineNodes();
return NodeGraphBuilder.NodeGraphBuilderFactory.getInstance(b2)
.union(oldNodes, new Link("unused"))
.stream()
.map(bpn -> ((PipelineNodeImpl) bpn).getFlowNodeWrapper())
.collect(Collectors.toList());
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
stage('first') {
// Changes the number of nodes in this stage on every build, so any code that
// relies on matching Node IDs across builds will break.
for (int i = 0; i < currentBuild.number; i++) {
echo "${i}"
}
}
stage('second') {
semaphore 'second'
}
stage('third') {
semaphore 'third'
}