Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JENKINS-39203 - Only show blocks containing a WarningAction as unstable #1954

Merged
merged 12 commits into from
Jun 6, 2019

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Apr 18, 2019

Description

See jenkinsci/workflow-api-plugin#91, jenkinsci/pipeline-graph-analysis-plugin#25, jenkinsci/workflow-basic-steps-plugin#83, and JENKINS-39203. (I am intentionally not updating the status of that ticket or assigning it to myself until I am ready to merge/release all relevant changes to avoid giving users premature hope of a fix and so that I have clear answers for what will and won't work and when any fixes will be released.)

This PR adjusts the nodes used when calling StatusAndTiming.computeChunkStatus2 for stages. Previously, before was the stage step start, firstNode was the first atom node in the stage, lastNode was the end of the stage step's body, and after was the stage step end. Now, firstNode is the start of the stage step's body. This change matches the standard behavior of StageChunkFinder, which means that we can create tests in pipeline-graph-analysis that actually match the way the APIs are used by Blue Ocean rather than being subtly different, and it makes it easier to look through the current block inside of computeChunkStatus2 (I am a little curious about whether firstExecuted can be removed entirely, but haven't tested that out yet).

This PR also adds regression tests for JENKINS-43292.

Screenshot with the changes (without the changes, all of the yellow exclamation points would be green check marks:

Screen Shot 2019-04-24 at 17 02 02

Submitter checklist

  • Link to JIRA ticket in description, if appropriate.
  • Change is code complete and matches issue description
  • Appropriate unit or acceptance tests or explanation to why this change has no tests
  • Reviewer's manual test instructions provided in PR description. See Reviewer's first task below.

Reviewer checklist

  • Run the changes and verified the change matches the issue description
  • Reviewed the code
  • Verified that the appropriate tests have been written or valid explanation given

@@ -27,7 +27,7 @@
<revision>1.15.0</revision>
<changelist>-SNAPSHOT</changelist>
<java.level>8</java.level>
<jenkins.version>2.121.1</jenkins.version> <!-- Should be kept in sync with the version resulting from Dockerfile FROM statement-->
<jenkins.version>2.138.4</jenkins.version> <!-- Should be kept in sync with the version resulting from Dockerfile FROM statement-->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for the workflow-api update. Current LTS is 2.164.2, not sure if you want to go all the way to that version though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you need to update some Dockerfile as the comment asks?

Copy link
Member Author

@dwnusbaum dwnusbaum Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile uses FROM jenkins/jenkins:lts, and that version is currently 2.164.2. I wasn't sure if we want to update to such a new version, hence my comment in the PR. Perhaps the comment in pom.xml should just be deleted now that the Dockerfile always tracks the latest LTS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the comment in pom.xml should just be deleted now that the Dockerfile always tracks the latest LTS.

Or the Dockerfile should be made to pin a version. Depends on what it is being used for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would however update the one in Jenkinsfile (but that looks like its already 2.138.4, so it doesn't matter)

Copy link
Member Author

@dwnusbaum dwnusbaum Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, this line needs to be deleted or modified since we wouldn't be able to test on a baseline older than 2.138.4 anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead.

I did learn recently when i cleaned up the jenkinsfile, the weekly ath isn't run anymore anyways so no fears of breaking things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d5861ae.

@@ -45,6 +45,7 @@
<git.version>3.8.0</git.version>
<access-modifier-checker.failOnError>true</access-modifier-checker.failOnError>
<workflow-support.version>3.2</workflow-support.version>
<pipeline-model-definition.version>1.3.7</pipeline-model-definition.version>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a property to make it easy to keep everything in sync, and the update is required for the JENKINS-43292 regression tests to work.

</dependency>
<dependency>
<groupId>org.jenkinsci.plugins</groupId>
<artifactId>pipeline-model-extensions</artifactId>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one of the other Declarative plugins, not sure why it wasn't here already.

assertStageAndEdges(nodes, "stage 1 marked as unstable", BlueRun.BlueRunState.NOT_BUILT, BlueRun.BlueRunResult.NOT_BUILT, "stage 2 wait");
assertStageAndEdges(nodes, "stage 2 wait", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.UNSTABLE);
assertStageAndEdges(nodes, "stage 1 marked as unstable", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.UNSTABLE, "stage 2 wait");
assertStageAndEdges(nodes, "stage 2 wait", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test changed, does it mean existing behaviour might break?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this test in particular is the exact case described in JENKINS-39203. stage 2 wait succeeded, but previously it would have been shown as unstable. The fix is that only the stage that actually set the status to unstable should be marked as unstable. I'm not sure why stage 1 was previously computed to be NOT_BUILT, but I will see if I can figure it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its probably all good, i just get super nervous when tests change. Josh will be able to give better feedback next week

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this test is definitely expected to change since it was asserting exactly the behavior that we don't want anymore (i.e. it was testing the bug). No rush on reviewing, this is on top of 3 other PRs and will likely change over time, so probably better to wait to review until I upgrade it from draft status to real PR status anyway.

@@ -1,6 +1,6 @@
node {
stage ("stage 1 marked as unstable") {
currentBuild.result = 'UNSTABLE';
unstable('foo')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look into whether RunWrapper.setResult can be deprecated. Tricky because this was done as a method on a POJO rather than a Step, so we have no TaskListener available. Ideally currentBuild would be deprecated as a whole: jenkinsci/workflow-basic-steps-plugin#71

@@ -27,7 +27,7 @@
<revision>1.15.0</revision>
<changelist>-SNAPSHOT</changelist>
<java.level>8</java.level>
<jenkins.version>2.121.1</jenkins.version> <!-- Should be kept in sync with the version resulting from Dockerfile FROM statement-->
<jenkins.version>2.138.4</jenkins.version> <!-- Should be kept in sync with the version resulting from Dockerfile FROM statement-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you need to update some Dockerfile as the comment asks?

@halkeye
Copy link
Member

halkeye commented Apr 22, 2019

I can't seem to reply to the comment, but I think the comment was resolved a long time ago as https://github.com/jenkinsci/blueocean-plugin/blob/master/docker/official/Dockerfile says lts? maybe?

@halkeye
Copy link
Member

halkeye commented Apr 23, 2019

Hrm, want to label this correctly, is it a bug or enhancement? I'm thinking bug fix more than new feature

@dwnusbaum
Copy link
Member Author

dwnusbaum commented Apr 23, 2019

@halkeye I guess it's somewhat on the border between them, but from a user's perspective it probably makes the most sense to treat it as a bug.

@halkeye
Copy link
Member

halkeye commented Apr 23, 2019

I wonder what release drafter will do if its labeled as both :)

NodeGraphBuilder graph = NodeGraphBuilder.NodeGraphBuilderFactory.getInstance(run);
List<FlowNodeWrapper> nodes = graph.getPipelineNodes();

// top should be BlueRunResult.FAILURE, but the status is currently computed from the beginning of top to the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because StandardChunkVisitor doesn't handled nested stages. See further discussion and an ignored test here: https://github.com/jenkinsci/pipeline-graph-analysis-plugin/pull/25/files#r277844096.

@halkeye halkeye changed the title [JENKINS-39203] Only show blocks containing a WarningAction as unstable JENKINS-39203 - Only show blocks containing a WarningAction as unstable Apr 24, 2019
@dwnusbaum
Copy link
Member Author

Closing and reopening to see if those ATH failures were flakes or not.

@dwnusbaum dwnusbaum closed this May 14, 2019
@dwnusbaum dwnusbaum reopened this May 14, 2019
@halkeye
Copy link
Member

halkeye commented May 15, 2019

Restart stage is still slightly flakey, but the other one is a new error

@dwnusbaum
Copy link
Member Author

Tried reproducing the PipelineNodeTest.nodeWithPartialParallels failure but wasn't able to. I'll try to track down the source of the ATH tests, perhaps they are looking for a specific result or something that has intentionally changed.

@dwnusbaum dwnusbaum marked this pull request as ready for review May 15, 2019 19:37
@dwnusbaum dwnusbaum requested a review from imeredith as a code owner May 15, 2019 19:37
@dwnusbaum
Copy link
Member Author

All downstream changes have been released. Strictly speaking, the only necessary non-test change in this plugin is the one in NodeRunStatus.java that will change the per-step status displayed to support WarningAction. The change in PipelineNodeGraphVisitor.java described in the PR description could be backed out, but the change makes the usage of StatusAndTiming.computeChunkStatus2 here much easier to test elsewhere and more consistent with no drawbacks as far as I can tell.

Copy link
Collaborator

@vivek vivek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one @dwnusbaum, going to make lots of folks happy 🎉

@halkeye
Copy link
Member

halkeye commented May 15, 2019

When I get a few min I'll poke at the test and see if I can figure out what's wrong

@dwnusbaum
Copy link
Member Author

Ok, so from the last 3 builds, which don't have any interesting changes between them, it looks like there are some test flakes.

In build 8, the following 2 tests failed:

  • AllTest.javaTest
  • RestartStageTest.restartStage

In build 9, 4 tests failed:

  • AllTest.javaTest
  • FavoritesCardsTest.testFreestyle
  • PipelineNodeTest.nodesWithPartialParallels
  • RestartStageTest.restartStage

In Build 10, 2 tests failed:

  • AllTest.javaTest
  • FavoritesCardsTest.testFreestyle

The AllTest.javaTest failure is consistent, but I don't understand what that test is doing. It looks like it just reruns the integration test using mvn test, but as part of the ATH? Or maybe that test is just a wrapper for all of the other tests, so it fails if any ATH test fails?

Based on the screenshot for FavoritesCardTest, it looks like maybe the build completed before the status was detected as running, so the test timed out because the build was already done, but I'm not sure.

@halkeye
Copy link
Member

halkeye commented May 15, 2019 via email

Copy link
Member

@olamy olamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM good job!

status = new NodeRunStatus(
StatusAndTiming.computeChunkStatus2(run, nodeBefore, firstExecuted, lastNode, nodeAfter));
StatusAndTiming.computeChunkStatus2(run, nodeBefore, firstNode, lastNode, nodeAfter));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this change wasn't strictly necessary, but it turns out that it is, and it fixes things like JENKINS-57579. I can't quite tell what firstExecuted is expected to be, and it seems like it will vary based on context, but firstNode is always (even for in-progress builds), going to be the start of a stage step block or parallel branch block, and that is what we expect over in StatusAndTiming.computeChunkStatus2 anyway.

My guess as to the problem is that in some cases firstExecuted ends up being identical to or later than lastNode in the graph, and so setting up the graph traversal here in StatusAndTiming.findWarningsBetween fails and the method returns null.

@dwnusbaum
Copy link
Member Author

@halkeye Anything in particular you want me to look into here? It looks like we are just seeing a bunch of different flaky test failures. I looked into the PipelineNodeTest.nodesWithPartialParallels failure since you hadn't seen it before, but I haven't been able to reproduce it, and looking at the test I don't really see how my changes would make it flaky unless it was already flaky.

@halkeye
Copy link
Member

halkeye commented Jun 3, 2019

nope, not right now, I've been poking at trying to fix them (#1962) but they are mostly timing based e2e tests that are failing.

I've moved projects so I only have personal time to commit to this project now.

@vivek
Copy link
Collaborator

vivek commented Jun 5, 2019

Failing test in latest build#12 has been flakily failing on master as well across multiple build: https://ci.blueocean.io/blue/organizations/jenkins/blueocean/detail/PR-1954/12/tests/. I can't get ATH to run locally to reproduce to see why this test is flaky. In any case, fixing this flaky test is out of scope of this PR.

Given wider interest in this PR, I propose we merge it and address flaky ATH tests separately. @michaelneale @halkeye @kshultzCB

@michaelneale
Copy link
Member

looks like a good improvement - not sure of impact of dependency changes (and the new pipeline dependency - but likely it is fine). Looks like an improvement of things anyway.

@michaelneale
Copy link
Member

@dwnusbaum so mostly flakes that fail that you see?

@dwnusbaum
Copy link
Member Author

and the new pipeline dependency

pipeline-model-extensions was already a dependency through pipeline-model-definition, it just wasn't declared explicitly in pom.xml.

so mostly flakes that fail that you see?

Yes, as far as I can tell the ATH failures are not consistent. The one non-ATH failure that seems a bit sketchy is PipelineNodeTest.nodesWithPartialParallels, but I haven't been able to reproduce that failure locally. I recreated the test in a live instance based on this PR, and still wasn't able to reproduce it.

I went ahead and pushed 629d57f, which changes that test from guessing how long to wait using Thread.sleep to using the semaphore step which was designed exactly for this purpose for testing. Now, we know that when we make the other calls in the test the Pipeline must have progressed to the point where there are definitely 3 steps in the left branch and 2 steps in the right branch. If we don't get there, then the test will time out, showing us that something has gone wrong earlier. (If we actually cared about the inputs in the test or the visualization, changing them wouldn't work, but as it is they are just used to stop the execution for the test)

@dwnusbaum
Copy link
Member Author

Rebuilding due to infra failure:

Could not transfer artifact org.jenkins-ci.plugins:pipeline-graph-analysis:jar:1.10 from/to repo.jenkins-ci.org (http://repo.jenkins-ci.org/public/): Failed to transfer file: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/pipeline-graph-analysis/1.10/pipeline-graph-analysis-1.10.jar. Return code is: 502 , ReasonPhrase:Bad Gateway.

@dwnusbaum dwnusbaum closed this Jun 5, 2019
@dwnusbaum dwnusbaum reopened this Jun 5, 2019
@dwnusbaum
Copy link
Member Author

dwnusbaum commented Jun 6, 2019

Latest test failure is PipelineNodeTest.sequentialParallelStagesLongRun. Not sure what is wrong with that one, but I haven't been able to reproduce it locally, and since it doesn't have anything to do with statuses I don't think it's related to my change. Best guess is some unusual race condition being hit in this loop that runs every 100ms. Closing and reopening for a fresh build.

@dwnusbaum dwnusbaum closed this Jun 6, 2019
@dwnusbaum dwnusbaum reopened this Jun 6, 2019
@michaelneale
Copy link
Member

woo hoo green !

@halkeye halkeye merged commit 6696ce5 into jenkinsci:master Jun 6, 2019
@dwnusbaum dwnusbaum deleted the JENKINS-39203 branch June 7, 2019 14:17
@rodrigc
Copy link
Contributor

rodrigc commented Jul 12, 2019

@dwnusbaum nice job, and nice blog post: https://jenkins.io/blog/2019/07/05/jenkins-pipeline-stage-result-visualization-improvements/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants