Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing #321

Closed
nknize opened this issue Jul 7, 2023 · 8 comments
Assignees
Labels
agents enhancement New feature or request packer

Comments

@nknize
Copy link

nknize commented Jul 7, 2023

Is your feature request related to a problem? Please describe

Coming out of this public slack discussion I'd like to explore a possible spike in flaky test failures during gradlew check on PRs in the OpenSearch core repository during regular business hours.

The concrete test failures we're noticing are similar to:

Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
	at org.opensearch.nio.SocketChannelContext.connect(SocketChannelContext.java:157) ~[opensearch-nio-2.9.0-SNAPSHOT.jar:2.9.0-SNAPSHOT]

As can be seen in this one instance. This seems mostly related to socket issues in the runner and seems to occur on "aggressive" Integration Tests (e.g., those using Scope.Test level, which fires up a new cluster for each test method).

With jenkins having its own Runner for each invocation I wouldn't expect the high level of activity (e.g., multiple PRs throughout the day) to contribute, so maybe this is more related to the test intensity, --parallel gradle invocation, and size of the Runner instance?

Describe the solution you'd like

As a parallel effort to trying to lean out the intense integration tests in the core repo, I'd like for us to see if we can root cause these time outs as a function of instance resources (e.g., CPU, Memory) and the test configuration (e.g., number of concurrent integration tests, number of sockets).

It may be that we just aren't closing the sockets in the core IntegrationTest class? (we can explore that separately).

Describe alternatives you've considered

  • Check the core Integration Test harness is properly closing sockets
  • Check the socket pool configuration in the core test framework.
  • ... other core improvements not explicitly mentioned here.

Additional context

Thank you!

@nknize nknize added enhancement New feature or request untriaged Issues that have not yet been triaged labels Jul 7, 2023
@peterzhuamazon
Copy link
Member

We will try to create a new runner with @nknize own env specs: 32/128 similar to m5.8xlarge.
It is possible that Nick his 32/128 but we have 96/192, that means for --parallel to create 3 times more parallel tasks on our instance, each job is being assigned 2 times less the memory.

Also the desktop env setup means his cpu single core processing frequency is way higher than genuine intel server cpus. That needs to be taken into account as well. I will start investigating this next week.

Thanks.

@peterzhuamazon
Copy link
Member

@peterzhuamazon
Copy link
Member

Screenshot 2023-07-20 at 7 31 18 PM
Screenshot 2023-07-20 at 7 31 22 PM

Several days data shows the new setup would have 90% unstable rate vs 10% success rate, but yet to see complete failure rate yet.

So it is possible the new spec of m58xlarge is better than original c524xlarge setups.

Thanks.

@peterzhuamazon
Copy link
Member

We have decided to test switching the default runner to m58xlarge next week.

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Jul 25, 2023

New spec live. Monitoring a bit.

@peterzhuamazon
Copy link
Member

image
More success runs.

@bbarani
Copy link
Member

bbarani commented Jul 31, 2023

Closing this issue as the changes were completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agents enhancement New feature or request packer
Projects
Development

No branches or pull requests

3 participants