-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multi-node integ tests are flaky #461
Comments
Set up a test to run 15 times (all JDKs from 17 to 21 against 3 OS's). Failures on JDKs 17, 18, 19, and 21 on macOS, and 21 on Windows. Consistently failing some of the tests. So far 5 out of 6 match the above log style with the null src. One other type:
Commonality is still that this is the first test executed and the one that initializes the REST clients. |
The client initialization log message comes from However, in the multi-node case, there's no guarantee that the other node(s) have finished their initialization before we try to call them. I still think this points to the |
What is the bug?
Multi-node integ tests are failing randomly. The common symptom is that the cluster does not appear to be started up when the first test(s) run, which leads to 1 in 3 chance of failure if the REST request hits a node that does not contain the primary shard for a queried index.
How can one reproduce the bug?
Occurs randomly on GitHub Actions. Latest failures :
What is the expected behavior?
Passing tests
What is your host/environment?
GitHub actions. This has occurred on all OS's. So far we've only done multi-node tests on JDK 21, will continue to evaluate other JDKs.
Do you have any additional context?
The common line tends to be this exception, always on
::integTest-0
node. Other nodes have failures ~20s later, associated with cluster manager (which was probably this node).Note the sequence and timestamps involved here: we do an HTTP request here
We start the test at 16:24:31,158
The exception occurs at an unknown time (the log entry is prior to the initialization line)
We start initializing the REST clients at 16:24:31,180
We finish the test at 16:24:41.
Other tests seem to complete successfully.
Another example, where the exception appears after the "after test" log but was clearly thrown in the test:
Again, later tests complete successfully.
Occasionally we've seen multiple test failures but it's still the "first N" failures followed by success.
These are different internal tests, indicating some other commonality.
I'll be investigating these client connections and means of confirming they're connected before running tests.
The text was updated successfully, but these errors were encountered: