Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multi-node integ tests are flaky #461

Closed
dbwiddis opened this issue Jan 26, 2024 · 2 comments · Fixed by #463
Closed

[BUG] Multi-node integ tests are flaky #461

dbwiddis opened this issue Jan 26, 2024 · 2 comments · Fixed by #463
Labels
bug Something isn't working untriaged

Comments

@dbwiddis
Copy link
Member

What is the bug?

Multi-node integ tests are failing randomly. The common symptom is that the cluster does not appear to be started up when the first test(s) run, which leads to 1 in 3 chance of failure if the REST request hits a node that does not contain the primary shard for a queried index.

How can one reproduce the bug?

Occurs randomly on GitHub Actions. Latest failures :

What is the expected behavior?

Passing tests

What is your host/environment?

GitHub actions. This has occurred on all OS's. So far we've only done multi-node tests on JDK 21, will continue to evaluate other JDKs.

Do you have any additional context?

The common line tends to be this exception, always on ::integTest-0 node. Other nodes have failures ~20s later, associated with cluster manager (which was probably this node).

Note the sequence and timestamps involved here: we do an HTTP request here

1> [2024-01-26T16:24:31,158][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testCreateAndProvisionLocalModelWorkflow] before test
  2> org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:37145], URI [/_plugins/_flow_framework/workflow/S90ER40BNKhClHx_ZtQu/_provision], status line [HTTP/1.1 500 Internal Server Error]
    {"error":"Cannot invoke \"String.getBytes(java.nio.charset.Charset)\" because \"src\" is null"}
        at __randomizedtesting.SeedInfo.seed([3366CB2459279EBB:C09AA671C3648979]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:116)
        at app//org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:89)
        at app//org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:78)
        at app//org.opensearch.flowframework.FlowFrameworkRestTestCase.provisionWorkflow(FlowFrameworkRestTestCase.java:381)
        at app//org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testCreateAndProvisionRemoteModelWorkflow(FlowFrameworkRestApiIT.java:224)
  1> [2024-01-26T16:24:31,180][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testCreateAndProvisionLocalModelWorkflow] initializing REST clients against [http://[::1]:38359, http://127.0.0.1:36629, http://[::1]:38133, http://127.0.0.1:37145/, http://[::1]:39713, http://127.0.0.1:40[83](https://github.com/opensearch-project/flow-framework/actions/runs/7671893205/job/20911274115#step:4:84)5]
  1> [2024-01-26T16:24:41,246][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testCreateAndProvisionLocalModelWorkflow] after test

We start the test at 16:24:31,158
The exception occurs at an unknown time (the log entry is prior to the initialization line)
We start initializing the REST clients at 16:24:31,180
We finish the test at 16:24:41.

Other tests seem to complete successfully.

Another example, where the exception appears after the "after test" log but was clearly thrown in the test:

  1> [2024-01-26T15:25:32,304][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testFailedUpdateWorkflow] before test
  1> [2024-01-26T15:25:32,304][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testFailedUpdateWorkflow] initializing REST clients against [http://[::1]:64663, http://127.0.0.1:64664,/ http://[::1]:64668, http://127.0.0.1:64669,/ http://[::1]:64675, http://127.0.0.1:64676]/
  1> [2024-01-26T15:25:52,299][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testFailedUpdateWorkflow] There are still tasks running after this test that might break subsequent tests [internal:index/shard/recovery/start_recovery].
  1> [2024-01-26T15:25:52,299][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testFailedUpdateWorkflow] after test
  2> org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:64669], URI [/_plugins/_flow_framework/workflow/el0FR40BN7x2uhV1Ihk7/_provision], status line [HTTP/1.1 500 Internal Server Error]
    {"error":"Cannot invoke \"String.getBytes(java.nio.charset.Charset)\" because \"src\" is null"}
        at __randomizedtesting.SeedInfo.seed([4FF4E0B23ACF7636:8C5D06D[74](https://github.com/opensearch-project/flow-framework/actions/runs/7671893205/job/20911275308#step:4:75)3C59EC4]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:3[76](https://github.com/opensearch-project/flow-framework/actions/runs/7671893205/job/20911275308#step:4:77))
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:116)
        at app//org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:89)
        at app//org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:[78](https://github.com/opensearch-project/flow-framework/actions/runs/7671893205/job/20911275308#step:4:79))
        at app//org.opensearch.flowframework.FlowFrameworkRestTestCase.provisionWorkflow(FlowFrameworkRestTestCase.java:3[81](https://github.com/opensearch-project/flow-framework/actions/runs/7671893205/job/20911275308#step:4:82))
        at app//org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testFailedUpdateWorkflow(FlowFrameworkRestApiIT.java:92)

Again, later tests complete successfully.

Occasionally we've seen multiple test failures but it's still the "first N" failures followed by success.

These are different internal tests, indicating some other commonality.

I'll be investigating these client connections and means of confirming they're connected before running tests.

@dbwiddis dbwiddis added bug Something isn't working untriaged labels Jan 26, 2024
@dbwiddis
Copy link
Member Author

Set up a test to run 15 times (all JDKs from 17 to 21 against 3 OS's). Failures on JDKs 17, 18, 19, and 21 on macOS, and 21 on Windows.

Consistently failing some of the tests. So far 5 out of 6 match the above log style with the null src. One other type:

org.opensearch.flowframework.rest.FlowFrameworkRestApiIT > testCreateAndProvisionRemoteModelWorkflow FAILED
    org.junit.ComparisonFailure: expected:<[COMPLET]ED> but was:<[FAIL]ED>
        at __randomizedtesting.SeedInfo.seed([AFD09CC584B20298:5C2CF1901EF1155A]:0)
        at org.junit.Assert.assertEquals(Assert.java:117)
        at org.junit.Assert.assertEquals(Assert.java:146)
        at org.opensearch.flowframework.FlowFrameworkRestTestCase.getAndAssertWorkflowStatus(FlowFrameworkRestTestCase.java:[57](https://github.com/dbwiddis/flow-framework/actions/runs/7673708105/job/20917202324#step:4:58)5)
        at org.opensearch.flowframework.FlowFrameworkRestTestCase.lambda$getResourcesCreated$7(FlowFrameworkRestTestCase.java:602)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1089)
        at org.opensearch.flowframework.FlowFrameworkRestTestCase.getResourcesCreated(FlowFrameworkRestTestCase.java:601)
        at org.opensearch.flowframework.rest.FlowFrameworkRestApiIT.testCreateAndProvisionRemoteModelWorkflow(FlowFrameworkRestApiIT.java:231)

  1> [2024-01-26T22:[59](https://github.com/dbwiddis/flow-framework/actions/runs/7673708105/job/20917202324#step:4:60):32,181][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testCreateAndProvisionRemoteModelWorkflow] before test

  1> [2024-01-26T22:59:32,184][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testCreateAndProvisionRemoteModelWorkflow] initializing REST clients against [http://[::1]:49243, http://127.0.0.1:49244,/ http://[::1]:49247, http://127.0.0.1:49248,/ http://[::1]:49251, http://127.0.0.1:49252]/
Suite: Test class org.opensearch.flowframework.rest.FlowFrameworkRestApiIT
  1> [2024-01-26T23:00:24,111][INFO ][o.o.f.r.FlowFrameworkRestApiIT] [testCreateAndProvisionRemoteModelWorkflow] after test

Commonality is still that this is the first test executed and the one that initializes the REST clients.

@dbwiddis
Copy link
Member Author

The client initialization log message comes from initClient() in OpenSearchRestTestCase which has @Before JUnit annotation.

However, in the multi-node case, there's no guarantee that the other node(s) have finished their initialization before we try to call them.

I still think this points to the build.gradle not waiting for the correct conditions before starting these tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant