Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileSyncingSegments is flaky #11165

Closed
reta opened this issue Nov 11, 2023 · 4 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@reta
Copy link
Collaborator

reta commented Nov 11, 2023

Describe the bug
The test case org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.classMethod is flaky:

org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.classMethod

java.lang.RuntimeException: file handle leaks: [FileChannel(/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT_ECA5AE23FF43898D-001/tempDir-003/node_t1-shared/bkfrcfdQKM/0/mb396qQtR8S09Afh8Pj9ww/0/translog/translog-10.ckp), FileChannel(/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT_ECA5AE23FF43898D-001/tempDir-003/node_t1-shared/bkfrcfdQKM/0/mb396qQtR8S09Afh8Pj9ww/0/translog/translog-9.ckp), FileChannel(/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT_ECA5AE23FF43898D-001/tempDir-003/node_t1-shared/bkfrcfdQKM/0/mb396qQtR8S09Afh8Pj9ww/0/translog/translog-8.ckp), FileChannel(/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT_ECA5AE23FF43898D-001/tempDir-003/node_t1-shared/bkfrcfdQKM/0/mb396qQtR8S09Afh8Pj9ww/0/translog/translog-8.tlog), FileChannel(/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT_ECA5AE23FF43898D-001/tempDir-003/node_t1-shared/bkfrcfdQKM/0/mb396qQtR8S09Afh8Pj9ww/0/translog/translog-10.tlog), FileChannel(/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT_ECA5AE23FF43898D-001/tempDir-003/node_t1-shared/bkfrcfdQKM/0/mb396qQtR8S09Afh8Pj9ww/0/translog/translog-9.tlog)]
	at __randomizedtesting.SeedInfo.seed([ECA5AE23FF43898D]:0)
	at org.apache.lucene.tests.mockfile.LeakFS.onClose(LeakFS.java:63)
	at org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:69)
	at org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:70)
	at org.apache.lucene.tests.util.TestRuleTemporaryFilesCleanup.afterAlways(TestRuleTemporaryFilesCleanup.java:223)
	at com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterAlways(TestRuleAdapter.java:31)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.Exception
	at org.apache.lucene.tests.mockfile.LeakFS.onOpen(LeakFS.java:46)
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:82)
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:202)
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:171)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
	at org.opensearch.index.translog.transfer.FileSnapshot.<init>(FileSnapshot.java:46)
	at org.opensearch.index.translog.transfer.FileSnapshot$TransferFileSnapshot.<init>(FileSnapshot.java:113)
	at org.opensearch.index.translog.transfer.FileSnapshot$CheckpointFileSnapshot.<init>(FileSnapshot.java:196)
	at org.opensearch.index.translog.transfer.TranslogCheckpointTransferSnapshot$Builder.build(TranslogCheckpointTransferSnapshot.java:167)
	at org.opensearch.index.translog.RemoteFsTranslog.upload(RemoteFsTranslog.java:338)
	at org.opensearch.index.translog.RemoteFsTranslog.prepareAndUpload(RemoteFsTranslog.java:310)
	at org.opensearch.index.translog.RemoteFsTranslog.sync(RemoteFsTranslog.java:365)
	at org.opensearch.index.translog.InternalTranslogManager.syncTranslog(InternalTranslogManager.java:196)
	at org.opensearch.index.engine.InternalEngine.syncTranslog(InternalEngine.java:610)
	at org.opensearch.index.shard.IndexShard.postActivatePrimaryMode(IndexShard.java:3449)
	at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:727)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4052)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4022)
	at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$37(IndexShard.java:3973)
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileSyncingSegments" -Dtests.seed=ECA5AE23FF43898D 

Expected behavior
The test must always pass

Plugins
Standard

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context

nov 11, 2023 10:36:16 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[opensearch[node_t1][generic][T#4],5,TGRP-SegmentReplicationUsingRemoteStoreDisruptionIT]
java.lang.AssertionError:  inconsistent generation 
	at __randomizedtesting.SeedInfo.seed([ECA5AE23FF43898D]:0)
	at org.opensearch.index.translog.transfer.TranslogCheckpointTransferSnapshot$Builder.build(TranslogCheckpointTransferSnapshot.java:180)
	at org.opensearch.index.translog.RemoteFsTranslog.upload(RemoteFsTranslog.java:338)
	at org.opensearch.index.translog.RemoteFsTranslog.prepareAndUpload(RemoteFsTranslog.java:310)
	at org.opensearch.index.translog.RemoteFsTranslog.sync(RemoteFsTranslog.java:365)
	at org.opensearch.index.translog.InternalTranslogManager.syncTranslog(InternalTranslogManager.java:196)
	at org.opensearch.index.engine.InternalEngine.syncTranslog(InternalEngine.java:610)
	at org.opensearch.index.shard.IndexShard.postActivatePrimaryMode(IndexShard.java:3449)
	at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:727)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4052)
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4022)
	at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$37(IndexShard.java:3973)
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
@reta reta added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels Nov 11, 2023
@kartg
Copy link
Member

kartg commented Dec 27, 2023

Note - the stack trace from the additional context section shows a similar output to #11255 which I just closed since I was unable to reproduce the failure even after several thousand retries.

Repro commands from the CI run:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileSyncingSegments" -Dtests.seed=ECA5AE23FF43898D -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nb-NO -Dtests.timezone=Asia/Ujung_Pandang -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT" -Dtests.seed=ECA5AE23FF43898D -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Etc/UTC -Druntime.java=17

This run appears to be against the 2.x branch and is dated Nov 11, while the backport of the flaky test fix was merged to 2.x only on Nov 17. It's possible that this failure has been mitigated. I'll run the repro commands to try and reproduce the error.

@kartg kartg self-assigned this Dec 27, 2023
@kartg
Copy link
Member

kartg commented Dec 28, 2023

While the command loops, I dug a little deeper into the stacktrace

  • The failure seems to stem from a Lucene test class named LeakFS whose desired behavior is to throw an exception (simluating file handle leaks) when onClose is invoked. The onOpen behavior is to simply add an Exception object to an in-memory map.
  • The "caused by" portion of the trace shows that onOpen was invoked by an Opensearch code path (via RemoteFsTranslog)
  • However, the onClose portion of the trace does not show any Opensearch package names in it. It is instead triggered by Lucene's TestRuleTemporaryFilesCleanup class, whose purpose is to clean up temporary files and close file systems. Calling close on the LeakFS predictably results in the exception being thrown.

@kotwanikunal kotwanikunal changed the title [BUG] org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.classMethod is flaky [BUG] org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreDisruptionIT.testCancelReplicationWhileSyncingSegments is flaky Jan 2, 2024
@mch2
Copy link
Member

mch2 commented Jan 2, 2024

The issue here seems to be the failure during https://build.ci.opensearch.org/job/gradle-check/29837/testReport/junit/org.opensearch.remotestore/SegmentReplicationUsingRemoteStoreDisruptionIT/testCancelReplicationWhileSyncingSegments/ which leaves open files on disk. The failure is caused by an "inconsistent generation" assertion error thrown here which means there is an assertion error thrown during translog upload.

@mch2
Copy link
Member

mch2 commented Jan 3, 2024

Fix for the translog gen assertion went into 2.x on Nov 17th which is after this issue was cut. @reta's build was against 2.x on the 11th. I've run this and all tests in this class over 2k times today without failures. Closing this as I believe its fixed, pls re-open if it occurs again.

@mch2 mch2 closed this as completed Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
Development

No branches or pull requests

4 participants