Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RemoteIndexShardTests.testSegRepSucceedsOnPreviousCopiedFiles is flaky - suite timeout was reached #10885

Closed
peternied opened this issue Oct 24, 2023 · 7 comments · Fixed by #11786
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@peternied
Copy link
Member

Describe the bug
org.opensearch.index.shard.RemoteIndexShardTests.testSegRepSucceedsOnPreviousCopiedFiles test is flaky.

Stacktrace

java.lang.Exception: Test abandoned because suite timeout was reached.
	at __randomizedtesting.SeedInfo.seed([AABBFC4C7FB026AC]:0)
java.lang.Exception: Suite timeout exceeded (>= 1200000 msec).
java.lang.AssertionError: Replication should not fail

To Reproduce
CI - https://build.ci.opensearch.org/job/gradle-check/28624/testReport/org.opensearch.index.shard/RemoteIndexShardTests/testSegRepSucceedsOnPreviousCopiedFiles/

Expected behavior
Test should always pass

@peternied peternied added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run labels Oct 24, 2023
@peternied
Copy link
Member Author

Impacted - #10800 (comment)

@amkhar
Copy link
Contributor

amkhar commented Oct 25, 2023

#10915 (comment)

@mch2
Copy link
Member

mch2 commented Oct 25, 2023

@dreamer-89 Can you pls take a look on this?

@mch2
Copy link
Member

mch2 commented Oct 25, 2023

@dreamer-89 - I think whats happening here is the test is throwing an error from getSegmentFiles outside of the actual remote fetch.

                    super.getSegmentFiles(replicationId, checkpoint, filesToFetch, indexShard, (fileName, bytesRecovered) -> {}, listener);
                    runAfterGetFiles[index.getAndIncrement()].run();

From the trace it looks like the open indexInput is because the first set of fetch has not issued a cancel on the source because the fail first notifies listeners & then invokes cancel on the target.:

    public void fail(ReplicationFailedException e, boolean sendShardFailure) {
        if (finished.compareAndSet(false, true)) {
            try {
                logger.debug("marking target " + description() + " as failed", e);
                notifyListener(e, sendShardFailure);
            } finally {
                try {
                    cancellableThreads.cancel("failed" + description() + "[" + ExceptionsHelper.stackTrace(e) + "]");
                } finally {
                    // release the initial reference. replication files will be cleaned as soon as ref count goes to zero, potentially now
                    decRef();
                }
            }
        }
    }

I haven't been able to repro this race, but I think we can avoid it by adding an assertBusy that the first target has a 0 refcount similar to this test.

@dreamer-89
Copy link
Member

Looking into it.

@dreamer-89
Copy link
Member

From build failures 28991, 28624, the failure is happening because of unclosed IndexOutput on _0.si segment file (stack trace below).

[2023-10-25T20:45:32,399][INFO ][o.o.i.s.RemoteIndexShardTests] [testSegRepSucceedsOnPreviousCopiedFiles] before test
[2023-10-25T20:45:32,890][INFO ][o.o.i.s.RemoteStoreRefreshListener] [testSegRepSucceedsOnPreviousCopiedFiles] [test][0] Scheduled retry with didRefresh=true
[2023-10-25T20:45:32,935][ERROR][o.o.i.s.IndexShard       ] [testSegRepSucceedsOnPreviousCopiedFiles] [test][0] skip local recovery as no index commit found
[2023-10-25T20:45:33,117][INFO ][o.o.i.r.RecoverySourceHandler] [org.opensearch.index.shard.RemoteIndexShardTests] [test][0][recover to s1] finalizing recovery took [16.4ms]
[2023-10-25T20:45:39,554][ERROR][o.o.i.s.RemoteIndexShardTests] [org.opensearch.index.shard.RemoteIndexShardTests] Replication error
org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:532) [main/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [main/:?]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:169) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:515) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:501) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: java.lang.RuntimeException: Simulated
	at org.opensearch.index.shard.RemoteIndexShardTests.lambda$testSegRepSucceedsOnPreviousCopiedFiles$1(RemoteIndexShardTests.java:374) ~[test/:?]
	at org.opensearch.index.shard.RemoteIndexShardTests$1.getSegmentFiles(RemoteIndexShardTests.java:396) ~[test/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:173) ~[main/:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 14 more
[2023-10-25T20:45:39,613][ERROR][o.o.i.s.RemoteIndexShardTests] [org.opensearch.index.shard.RemoteIndexShardTests] Replication error
org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:532) [main/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [main/:?]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:169) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:515) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:501) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: java.io.UncheckedIOException: Error reading name [_0.si], length [325], checksum [r5w8xw], writtenBy [9.8.0]
	at org.opensearch.indices.replication.SegmentReplicationTarget.validateLocalChecksum(SegmentReplicationTarget.java:246) ~[main/:?]
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:200) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:170) ~[main/:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 14 more
Caused by: java.nio.file.AccessDeniedException: MockDirectoryWrapper: file "_0.si" is still open for writing
	at org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:808) ~[lucene-test-framework-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.indices.replication.SegmentReplicationTarget.validateLocalChecksum(SegmentReplicationTarget.java:236) ~[main/:?]
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:200) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:170) ~[main/:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 14 more
Caused by: java.lang.RuntimeException: unclosed IndexOutput: _0.si
	at org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783) ~[lucene-test-framework-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.tests.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:719) ~[lucene-test-framework-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:153) ~[main/:?]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.store.Directory.copyFrom(Directory.java:181) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:978) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationSource$ReplicationStatsDirectoryWrapper.copyFrom(SegmentReplicationSource.java:88) ~[main/:?]
	at org.opensearch.index.store.RemoteStoreFileDownloader.lambda$copyOneFile$2(RemoteStoreFileDownloader.java:149) ~[main/:?]
	at org.opensearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:124) ~[main/:?]
	at org.opensearch.index.store.RemoteStoreFileDownloader.lambda$copyOneFile$3(RemoteStoreFileDownloader.java:148) ~[main/:?]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[main/:?]
	... 3 more

@dreamer-89
Copy link
Member

The usual suspect here is incomplete download of files in first round of segment replication. The incomplete download results in open IndexOutput objects which results in subsequent round of segment replication to fail. We recently make the download logic async and thus, with current test logic it is possible to have in-progress file downloads when next round of segment replication is invoked.

Caused by: java.lang.RuntimeException: unclosed IndexOutput: _0.si
	at org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783) ~[lucene-test-framework-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.tests.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:719) ~[lucene-test-framework-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:153) ~[main/:?]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.store.Directory.copyFrom(Directory.java:181) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
4 participants