Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Test org.opensearch.indices.replication.SegmentReplicationSuiteIT is flaky #9499

Closed
sachinpkale opened this issue Aug 23, 2023 · 9 comments · Fixed by #11977
Closed
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.11.0 Issues and PRs related to version 2.11.0

Comments

@sachinpkale
Copy link
Member

Using the same seed does not always fail the test. We need to run the test multiple times to get the failure (On my local, I got it on the 13th retry)

Build where it failed: https://build.ci.opensearch.org/job/gradle-check/23203/

I was able to reproduce with main

  2> java.lang.IllegalStateException: Some shards are still open after the threadpool terminated. Something is leaking index readers or store references.
        at __randomizedtesting.SeedInfo.seed([CFC3DCBFE313A077]:0)
        at org.opensearch.node.Node.awaitClose(Node.java:1541)
        at org.opensearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1129)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:114)
        at org.opensearch.test.InternalTestCluster.close(InternalTestCluster.java:966)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
        at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:114)
        at org.opensearch.test.OpenSearchIntegTestCase.clearClusters(OpenSearchIntegTestCase.java:576)
        at org.opensearch.test.OpenSearchIntegTestCase.afterClass(OpenSearchIntegTestCase.java:2283)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:578)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:901)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.base/java.lang.Thread.run(Thread.java:1623)
  2> REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationSuiteIT" -Dtests.seed=CFC3DCBFE313A077 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-US -Dtests.timezone=UTC -Druntime.java=20
Tests with failures:
 - org.opensearch.indices.replication.SegmentReplicationSuiteIT.testFullRestartDuringReplication
 - org.opensearch.indices.replication.SegmentReplicationSuiteIT.testDropRandomNodeDuringReplication
 - org.opensearch.indices.replication.SegmentReplicationSuiteIT.testDeleteIndexWhileReplicating
 - org.opensearch.indices.replication.SegmentReplicationSuiteIT.testBasicReplication
 - org.opensearch.indices.replication.SegmentReplicationSuiteIT.classMethod
@sachinpkale sachinpkale added bug Something isn't working untriaged flaky-test Random test failure that succeeds on second run and removed untriaged labels Aug 23, 2023
@sachinpkale sachinpkale added v2.10.0 Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed untriaged labels Aug 23, 2023
@mch2
Copy link
Member

mch2 commented Aug 23, 2023

Have left this running overnight on main - pre #9480 merge and not seeing this after ~12k iterations. Will try pulling in latest changes. We likely have a race closing commit refs before shutdown.

@dreamer-89
Copy link
Member

Another occurrence: https://build.ci.opensearch.org/job/gradle-check/23440/

Test Result (2 failures / +1) org.opensearch.indices.replication.SegmentReplicationSuiteIT.testFullRestartDuringReplication org.opensearch.indices.replication.SegmentReplicationSuiteIT.testDeleteIndexWhileReplicating

The test failure happens on primary shard where shard does not exist in IndexService.

...
Caused by: org.opensearch.transport.RemoteTransportException: [node_s1][127.0.0.1:40705][internal:index/shard/replication/get_checkpoint_info]
Caused by: org.opensearch.index.shard.ShardNotFoundException: no such shard
	at org.opensearch.index.IndexService.getShard(IndexService.java:337) ~[main/:?]
	at org.opensearch.indices.replication.OngoingSegmentReplications.getCachedCopyState(OngoingSegmentReplications.java:84) ~[main/:?]
	at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:141) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:129) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:110) ~[main/:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[main/:?]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:454) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1623) ~[?:?]
[2023-08-25T09:13:39,638][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_s0] [shardId [test-idx-1][2]] [replication id 94] Replication failed, timing data: {INIT=0, REPLICATING=0}
org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:528) [main/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:84) [main/:?]
	at org.opensearch.core.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) [main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1483) [main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:421) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [node_s1][127.0.0.1:40705][internal:index/shard/replication/get_checkpoint_info]
Caused by: org.opensearch.index.shard.ShardNotFoundException: no such shard
	at org.opensearch.index.IndexService.getShard(IndexService.java:337) ~[main/:?]
	at org.opensearch.indices.replication.OngoingSegmentReplications.getCachedCopyState(OngoingSegmentReplications.java:84) ~[main/:?]
	at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:141) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:129) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:110) ~[main/:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[main/:?]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:454) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]

@andrross
Copy link
Member

andrross commented Oct 5, 2023

Another failure: #10388 (comment)

@ashking94
Copy link
Member

#11021 (comment)

@reta
Copy link
Collaborator

reta commented Apr 29, 2024

java.lang.IllegalStateException: Some shards are still open after the threadpool terminated. Something is leaking index readers or store references.
	at __randomizedtesting.SeedInfo.seed([776E1C21CC3D36FF:CEFFDF69ECCD0FF6]:0)
	at org.opensearch.node.Node.awaitClose(Node.java:1740)
	at org.opensearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1130)
	at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:89)
	at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:131)
	at org.opensearch.common.util.io.IOUtils.close(IOUtils.java:114)
	at org.opensearch.test.InternalTestCluster.close(InternalTestCluster.java:967)
	at org.opensearch.test.OpenSearchTestClusterRule.afterInternal(OpenSearchTestClusterRule.java:325)
	at org.opensearch.test.OpenSearchTestClusterRule.after(OpenSearchTestClusterRule.java:188)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:374)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1583)

@reta reta reopened this Apr 29, 2024
@reta
Copy link
Collaborator

reta commented Apr 29, 2024

This issue is not gone or fixed: #13446

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@reta Thanks for reopenning this reoccuring issue

@mch2
Copy link
Member

mch2 commented May 7, 2024

This issue is not gone or fixed: #13446

@reta I believe this is a diff test in https://build.ci.opensearch.org/job/gradle-check/37929/ SegmentReplicationIT.testReplicaAlreadyAtCheckpoint which is a diff test suite - opened issue #13593 for that.

@mch2 mch2 closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.11.0 Issues and PRs related to version 2.11.0
Projects
None yet
8 participants