[CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure #1693

opensearch-ci-bot · 2021-12-10T01:03:54Z

@opensearch-project/opensearch-core team, please take a look at Build 1420 in gradle_check log for more information.

tlfeng · 2021-12-10T17:25:11Z

The test failure in gradle check log:


> Task :server:internalClusterTest

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=93B51E5BAF1F40EE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=America/Rosario -Druntime.java=17

org.opensearch.cluster.ClusterHealthIT > testHealthOnMasterFailover FAILED
    java.util.concurrent.ExecutionException: RemoteTransportException[[node_s3][127.0.0.1:43689][cluster:monitor/health]]; nested: MasterNotDiscoveredException[NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}];
        at __randomizedtesting.SeedInfo.seed([93B51E5BAF1F40EE:D5A3176ACF55C08]:0)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:281)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:268)
        at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:99)
        at org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover(ClusterHealthIT.java:393)

        Caused by:
        RemoteTransportException[[node_s3][127.0.0.1:43689][cluster:monitor/health]]; nested: MasterNotDiscoveredException[NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}];

            Caused by:
            MasterNotDiscoveredException[NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}];
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:275)
                at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:369)
                at org.opensearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:174)
                at org.opensearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:142)
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.retry(TransportMasterNodeAction.java:258)
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.retryOnMasterChange(TransportMasterNodeAction.java:239)
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.access$200(TransportMasterNodeAction.java:143)
                at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.handleException(TransportMasterNodeAction.java:224)
                at org.opensearch.transport.TransportService$6.handleException(TransportService.java:742)
                at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1357)
                at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:345)
                at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:296)
                at org.opensearch.transport.InboundHandler.handleException(InboundHandler.java:343)
                at org.opensearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:335)
                at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:152)
                at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:108)
                at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:759)
                at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:170)
                at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:145)
                at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110)
                at org.opensearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:341)
                at org.opensearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:246)
                at org.opensearch.nio.BytesChannelContext.read(BytesChannelContext.java:59)
                at org.opensearch.nio.EventHandler.handleRead(EventHandler.java:152)
                at org.opensearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:167)
                at org.opensearch.nio.NioSelector.handleRead(NioSelector.java:438)
                at org.opensearch.nio.NioSelector.processKey(NioSelector.java:264)
                at org.opensearch.nio.NioSelector.singleLoop(NioSelector.java:191)
                at org.opensearch.nio.NioSelector.runLoop(NioSelector.java:148)
                at java.lang.Thread.run(Thread.java:833)

                Caused by:
                NodeClosedException[node closed {node_s0}{5ZgaEvMvTSyRAfSIhO0oJA}{ZDF2DemYSeO5FtbS_BaoKg}{127.0.0.1}{127.0.0.1:32949}{dimr}{shard_indexing_pressure_enabled=true}]
                    at org.opensearch.action.admin.cluster.health.TransportClusterHealthAction$3.onClusterServiceClose(TransportClusterHealthAction.java:274)
                    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:362)
                    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:269)
                    at org.opensearch.cluster.service.ClusterApplierService.doStop(ClusterApplierService.java:189)
                    at org.opensearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:92)
                    at org.opensearch.cluster.service.ClusterService.doStop(ClusterService.java:135)
                    at org.opensearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:92)
                    at org.opensearch.node.Node.stop(Node.java:1187)
                    at org.opensearch.node.Node.close(Node.java:1212)
                    at org.opensearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:1098)
                    at org.opensearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:1040)
                    at org.opensearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1971)
                    at org.opensearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1937)
                    at org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover(ClusterHealthIT.java:380)

andrross · 2021-12-10T22:41:06Z

Another occurrence here: https://fork-jenkins.searchservices.aws.dev/job/OpenSearch_CI/job/PR_Checks/job/Gradle_Check/1445/

andrross · 2021-12-11T02:02:19Z

The assertion that fails is here:

https://github.com/andrross/OpenSearch/blob/996d33adb22ac6a523962166f3677f8dc6ac9c1f/server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java#L393

which is asserting that the response to the get health call is a success, and it fails with with a MasterNotDiscoveredException. I was able to semi-regularly reproduce this error on my Mac, and tracked it down in the debugger to be coming from the timeout here:

https://github.com/andrross/OpenSearch/blob/996d33adb22ac6a523962166f3677f8dc6ac9c1f/server/src/main/java/org/opensearch/action/support/master/TransportMasterNodeAction.java#L275

nknize · 2021-12-16T14:26:02Z

Another failure on #1742 PR.

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=58E01485FD8F1669 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m -Djava.security.manager=allow" -Dtests.locale=sr-Latn-RS -Dtests.timezone=America/Montserrat -Druntime.java=17

org.opensearch.cluster.ClusterHealthIT > testHealthOnMasterFailover FAILED
    java.util.concurrent.ExecutionException: MasterNotDiscoveredException[NodeClosedException[node closed {node_s1}{LW9HHvIqQ0qop98CQxk5og}{qu7gbOicQA6-BH7hVPBpHw}{127.0.0.1}{127.0.0.1:42515}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s1}{LW9HHvIqQ0qop98CQxk5og}{qu7gbOicQA6-BH7hVPBpHw}{127.0.0.1}{127.0.0.1:42515}{dimr}{shard_indexing_pressure_enabled=true}];
        at __randomizedtesting.SeedInfo.seed([58E01485FD8F1669:C60F3BA8FE650A8F]:0)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:281)
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:268)
        at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:99)
        at org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover(ClusterHealthIT.java:393)

        Caused by:
        MasterNotDiscoveredException[NodeClosedException[node closed {node_s1}{LW9HHvIqQ0qop98CQxk5og}{qu7gbOicQA6-BH7hVPBpHw}{127.0.0.1}{127.0.0.1:42515}{dimr}{shard_indexing_pressure_enabled=true}]]; nested: NodeClosedException[node closed {node_s1}{LW9HHvIqQ0qop98CQxk5og}{qu7gbOicQA6-BH7hVPBpHw}{127.0.0.1}{127.0.0.1:42515}{dimr}{shard_indexing_pressure_enabled=true}];
            at app//org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:275)
            at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:369)

dblock · 2021-12-22T18:32:12Z

Another one in #1791

dreamer-89 · 2021-12-22T18:41:16Z

Looking into it

dreamer-89 · 2021-12-23T16:54:35Z

After running tests multiple times, depicts that failure is more prolific when index creation is enabled and master node timeout is lesser.

dreamer-89 · 2021-12-23T16:55:54Z

Failure occurrences are related to master node timeout and whether index creation is enabled or not.

A summary of test failures based on above two properties.

Failures %	Index Not Used	Index Used
Timeout 60 sec (current)	0%	~1%
10 second	<0-1%	~50-70%
1 second	<1%	>90%

dreamer-89 · 2021-12-23T17:21:13Z

The experiment shows that master node generally takes more time to boot up when indices are used for the test.

Next Steps:

Reduce the flakiness of test by increasing the timeout to 2 minutes.
Dissect master bootup time and check why it is taking longer when indices are used.

dreamer-89 · 2021-12-29T20:39:19Z

Created new issue to debug/improve higher bootup time for master when indices are used.
#1828

opensearch-ci-bot added bug Something isn't working cicd labels Dec 10, 2021

nknize changed the title ~~Jenkins Randomized Gradle Check Failed on main~~ [CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure Dec 16, 2021

nknize mentioned this issue Dec 16, 2021

[CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure #1744

Closed

nknize added >test-failure Test failure from CI, local build, etc. v2.0.0 Version 2.0.0 untriaged labels Dec 16, 2021

Poojita-Raj assigned dreamer-89 Dec 22, 2021

dreamer-89 mentioned this issue Dec 27, 2021

[Tests] ClusterHealthIT:testHealthOnMasterFailover - Increase master node timeout #1812

Merged

5 tasks

anasalkouz added flaky-test Random test failure that succeeds on second run and removed untriaged labels Dec 28, 2021

anasalkouz closed this as completed Jan 4, 2022

tlfeng mentioned this issue Jan 5, 2022

Update FIPS API libraries of Bouncy Castle #1853

Merged

5 tasks

dblock mentioned this issue Jan 14, 2022

[Meta] Fix random test failures #1715

Closed

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure #1693

[CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure #1693

opensearch-ci-bot commented Dec 10, 2021

tlfeng commented Dec 10, 2021

andrross commented Dec 10, 2021

andrross commented Dec 11, 2021

nknize commented Dec 16, 2021

dblock commented Dec 22, 2021

dreamer-89 commented Dec 22, 2021

dreamer-89 commented Dec 23, 2021

dreamer-89 commented Dec 23, 2021 •

edited

Loading

dreamer-89 commented Dec 23, 2021

dreamer-89 commented Dec 29, 2021

[CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure #1693

[CI] o.o.cluster.ClusterHealthIT.testHealthOnMasterFailover test failure #1693

Comments

opensearch-ci-bot commented Dec 10, 2021

tlfeng commented Dec 10, 2021

andrross commented Dec 10, 2021

andrross commented Dec 11, 2021

nknize commented Dec 16, 2021

dblock commented Dec 22, 2021

dreamer-89 commented Dec 22, 2021

dreamer-89 commented Dec 23, 2021

dreamer-89 commented Dec 23, 2021 • edited Loading

dreamer-89 commented Dec 23, 2021

dreamer-89 commented Dec 29, 2021

dreamer-89 commented Dec 23, 2021 •

edited

Loading