Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646

Merged
merged 2 commits into from
Jun 27, 2022

Conversation

imRishN
Copy link
Member

@imRishN imRishN commented Jun 22, 2022

Signed-off-by: Rishab Nahata rnnahata@amazon.com

Description

Caused by #3563

org.opensearch.cluster.allocation.AwarenessAllocationIT > testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness FAILED
    java.lang.AssertionError: unexpected
        at org.opensearch.test.InternalTestCluster.removeExclusions(InternalTestCluster.java:1912)
        at org.opensearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1777)
        at org.opensearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1764)
        at org.opensearch.test.InternalTestCluster.stopRandomNode(InternalTestCluster.java:1672)
        at org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness(AwarenessAllocationIT.java:425)

        Caused by:
        java.util.concurrent.ExecutionException: MasterNotDiscoveredException[null]
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286)
            at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:273)
            at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:104)
            at org.opensearch.test.InternalTestCluster.removeExclusions(InternalTestCluster.java:1910)
            ... 4 more

            Caused by:
            MasterNotDiscoveredException[null]
                at app//org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:282)
                at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
                at app//org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
                at app//org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697)
                at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739)
                at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
                at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
                at java.base@17.0.3/java.lang.Thread.run(Thread.java:833)

    MasterNotDiscoveredException[null]
        at app//org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:282)
        at app//org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
        at app//org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
        at app//org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:697)
        at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739)
        at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base@17.0.3/java.lang.Thread.run(Thread.java:833)

Issues Resolved

#3603

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@imRishN imRishN requested review from a team and reta as code owners June 22, 2022 07:08
…onIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness by adding dedicated cluster manager node

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 9b30f1a711d9a06dfa96996513124b21a60f263c
Log 6210

Reports 6210

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 06b2acd
Log 6211

Reports 6211

…lueAndLoadAwareness

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 7095265
Log 6218

Reports 6218

@imRishN
Copy link
Member Author

imRishN commented Jun 22, 2022

Ran the test 100 times now. Succeeds every time.

for i in {1..100}
do
echo "Task $i"
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17 --rerun-tasks;
response_code=$?
echo "Response code $response_code"
if [[ $response_code = 1 ]]; then
	echo "Test $i failed"
	break
else
	echo "Test $i passed. Sleeping 5 seconds"
	sleep 5
fi
done

@kartg
Copy link
Member

kartg commented Jun 23, 2022

Seems like both gradle check failures are from a flaky test - #3650

Refiring.

@kartg
Copy link
Member

kartg commented Jun 23, 2022

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 7095265
Log 6267

Reports 6267

@dreamer-89
Copy link
Member

dreamer-89 commented Jun 25, 2022

Ran the test 100 times now. Succeeds every time.

for i in {1..100}
do
echo "Task $i"
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17 --rerun-tasks;
response_code=$?
echo "Response code $response_code"
if [[ $response_code = 1 ]]; then
	echo "Test $i failed"
	break
else
	echo "Test $i passed. Sleeping 5 seconds"
	sleep 5
fi
done

Thank you @imRishN for this PR. Appreciate for taking time in fixing this flaky test.

Previously it has been observed that a flaky test rarely fail when run in isolation as single test. I suspect the test will still pass without your fix. Running entire gradle check will provide a better picture as it is what running on CI today. Can you give it a try ?

@@ -364,18 +364,22 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
.put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c")
.put("cluster.routing.allocation.load_awareness.skew_factor", "0.0")
.put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ * 3))
.put("cluster.routing.allocation.allow_rebalance", "indices_primaries_active")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imRishN From test failure trace (MasterNotDiscoveredException), it is not clear if it is an actual issue or a flaky one. Can you explain how existing test is identified as flaky and changes here fixes it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test here now adds a dedicated cluster manager node where as previously there was no dedicated cluster manager setup and the test was randomly killing half the nodes in a particular zone. I assume MasterNotDiscoveredException was coming when a node that was stopped was an active master that time and hence the exception was thrown sometimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @imRishN for the clarification.

@imRishN
Copy link
Member Author

imRishN commented Jun 26, 2022

Ran the test 100 times now. Succeeds every time.

for i in {1..100}
do
echo "Task $i"
./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness" -Dtests.seed=CD3B9289D31206B8 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl -Dtests.timezone=Asia/Katmandu -Druntime.java=17 --rerun-tasks;
response_code=$?
echo "Response code $response_code"
if [[ $response_code = 1 ]]; then
	echo "Test $i failed"
	break
else
	echo "Test $i passed. Sleeping 5 seconds"
	sleep 5
fi
done

Thank you @imRishN for this PR. Appreciate for taking time in fixing this flaky test.

Previously it has been observed that a flaky test rarely fail when run in isolation as single test. I suspect the test will still pass without your fix. Running entire gradle check will provide a better picture as it is what running on CI today. Can you give it a try ?

The build passes locally

@Bukhtawar
Copy link
Collaborator

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 7095265
Log 6348

Reports 6348

@dreamer-89
Copy link
Member

Test (flaky) failure. Tracked in #3579

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testHighWatermarkNotExceeded" -Dtests.seed=3CBC6279C41EB13E -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=cs-CZ -Dtests.timezone=Asia/Dili -Druntime.java=17

org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT > testHighWatermarkNotExceeded FAILED
    java.lang.AssertionError: Mismatching shard routings: []
    Expected: a collection with size <1>
         but: collection size was <0>
        at __randomizedtesting.SeedInfo.seed([3CBC6279C41EB13E:D59D83CB44D878D0]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:964)
        at org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.lambda$assertBusyWithDiskUsageRefresh$5(DiskThresholdDeciderIT.java:362)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1049)
        at org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.assertBusyWithDiskUsageRefresh(DiskThresholdDeciderIT.java:355)
        at org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testHighWatermarkNotExceeded(DiskThresholdDeciderIT.java:188)

@dreamer-89
Copy link
Member

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 7095265
Log 6352

Reports 6352

Copy link
Member

@dreamer-89 dreamer-89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -364,18 +364,22 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
.put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c")
.put("cluster.routing.allocation.load_awareness.skew_factor", "0.0")
.put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ * 3))
.put("cluster.routing.allocation.allow_rebalance", "indices_primaries_active")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @imRishN for the clarification.

nodeCountPerAZ,
Settings.builder().put(commonSettings).put("node.attr.zone", "a").build()
);
List<String> nodes_in_zone_b = internalCluster().startNodes(
List<String> nodes_in_zone_b = internalCluster().startDataOnlyNodes(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looks like nodes_in_zone_b is not used after declaration. In that case, it can be removed.

nodeCountPerAZ,
Settings.builder().put(commonSettings).put("node.attr.zone", "b").build()
);
List<String> nodes_in_zone_c = internalCluster().startNodes(
List<String> nodes_in_zone_c = internalCluster().startDataOnlyNodes(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@Bukhtawar Bukhtawar merged commit 22b42e4 into opensearch-project:main Jun 27, 2022
saratvemulapalli pushed a commit that referenced this pull request Jun 27, 2022
…reness (#3646)

* Fixing flaky test org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness by adding dedicated cluster manager node

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@Poojita-Raj Poojita-Raj mentioned this pull request Nov 15, 2022
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants