Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646

Merged
merged 2 commits into from
Jun 27, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -364,18 +364,22 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
.put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c")
.put("cluster.routing.allocation.load_awareness.skew_factor", "0.0")
.put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ * 3))
.put("cluster.routing.allocation.allow_rebalance", "indices_primaries_active")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imRishN From test failure trace (MasterNotDiscoveredException), it is not clear if it is an actual issue or a flaky one. Can you explain how existing test is identified as flaky and changes here fixes it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test here now adds a dedicated cluster manager node where as previously there was no dedicated cluster manager setup and the test was randomly killing half the nodes in a particular zone. I assume MasterNotDiscoveredException was coming when a node that was stopped was an active master that time and hence the exception was thrown sometimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @imRishN for the clarification.

.build();

logger.info("--> starting a dedicated cluster manager node");
internalCluster().startClusterManagerOnlyNode();

logger.info("--> starting 15 nodes on zones 'a' & 'b' & 'c'");
List<String> nodes_in_zone_a = internalCluster().startNodes(
List<String> nodes_in_zone_a = internalCluster().startDataOnlyNodes(
nodeCountPerAZ,
Settings.builder().put(commonSettings).put("node.attr.zone", "a").build()
);
List<String> nodes_in_zone_b = internalCluster().startNodes(
List<String> nodes_in_zone_b = internalCluster().startDataOnlyNodes(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looks like nodes_in_zone_b is not used after declaration. In that case, it can be removed.

nodeCountPerAZ,
Settings.builder().put(commonSettings).put("node.attr.zone", "b").build()
);
List<String> nodes_in_zone_c = internalCluster().startNodes(
List<String> nodes_in_zone_c = internalCluster().startDataOnlyNodes(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

nodeCountPerAZ,
Settings.builder().put(commonSettings).put("node.attr.zone", "c").build()
);
Expand All @@ -395,7 +399,7 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
.setIndices("test-1")
.setWaitForEvents(Priority.LANGUID)
.setWaitForGreenStatus()
.setWaitForNodes(Integer.toString(nodeCountPerAZ * 3))
.setWaitForNodes(Integer.toString(nodeCountPerAZ * 3 + 1))
.setWaitForNoRelocatingShards(true)
.setWaitForNoInitializingShards(true)
.execute()
Expand Down Expand Up @@ -431,7 +435,7 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
.prepareHealth()
.setIndices("test-1")
.setWaitForEvents(Priority.LANGUID)
.setWaitForNodes(Integer.toString(nodeCountPerAZ * 3 - nodesToStop))
.setWaitForNodes(Integer.toString(nodeCountPerAZ * 3 - nodesToStop + 1))
.setWaitForNoRelocatingShards(true)
.setWaitForNoInitializingShards(true)
.execute()
Expand All @@ -452,7 +456,7 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
.prepareHealth()
.setIndices("test-1", "test-2")
.setWaitForEvents(Priority.LANGUID)
.setWaitForNodes(Integer.toString(nodeCountPerAZ * 3 - nodesToStop))
.setWaitForNodes(Integer.toString(nodeCountPerAZ * 3 - nodesToStop + 1))
.setWaitForNoRelocatingShards(true)
.setWaitForNoInitializingShards(true)
.execute()
Expand All @@ -477,7 +481,7 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E
.prepareHealth()
.setIndices("test-1", "test-2")
.setWaitForEvents(Priority.LANGUID)
.setWaitForNodes(Integer.toString(nodeCountPerAZ * 3))
.setWaitForNodes(Integer.toString(nodeCountPerAZ * 3 + 1))
.setWaitForGreenStatus()
.setWaitForActiveShards(2 * numOfShards * (numOfReplica + 1))
.setWaitForNoRelocatingShards(true)
Expand Down