tests/partition_balancer: more robust wait_until_status

Previously, when the controller leader node was suspended during the test all status requests would fail with the timed-out error. This was true for all nodes, not just the suspended one (because we proxy the status request to the controller leader), so internal retries in the admin API wrapper didn't help. We increase the timeout and add 504 to retriable status codes so that internal retries can handle this situation. (cherry picked from commit dc83a7b)
redpanda-data · Aug 15, 2022 · 7ae38b0 · 7ae38b0
1 parent 68b54be
commit 7ae38b0
Showing 1 changed file with 4 additions and 2 deletions.
diff --git a/tests/rptest/tests/partition_balancer_test.py b/tests/rptest/tests/partition_balancer_test.py
@@ -77,13 +77,15 @@ def node2partition_count(self):
         return ret
 
     def wait_until_status(self, predicate, timeout_sec=120):
-        admin = Admin(self.redpanda)
+        # We may get a 504 if we proxy a status request to a suspended node.
+        # It is okay to retry (the controller leader will get re-elected in the meantime).
+        admin = Admin(self.redpanda, retry_codes=[503, 504])
         start = time.time()
 
         def check():
             req_start = time.time()
 
-            status = admin.get_partition_balancer_status(timeout=1)
+            status = admin.get_partition_balancer_status(timeout=10)
             self.logger.info(f"partition balancer status: {status}")
 
             if "seconds_since_last_tick" not in status: