Failure in MaintenanceTest.test_maintenance_sticky.use_rpk=True #4772

dimitriscruz · 2022-05-17T00:02:13Z

Build (v22.1.x): https://buildkite.com/redpanda/redpanda/builds/10184#2e1a4495-7ef5-4a77-9d1c-0d1002b71ed8

FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/3 runs)
  failure at 2022-05-16T14:38:18.499Z: AssertionError('`rpk cluster maintenance status` has changed: [\'Request\', \'error,\', \'trying\', \'another\', \'node:\', \'request\', \'failed:\', \'Service\', \'Unavailable,\', \'body:\', \'"{\\\\"message\\\\":\', \'\\\\"Unable\', \'to\', \'get\', \'cluster\', \'health:\', \'Currently\', \'there\', \'is\', \'no\', \'leader\', \'controller\', \'elected\', \'in\', \'the\', \'cluster\\\\",\', \'\\\\"code\\\\":\', \'503}"\']')
      in job https://buildkite.com/redpanda/redpanda/builds/10184#2e1a4495-7ef5-4a77-9d1c-0d1002b71ed8

Error

test_id:    rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True
--
  | status:     FAIL
  | run time:   9 minutes 1.238 seconds
  |  
  |  
  | AssertionError('`rpk cluster maintenance status` has changed: [\'Request\', \'error,\', \'trying\', \'another\', \'node:\', \'request\', \'failed:\', \'Service\', \'Unavailable,\', \'body:\', \'"{\\\\"message\\\\":\', \'\\\\"Unable\', \'to\', \'get\', \'cluster\', \'health:\', \'Currently\', \'there\', \'is\', \'no\', \'leader\', \'controller\', \'elected\', \'in\', \'the\', \'cluster\\\\",\', \'\\\\"code\\\\":\', \'503}"\']')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/maintenance_test.py", line 225, in test_maintenance_sticky
  | self._verify_cluster(None, False)
  | File "/root/tests/rptest/tests/maintenance_test.py", line 175, in _verify_cluster
  | wait_until(
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until
  | raise e
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until
  | if condition():
  | File "/root/tests/rptest/tests/maintenance_test.py", line 176, in <lambda>
  | lambda: self._verify_maintenance_status(node, expect),
  | File "/root/tests/rptest/tests/maintenance_test.py", line 87, in _verify_maintenance_status
  | statuses = self.rpk.cluster_maintenance_status()
  | File "/root/tests/rptest/clients/rpk.py", line 553, in cluster_maintenance_status
  | return list(filter(None, map(parse, output.splitlines())))
  | File "/root/tests/rptest/clients/rpk.py", line 528, in parse
  | assert len(
  | AssertionError: `rpk cluster maintenance status` has changed: ['Request', 'error,', 'trying', 'another', 'node:', 'request', 'failed:', 'Service', 'Unavailable,', 'body:', '"{\\"message\\":', '\\"Unable', 'to', 'get', 'cluster', 'health:', 'Currently', 'there', 'is', 'no', 'leader', 'controller', 'elected', 'in', 'the', 'cluster\\",', '\\"code\\":', '503}"']
  |

The text was updated successfully, but these errors were encountered:

jcsp · 2022-07-08T09:01:06Z

This is a TimeoutError but it's in a part of the test waiting for leadership, so conceivably the same issue:
https://buildkite.com/redpanda/redpanda/builds/12286#0181dc4d-53e3-40ee-b939-2a2e58085864

jcsp · 2022-07-08T13:23:26Z

Hmm, this just failed twice on dev right after I merged #5159, so maybe that wasn't a coincidence.

https://buildkite.com/redpanda/redpanda/builds/12297#0181dd0a-fbe1-4f73-915e-5e5e8dd3ad0c

Although that particular PR shouldn't have made leadership balancer any more aggressive, it was about throttling it.

jcsp · 2022-07-08T14:27:51Z

Looking at a failure log, it seems like leader balancer is trying to move leaderships to a node in maintenance mode, which fails, those groups get muted, and then when the tests expects them to get migrated after maintenance mode is over, that doesn't happen because we're still in the mute period.

Could be that the test used to work because leader balances weren't fast enough to all run through and trigger mutes right away.

The real fix will be to make the leader balancer aware of maintenance mode, but the test has become much more unstable since recent leader balancer changes to do more movements concurrently, so for the moment just run the test with the leader balancer disabled. Related: redpanda-data#4772

The real fix will be to make the leader balancer aware of maintenance mode, but the test has become much more unstable since recent leader balancer changes to do more movements concurrently, so its worth mitigating that. The workaround is to set a short mute timeout so that muting nodes has no real effect, and a short idle timeout so that post-maintenance leader movements happen promptly. Related: redpanda-data#4772

LenaAn · 2022-07-19T14:11:40Z

another one: https://buildkite.com/redpanda/redpanda/builds/12722#01821639-c833-4786-b045-29a0e1a29f5b from #5484

twmb · 2022-07-20T03:50:17Z

https://buildkite.com/redpanda/redpanda/builds/12767#01821925-7ca4-4225-9088-7971d1272679/1583-8860

twmb · 2022-07-20T03:55:12Z

https://buildkite.com/redpanda/redpanda/builds/12758#018218b5-afd6-4e91-af39-ada13e1c3ddf/1501-9102

twmb · 2022-07-20T15:13:45Z

https://buildkite.com/redpanda/redpanda/builds/12758#018218b5-afd6-4e91-af39-ada13e1c3ddf/1501-9102

LenaAn · 2022-07-21T14:27:12Z

another one from #5484
https://buildkite.com/redpanda/redpanda/builds/12850#018220d1-a930-4179-99b9-f43e1612c9d3

rystsov · 2022-07-22T16:36:07Z

another occurrence: https://buildkite.com/redpanda/redpanda/builds/12899#0182246f-263f-4db9-ba6a-0b2d60eeeb8c

dimitriscruz added kind/bug Something isn't working ci-failure labels May 17, 2022

dotnwat self-assigned this May 17, 2022

mmedenjak added the area/tests label Jul 5, 2022

jcsp mentioned this issue Jul 8, 2022

Rate limit leadership transfers #5159

Merged

ballard26 self-assigned this Jul 8, 2022

abhijat mentioned this issue Jul 8, 2022

cloud_roles: adds minio credentials to mock iam server #5402

Merged

jcsp mentioned this issue Jul 8, 2022

tests: mitigate MaintenanceTest failures #5408

Closed

jcsp mentioned this issue Jul 8, 2022

tests: mitigate MaintenanceTest failures #5409

Merged

r-vasquez mentioned this issue Jul 11, 2022

rpk tune ducktape tests #5407

Merged

ballard26 mentioned this issue Jul 12, 2022

Handle maintenance nodes in the leader balancer #5426

Merged

piyushredpanda unassigned dotnwat Jul 12, 2022

andrwng mentioned this issue Jul 15, 2022

tests: re-use installs in upgrade tests #5459

Merged

BenPope mentioned this issue Jul 17, 2022

net: Remove security::tls::principal_mapper #5493

Merged

abhijat mentioned this issue Jul 18, 2022

cloud_roles: optionally enable tls in http client #5497

Merged

ztlpn mentioned this issue Jul 18, 2022

Partition movement interruption APIs #5334

Merged

bharathv mentioned this issue Jul 19, 2022

Capture redpanda version and timestamp metadata in OMB runs #5502

Merged

r-vasquez mentioned this issue Jul 19, 2022

rpk: add redpanda.aggregate_metrics to cfg struct #5513

Merged

bharathv mentioned this issue Jul 19, 2022

serde: support std::chrono::duration #5432

Merged

twmb mentioned this issue Jul 20, 2022

rpk redpanda config set: improve setting arrays #5522

Merged

twmb mentioned this issue Jul 20, 2022

rpk: patch group seek #4594

Merged

BenPope mentioned this issue Jul 20, 2022

reflection/type_traits: use is_specialization_of<> template helper #5511

Merged

dotnwat mentioned this issue Jul 21, 2022

serde: Support absl::flat_hash_map and absl::btree_set #5431

Merged

BenPope mentioned this issue Jul 21, 2022

metrics: Move disk space metrics to /public_metrics #5547

Merged

r-vasquez mentioned this issue Jul 21, 2022

rpk: Add option to generate scrape_config for both metrics endpoints #5526

Merged

andrwng mentioned this issue Jul 21, 2022

tests: add mixed-version test for sending Raft snapshot RPCs #5488

Merged

This was referenced Jul 21, 2022

rpk: print license expiration warning #5491

Merged

rpk: remap group seek error on non-empty groups #5554

Merged

abhijat mentioned this issue Jul 22, 2022

cloud_storage/tests: adjust delta for restored partitions #5372

Merged

BenPope mentioned this issue Jul 22, 2022

schema_registry: Return schemaType for /schemas/ids/<id> #5568

Merged

r-vasquez mentioned this issue Jul 22, 2022

rpk: add seastar_memory and serde to log list #5557

Merged

andrwng mentioned this issue Jul 22, 2022

tests: start using RedpandaInstaller in more tests #5282

Merged

rystsov mentioned this issue Jul 22, 2022

Reduce duration of partition_movement_test from 25min to 8min #5238

Draft

dotnwat closed this as completed in #5426 Jul 22, 2022

bharathv mentioned this issue Jul 22, 2022

omb/perf: Bump client request timeouts to 5mins #5561

Merged

jcsp reopened this Jul 25, 2022

jcsp closed this as completed Jul 25, 2022

jcsp mentioned this issue Jul 25, 2022

Failure in maintenance_test.MaintenanceTest.test_maintenance_sticky #4228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in MaintenanceTest.test_maintenance_sticky.use_rpk=True #4772

Failure in MaintenanceTest.test_maintenance_sticky.use_rpk=True #4772

dimitriscruz commented May 17, 2022

jcsp commented Jul 8, 2022

jcsp commented Jul 8, 2022

jcsp commented Jul 8, 2022

LenaAn commented Jul 19, 2022

twmb commented Jul 20, 2022

twmb commented Jul 20, 2022

twmb commented Jul 20, 2022

LenaAn commented Jul 21, 2022

rystsov commented Jul 22, 2022

Failure in MaintenanceTest.test_maintenance_sticky.use_rpk=True #4772

Failure in MaintenanceTest.test_maintenance_sticky.use_rpk=True #4772

Comments

dimitriscruz commented May 17, 2022

jcsp commented Jul 8, 2022

jcsp commented Jul 8, 2022

jcsp commented Jul 8, 2022

LenaAn commented Jul 19, 2022

twmb commented Jul 20, 2022

twmb commented Jul 20, 2022

twmb commented Jul 20, 2022

LenaAn commented Jul 21, 2022

rystsov commented Jul 22, 2022