Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) #5178

LenaAn · 2022-06-21T18:07:07Z

From PandaResults:

Unstable test: AvailabilityTests.test_recovery_after_catastrophic_failure (1/17 runs failed)
  Latest failure at 2022-06-21T07:35:34.963Z: TimeoutError('Producer failed to produce messages for 180s.')
             in job https://buildkite.com/redpanda/redpanda/builds/11508#018184e3-ca51-40ed-b5ec-a2bc6a579d02

Assigning to @jcsp since you recently closed #5078

The text was updated successfully, but these errors were encountered:

jcsp · 2022-07-05T08:41:09Z

(2/738 failures in last 30 days, most recent is the one linked above)

VadimPlh · 2022-07-25T10:41:40Z

Again https://buildkite.com/redpanda/redpanda/builds/12962#018228b4-7137-45b5-a4db-e3d995339a6a

jcsp · 2022-07-27T09:09:18Z

This is unrelated to #5078 -- that was a log error, this is a producer hang.

jcsp · 2022-07-27T13:00:48Z

Analyzed https://buildkite.com/redpanda/redpanda/builds/12962#018228b4-7137-45b5-a4db-e3d995339a6a

This is a shutdown hang. docker-rp-22 is supposed to be stopped and then started, but it fails to stop. failure_injector catches + logs the exception of that happening, then proceeds not to schedule the action to recover the node, so it stays in its shut-down-but-still-running state. Only one node in the cluster is still alive, so clients do not progress.

jcsp · 2022-07-27T13:09:57Z

The last log message from a particular subsystem stopping is

INFO  2022-07-23 02:18:01,292 [shard 0] cluster - partition_balancer_backend.cc:93 - stopping...

It isn't necessarily that subsystem that's causing a problem, as they unfortunately don't all have log messages when they start stopping. The next one in the shutdown order that does have a log message is config_manager, so the hang is in one of these steps:

          .then([this] { return _partition_balancer.stop(); })
          .then([this] { return _metrics_reporter.stop(); })
          .then([this] { return _feature_manager.stop(); })
          .then([this] { return _hm_frontend.stop(); })
          .then([this] { return _hm_backend.stop(); })
          .then([this] { return _health_manager.stop(); })
          .then([this] { return _members_backend.stop(); })

jcsp · 2022-07-27T13:53:09Z

This motivates improvement of the controller shutdown logging: #5672

Previously, a hang on SIGINT would be logged, but then the test would proceed with a node in this zombie state. Since shutdown hangs are always a bug, it is correct to fail the test as early as we can with the right reason, so just re-raise the exception instead of trying to proceed. Related: redpanda-data#5178

jcsp · 2022-07-27T14:22:06Z

This also motivates improving the test to fail at the point of the shutdown hang rather than later: #5677

Previously, a hang on SIGINT would be logged, but then the test would proceed with a node in this zombie state. Since shutdown hangs are always a bug, it is correct to fail the test as early as we can with the right reason, so just re-raise the exception instead of trying to proceed. Related: redpanda-data#5178

jcsp · 2022-08-01T08:53:35Z

Reproduced (hopefully with new logging?) https://buildkite.com/redpanda/redpanda/builds/13352#01824c53-bbf6-4028-bcf2-9a1fac1ddc98

jcsp · 2022-08-01T08:58:06Z

Logs appear to point to health_monitor_backend not shutting down properly while it had a request in flight to the suspended peer:

WARN  2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:445 - Failed to refresh cluster health.
TRACE 2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:508 - skipping tick, not leader
INFO  2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:119 - Stopping Health Monitor Backend...
DEBUG 2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:405 - aborting current refresh request to 1
DEBUG 2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:327 - refreshing health cache, leader id: {1}
INFO  2022-07-30 01:05:40,747 [shard 0] rpc - transport.cc:144 - Request timeout to {host: docker-rp-16, port: 33145}, correlation id: 63 (2 in flight)
WARN  2022-07-30 01:05:40,747 [shard 0] cluster - health_monitor_backend.cc:359 - unable to get cluster health metadata from 1 - rpc::errc::client_request_timeout
INFO  2022-07-30 01:05:48,747 [shard 0] rpc - transport.cc:144 - Request timeout to {host: docker-rp-16, port: 33145}, correlation id: 65 (1 in flight)
WARN  2022-07-30 01:05:48,747 [shard 0] cluster - health_monitor_backend.cc:359 - unable to get cluster health metadata from 1 - rpc::errc::client_request_timeout
TRACE 2022-07-30 01:08:18,746 [shard 0] cluster - health_manager.cc:183 - Health: skipping tick as non-leader

If refresh_cluster_health_cache was waiting on _refresh_mutex while ::stop ran, and another fiber had a refresh in progress, then ::stop cancels the other fiber's refresh + the first fiber proceeds to try and refresh again, holding the gate open while ::stop is waiting for it to close. Fixes redpanda-data#5178

If refresh_cluster_health_cache was waiting on _refresh_mutex while ::stop ran, and another fiber had a refresh in progress, then ::stop cancels the other fiber's refresh + the first fiber proceeds to try and refresh again, holding the gate open while ::stop is waiting for it to close. Fixes redpanda-data#5178 (cherry picked from commit d32c9a0)

Previously, a hang on SIGINT would be logged, but then the test would proceed with a node in this zombie state. Since shutdown hangs are always a bug, it is correct to fail the test as early as we can with the right reason, so just re-raise the exception instead of trying to proceed. Related: redpanda-data#5178

LenaAn added the kind/bug Something isn't working label Jun 21, 2022

LenaAn assigned jcsp Jun 21, 2022

LenaAn added the ci-failure label Jun 21, 2022

mmedenjak added area/tests and removed area/tests labels Jul 5, 2022

jcsp added the area/controller label Jul 27, 2022

jcsp mentioned this issue Jul 27, 2022

cluster: improved logging during shutdown #5672

Merged

jcsp mentioned this issue Jul 27, 2022

tests: don't ignore shutdown hangs in failure injector #5677

Merged

jcsp changed the title ~~CI Failure AvailabilityTests.test_recovery_after_catastrophic_failure~~ Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) Aug 1, 2022

jcsp mentioned this issue Aug 1, 2022

cluster: fix shutdown hang in health_monitor_backend #5756

Merged

4 tasks

jcsp closed this as completed in #5756 Aug 1, 2022

vbotbuildovich mentioned this issue Aug 1, 2022

[v22.1.x] Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) #5772

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) #5178

Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) #5178

LenaAn commented Jun 21, 2022

jcsp commented Jul 5, 2022

VadimPlh commented Jul 25, 2022

jcsp commented Jul 27, 2022

jcsp commented Jul 27, 2022

jcsp commented Jul 27, 2022

jcsp commented Jul 27, 2022

jcsp commented Jul 27, 2022

jcsp commented Aug 1, 2022

jcsp commented Aug 1, 2022

Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) #5178

Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) #5178

Comments

LenaAn commented Jun 21, 2022

jcsp commented Jul 5, 2022

VadimPlh commented Jul 25, 2022

jcsp commented Jul 27, 2022

jcsp commented Jul 27, 2022

jcsp commented Jul 27, 2022

jcsp commented Jul 27, 2022

jcsp commented Jul 27, 2022

jcsp commented Aug 1, 2022

jcsp commented Aug 1, 2022