Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) #5178

Closed
LenaAn opened this issue Jun 21, 2022 · 9 comments · Fixed by #5756
Closed

Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) #5178

LenaAn opened this issue Jun 21, 2022 · 9 comments · Fixed by #5756
Assignees

Comments

@LenaAn
Copy link
Contributor

LenaAn commented Jun 21, 2022

From PandaResults:

Unstable test: AvailabilityTests.test_recovery_after_catastrophic_failure (1/17 runs failed)
  Latest failure at 2022-06-21T07:35:34.963Z: TimeoutError('Producer failed to produce messages for 180s.')
             in job https://buildkite.com/redpanda/redpanda/builds/11508#018184e3-ca51-40ed-b5ec-a2bc6a579d02

Assigning to @jcsp since you recently closed #5078

@LenaAn LenaAn added the kind/bug Something isn't working label Jun 21, 2022
@jcsp
Copy link
Contributor

jcsp commented Jul 5, 2022

(2/738 failures in last 30 days, most recent is the one linked above)

@VadimPlh
Copy link
Contributor

@jcsp
Copy link
Contributor

jcsp commented Jul 27, 2022

This is unrelated to #5078 -- that was a log error, this is a producer hang.

@jcsp
Copy link
Contributor

jcsp commented Jul 27, 2022

Analyzed https://buildkite.com/redpanda/redpanda/builds/12962#018228b4-7137-45b5-a4db-e3d995339a6a

This is a shutdown hang. docker-rp-22 is supposed to be stopped and then started, but it fails to stop. failure_injector catches + logs the exception of that happening, then proceeds not to schedule the action to recover the node, so it stays in its shut-down-but-still-running state. Only one node in the cluster is still alive, so clients do not progress.

@jcsp
Copy link
Contributor

jcsp commented Jul 27, 2022

The last log message from a particular subsystem stopping is

INFO  2022-07-23 02:18:01,292 [shard 0] cluster - partition_balancer_backend.cc:93 - stopping...

It isn't necessarily that subsystem that's causing a problem, as they unfortunately don't all have log messages when they start stopping. The next one in the shutdown order that does have a log message is config_manager, so the hang is in one of these steps:

          .then([this] { return _partition_balancer.stop(); })
          .then([this] { return _metrics_reporter.stop(); })
          .then([this] { return _feature_manager.stop(); })
          .then([this] { return _hm_frontend.stop(); })
          .then([this] { return _hm_backend.stop(); })
          .then([this] { return _health_manager.stop(); })
          .then([this] { return _members_backend.stop(); })

@jcsp
Copy link
Contributor

jcsp commented Jul 27, 2022

This motivates improvement of the controller shutdown logging: #5672

jcsp added a commit to jcsp/redpanda that referenced this issue Jul 27, 2022
Previously, a hang on SIGINT would be logged, but then
the test would proceed with a node in this zombie state.

Since shutdown hangs are always a bug, it is correct to fail
the test as early as we can with the right reason, so just
re-raise the exception instead of trying to proceed.

Related: redpanda-data#5178
@jcsp
Copy link
Contributor

jcsp commented Jul 27, 2022

This also motivates improving the test to fail at the point of the shutdown hang rather than later: #5677

jcsp added a commit to jcsp/redpanda that referenced this issue Jul 28, 2022
Previously, a hang on SIGINT would be logged, but then
the test would proceed with a node in this zombie state.

Since shutdown hangs are always a bug, it is correct to fail
the test as early as we can with the right reason, so just
re-raise the exception instead of trying to proceed.

Related: redpanda-data#5178
@jcsp
Copy link
Contributor

jcsp commented Aug 1, 2022

@jcsp
Copy link
Contributor

jcsp commented Aug 1, 2022

Logs appear to point to health_monitor_backend not shutting down properly while it had a request in flight to the suspended peer:

WARN  2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:445 - Failed to refresh cluster health.
TRACE 2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:508 - skipping tick, not leader
INFO  2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:119 - Stopping Health Monitor Backend...
DEBUG 2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:405 - aborting current refresh request to 1
DEBUG 2022-07-30 01:05:38,747 [shard 0] cluster - health_monitor_backend.cc:327 - refreshing health cache, leader id: {1}
INFO  2022-07-30 01:05:40,747 [shard 0] rpc - transport.cc:144 - Request timeout to {host: docker-rp-16, port: 33145}, correlation id: 63 (2 in flight)
WARN  2022-07-30 01:05:40,747 [shard 0] cluster - health_monitor_backend.cc:359 - unable to get cluster health metadata from 1 - rpc::errc::client_request_timeout
INFO  2022-07-30 01:05:48,747 [shard 0] rpc - transport.cc:144 - Request timeout to {host: docker-rp-16, port: 33145}, correlation id: 65 (1 in flight)
WARN  2022-07-30 01:05:48,747 [shard 0] cluster - health_monitor_backend.cc:359 - unable to get cluster health metadata from 1 - rpc::errc::client_request_timeout
TRACE 2022-07-30 01:08:18,746 [shard 0] cluster - health_manager.cc:183 - Health: skipping tick as non-leader

@jcsp jcsp changed the title CI Failure AvailabilityTests.test_recovery_after_catastrophic_failure Shutdown hang (Failure AvailabilityTests.test_recovery_after_catastrophic_failure) Aug 1, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 1, 2022
If refresh_cluster_health_cache was waiting on _refresh_mutex
while ::stop ran, and another fiber had a refresh in progress,
then ::stop cancels the other fiber's refresh + the first fiber
proceeds to try and refresh again, holding the gate open
while ::stop is waiting for it to close.

Fixes redpanda-data#5178
@jcsp jcsp closed this as completed in #5756 Aug 1, 2022
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Aug 1, 2022
If refresh_cluster_health_cache was waiting on _refresh_mutex
while ::stop ran, and another fiber had a refresh in progress,
then ::stop cancels the other fiber's refresh + the first fiber
proceeds to try and refresh again, holding the gate open
while ::stop is waiting for it to close.

Fixes redpanda-data#5178

(cherry picked from commit d32c9a0)
andrwng pushed a commit to andrwng/redpanda that referenced this issue Aug 5, 2022
Previously, a hang on SIGINT would be logged, but then
the test would proceed with a node in this zombie state.

Since shutdown hangs are always a bug, it is correct to fail
the test as early as we can with the right reason, so just
re-raise the exception instead of trying to proceed.

Related: redpanda-data#5178
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants