cluster: fix shutdown hang in health_monitor_backend

If refresh_cluster_health_cache was waiting on _refresh_mutex while ::stop ran, and another fiber had a refresh in progress, then ::stop cancels the other fiber's refresh + the first fiber proceeds to try and refresh again, holding the gate open while ::stop is waiting for it to close. Fixes redpanda-data#5178 (cherry picked from commit d32c9a0)
vbotbuildovich · Aug 1, 2022 · 6e75078 · 6e75078
1 parent 08e95c3
commit 6e75078
Showing 1 changed file with 4 additions and 0 deletions.
diff --git a/src/v/cluster/health_monitor_backend.cc b/src/v/cluster/health_monitor_backend.cc
@@ -112,6 +112,7 @@ ss::future<> health_monitor_backend::stop() {
       _leadership_notification_handle);
 
     auto f = _gate.close();
+    _refresh_mutex.broken();
     abort_current_refresh();
     _tick_timer.cancel();
 
@@ -426,6 +427,9 @@ health_monitor_backend::maybe_refresh_cluster_health(
                   err.message());
                 co_return err;
             }
+        } catch (const ss::broken_semaphore&) {
+            // Refresh was waiting on _refresh_mutex during shutdown
+            co_return errc::shutting_down;
         } catch (const ss::timed_out_error&) {
             vlog(
               clusterlog.info,