Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systemd prevents pause_minority re-join by restarting RabbitMQ #3289

Closed
dumbbell opened this issue Aug 10, 2021 Discussed in #3262 · 0 comments · Fixed by #3280
Closed

Systemd prevents pause_minority re-join by restarting RabbitMQ #3289

dumbbell opened this issue Aug 10, 2021 Discussed in #3262 · 0 comments · Fixed by #3280
Assignees

Comments

@dumbbell
Copy link
Member

Discussed in #3262

Originally posted by robertdahlem August 4, 2021
I have a three node cluster to test pause_minority. All nodes run RHEL 7.9, Erlang 23.3.4.5 and RabbitMQ 3.9.0. I use RPMs from https://github.com/rabbitmq (erlang-rpm and rabbitmq-server). Nodes are joined manually to the cluster, I use the same rabbitmq.conf for all three nodes (rabbit1, rabbit2, rabbit3).

When I pull the network cable from rabbit2, a minute later it detects minority status and stops the applications. 90 seconds later, systemd detects that something is wrong with rabbitmq-server and restarts it:

systemd: rabbitmq-server.service: main process exited, code=killed, status=9/KILL
systemd: Unit rabbitmq-server.service entered failed state.
systemd: rabbitmq-server.service failed.
systemd: rabbitmq-server.service holdoff time over, scheduling restart.
systemd: Stopped RabbitMQ broker.
systemd: Starting RabbitMQ broker...

After that, nothing else happens although I reconnect the cable. I would expect rabbit2 to re-join the cluster, but that seems to be sabotaged by systemd restarting RabbitMQ.

The node re-joins the cluster when I reconnect the cable before 90 seconds, but systemd mercilessly kills and restarts RabbitMQ anyway after 90 seconds.

Here is the time table for the things I did:

17:57:08 disconnect eth0
17:58:10 Node rabbit2 detects loss of connectivity
17:59:40 systemd reports: stop-sigterm timed out. Killing
18:01:41 reconnect eth0

Log files and rabbitmq.conf attached.
rabbit@rabbit2.log
/var/log/messages
rabbitmq.conf

@dumbbell dumbbell self-assigned this Aug 10, 2021
dumbbell added a commit that referenced this issue Aug 10, 2021
The problem is we only know about the state of the `rabbit` Erlang
application — when it is started and stopped. But we can't know the fate
of the Erlang VM, except if `rabbit:stop_and_halt()` is called. This
function is not called if `init:stop()` or a SIGTERM are used for
instance.

systemd is interested in the state of the system process (the Erlang
VM), not what's happening inside. But inside, we have multiple
situations where the Erlang application is stopped, but not the Erlang
VM. For instance:

    * When clustering, the Erlang application is stopped before the
      cluster is created or expanded. The application is restarted once
      done. This is controled either manually or using the peer
      discovery plugins.

    * The `pause_minority` or `pause_if_all_down` partition strategies
      both stop the Erlang application for an indefinite period of time,
      but RabbitMQ as a service is still up (even though it is managing
      its own degraded mode and no connections are accepted).

In both cases, the service is still running from the system's service
manager's point of view.

As said above, we can never tell "the VM is being terminated" with
confidence. We can only know about the Erlang application itself.
Therefore, it is best to report the latter as a systemd state
description, but not reporting the "STOPPING=1" state at all. systemd
will figure out itself that the Erlang VM exited anyway.

Before this change, we were reporting the "STOPPING=1" state to systemd
every time the Elang application was stopped. The problem was that
systemd expected the system process (the Erlang VM) to exit within a
configured period of time (90 seconds by default) or report that's it's
ready again ("READY=1"). This issue remained unnoticed when the cluster
was created/expanded because it probably happened within that time
frame. However, it was reported with the partition healing strategies
because the partition might last longer than 90 seconds. When this
happened, the Erlang VM was killed (SIGKILL) and the service restarted.

References #3262.
Fixes #3289.
dumbbell added a commit that referenced this issue Aug 10, 2021
The problem is we only know about the state of the `rabbit` Erlang
application — when it is started and stopped. But we can't know the fate
of the Erlang VM, except if `rabbit:stop_and_halt()` is called. This
function is not called if `init:stop()` or a SIGTERM are used for
instance.

systemd is interested in the state of the system process (the Erlang
VM), not what's happening inside. But inside, we have multiple
situations where the Erlang application is stopped, but not the Erlang
VM. For instance:

    * When clustering, the Erlang application is stopped before the
      cluster is created or expanded. The application is restarted once
      done. This is controled either manually or using the peer
      discovery plugins.

    * The `pause_minority` or `pause_if_all_down` partition strategies
      both stop the Erlang application for an indefinite period of time,
      but RabbitMQ as a service is still up (even though it is managing
      its own degraded mode and no connections are accepted).

In both cases, the service is still running from the system's service
manager's point of view.

As said above, we can never tell "the VM is being terminated" with
confidence. We can only know about the Erlang application itself.
Therefore, it is best to report the latter as a systemd state
description, but not reporting the "STOPPING=1" state at all. systemd
will figure out itself that the Erlang VM exited anyway.

Before this change, we were reporting the "STOPPING=1" state to systemd
every time the Elang application was stopped. The problem was that
systemd expected the system process (the Erlang VM) to exit within a
configured period of time (90 seconds by default) or report that's it's
ready again ("READY=1"). This issue remained unnoticed when the cluster
was created/expanded because it probably happened within that time
frame. However, it was reported with the partition healing strategies
because the partition might last longer than 90 seconds. When this
happened, the Erlang VM was killed (SIGKILL) and the service restarted.

References #3262.
Fixes #3289.
mergify bot pushed a commit that referenced this issue Aug 10, 2021
The problem is we only know about the state of the `rabbit` Erlang
application — when it is started and stopped. But we can't know the fate
of the Erlang VM, except if `rabbit:stop_and_halt()` is called. This
function is not called if `init:stop()` or a SIGTERM are used for
instance.

systemd is interested in the state of the system process (the Erlang
VM), not what's happening inside. But inside, we have multiple
situations where the Erlang application is stopped, but not the Erlang
VM. For instance:

    * When clustering, the Erlang application is stopped before the
      cluster is created or expanded. The application is restarted once
      done. This is controled either manually or using the peer
      discovery plugins.

    * The `pause_minority` or `pause_if_all_down` partition strategies
      both stop the Erlang application for an indefinite period of time,
      but RabbitMQ as a service is still up (even though it is managing
      its own degraded mode and no connections are accepted).

In both cases, the service is still running from the system's service
manager's point of view.

As said above, we can never tell "the VM is being terminated" with
confidence. We can only know about the Erlang application itself.
Therefore, it is best to report the latter as a systemd state
description, but not reporting the "STOPPING=1" state at all. systemd
will figure out itself that the Erlang VM exited anyway.

Before this change, we were reporting the "STOPPING=1" state to systemd
every time the Elang application was stopped. The problem was that
systemd expected the system process (the Erlang VM) to exit within a
configured period of time (90 seconds by default) or report that's it's
ready again ("READY=1"). This issue remained unnoticed when the cluster
was created/expanded because it probably happened within that time
frame. However, it was reported with the partition healing strategies
because the partition might last longer than 90 seconds. When this
happened, the Erlang VM was killed (SIGKILL) and the service restarted.

References #3262.
Fixes #3289.

(cherry picked from commit 23c71b2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant