Skip to content
This repository has been archived by the owner on May 28, 2021. It is now read-only.

MySQL cluster doesn't survive Kubernetes cluster restart #288

Open
tomclark opened this issue Aug 5, 2019 · 0 comments
Open

MySQL cluster doesn't survive Kubernetes cluster restart #288

tomclark opened this issue Aug 5, 2019 · 0 comments

Comments

@tomclark
Copy link

tomclark commented Aug 5, 2019

QUESTION

This is a question because I'm sure I'm doing this wrong, or misunderstanding how things are intended to work rather than it being a bug.

I found this issue which seems to be resolved, but I don't quite understand the fix and the steps don't work for me (or I'm doing it wrong):

#94

Versions

MySQL Operator Version: 0.2.1

Environment: Three node Kubernetes cluster (Rancher)

  • Kubernetes version (use kubectl version): 1.14.2
  • Cloud provider or hardware configuration: 3 x Hetzner CX41 (4 x vCPU, 16G RAM, 160G SSD) plus a RancherOS Nginx load balancer that sits in front of them as a reverse proxy
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.2 LTS
  • Kernel (e.g. uname -a): 4.15.0-54-generic E2E tests continue to run after they've failed #58-Ubuntu SMP
  • Others:

What happened?

I built a three member cluster with a persistent volume for data located on an NFS share (using nfs-client-provisioner-1.2.6 v 3.1.0), and a configmap for mysqld.cnf to change the default authentication plugin to be mysql_native_password, plus skip-host-cache and skip-host-resolve. All works as expected, cluster gets up and running, I create services for read-write and read-only endpoints, replication works absolutely fine - everything works great.

The problem comes when rebooting the MySQL master. If I restart either of the slaves, they sync up and rejoin the cluster as expected. However, if I restart the box that is running the master, the two slaves end up unable to rejoin the cluster with the following messages about their IP addresses not being not being in the IP whitelist repeated in the logs:

2019-08-05T02:40:40.445890Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Connection attempt from IP address 10.42.2.131 refused. Address is not in the IP whitelist.'
2019-08-05T02:40:40.546951Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Connection attempt from IP address 10.42.2.131 refused. Address is not in the IP whitelist.'
2019-08-05T02:40:40.649888Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Connection attempt from IP address 10.42.2.131 refused. Address is not in the IP whitelist.'
2019-08-05T02:40:40.751126Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Connection attempt from IP address 10.42.2.131 refused. Address is not in the IP whitelist.'

What you expected to happen?

I would expect the slaves to rejoin the cluster and sync with the master when it comes back up. As I say, I could well be simply misunderstanding how this is supposed to work and that a complete K8s outage necessarily it won't come back to life.

How to reproduce it (as minimally and precisely as possible)?

  1. Build the cluster as usual using the definition below.
  2. Once the cluster is up and running, connect to the primary node and create a database, perhaps restore some data (i.e. do something that will be replicated, basically).
  3. Connect to one of the slaves and check that the database created in step two exists, and any data you restored has been replicated.
  4. Reboot one or both of the slaves and check that they come back up and the cluster has all pods running.
  5. Reboot the master and check the mysql pod logs ("kubectl logs -n mysql cluster-mysql-0 mysql"). Note the error messages saying that the slave nodes are unable to connect.

Anything else we need to know?

Please note that as far as I can see, the group_replication_ip_whitelist is set to 10.0.0.0/8:

2019-08-05T17:20:45.281581Z 356 [Note] [MY-011694] [Repl] Plugin group_replication reported: 'Initialized group communication with configuration: group_replication_group_name: '32656835-b79a-11e9-9e70-6ed86065d123'; group_replication_local_address: 'cluster-mysql-1.cluster-mysql:33061'; group_replication_group_seeds: 'cluster-mysql-0.cluster-mysql:33061'; group_replication_bootstrap_group: 'false'; group_replication_poll_spin_loops: 0; group_replication_compression_threshold: 1000000; group_replication_ip_whitelist: '10.0.0.0/8'; group_replication_communication_debug_options: 'GCS_DEBUG_NONE''

I'm a newbie, so I'm sure I'm missing something - apologies in advance if this is a stupid question! Here's the cluster definition:

apiVersion: mysql.oracle.com/v1alpha1
kind: Cluster
metadata:
  name: cluster-mysql
  namespace: mysql
spec:
  members: 3
  multiMaster: false
  config:
    name: custom-mycnf
    namespace: mysql
  volumeClaimTemplate:
    metadata:
      name: mysql-volume
      namespace: mysql
    spec:
      storageClassName: nfs-client
      accessModes:
        - ReadWriteMany
      resources:
        requests:
          storage: 8Gi
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant