Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent loadbalancer may deadlock when servers are removed #6208

Closed
tmmorin opened this issue Jun 14, 2024 · 34 comments
Closed

Agent loadbalancer may deadlock when servers are removed #6208

tmmorin opened this issue Jun 14, 2024 · 34 comments
Assignees
Labels
kind/bug Something isn't working

Comments

@tmmorin
Copy link

tmmorin commented Jun 14, 2024

Environmental Info:

RKE2 Version: 1.28.9+rke2r1

Cluster Configuration:

  • 3 control plane nodes
  • 3 worker nodes

Context:

The context is an RKE2 cluster deployed and upgraded using Cluster API (using https://github.com/rancher-sandbox/cluster-api-provider-rke2)

This implies that on some operations (RKE2 upgrade, OS upgrade, etc.) all nodes are replaced: old nodes are drained and deleted while new node are created (one at a time for CP nodes, possibly more than one at a time for worker). This is called a "node rolling update" and similar as Pod replacements in a ReplicaSet.

Describe the bug:

After a node rolling update, I observe that some of the old nodes (one that would be replaced by the rolling update) are stuck in "NotReady,SchedulingDisabled" state.

The RKE2 logs show the following error repeated over and over:

root@management-cluster-md-md0-1ba074d313-qljfm:/var/lib/rancher/rke2/agent/etc# journalctl -xeu rke2-agent | tail 
Jun 14 12:37:02 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:02Z" level=info msg="Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:04 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:04Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 172.20.129.65:9345: connect: no route to host"
Jun 14 12:37:04 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:04Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 172.20.129.65:9345: connect: no route to host" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:05 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:05Z" level=info msg="Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:07 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:07Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 172.20.129.65:9345: connect: no route to host"
Jun 14 12:37:07 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:07Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 172.20.129.65:9345: connect: no route to host" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:08 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:08Z" level=info msg="Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:10 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:10Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 172.20.129.65:9345: connect: no route to host"
Jun 14 12:37:10 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:10Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 172.20.129.65:9345: connect: no route to host" url="wss://172.20.129.65:9345/v1-rke2/connect"
Jun 14 12:37:11 management-cluster-md-md0-1ba074d313-qljfm rke2[1209]: time="2024-06-14T12:37:11Z" level=info msg="Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect"

more readable extract of the end of lines:

Failed to connect to proxy. Empty dialer response" error="dial tcp 172.20.129.65:9345: connect: no route to host
Remotedialer proxy error; reconecting..." error="dial tcp 172.20.129.65:9345: connect: no route to host" url="wss://172.20.129.65:9345/v1-rke2/connect
Connecting to proxy" url="wss://172.20.129.65:9345/v1-rke2/connect

What is striking is that the IP which is not reachable is not an IP of any node, it's actually an IP of a CP nodes that has already been drained and deleted (the underlying VM does not exist anymore) but which still exist as a Kubernetes node. (I think I observed occurrences of this issue where connection attempts where made to an IP which wasn't an IP of any currently existing Node).

On disk I find the following:

root@management-cluster-md-md0-1ba074d313-qljfm:/var/lib/rancher/rke2/agent/etc# grep . *json
rke2-agent-load-balancer.json:{
rke2-agent-load-balancer.json:  "ServerURL": "https://172.20.129.32:9345",
rke2-agent-load-balancer.json:  "ServerAddresses": [
rke2-agent-load-balancer.json:    "172.20.129.185:9345",
rke2-agent-load-balancer.json:    "172.20.129.42:9345",
rke2-agent-load-balancer.json:    "172.20.129.57:9345",
rke2-agent-load-balancer.json:    "172.20.129.65:9345"
rke2-agent-load-balancer.json:  ],
rke2-agent-load-balancer.json:  "Listener": null
rke2-agent-load-balancer.json:}
rke2-api-server-agent-load-balancer.json:{
rke2-api-server-agent-load-balancer.json:  "ServerURL": "https://172.20.129.32:6443",
rke2-api-server-agent-load-balancer.json:  "ServerAddresses": [
rke2-api-server-agent-load-balancer.json:    "172.20.129.185:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.42:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.57:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.65:6443"
rke2-api-server-agent-load-balancer.json:  ],
rke2-api-server-agent-load-balancer.json:  "Listener": null
rke2-api-server-agent-load-balancer.json:}

The currently existing nodes are:

$ k get nodes -o wide        
NAME                                         STATUS                        ROLES                       AGE     VERSION          INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
management-cluster-cp-43246891df-968lm       Ready                         control-plane,etcd,master   22h     v1.28.9+rke2r1   172.20.129.164   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.11-k3s2
management-cluster-cp-43246891df-vqb8d       Ready                         control-plane,etcd,master   24h     v1.28.9+rke2r1   172.20.129.42    <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.11-k3s2
management-cluster-cp-fc5b1c9850-6dpm8       Ready                         control-plane,etcd,master   47h     v1.28.9+rke2r1   172.20.129.57    <none>        Ubuntu 22.04.4 LTS   5.15.0-107-generic   containerd://1.7.11-k3s2
management-cluster-cp-fc5b1c9850-7mmxr       NotReady,SchedulingDisabled   control-plane,etcd,master   2d      v1.28.9+rke2r1   172.20.129.65    <none>        Ubuntu 22.04.4 LTS   5.15.0-107-generic   containerd://1.7.11-k3s2
management-cluster-cp-fc5b1c9850-7rrw2       Ready,SchedulingDisabled      control-plane,etcd,master   2d      v1.28.9+rke2r1   172.20.129.185   <none>        Ubuntu 22.04.4 LTS   5.15.0-107-generic   containerd://1.7.11-k3s2
management-cluster-md-md0-1ba074d313-9tdpg   Ready,SchedulingDisabled      <none>                      24h     v1.28.9+rke2r1   172.20.129.93    <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.11-k3s2
management-cluster-md-md0-1ba074d313-d9xb5   Ready                         <none>                      3h19m   v1.28.9+rke2r1   172.20.129.156   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.11-k3s2
management-cluster-md-md0-1ba074d313-qljfm   NotReady,SchedulingDisabled   <none>                      24h     v1.28.9+rke2r1   172.20.129.178   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.11-k3s2
management-cluster-md-md0-1ba074d313-s479p   Ready                         <none>                      24h     v1.28.9+rke2r1   172.20.129.253   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.11-k3s2

So the rke2-*-load-balancer.json files have:

  • ServerURL: https://172.20.129.32 -> this is the cluster virtual IP (managed with MetalLB), which is always reachable (except short handovers when the node holding it is replaced)
  • ServerAddresses:
    • 172.20.129.65 -> the IP in error in the logs above
    • 172.20.129.185 in this JSON file, but not appearing in logs - this CP node is Ready,SchedulingDisabled (next CP node that Cluster API is currently trying to drain)
    • 172.20.129.42 in this JSON file, but not appearing in logs
    • 172.20.129.57 in this JSON file, but not appearing in logs
    • the address 172.20.129.164 (new CP node) does not appear in these files

When the RKE2 agent is in such a state, the only solution I found was to restart it, and this is sufficient to have it join properly the cluster. Waiting, even long (tens of hours), isn't sufficient.

The questions would be:

  • why does the RKE2 agent fail to join the cluster ?
  • why isn't the agent trying to use multiple CP nodes that it has in the rke2-*-load-balancer.json files ?
  • why isn't the new CP node (.164) present in the rke2-*-load-balancer.json files ?
  • why doesn't the agent try to use the cluster VIP which is mostly-always working to find current nodes ?

Expected behavior:

We would need an already existing node to keep working fine even when some or all CP nodes are drained/deleted/recreated during a Cluster API node rolling update.

Additional elements:

Details on node creation times:

$ k get nodes -o custom-columns='CREATION:.metadata.creationTimestamp,NAME:.metadata.name,IP:.status.addresses[?(@.type=="InternalIP")].address' | sort          
2024-06-12T12:46:28Z   management-cluster-cp-fc5b1c9850-7mmxr       172.20.129.65
2024-06-12T12:57:51Z   management-cluster-cp-fc5b1c9850-7rrw2       172.20.129.185
2024-06-12T13:16:33Z   management-cluster-cp-fc5b1c9850-6dpm8       172.20.129.57
2024-06-13T12:55:05Z   management-cluster-md-md0-1ba074d313-qljfm   172.20.129.178   <<<<<<<<<<<<< the node on which we see errors
2024-06-13T12:57:01Z   management-cluster-cp-43246891df-vqb8d       172.20.129.42
2024-06-13T14:42:25Z   management-cluster-cp-43246891df-968lm       172.20.129.164

(other workers omitted from the list)

Alternative scenario:

I'll report more if I can reproduce it, but I'm pretty sure I observed cases where the rke2-*-load-balancer.json files had no currently existing control plane nodes, and where only listing old/deleted ones. I suspect this would possibly be observed on a worker node that has been created before the rolling update, and the CP node rollout completes before the worker node is replaced.

I'll try to reproduce this variant and report back here.

Reference:

I've wondered if k3s-io/k3s#10241 would not be remotely related (in the issue I report, it's also about a single remaining server being attempted).

@tmmorin
Copy link
Author

tmmorin commented Jun 14, 2024

hello @brandond -- I'm thinking that this one is possibly around code that you own

thanks in advance

@brandond
Copy link
Contributor

I believe this is a duplicate of #5949

@tmmorin
Copy link
Author

tmmorin commented Jun 17, 2024

I believe this is a duplicate of #5949

Indeed, it seems to be.

Thanks!

@tmmorin
Copy link
Author

tmmorin commented Jun 19, 2024

@brandond -- I would like to reopen this issue.

I've been testing with v1.28.11-rc3+rke2r1 (which incorporates the fix for #5949) and the problem I described here is still present:

  • worker node remaining NotReady
  • RKE2 agent looping logs mentioning nodes that do not exist anymore:
Remotedialer proxy error; reconnecting... error=dial tcp 172.20.129.164:9345: connect: no route to host url=wss://172.20.129.164:9345/v1-rke2/connect
Failed to connect to proxy. Empty dialer response error=dial tcp 172.20.129.92:9345: connect: no route to host
Remotedialer proxy error; reconnecting... error=dial tcp 172.20.129.92:9345: connect: no route to host url=wss://172.20.129.92:9345/v1-rke2/connect
Connecting to proxy url=wss://172.20.129.164:9345/v1-rke2/connect
...
  • a systemctl restart rke2-agent solves the issue
  • rke2-*-load-balancer.json files contain a mix of existing and non-existing CP nodes, and not containing all current CP nodes (missing 1)

Details....

The errors was triggered by doing a node rolling update on control nodes only (with all nodes starting with v1.28.11+rke2r1):

k get nodes -o wide
NAME                                         STATUS     ROLES                       AGE    VERSION           INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
management-cluster-cp-7ac0c5d32c-2l6fv       Ready      control-plane,etcd,master   116m   v1.28.11+rke2r1   172.20.129.181   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1
management-cluster-cp-7ac0c5d32c-hbcgb       Ready      control-plane,etcd,master   111m   v1.28.11+rke2r1   172.20.129.232   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1
management-cluster-cp-7ac0c5d32c-qlf6g       Ready      control-plane,etcd,master   122m   v1.28.11+rke2r1   172.20.129.216   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1
management-cluster-md-md0-be39655a08-fm9qd   Ready      <none>                      5h3m   v1.28.11+rke2r1   172.20.129.40    <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1
management-cluster-md-md0-be39655a08-jvgrd   NotReady   <none>                      5h8m   v1.28.11+rke2r1   172.20.129.145   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1
management-cluster-md-md0-be39655a08-w2fq6   Ready      <none>                      5h8m   v1.28.11+rke2r1   172.20.129.157   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1
k get nodes -o custom-columns='CREATION:.metadata.creationTimestamp,NAME:.metadata.name,IP:.status.addresses[?(@.type=="InternalIP")].address' | sort                  
2024-06-19T10:10:28Z   management-cluster-md-md0-be39655a08-w2fq6   172.20.129.157
2024-06-19T10:10:40Z   management-cluster-md-md0-be39655a08-jvgrd   172.20.129.145   <<<<< node in error
2024-06-19T10:15:18Z   management-cluster-md-md0-be39655a08-fm9qd   172.20.129.40
2024-06-19T13:16:17Z   management-cluster-cp-7ac0c5d32c-qlf6g       172.20.129.216
2024-06-19T13:22:14Z   management-cluster-cp-7ac0c5d32c-2l6fv       172.20.129.181
2024-06-19T13:27:31Z   management-cluster-cp-7ac0c5d32c-hbcgb       172.20.129.232

Logs:

....
Failed to connect to proxy. Empty dialer response error=dial tcp 172.20.129.196:9345: connect: connection timed out
Remotedialer proxy error; reconnecting... error=dial tcp 172.20.129.196:9345: connect: connection timed out url=wss://172.20.129.196:9345/v1-rke2/connect
Connecting to proxy url=wss://172.20.129.196:9345/v1-rke2/connect
Failed to connect to proxy. Empty dialer response error=dial tcp 172.20.129.164:9345: connect: no route to host
Remotedialer proxy error; reconnecting... error=dial tcp 172.20.129.164:9345: connect: no route to host url=wss://172.20.129.164:9345/v1-rke2/connect
Failed to connect to proxy. Empty dialer response error=dial tcp 172.20.129.92:9345: connect: no route to host
Remotedialer proxy error; reconnecting... error=dial tcp 172.20.129.92:9345: connect: no route to host url=wss://172.20.129.92:9345/v1-rke2/connect
Connecting to proxy url=wss://172.20.129.164:9345/v1-rke2/connect
Connecting to proxy url=wss://172.20.129.92:9345/v1-rke2/connect
...

None of the IPs above correspond to CP nodes that do not exist anymore.

root@management-cluster-md-md0-be39655a08-jvgrd:/var/lib/rancher/rke2/agent/etc# grep . *json
rke2-agent-load-balancer.json:{
rke2-agent-load-balancer.json:  "ServerURL": "https://172.20.129.32:9345",
rke2-agent-load-balancer.json:  "ServerAddresses": [
rke2-agent-load-balancer.json:    "172.20.129.164:9345",
rke2-agent-load-balancer.json:    "172.20.129.196:9345",
rke2-agent-load-balancer.json:    "172.20.129.216:9345",
rke2-agent-load-balancer.json:    "172.20.129.92:9345"
rke2-agent-load-balancer.json:  ],
rke2-agent-load-balancer.json:  "Listener": null
rke2-agent-load-balancer.json:}
rke2-api-server-agent-load-balancer.json:{
rke2-api-server-agent-load-balancer.json:  "ServerURL": "https://172.20.129.32:6443",
rke2-api-server-agent-load-balancer.json:  "ServerAddresses": [
rke2-api-server-agent-load-balancer.json:    "172.20.129.164:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.196:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.216:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.92:6443"
rke2-api-server-agent-load-balancer.json:  ],
rke2-api-server-agent-load-balancer.json:  "Listener": null
rke2-api-server-agent-load-balancer.json:}

in these files:

  • 172.20.129.164 -> no CP node anymore at this address
  • 172.20.129.196 -> no CP node anymore at this address
  • 172.20.129.216 --> a CP nodes exists at this address
  • 172.20.129.92 -> no CP node anymore at this address

On the worker nodes which are not in error, there are no comparable error in logs () these files have all current CP nodes:

root@management-cluster-md-md0-be39655a08-fm9qd:/var/lib/rancher/rke2/agent/etc# grep . *json
rke2-agent-load-balancer.json:{
rke2-agent-load-balancer.json:  "ServerURL": "https://172.20.129.32:9345",
rke2-agent-load-balancer.json:  "ServerAddresses": [
rke2-agent-load-balancer.json:    "172.20.129.181:9345",
rke2-agent-load-balancer.json:    "172.20.129.216:9345",
rke2-agent-load-balancer.json:    "172.20.129.232:9345"
rke2-agent-load-balancer.json:  ],
rke2-agent-load-balancer.json:  "Listener": null
rke2-agent-load-balancer.json:}
rke2-api-server-agent-load-balancer.json:{
rke2-api-server-agent-load-balancer.json:  "ServerURL": "https://172.20.129.32:6443",
rke2-api-server-agent-load-balancer.json:  "ServerAddresses": [
rke2-api-server-agent-load-balancer.json:    "172.20.129.181:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.216:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.232:6443"
rke2-api-server-agent-load-balancer.json:  ],
rke2-api-server-agent-load-balancer.json:  "Listener": null
rke2-api-server-agent-load-balancer.json:}

@brandond
Copy link
Contributor

brandond commented Jun 19, 2024

@tmmorin can you confirm that

  1. The server in all agent kubeconfigs points to the local load-balancer address:
    root@rke2-agent-1:/# grep server: /var/lib/rancher/rke2/agent/*.kubeconfig
    /var/lib/rancher/rke2/agent/kubelet.kubeconfig:    server: https://127.0.0.1:6443
    /var/lib/rancher/rke2/agent/kubeproxy.kubeconfig:    server: https://127.0.0.1:6443
    /var/lib/rancher/rke2/agent/rke2controller.kubeconfig:    server: https://127.0.0.1:6443
  2. The load-balancer at 172.20.129.32:6443 is reachable and contains only currently available cluster members
  3. You're replacing the control-plane nodes with enough time between node replacements that agents are able to reconnect to another node and discover the updated endpoint list, before all of the previous endpoints have been replaced. Waiting a few minutes between node replacements should be sufficient.

If all of these are true, please run your agent nodes with debug: true and attach logs from one of the nodes that becomes isolated from the cluster following control-plane node replacement.

The behavior you're describing makes it sound like you're replacing all of the control-plane nodes before the agent's health-check fails and forces it to switch over to a new server, at which point all of the servers it was previously aware of are unavailable and it has nowhere to fall back to. There could possibly be an issue with the agent failing to fall back to the default ServerURL, because there are still servers present in the pool? But I'd need more information from you to confirm that.

@tmmorin
Copy link
Author

tmmorin commented Jun 19, 2024

On (1) : I'll re-break my environment to check that and report here

On (2) :

The load-balancer at 172.20.129.32:6443 is reachable and contains only currently available cluster members

(not entirely sure if this is your question, but:)
The k8s cluster at 172.20.129.32:6443 list both CP nodes and worker nodes, as in the output above (all CP nodes are up/ready/current, and all worker nodes are ok except the NotReady one)

or are you asking about RKE2 cluster members ? (but then I'd expect you to ask about :9345 and then I would need to know how to list cluster members on RKE2 server API: is there some curl I can do to see that ?)

On (3):

You're replacing the control-plane nodes with enough time between node replacements that agents are able to reconnect to another node and discover the updated endpoint list

The rolling update started at 13:12, first node was up at 13:16, next at 13:22, last at 13:27.

Cluster API does by default not wait anytime between the last created node being Ready and starting the drain of the next one. I need to search a bit, this possibly is tunable.

The behavior you're describing makes it sound like you're replacing all of the control-plane nodes before the agent's health-check fails and forces it to switch over to a new server, at which point all of the servers it was previously aware of are unavailable and it has nowhere to fall back to. There could possibly be an issue with the agent failing to fall back to the default ServerURL, because there are still servers present in the pool? But I'd need more information from you to confirm that.

My questions would be:

  • what is the health-check period of the agents ? how long should be set the delay to be safe ? (can we tune it to a shorter delay ?)
  • is the hypothetic scenario of "nowhere to fallback" compatible with the observation that I reported above that the rke2-*load-balancer.json files does contain one valid CP node IP address ? (172.20.129.216 in my dumps above) why isn't my agent trying to use this address ?

@brandond
Copy link
Contributor

brandond commented Jun 19, 2024

The k8s cluster at 172.20.129.32:6443 list both CP nodes and worker nodes, as in the output above (all CP nodes are up/ready/current, and all worker nodes are ok except the NotReady one)
or are you asking about RKE2 cluster members ? (but then I'd expect you to ask about :9345 and then I would need to know how to list cluster members on RKE2 server API: is there some curl I can do to see that ?)

The server address you are using in your agent configuration 172.20.129.32 is a load-balancer or virtual IP, not just the address of a single node, correct? What we refer to in the HA docs as a fixed registration address that is placed in front of server nodes. How are you keeping this in sync with the current list of available control-plane nodes, as you cycle nodes in and out of the cluster?

The rolling update started at 13:12, first node was up at 13:16, next at 13:22, last at 13:27.
Cluster API does by default not wait anytime between the last created node being Ready and starting the drain of the next one. I need to search a bit, this possibly is tunable.

Run kubectl get endpoints kubernetes -w on one of the control-plane nodes while you're rotating other nodes in and out of the cluster, and check to see when the new nodes are added to the list, and when the old ones are removed. Confirm that the new node is present in this list, before the next old node is being taken down by CAPI.

what is the health-check period of the agents ? how long should be set the delay to be safe ? (can we tune it to a shorter delay ?)

The agent load-balancers use connectivity to the websocket tunnel endpoint on a server as a health-check. When the websocket tunnel is disconnected, or the server is removed from the list, all connections using that server are closed, and clients will reconnect through the load-balancer to another server that IS passing health checks.

If the server is being hard powered off by CAPI, and you have to wait for the client to get an IO Timeout from the network stack instead of the connection being actively closed when the server is stopped, then the health-checks may take longer to respond to outages. Ideally you would be stopping RKE2 on the nodes before removing them from the cluster, rather than just unceremoniously dropping them off the network.

is the hypothetic scenario of "nowhere to fallback" compatible with the observation that I reported above that the rke2-*load-balancer.json files does contain one valid CP node IP address ? (172.20.129.216 in my dumps above) why isn't my agent trying to use this address ?

You're not running with debug enabled, so we can't see whether or not it is trying to use it. All that your logs show is that it's trying to reconnect to the old server's websocket endpoints, because it hasn't seen them get removed from the kubernetes service's endpoints list.

@tmmorin
Copy link
Author

tmmorin commented Jun 19, 2024

The server address you are using in your agent configuration 172.20.129.32 is a load-balancer or virtual IP, not just the address of a single node, correct? What we refer to in the HA docs as a fixed registration address that is placed in front of server nodes. How are you keeping this in sync with the current list of available control-plane nodes, as you cycle nodes in and out of the cluster?

Yes, 172.20.129.32 is a virtual IP.

It's managed by MetalLB which acts as the loadbalancer class for a few LB services, including one for k8s API itself and RKE2 9345:

$ k get services -n kube-system kubernetes-vip
kubernetes-vip  LoadBalancer   100.73.199.33   172.20.129.32       6443:30614/TCP,9345:31483/TCP         65d

MetaLB is essentially handling the VIP failover aspects, then once the traffic hits the node holding the VIP, the rest is essentially kube-proxy doing the work.

... If the server is being hard powered off by CAPI, ...

Well, CAPI is (ceremoniously) doing a drain, then deleting the Node from k8s, then the node is shutdown via (in my case) OpenStack API which I presume triggers a systemd-based system shutdown (but I would need to check that).

At this point I think we'll need to loop with the team working on https://github.com/rancher-sandbox/cluster-api-provider-rke2 - they may have ideas on this topic.

/cc @alexander-demicev @belgaied2 @Danil-Grigorev @furkatgofurov7

... you have to wait for the client to get an IO Timeout from the network stack instead of the connection being actively closed when the server is stopped ...

As a side note : a server being brutally shutdown is something that can happen, and that I would typically expect a healthcheck mechanism to have some form of keepalives to react faster than how long it takes TCP to react.

why isn't my agent trying to use [the address of the working CP nodes it knows ] ?

You're not running with debug enabled, so we can't see whether or not it is trying to use it.

Ah, I get it. So I gather that seeing the error logs I see does not necessarily mean that the agent has no connection to the control plane nodes. But then the next question would be: why does it get stuck in NotReady ?

@tmmorin
Copy link
Author

tmmorin commented Jun 19, 2024

I just did a run, and here is the information I gathered (I started the watch while one CP node already had been introduced).

$ (while true; do kubectl get endpoints -n kube-system kubernetes-vip -o yaml -w; sleep 1; done) > watch-kubernetes-vip-endpoints.yaml
grep -E 'ip:|---' watch-kubernetes-vip-endpoints.yaml
  - ip: 172.20.129.145
  - ip: 172.20.129.19
  - ip: 172.20.129.64
---
  - ip: 172.20.129.145
  - ip: 172.20.129.19
  - ip: 172.20.129.64
---
  - ip: 172.20.129.145
  - ip: 172.20.129.19
  - ip: 172.20.129.64
  - ip: 172.20.129.84    # new node
---
  - ip: 172.20.129.145
  - ip: 172.20.129.19  
# .64 is gone
  - ip: 172.20.129.84
---
  - ip: 172.20.129.145
  - ip: 172.20.129.19
  - ip: 172.20.129.21   # new node
  - ip: 172.20.129.84
---
  - ip: 172.20.129.145
# .19 is gone
  - ip: 172.20.129.21
  - ip: 172.20.129.84

So it seems like the changes always add the endpoint for a new node before removing the old one.

Confirm that the new node is present in this list, before the next old node is being taken down by CAPI.

One question though: in a baremetal environment, where we typical not always have spare nodes, a control plane node rolling update will possibly happen by first draining/deleting a CP node and only after this recreating one (on the same baremetal server). Is this something that would be problematic for RKE2 ?

Also, on the server that ended up NotReady, I checked what you asked about in a previous comment:

root@management-cluster-md-md0-be39655a08-q46b9:/home/node-admin# grep server: /var/lib/rancher/rke2/agent/*.kubeconfig
/var/lib/rancher/rke2/agent/kubelet.kubeconfig:    server: https://127.0.0.1:6443
/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig:    server: https://127.0.0.1:6443
/var/lib/rancher/rke2/agent/rke2controller.kubeconfig:    server: https://127.0.0.1:6443

@brandond
Copy link
Contributor

Well, debug logs and kubeconfigs should show more... but a couple different things could be happening:

  • Kubelet and supervisor arent using the load balancer
  • Load balancer isn't getting updated with new servers, because all the old servers are going away before it can fail over and update the list
  • Something else?

Grab that info when you can.

@tmmorin
Copy link
Author

tmmorin commented Jun 20, 2024

Here are the RKE2 logs from a worker node, taken during CP node rolling update.

The rolling update was triggerred around 10:19.

Before the rolling update, the CP nodes were:

NAME                                         STATUS     ROLES                       AGE   VERSION           INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
management-cluster-cp-7ac0c5d32c-29wkb       Ready      control-plane,etcd,master   13h   v1.28.11+rke2r1   172.20.129.84    <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1
management-cluster-cp-7ac0c5d32c-cqxjm       Ready      control-plane,etcd,master   13h   v1.28.11+rke2r1   172.20.129.21    <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1
management-cluster-cp-7ac0c5d32c-z6667       Ready      control-plane,etcd,master   13h   v1.28.11+rke2r1   172.20.129.145   <none>        Ubuntu 22.04.4 LTS   5.15.0-112-generic   containerd://1.7.17-k3s1

After the rolling update:

$ k get nodes -o custom-columns='CREATION:.metadata.creationTimestamp,NAME:.metadata.name,IP:.status.addresses[?(@.type=="InternalIP")].address,VERSION:.status.nodeInfo.kubeletVersion' | sort
2024-06-19T16:49:50Z   management-cluster-md-md0-be39655a08-6v428   172.20.129.102   v1.28.11+rke2r1
2024-06-19T16:50:17Z   management-cluster-md-md0-be39655a08-q46b9   172.20.129.240   v1.28.11+rke2r1
2024-06-19T16:54:09Z   management-cluster-md-md0-be39655a08-2njjb   172.20.129.157   v1.28.11+rke2r1
2024-06-20T10:15:10Z   management-cluster-cp-7ac0c5d32c-fpxw9       172.20.129.250   v1.28.11+rke2r1
2024-06-20T10:21:29Z   management-cluster-cp-7ac0c5d32c-2w5pv       172.20.129.211   v1.28.11+rke2r1
2024-06-20T10:27:24Z   management-cluster-cp-7ac0c5d32c-9fxjs       172.20.129.241   v1.28.11+rke2r1

Here are the logs of management-cluster-md-md0-be39655a08-q46b9 (172.20.129.240):
rke2.log

After the rolling update, I have this on that node (2 old/invalid IPs, and 2 current/valid ones):

rke2-agent-load-balancer.json:{
rke2-agent-load-balancer.json:  "ServerURL": "https://172.20.129.32:9345",
rke2-agent-load-balancer.json:  "ServerAddresses": [
rke2-agent-load-balancer.json:    "172.20.129.211:9345",
rke2-agent-load-balancer.json:    "172.20.129.21:9345",
rke2-agent-load-balancer.json:    "172.20.129.250:9345",
rke2-agent-load-balancer.json:    "172.20.129.84:9345"
rke2-agent-load-balancer.json:  ],
rke2-agent-load-balancer.json:  "Listener": null
rke2-agent-load-balancer.json:}
rke2-api-server-agent-load-balancer.json:{
rke2-api-server-agent-load-balancer.json:  "ServerURL": "https://172.20.129.32:6443",
rke2-api-server-agent-load-balancer.json:  "ServerAddresses": [
rke2-api-server-agent-load-balancer.json:    "172.20.129.211:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.21:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.250:6443",
rke2-api-server-agent-load-balancer.json:    "172.20.129.84:6443"
rke2-api-server-agent-load-balancer.json:  ],
rke2-api-server-agent-load-balancer.json:  "Listener": null
rke2-api-server-agent-load-balancer.json:}

@tmmorin
Copy link
Author

tmmorin commented Jun 20, 2024

This surprises me -- one the old CP nodes address still appears in the RKE2 logs, even after log lines which indicates that RKE2 saw this CP node go away (172.20.129.145):

$ sed -e 's/.*]: //' rke2.log | tr -d '"' | grep 172.20.129.145
time=2024-06-20T10:15:08Z level=info msg=Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [172.20.129.145:6443 172.20.129.21:6443 172.20.129.250:6443 172.20.129.84:6443] [default: 172.20.129.32:6443]
time=2024-06-20T10:15:08Z level=info msg=Updated load balancer rke2-agent-load-balancer server addresses -> [172.20.129.145:9345 172.20.129.21:9345 172.20.129.250:9345 172.20.129.84:9345] [default: 172.20.129.145:9345]
time=2024-06-20T10:16:35Z level=info msg=Removing server from load balancer rke2-api-server-agent-load-balancer: 172.20.129.145:6443
time=2024-06-20T10:16:35Z level=info msg=Removing server from load balancer rke2-agent-load-balancer: 172.20.129.145:9345
time=2024-06-20T10:16:35Z level=info msg=Updated load balancer rke2-agent-load-balancer server addresses -> [172.20.129.21:9345 172.20.129.250:9345 172.20.129.84:9345] [default: 172.20.129.145:9345]
time=2024-06-20T10:16:35Z level=info msg=Stopped tunnel to 172.20.129.145:9345
time=2024-06-20T10:16:35Z level=info msg=Proxy done err=context canceled url=wss://172.20.129.145:9345/v1-rke2/connect
time=2024-06-20T10:21:31Z level=info msg=Updated load balancer rke2-agent-load-balancer server addresses -> [172.20.129.211:9345 172.20.129.21:9345 172.20.129.250:9345 172.20.129.84:9345] [default: 172.20.129.145:9345]

So:

  • at 10:15:08 we have an update (the addition of 172.20.129.250) and we learn that 172.20.129.145 is the current default for "rke2-agent-load-balancer server addresses"
  • at 10:16:35 we have messages indicating that 172.20.129.145 is being removed (this is the first CP node to be dismantled, one minute after the new on, 172.20.129.250, was added)
Removing server from load balancer rke2-api-server-agent-load-balancer: 172.20.129.145:6443
Removing server from load balancer rke2-agent-load-balancer: 172.20.129.145:9345
  • but then, we see following messages (at 10:16:35 and 10:21:31) still mentioning 172.20.129.145 as "default":
2024-06-20T10:16:35Z Updated load balancer rke2-agent-load-balancer server addresses ->
       [172.20.129.21:9345 172.20.129.250:9345 172.20.129.84:9345] [default: 172.20.129.145:9345]
                                                                         ^^^^^^^^^^^^^^^^^^^^^
...
2024-06-20T10:21:31Z Updated load balancer rke2-agent-load-balancer server addresses ->
        [172.20.129.211:9345 172.20.129.21:9345 172.20.129.250:9345 172.20.129.84:9345] [default: 172.20.129.145:9345]
                                                                                               ^^^^^^^^^^^^^^^^^^^^^

It does not seem right at all that 172.20.129.145 would still appear even though it was removed ?

EDIT: this seems to possibly be the reason - https://github.com/k3s-io/k3s/blame/aa4794b37223156c5f651d94e23670bd7e581607/pkg/agent/loadbalancer/servers.go#L88

            // Don't delete the default server from the server map, in case we need to fall back to it.
            if removedServer != lb.defaultServerAddress {
                delete(lb.servers, removedServer)
            }

Also, I can't help wonder why the agent load-balancer isn't using the VIP as default (contrarily to the api-server LB which has the VIP as "default")...

EDIT (after reading bits of code around https://github.com/k3s-io/k3s/blob/master/pkg/agent/loadbalancer/loadbalancer.go): I'm far from having a good grasp of the k3s proxy/loadbalancer code, but I'm thinking the answer may be related to how loadbalancer.SetDefault and proxy.SetSupervisorDefault are called with per-server addresses

EDIT: after checking logs:

2024-06-20T09:55:52Z Getting list of apiserver endpoints from server
2024-06-20T09:55:52Z  Waiting for Ready condition to be updated for Kubelet Port assignment
2024-06-20T09:55:52Z Updated load balancer rke2-agent-load-balancer default server address -> 172.20.129.145:9345

So the default agent load-balancer was set from https://github.com/k3s-io/k3s/blob/aa4794b37223156c5f651d94e23670bd7e581607/pkg/agent/tunnel/tunnel.go#L131 to the address of one of the API servers known at the time.

@tmmorin
Copy link
Author

tmmorin commented Jun 20, 2024

My recap of the timeline of this log would be:

  • 10:15: .250 node appears, added as backend on the two load balancers (agent and api-server)
  • 10:16: .145 node goes away, removed from the list of backends (but as noticed above, remains as "default" for the agent load balancer)
  • 10:21: .211 node appears, added as backend on the two load balancers (agent and api-server)
  • 10:23:20-10:23:30: .84 node goes away (checked from RKE2 logs of other nodes in the cluster "Removing etcd member from cluster due to node delete"), but there is no message indicating that our node notices it
  • from this point, lots of errors repeated errors:
2024-06-20T10:23:22Z debug Dial error from load balancer rke2-api-server-agent-load-balancer: dial tcp 172.20.129.84:6443: connect: no route to host
2024-06-20T10:23:22Z debug Failed over to new server for load balancer rke2-api-server-agent-load-balancer: 172.20.129.84:6443 -> 172.20.129.21:6443
  • then slightly different ones:
2024-06-20T10:23:31Z error Remotedialer proxy error; reconnecting... error=read tcp 172.20.129.240:33456->172.20.129.84:9345: i/o timeout url=wss://172.20.129.84:9345/v1-rke2/connect
2024-06-20T10:23:31Z debug Wrote ping
2024-06-20T10:23:32Z debug Dial error from load balancer rke2-api-server-agent-load-balancer: dial tcp 172.20.129.84:6443: connect: no route to host
2024-06-20T10:23:32Z info Connecting to proxy url=wss://172.20.129.84:9345/v1-rke2/connect
2024-06-20T10:23:33Z debug Wrote ping
2024-06-20T10:23:35Z error Failed to connect to proxy. Empty dialer response error=dial tcp 172.20.129.84:9345: connect: no route to host
2024-06-20T10:23:35Z error Remotedialer proxy error; reconnecting... error=dial tcp 172.20.129.84:9345: connect: no route to host url=wss://172.20.129.84:9345/v1-rke2/connect
2024-06-20T10:23:36Z debug Wrote ping

these ones repeat over and over

  • at 10:29, node .21 is removed (seen from other nodes in the cluster "Removing etcd member from cluster due to node delete"), and then we have those error messages repeated
2024-06-20T10:29:54Z info Connecting to proxy url=wss://172.20.129.84:9345/v1-rke2/connect
2024-06-20T10:29:56Z error Failed to connect to proxy. Empty dialer response error=dial tcp 172.20.129.21:9345: connect: no route to host
2024-06-20T10:29:56Z error Remotedialer proxy error; reconnecting... error=dial tcp 172.20.129.21:9345: connect: no route to host url=wss://172.20.129.21:9345/v1-rke2/connect
2024-06-20T10:29:56Z error Failed to connect to proxy. Empty dialer response error=dial tcp 172.20.129.84:9345: connect: no route to host
2024-06-20T10:29:56Z error Remotedialer proxy error; reconnecting... error=dial tcp 172.20.129.84:9345: connect: no route to host url=wss://172.20.129.84:9345/v1-rke2/connect
2024-06-20T10:29:56Z debug Wrote ping
2024-06-20T10:29:57Z info Connecting to proxy url=wss://172.20.129.21:9345/v1-rke2/connect
2024-06-20T10:29:57Z info Connecting to proxy url=wss://172.20.129.84:9345/v1-rke2/connect
2024-06-20T10:29:58Z debug Wrote ping

@tmmorin
Copy link
Author

tmmorin commented Jun 20, 2024

I also want to share another observation I made, on CP nodes. I fully realize that it may be noise, and not relate to this issue, but it's sufficiently correlated (in time) and surprising, to be worth mentioning.

Around the time of the rolling update, I see this in rke2 logs:

----------------------------------- 172.20.129.250 ------------------------------------
2024/06/20 10:16:33 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
----------------------------------- 172.20.129.211 ------------------------------------
2024/06/20 10:22:57 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 10:26:57 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 10:34:57 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 10:50:57 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 11:22:57 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 12:26:57 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 14:34:57 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
----------------------------------- 172.20.129.241 ------------------------------------
2024/06/20 10:28:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 10:32:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 10:40:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 10:56:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 11:28:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 12:32:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/06/20 14:40:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

(the times at which these error pop up match the times where nodes appear/disappear)

@brandond brandond self-assigned this Jun 25, 2024
@brandond brandond reopened this Jun 25, 2024
@brandond brandond added the kind/bug Something isn't working label Jun 25, 2024
@tmmorin
Copy link
Author

tmmorin commented Jun 25, 2024

To try to naildown why kubelet isn't able to connect to https://localhost:6443, I tried with curl.

root@management-cluster-md-md0-f2d2b6e8a6-r7wsq:/var/lib/rancher/rke2# curl https://127.0.0.1:6443 -vvv
*   Trying 127.0.0.1:6443...
* Connected to 127.0.0.1 (127.0.0.1) port 6443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* SSL connection timeout
* Closing connection 0

this give no response, until it hits curl: (28) SSL connection timeout...

I let it run in a terminal

In another terminal:

root@management-cluster-md-md0-f2d2b6e8a6-r7wsq:/home/node-admin# ss -atpn |grep curl
ESTAB      0      0                    127.0.0.1:57024               127.0.0.1:6443  users:(("curl",pid=2553734,fd=5))                          
root@management-cluster-md-md0-f2d2b6e8a6-r7wsq:/home/node-admin# ss -atpn |grep 57024
ESTAB      517    0                    127.0.0.1:6443                127.0.0.1:57024 users:(("rke2",pid=2451265,fd=4728))                       
ESTAB      0      0                    127.0.0.1:57024               127.0.0.1:6443  users:(("curl",pid=2553734,fd=5))                          

This indicates that the rke2 process isn't really consuming the data on this socket.

Doing this again while doing a tcpdump capture gives me a packet dump where I see:

  • the TCP handshake
  • the client sending the TLS Client Hello
  • a TCP ACK from the server acknowledging this data
  • but no TLS answer

This seems to indicate that the connection would be stuck somewhere in rke2 process, possibly close to the TLS layer.

@alexander-demicev
Copy link
Member

@tmmorin Hi, there is likely a bug in CAPRKE2. If you're using registrationMethod: address in RKE2ControlPlane object, it should use the provided IP as the registration endpoint for your nodes.

@tmmorin
Copy link
Author

tmmorin commented Jun 25, 2024

@tmmorin Hi, there is likely a bug in CAPRKE2. If you're using registrationMethod: address in RKE2ControlPlane object, it should use the provided IP as the registration endpoint for your nodes.

I'm not sure I follow you @alexander-demicev

We do use registrationMethod: address with our MetalLB VIP:

apiVersion: controlplane.cluster.x-k8s.io/v1alpha1
kind: RKE2ControlPlane
...
spec:
...
  registrationAddress: 172.20.129.32
  registrationMethod: address

And the rke2 config is generated accordingly:

root@management-cluster-md-md0-f2d2b6e8a6-r7wsq:/home/node-admin# grep server: /etc/rancher/rke2/config.yaml 
server: https://172.20.129.32:9345

This IP is then properly used by the agent to do it's initial connections to the cluster.

It's only later that the default address for the supervisor agent load-balancer to become a node-specific address:

2024-06-20T09:55:52Z Updated load balancer rke2-agent-load-balancer default server address -> 172.20.129.145:9345
                                                                                             ^^not the VIP^^

This just does not seem right, and the place where this happen seems to be pretty clear:

Now, what I do not get is if (or how) this would related to the root of our issue: the fact that kubelet can't talk to https://127.0.0.1:6443 anymore, resulting in the node becoming NotReady soon after.

@tmmorin
Copy link
Author

tmmorin commented Jun 25, 2024

I realize that I should have provided more context:

This issue is seen in the context of Sylva project, where we have been successfully testing and deploying RKE2-based clusters, and doing node rolling updates.

This has been working incrementally better through the past year, and reached a fairly satisfying maturity months ago. Despite sometimes hitting some issues (e.g. side-effects of #5614), in the past months we've been experiencing a lot of successful Cluster API node rolling updates, without ever hitting the issue I describe here.

This issue started to pop in Sylva with the integration of 1.28.9 (to replace 1.28.8).

There was in this time frame no significant change that we could relate to this issue, around how we use Cluster API, configure the Cluster API bootstrap provider for RKE2, or how fast we shutdown nodes.

As a formal confirmation, I just tried today to reproduce the issue with RKE2 1.28.8 on worker nodes -- doing the exact same things, in the exact same setup (control nodes remaining in RKE2 1.28.11-rc5 that I had in the past days), and I can't reproduce the issue with RKE2 1.28.8 on worker nodes, which leads me to believe the issue was introduced in 1.28.9.

@tmmorin
Copy link
Author

tmmorin commented Jun 25, 2024

I took current release-1.28 branches of rke2 and k3s, added debug logs in different places, including before and after the lb.mutex Lock/RLock calls in nextServer and setServers (https://github.com/k3s-io/k3s/blob/5773a3444740c69b86019d82e6cfb00a76b3e148/pkg/agent/loadbalancer/servers.go#L68 and https://github.com/k3s-io/k3s/blob/5773a3444740c69b86019d82e6cfb00a76b3e148/pkg/agent/loadbalancer/servers.go#L127).

I did this the quick'n'dirty way, but I observe much more logs of "before lock" than "after lock".

Jun 25 15:56:21 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:21Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:21 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:21Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (after lock)"
Jun 25 15:56:21 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:21Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:21 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:21Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (after lock)"
Jun 25 15:56:21 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:21Z" level=debug msg="setServers rke2-api-server-agent-load-balancer (before lock): [172.20.129.109:6443 172.20.129.184:6443 172.20.129.253:6443] [hasOriginalServer:%!s(bool=false)]"
Jun 25 15:56:24 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:24Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: message repeated 2 times: [ time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"]
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:37 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:37Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:41 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:41Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:49 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:49Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:53 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:53Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:53 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:53Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:53 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:53Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:53 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:53Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:53 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:53Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:53 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:53Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: message repeated 2 times: [ time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"]
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"
Jun 25 15:56:56 management-cluster-md-md0-f2d2b6e8a6-r7wsq rke2[3328578]: time="2024-06-25T15:56:56Z" level=debug msg="lb.rke2-api-server-agent-load-balancer nextServer 172.20.129.237:6443 (before lock)"

These places were touched by k3s-io/k3s#9757 which was introduced in 1.28.9.

I wanted to share this with you, altough being very possibly (likely!) off the mark. Please take this with a grain of salt.

I clean up my debug log addition and share complete logs tomorrow.

@alexander-demicev
Copy link
Member

@tmmorin Sorry, I only had a chance to get a better look at the issue now. I don't have much RKE2 internal knowledge but my impression was that workers would always use the VIP.

It's up to the CAPI infra provider to shutdown the machine. In my experience, CAPI providers usually terminate instances gracefully. We can make a change in CAPRKE2 and add a systemd service to the user-data, that will stop the RKE2 service on shutdown. @brandond Can this help nodes discover the updated endpoint list?

@brandond
Copy link
Contributor

The NA team is all at an onsite through Wednesday, so I'm tied up with travel and meetings and won't have time to sit down and focus on this until later in the week. Any fixes wouldn't be merged until the July cycle anyway.

@tmmorin
Copy link
Author

tmmorin commented Jun 26, 2024

Any fixes wouldn't be merged until the July cycle anyway.

Ok, point taken.

For Sylva, we'll downgrade to 1.28.8 and see how it goes.

We had been seeing #5614 occur on our rolling updates, so it's hard to quantify how often this issue will trigger, but we're likely to be back to the situation of ~3 months ago. Let's also hope we won't have issues solved between Kubernetes 1.28.8 and 1.28.9 suddenly popping up.

@brandond
Copy link
Contributor

brandond commented Jul 3, 2024

Yes, 172.20.129.32 is a virtual IP.

It's managed by MetalLB which acts as the loadbalancer class for a few LB services, including one for k8s API itself and RKE2 9345

Can you share the spec for the kubernetes-vip service that you're using with MetalLB, along with any additional info you can provide on your MetalLB deployment - ie are you using L2 or BGP mode, etc. I am still trying to replicate this locally.

If you could also provide full debug-level logs from an rke2 agent from startup, through the period of a control-plane node rotation that disrupts the agent's connection to the server, that would be helpful. I'm not able to follow the full state of the loadbalancer that seems to be triggering this behavior with logs that show only its state during the rotation.

One potential oddity that I am seeing is that we do not update the default address for the apiserver loadbalancer, just the supervisor loadbalancer. I need to do some thinking about that current logic for that. There are a couple different cases that I want to make sure we handle properly:

  • Agent is started with a node as the server, and that node is removed from the cluster, and then the agent is restarted - the agent should be able to switch over and start up properly as long as at least one of the other nodes that it was aware of are still available.
  • Agent is started with a VIP as the server, and all of the servers it was aware of are removed and replaced - the agent should switch over and start up properly as long as the VIP is still available.
  • Etcd-only Server started pointing at itself (cluster-init) or another etcd-only Server node (joining), and needs to later switch over to another control-plane Server node, once one is available.

This indicates that the rke2 process isn't really consuming the data on this socket.

No, it won't consume any data on this socket until after it is able to dial a connection to the backend and bidirectional pipes are set up to relay data back and forth. If it can't dial any backend the connection should eventually be closed by the server, but not until it tries all the servers in the list and fails - which it seems like might take a while in your environment. The logs don't provide enough information to make it clear how long the dials take before getting a no route to host error but the slow failure may also be related. We might need a shorter timeout on the dials to the loadbalancer servers to ensure that we don't get stuck on a bad server if connections take a log time to fail.

@tmmorin
Copy link
Author

tmmorin commented Jul 11, 2024

Sorry for the delayed answer, we were busy preparing a release (based on 1.28.8)

Thanks for your answers on my point on "This indicates that the rke2 process isn't really consuming the data on this socket".

Can you share the spec for the kubernetes-vip service that you're using with MetalLB, along with any additional info you can provide on your MetalLB deployment - ie are you using L2 or BGP mode, etc. I am still trying to replicate this locally.

apiVersion: v1
kind: Service
metadata:
  annotations:
...
    metallb.universe.tf/allow-shared-ip: cluster-external-ip
    metallb.universe.tf/ip-allocated-from-pool: lbpool
...
  name: kubernetes-vip
  namespace: kube-system
...
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 100.73.199.33
  clusterIPs:
  - 100.73.199.33
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Local
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerIP: 172.20.129.32
  ports:
  - name: https
    nodePort: 30614
    port: 6443
    protocol: TCP
    targetPort: 6443
  - name: rancher
    nodePort: 31483
    port: 9345
    protocol: TCP
    targetPort: 9345
  selector:
    component: kube-apiserver
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 172.20.129.32

If you could also provide full debug-level logs from an rke2 agent from startup, through the period of a control-plane node rotation that disrupts the agent's connection to the server, that would be helpful. I'm not able to follow the full state of the loadbalancer that seems to be triggering this behavior with logs that show only its state during the rotation.

Yes, I will.

I could not find time to work on this issue in the past 2 weeks, but I plan to run a reproducer will an RKE2 binary based on last 1.28.11-rc with cleaner logs (including logs on time taken to acquire lock, just in case this proves useful).

One potential oddity that I am seeing is that we do not update the default address for the apiserver loadbalancer, just the supervisor loadbalancer.

Yes, I noticed this as well.

I can't see how it can make sense to set the default supervisor loadbalancer to a node IP though: node IPs are meant to change, so this default address may at some point become stale (and will become stale during a full control plane node rolling update).

The other thing I noticed is that on the default address of an LB, the healthCheck seems to always be the "return true" function. I'm wondering if this may not defeat the check of backend health done in dialContext (https://github.com/k3s-io/k3s/blob/1c9cffd97fd5f85a5c2c36e86c976e20820cb084/pkg/agent/loadbalancer/loadbalancer.go#L178).

Then there is also the fact that the default address is never removed from the list of servers (https://github.com/k3s-io/k3s/blob/5773a3444740c69b86019d82e6cfb00a76b3e148/pkg/agent/loadbalancer/servers.go#L94),

If we take the 3 things combined, it seems to me (taken for granted that I don't know much about k3s code) that this may result in keeping in the server list a stale server, breaking the ability to fallback on another one.

There are a couple different cases that I want to make sure we handle properly: ...

Let me focus on one scenario that you raised:

Agent is started with a VIP as the server, and all of the servers it was aware of are removed and replaced - the agent should switch over and start up properly as long as the VIP is still available.

Two things:

Assuming that the agent keeps getting updates from k8s API on endpoints, I understand that it should keep populating loadbalancer server backends with the addresses of new nodes that come up... so it should not ever need to fallback to the VIP

Now, if it was to ever fallback on the VIP (but see below, I did not observe this), then in our setup, it would not necessarily always work because kube-proxy sets up local iptables rules that bypass the normal IP forwarding to the VIP (where the traffic is forwarded to whoever answers ARP queries for the VIP), with DNAT rules pointing to CP node addresses -- this will give working connectivity to the VIP only if these addresses are still valid, but will not if these IP are stale, which might occur if kube-proxy has lost access to k8s (to have such access it relies on RKE2 localhost:6443, so there is a possible deadlock here) ... however by checking counters of kube-proxy iptables rules (iptables -nvL -t nat) I observed no hit on these DNAT rules

This last issue would be a robustness problem - we would hit this deadlock if a server is disconnected from the network during a full CP node rolling update and reconnected afterwards: it will have no valid node address, and the VIP forwarding will be broken. And only a flush of these iptables rules will break the deadlock. The solution we have in mind to solve this is to switch to Kube-VIP.

I'll try to find time tomorrow to give you more logs to chew on.

Please be aware that I'll be off for a few weeks starting from tomorrow EOB; my colleague Rémi Le Trocquer has followed the topic with me and will take over.

@brandond
Copy link
Contributor

brandond commented Jul 11, 2024

The other thing I noticed is that on the default address of an LB, the healthCheck seems to always be the "return true" function.

Yes, by design the default health-check just returns true, to ensure that endpoints aren't marked failed before their health-check callbacks are registered. The health checks are registered later, when the health check goroutines are started.

If the default server address is a LB VIP and is not associated with an actual node endpoint to provide a health-check, then it is assumed to always be healthy - which is what we want.

Now, if it was to ever fallback on the VIP (but see below, I did not observe this), then in our setup, it would not necessarily always work because kube-proxy sets up local iptables rules that bypass the normal IP forwarding to the VIP (where the traffic is forwarded to whoever answers ARP queries for the VIP), with DNAT rules pointing to CP node addresses -- this will give working connectivity to the VIP only if these addresses are still valid, but will not if these IP are stale, which might occur if kube-proxy has lost access to k8s

Ahhh, that is interesting. Usually when we deploy with an LB for the fixed registration endpoint we use standalone haproxy, an ELB endpoint, or DNS alias; I've not tried using kube-vip which WOULD interact with kube-proxy due to use of a Kubernetes service to expose the endpoint. I suspect that you are probably on to something with this.

@brandond
Copy link
Contributor

brandond commented Jul 11, 2024

xref:

You might try setting .status.loadBalancer.ingress.ipMode to Proxy on your kubernetes-vip service. As per that KEP, this should prevent kube-proxy from intercepting the VIP traffic with local iptables rules.

As the loadbalancer status is usually managed by the load-balancer controller, this may need to be supported by kube-vip to avoid having kube-vip remove that field from the status.

@brandond
Copy link
Contributor

brandond commented Jul 12, 2024

OK, so ignore that. With the kube-vip manifest from #6304 I was able to reproduce this by dropping the network on a the server that an agent was connected to. The kubelet and rke2 agent process reconnected to another server, but then hung, and all further attempts to connect through the apiserver load-balancer timed out.

You were correct that it is a locking issue, caused by use of reentrant use of a rwmutex read lock. I will try to get a fix in for next week's releases.

@zifeo
Copy link

zifeo commented Jul 12, 2024

@brandond thanks for the efforts!

@tmmorin
Copy link
Author

tmmorin commented Jul 12, 2024

Thank you @brandond for nailing this down 👍

I was going to post a reproducer log, but I'll just share this as a confirmation that there is locking involved, even if I understand that at this stage you don't need this:

# grep "rke2.*before lock" /var/log/syslog | wc -l
124577
# grep "rke2.*after lock" /var/log/syslog | wc -l
132

@tmmorin
Copy link
Author

tmmorin commented Jul 12, 2024

If the default server address is a LB VIP and is not associated with an actual node endpoint to provide a health-check, then it is assumed to always be healthy - which is what we want.

Well, I'm not sure:

  1. first, because of https://github.com/k3s-io/k3s/blob/58ab25927f98fa1597a01ee65ebf4a41f9e87fa0/pkg/agent/tunnel/tunnel.go#L134, even if we start with a VIP, the default address will become a node address -- for which we would need a real healthCheck (this happens only for supervisor/9345)
  2. even if the default address remains the VIP (the case for api-server/6443), then perhaps we would like to be robust to case where the VIP HA would fail ?

On (1) I would like to ask you: could we remove this behavior, at least optionally, so that when our server URL is a VIP we could tell RKE2 to never set the default address to a node address (since those may become stale at the next rolling update) ?

This is highly related to the code at https://github.com/k3s-io/k3s/blob/5773a3444740c69b86019d82e6cfb00a76b3e148/pkg/agent/loadbalancer/servers.go#L94 .

Now, if it was to ever fallback on the VIP (but see below, I did not observe this), then in our setup, it would not necessarily always work because kube-proxy sets up local iptables rules that bypass the normal IP forwarding to the VIP (where the traffic is forwarded to whoever answers ARP queries for the VIP), with DNAT rules pointing to CP node addresses -- this will give working connectivity to the VIP only if these addresses are still valid, but will not if these IP are stale, which might occur if kube-proxy has lost access to k8s

Ahhh, that is interesting. Usually when we deploy with an LB for the fixed registration endpoint we use standalone haproxy, an ELB endpoint, or DNS alias; I've not tried using kube-vip which WOULD interact with kube-proxy due to use of a Kubernetes service to expose the endpoint. I suspect that you are probably on to something with this.

One misunderstanding here: the issue I explain happens with MetalLB because it does rely on k8s services and kube-proxy, but would not happen with kube-vip (which does not)

You might try setting .status.loadBalancer.ingress.ipMode to Proxy on your kubernetes-vip service. As per that KEP, this should prevent kube-proxy from intercepting the VIP traffic with local iptables rules.
As the loadbalancer status is usually managed by the load-balancer controller, this may need to be supported by kube-vip to avoid having kube-vip remove that field from the status.

Interesting!

So as said above, in our context that would have to be supported by MetalLB, but after checking code and github issues, I find no trace of support for status.loadBalancer.ingress.ipMode in MetalLB.

@brandond
Copy link
Contributor

brandond commented Jul 12, 2024

even if we start with a VIP, the default address will become a node address -- for which we would need a real healthCheck (this happens only for supervisor/9345)
could we remove this behavior, at least optionally, so that when our server URL is a VIP we could tell RKE2 to never set the default address to a node address

I will take a look at that for the next release cycle. I am not sure that this is as critical of an issue, as the supervisor lb isn't actually used for anything after the agent has started. Once everything is up and running, the only thing the agent needs a connection to is the apiserver - the supervisor addresses targeted by the remotedialer tunnel are all seeded based on the endpoints returned by the kubernetes apiserver. When you restart the agent, the VIP is restored as the default server address since that is what the --server value is set to, so VIP's presence is never really missed.

@brandond brandond changed the title RKE2 agent loadbalancer proxy errors when CP nodes are deleted/replaced Agent loadbalancer may deadlock when servers are removed from the network Jul 12, 2024
@brandond brandond changed the title Agent loadbalancer may deadlock when servers are removed from the network Agent loadbalancer may deadlock when servers are removed Jul 12, 2024
@tmmorin
Copy link
Author

tmmorin commented Jul 12, 2024

I am not sure that this is as critical of an issue, as the supervisor lb isn't actually used for anything after the agent has started

Ah interesting to know. Then indeed I understand that this is only of a very relative priority.

Still not nice to see old unused IPs popping up in logs... ;-)

@brandond
Copy link
Contributor

I think I can actually get that in now: k3s-io/k3s#10511

@aganesh-suse
Copy link

Closing based on: #6321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants