Loadbalancer healthchecks report error despite node ports connectable #212

process0 · 2021-06-22T21:16:19Z

Creating this issue here so others don't waste time. Maybe this should be in the documentation.

Setup:

K8S
Calico
hcloud CCM
hcloud CSI
Istio

Annotated the istio-ingressgateway with all information needed to use a pre-provisioned (terraformed) load balancer. Each http, https, and tcp node port on the worker nodes were connectable, yet the HCloud load balancer heath checks kept saying they were unreachable / not healthy.

Checking tcpdump on the interface and filtering the node port, its clear the healthcheck packets don't get ack'd:

root@dev1-worker-2:~# tcpdump -i ens10 port 31945
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens10, link-type EN10MB (Ethernet), capture size 262144 bytes
22:49:07.439005 IP 10.9.8.5.46310 > dev1-worker-2.cluster.local.31945: Flags [S], seq 2304935557, win 64860, options [mss 1410,sackOK,TS val 1914248445 ecr 0,nop,wscale 7], length 0
22:49:08.461531 IP 10.9.8.5.46310 > dev1-worker-2.cluster.local.31945: Flags [S], seq 2304935557, win 64860, options [mss 1410,sackOK,TS val 1914249468 ecr 0,nop,wscale 7], length 0
22:49:10.477634 IP 10.9.8.5.46310 > dev1-worker-2.cluster.local.31945: Flags [S], seq 2304935557, win 64860, options [mss 1410,sackOK,TS val 1914251484 ecr 0,nop,wscale 7], length 0

6: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
    link/ether 0e:83:d9:ad:a1:e3 brd ff:ff:ff:ff:ff:ff
    ...
    inet 10.9.8.5/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever

It became clear why. A route was created on each of the worker nodes containing the HCloud load balancer IP, so the HCloud load balancer never received the response.

I came across this useful comment #58 (comment)

I think I'm hitting a similar issue and did take a deeper look into it. Actually, the issue seems to be "quite known" in the Kubernetes community (metallb/metallb#153, kubernetes/kubernetes#79976, kubernetes/kubernetes#66607, kubernetes/kubernetes#92312, kubernetes/enhancements#1392, kubernetes/kubernetes#79783, kubernetes/kubernetes#59976).

TLDR: I think the problem is the following:

Using hcloud-cloud-controller-manager, LoadBalancer services get to know their external IPs. This IP gets added to the ipvs0 interface to allow cluster-internal access to the LoadBalancer. Also, a route will be created pointing to this IP on all nodes (ip route show table local). If now the (Hetzner) Load Balancer tries to send a health check packet, the cluster's reply will stay within the cluster, since the route is pointing to the ipvs0 interface instead of the internal network's network card.

There are a lot solutions in discussion, but as far as I know, nothing helpful so far. The only workaround seems to be to use iptables instead of ipvs as kube_proxy mode (didn't try yet with Hetzner Load Balancer). However, this will come with a drawback regarding performance (https://www.projectcalico.org/comparing-kube-proxy-modes-iptables-or-ipvs/).
As a very dirty hack and experiment, I temporarily removed the local route (ip route del local $internal_loadbalancer_ip dev kube-ipvs0 table local) and health checks started to lighten in green immediatly. However, this ugly workaround will not survive a reboot.

Currently reading about stuff that it might be possbile to replace kube_proxy/ipvs with cilium, but just started with trying to understand things there...For now, I guess, only iptables will "work". But I'm happy to discuss and work with you and Hetzner staff to find a solution.

And a fix: #58 (comment), which is to annotate the Service with load-balancer.hetzner.cloud/hostname: your-ingress.acme.corp

The text was updated successfully, but these errors were encountered:

github-actions · 2021-08-22T12:57:10Z

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

process0 mentioned this issue Jun 22, 2021

Update README for kube-proxy IPVS information #213

Merged

github-actions bot added the stale label Aug 22, 2021

github-actions bot closed this as completed Aug 27, 2021

ludgart mentioned this issue Feb 3, 2022

Load balancer node unhealthy #276

Closed

Anthony-Bible mentioned this issue Apr 17, 2022

Linkerd-destination crashing linkerd/linkerd2#8235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loadbalancer healthchecks report error despite node ports connectable #212

Loadbalancer healthchecks report error despite node ports connectable #212

process0 commented Jun 22, 2021 •

edited

Loading

github-actions bot commented Aug 22, 2021

Loadbalancer healthchecks report error despite node ports connectable #212

Loadbalancer healthchecks report error despite node ports connectable #212

Comments

process0 commented Jun 22, 2021 • edited Loading

github-actions bot commented Aug 22, 2021

process0 commented Jun 22, 2021 •

edited

Loading